Course : Data Science with Scala

Module 1: Basic Statistics and Data Types

Question 1 : You import MLlib’s vectors from ?

• org.apache.spark.mllib.TF
• org.apache.spark.mllib.numpy
• org.apache.spark.mllib.linalg
• org.apache.spark.mllib.pandas

Question 2 :Select the types of distributed Matrices :

• RowMatrix
• IndexedRowMatrix
• CoordinateMatrix

Question 3 :How would you caculate the mean of the following ?

val observations: RDD[Vector] = sc.parallelize(Array(

Vectors.dense(1.0, 2.0),

Vectors.dense(4.0, 5.0),

Vectors.dense(7.0, 8.0)))

val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)

• summary.normL1
• summary.numNonzeros
• summary.mean
• summary.normL2

Question 4 :what task does the following lines of code?

import org.apache.spark.mllib.random.RandomRDDs._

val million = poissonRDD(sc, mean=1.0, size=1000000L, numPartitions=10)

• Calculate the variance
• calculate the mean
• generate random samples
• Calculate the variance

Question 5 : MLlib uses the compressed sparse column format for sparse matrices, as Such it only keeps the non-zero entrees?

• True
• False

Module 2: Preparing Data

Question 1 : WFor a dataframe object the method describe calculates the ?

• count
• mean
• standard deviation
• max
• min
• all of the above

Question 2:What line of code drops the rows that contain null values, select the best answer ?

• val dfnan = df.withColumn(“nanUniform”, halfTonNaN(df(“uniform”)))
• dfnan.na.replace(“uniform”, Map(Double.NaN -> 0.0))
• dfnan.na.drop(minNonNulls = 3)
• dfnan.na.fill(0.0)

Question 3:What task does the following lines of code perform ?

val lr = new LogisticRegression()

lr.setMaxIter(10).setRegParam(0.01)

val model1 = lr.fit(training)

• perform one hot encoding
• Train a linear regression model
• Train a Logistic regression model
• Perform PCA on the data

Question 4: The StandardScaleModel transforms the data such that ?

• each feature has a max value of 1
• each feature is Orthogonal
• each feature to have a unit standard deviation and zero mean
• each feature has a min value of -1

Module 3: Feature Engineering

Question 1: Spark ML works with?

• tensors
• vectors
• dataframes
• lists

Question 2:the function IndexToString() performs One hot encoding?

• True
• False

Question 3: Principal Component Analysis is Primarily used for ?

• to convert categorical variables to integers
• to predict discrete values
• dimensionality reduction

Question 4: one import set prior to using PCA is ?

• making sure every feature is not correlated
• taking the log for your data
• subtracting the mean

Module 4: Fitting a Model

Question 1 : You can use decision trees for ?

• regression
• classification
• classification and regression
• data normalization

Question 2 : the following lines of code: val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

• split the data into training and testing data
• train the model
• use 70% of the data for testing
• use 30% of the data for training
• make a prediction

Question 3 : in the Random Forest Classifier constructor .setNumTrees() ?

• sets the max depth of trees
• sets the minimum number of classes before a split
• set the number of trees

Question 4 : Elastic net regularization uses ?

• L0-norm
• L1-norm
• L2-norm
• a convex combination of the L1 norm and L2 norm

Module 5: Pipeline and Grid Search

Question 1 : what task does the following code perform: withColumn(“paperscore”, data(“A2”) * 4 + data(“A”) * 3) ?

• add 4 colunms to A2
• add 3 colunms to A1
• add 4 to each elment in colunm A2
• assign a higher weight to A2 and A journals

Question 2:In an estimator ?

• there is no need to call the method fit
• fit function is called
• transform fuction is only called

Question 3: Which is not a valid type of Evaluator in MLlib?

• RegressionEvaluator
• MultiClassClassificationEvaluator
• MultiLabelClassificationEvaluator
• BinaryClassificationEvaluator
• All are valid

Question 4: In the following lines of code, the last transform in the pipeline is a:

val rf = new RandomForestClassifier().setFeaturesCol(“assembled”).setLabelCol(“status”).setSeed(42)

import org.apache.spark.ml.Pipeline

val pipeline = new Pipeline().setStages(Array(value_band_indexer,category_indexer,label_indexer,assembler,rf))

• principal component analysis
• Vector Assembler
• String Indexer
• Vector Assembler
• Random Forest Classifier

Question 1

What is not true about labeled points?

• They associate sparse vectors with a corresponding label/response
• They associate dense vectors with a corresponding label/response
• They are used in unsupervised machine learning algorithms
• All are true
• None are true

Question 2

Which is true about column pointers in sparse matrices?

• By themselves, they do not represent the specific physical location of a value in the matrix
• They never repeat values
• They have the same number of values as the number of columns
• All are true
• None are true

Question 3

What is the name of the most basic type of distributed matrix?

• CoordinateMatrix
• IndexedRowMatrix
• SparseMatrix
• SimpleMatrix
• RowMatrix

Question 4

A perfect correlation is represented by what value?

• 3
• 1
• -1
• 100
• 0

Question 5

A MinMaxScaler is a transformer which:

• Rescales each feature to a specific range
• Takes no parameters
• Makes zero values remain untransformed
• All are true
• None are true

Question 6

Which is not a supported Random Data Generation distribution?

• Poisson
• Uniform
• Exponential
• Delta
• Normal

Question 7

Sampling without replacement means:

• The expected number of times each element is chosen is randomized
• The expected size of the sample is a fraction of the RDDs size
• The expected number of times each element is chosen
• The expected size of the sample is unknown
• The expected size of the sample is the same as the RDDs size

Question 8

What are the supported types of hypothesis testing?

• Pearson’s Chi-Squared Test for goodness of fit
• Pearson’s Chi-Squared Test for independence
• Kolmogorov-Smirnov test for equality of distribution
• All are supported
• None are supported

Question 9

For Kernel Density Estimation, which kernel is supported by Spark?

• KDEMultivariate
• KDEUnivariate
• Gaussian
• KernelDensity
• All are supported

Question 10

Which DataFrames statistics method computes the pairwise frequency table of the given columns?

• freqItems()
• cov()
• crosstab()
• pairwiseFreq()
• corr()

Question 11

Which is not true about the fill method for DataFrame NA functions?

• It is used for replacing NaN values
• It is used for replacing nil values
• It is used for replacing null values
• All are true
• None are true

Question 12

Which transformer listed below is used for Natural Language processing?

• StandardScaler
• OneHotEncoder
• ElementwiseProduct
• Normalizer
• None are used for Natural Language processing

Question 13

Which is true about the Mahalanobis Distance?

• It is a scale-variant distance
• It does not take into account the correlations of the dataset
• It is measured along each Principle Component axis
• It is a multi-dimensional generalization of measuring how many standard deviations a point is away from the median
• It has units of distance

Question 14

• It must be told which column to create for its output
• It creates a Sparse Vector
• It must be told which column is its input
• All are true
• None are true

Question 15

Principle Component Analysis is:

• Is never used for feature engineering
• Used for supervised machine learning
• A dimension reduction technique
• All are true
• None are true

Question 16

MLlib’s implementation of decision trees:

• Supports only multiclass classification
• Does not support regressions
• Partitions data by rows, allowing distributed training
• Supports only continuous features
• None are true

Question 17

Which is not a tunable of SparkML decision trees?

• maxBins
• maxMemoryInMB
• minInstancesPerNode
• minDepth
• minInfoGain

Question 18

Which is true about Random Forests?

• They support non-categorical features
• They combine many decision trees in order to reduce the risk of overfitting
• They do not support regression
• They only support binary classification
• None are true

Question 19

When comparing Random Forest versus Gradient-Based Trees, what must you consider?

• How the number of trees affects the outcome
• Depth of Trees
• Parallelization abilities
• All of these
• None of these

Question 20

Which is not a valid type of Evaluator in MLlib?

• MultiLabelClassificationEvaluator
• RegressionEvaluator
• BinaryClassificationEvaluator
• MultiClassClassificationEvaluator
• All are valid