**Enroll Here: Spark MLlIB Cognitive Class Exam Quiz Answers**

**Spark MLlIB Cognitive Class Certification Answers**

**Module 1 – Spark MLlIB Data Types Quiz Answers – Cognitive Class**

**Question 1: Sparse Data generally contains many non-zero values, and few zero values.**

- True
**False**

**Question 2: Local matrices are generally stored in distributed systems and rarely on single machines.**

- True
**False**

**Question 3: Which of the following are distributed matrices?**

- Row Matrix
- Column Matrix
- Coordinate Matrix
- Spherical Matrix
**Row Matrix and Coordinate Matrix**- All of the Above

**Module 2 – Review Alogrithms Quiz Answers – Cognitive Class**

**Question 1: Logistic Regression is an algorithm used for predicting numerical values.**

- True
**False**

**Question 2: The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.**

**True**- False

**Question 3: Which of the following is true about Gaussian Mixture Clustering?**

- The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
- The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
- The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
- Gaussian Mixture Clustering uses multiple centroids to cluster data points.
**All of the Above**

**Module 3 – Spark MLlIB Decision Trees and Random Forests Quiz Answers – Cognitive Class**

**Question 1: Which of the following is a stopping parameter in a Decision Tree?**

- The number of nodes in the tree reaches a specific value.
**The depth of the tree reaches a specific value.**- The breadth of the tree reaches a specific value.
- All of the Above

**Question 2: When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.**

- True
**False**

**Question 3: In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.**

- True
**False**

**Module 4 – Spark MLlIB Clustering Quiz Answers – Cognitive Class**

**Question 1: In Spark MLlib, the initialization mode for the K-Means training method is called**

- k-means–
- k-means++
**k-means||**- k-means

**Question 2: In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.**

- True
**False**

**Question 3: In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.**

**True**- False

**Spark MLlIB Final Exam Answers – Cognitive Class**

**Question 1: In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.**

- True
**False**

**Question 2: In Decision Trees, what is true about the size of a dataset?**

**Large datasets create “bins” on splits, which can be specified with the maxBins parameter.**- Large datasets sort feature values, then use the ordered values as split calculations.
- Small datasets create split candidates based on quantile calculations on a sample of the data.
- Small datasets split on random values for the feature.

**Question 3: A Logistic Regression algorithm is ineffective as a binary response predictor.**

- True
**False**

**Question 4: What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]**

- [1, 6]
**[0, 2, 3, 6]**- [0, 2, 3, 5]
- [2, 3]

**Question 5: For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.**

- True
**False**

**Question 6: In a Decision Tree, choosing a very large maxDepth value can:**

- Increase accuracy
- Increase the risk of overfitting to the training set
- Increase the cost of training
- All of the Above
**Increase the risk of overfitting and increase the cost of training**

**Question 7: In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.**

**True**- False

**Question 8: Increasing the value of epsilon when creating the K-Means Clustering model can:**

- Decrease training cost and decrease the number of iterations that the model undergoes
- Decrease training cost and increase the number of iterations that the model undergoes
- Increase training cost and decrease the number of iterations that the model undergoes
**Increase training cost and increase the number of iterations that the model undergoes**

**Question 9: In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)**

- Python List
- Textfile
- CSV file
**RDD**

**Question 10: What is true about Dense and Sparse Vectors?**

- A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
- A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
**A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.**- A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.

**Question 11: In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.**

**True**- False

**Question 12: In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.**

- True
**False**

**Question 13: What is true about Labeled Points?**

- A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
- B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
- C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
- D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
- All of the Above
**A and C only**

**Question 14: In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.**

**True**- False

**Question 15: In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.**

- True
**False**

**Question 16: What is true about the maxDepth parameter for Random Forests?**

- A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
**A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.**- A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
- A large maxDepth value is preferred since tree averaging yields an increase in overall variance.

**Introduction to Spark MLlIB**

Spark MLlib is a scalable machine learning library provided by Apache Spark, designed to make practical machine learning scalable and easy. Here’s a basic introduction to Spark MLlib:

**Scalability**: Apache Spark is known for its scalability, making it capable of handling large-scale data processing tasks efficiently. MLlib leverages Spark’s distributed computing framework, enabling it to process massive datasets across a cluster of machines in parallel.**Machine Learning Algorithms**: Spark MLlib offers a wide range of machine learning algorithms and utilities that are commonly used for classification, regression, clustering, collaborative filtering, dimensionality reduction, and more. These algorithms are designed to work efficiently in a distributed computing environment.**APIs**: MLlib provides two sets of APIs for building machine learning pipelines: the RDD-based API and the DataFrame-based API. The DataFrame-based API, also known as the “high-level API,” is more user-friendly and encourages a more concise and expressive coding style. It is built on top of Spark SQL and DataFrames, allowing users to seamlessly integrate machine learning tasks with data manipulation and querying.**Pipelines**: Spark MLlib introduces the concept of pipelines, which enable users to chain together multiple stages of data processing and machine learning algorithms into a single workflow. Pipelines facilitate the construction, tuning, and deployment of machine learning models in a systematic and reproducible manner.**Feature Transformers and Selectors**: MLlib provides a rich set of feature transformers for preprocessing raw input data, including methods for feature scaling, encoding categorical variables, handling missing values, and extracting features from text and images. Additionally, it offers feature selectors to automatically identify and select the most relevant features for model training.**Model Evaluation**: MLlib includes utilities for evaluating the performance of machine learning models using various metrics such as accuracy, precision, recall, F1-score, and area under the ROC curve (AUC). These evaluation metrics help users assess the quality of their models and make informed decisions during model selection and tuning.**Integration with Spark Ecosystem**: Spark MLlib seamlessly integrates with other components of the Spark ecosystem, such as Spark SQL, Spark Streaming, and Spark GraphX. This integration allows users to perform end-to-end data processing, analysis, and machine learning tasks within a unified platform.

Overall, Spark MLlib empowers data scientists and engineers to build and deploy scalable machine learning solutions on large-scale datasets using Apache Spark’s distributed computing capabilities.