Topic: Auto Model
Question 1: An advantage of using Auto Model vs. building a process manually with operators is that Auto Model (Select ANY correct answer)
- is able to use GPU processors in your computer to speed up the modeling process.
- encourages users to do feature selection which is often overlooked.
- follows many data science ‘best practices’.
- uses modeling algorithms that are not available as individual operators.
Question 2: You have a new data science project and must provide a predictive model with an accuracy that is at least 95%, and you must be able to interpret the model and explain how it works to decision makers in the business. Based on the Auto Model results below, which of the four machine learning models below has a good balance of both explainability and accuracy? (Select one)
- Generalized Linear Model
- Deep Learning
- Gradient Boosted Trees
- Support Vector Machine
Question 3: You run Auto Model and examine the performance of four models with this ROC curve plot:
Based on these ROC curves, which model has the best overall performance? (Select one)
- Naïve Bayes
- Logistic Regression
- Decision Tree
- Fast Large Margin
Topic: Unsupervised Techniques
Question 1: The k-means clustering algorithm works by (Select one)
- iteratively improving the position of k centroids in the sample space until an optimal placement is found.
- starting with one point in the sample space, finding more points in the space within a neighborhood ℇ until no more points can be found, and then repeating this process for k-1 points.
- iteratively determining the Gaussian distribution (via its mean and standard deviation) of k clusters until the probabilities of all points in the sample space are maximized.
- pairing each point with another point such that their distance is minimized, and then repeating this process with larger groups of points until there are only k clusters remaining.
Question 2: You wish to examine the results of k-means clustering from a data set containing four attributes: a1, a2, a3, and a4. The resulting scatterplot is as follows:
- attributes a2 and a4 partition the data set well between cluster_0 and cluster_1.
- attributes a1 and a3 do not partition the data set well between cluster_0 and cluster_1.
- attributes a2 and a4 partition the data set well between cluster_0 and cluster_2.
- attributes a2 and a4 do not partition the data set between cluster_0 and cluster_2.
Question 3: Examining a correlation matrix is useful when you want to (Select one)
- ensure that a regression model is not overfitting the data.
- find attributes that may have a relationship to one another.
- eliminate data that do not fit a particular model.
- computing the accuracy of a linear regression model.
Question 4: A Correlation Matrix operator can be used in feature selection by applying weights to attributes based on their correlations and then using the Select by Weights operator:
Which of the following attributes will be selected by the Select by Weights operator using the parameters given? (Select ALL correct answers)
- Wind
- Play
- Outlook
- Temperature
- Humidity
Question 5: You are performing a market basket analysis where there are over 5 million different types of electronic components for sale, and shopping carts often contain over 100 components. You have the data, but there are too many components to check all combinations, so the next step is to identify the most frequently-used combinations. Which machine learning technique would be most relevant in this scenario? (Select one)
- Decision Tree
- k-Means clustering
- Support Vector Machine
- FP-Growth
Question 6: How do you find the optimal number of clusters in k-Means? (Select ANY correct answer)
- If you are not sure, then use the default value, 5. It is almost always optimal.
- Start with X-Means instead of k-Means; it will find an optimal k according to a heuristic.
- Start with a value of k that is large relative to the number of attributes that you have and apply k-Means. Then visualize the results with a scatter plot and set k to the number of distinct clusters.
- There is no method that is consistent across all applications.
Question 7: In order to perform feature selection for a predictive machine learning model, you decide to use the Weight by Relief operator on your ExampleSet. You observe the following results:
Which of the five attributes would likely be least useful for this predictive model? (Select one)
- Year 1
- Year 2
- Year 3
- Year 4
- Year 5
Topic: Classification & Regression
Question 1: What happens to a k-NN model as you increase the value of k? (Select one)
- The bias and variance both increase.
- The bias and variance both decrease.
- The bias increases and the variance decreases.
- The bias decreases and the variance increases.
Question 2: The Naïve Bayes classifier assumes that (Select ANY correct answer)
- the attributes individually follow a Gaussian conditional probability distribution, given the class.
- the attributes individually follow a Gaussian probability distribution, independent of the class.
- the value of any attribute is statistically independent of the value of any other attribute (given the class value).
- the value of any attribute is statistically dependent of the value of any other attribute (given the class value).
Question 3: Given a small training set of 20 rows and 1000 columns, which of the following are valid reasons to NOT use a neural network with 1000s of inner nodes to build a machine learning model? (Select ANY correct answer)
- The model will likely overfit the data.
- Building the model will likely take a very long time on a standard laptop.
- The model will likely be too complex to interpret by humans.
- Building the model will require multiple GPU processors installed on a large server.
Question 4: The following is a decision tree model on the Golf data set:
- For which examples will the model predict “yes”? (Select ALL correct answers)
- Outlook=rain, Wind=true, Humidity=60
- Outlook=overcast, Wind=false, Humidity=90
- Outlook=sunny, Wind=true, Humidity=60
- none of the examples above predict ‘true’
Question 5: Logistic regression can only be used when (Select one)
- you have a numerical label and numerical attributes.
- you have a binominal label and numerical attributes.
- you have a numerical label and polynominal attributes.
- the data is from a logistics use case.
Question 6: Which of the following increases the complexity of a neural network model? (Select one)
- increasing the number of training cycles
- increasing the learning rate
- increasing the momentum
- adding more hidden layers
Question 7: Which of the following machine learning models is mathematically similar to statistical linear regression? (Select one)
- GLM
- Naïve Bayes
- k-NN
- Decision Tree
Question 8: When selecting a decision tree split criterion, which is a reason to choose Gain Ratio over Information Gain? (Select ANY correct answer)
- you have polynominal attributes with many values.
- you need to get the fastest runtime (Gain Ratio always has a shorter runtime than Information Gain).
- you have a relatively small data set (they will both take similar time to run but Gain Ratio always gives better performance over Information Gain).
- you want a criterion that takes Information Gain, and adjusts it for each attribute based on the number of possible values.
Question 9: To evaluate a binominal classification machine learning model, you examine this confusion matrix:
What can you infer from this confusion matrix? (Select ALL correct answers)
- This model had 67 false positive predictions.
- This model had 67 false negative predictions.
- This model was able to correctly predict 705 “BAD” values out of a total of 772 “BAD” values in the ExampleSet.
- Data scientists would consider this a ‘balanced’ data set.
Question 10: Which of the following four modeling algorithms is least vulnerable to outlier bias? (Select one)
- Linear Regression
- Naive Bayes
- k-NN
- GLM
Topic: Validation & Scoring
Question 1: Below is a basic RapidMiner process to build and find the performance of a machine learning model:
Which statements are correct? (Select ALL correct answers)
- Label 1 points to the training set wire.
- Label 1 points to the testing set wire.
- Operator 2 is the operator that builds the model (e.g. Decision Tree, SVM, etc…)
- Operator 3 is the operator that builds the model (e.g. Decision Tree, SVM, etc…)
Question 2: The term ‘model scoring’ in machine learning can refer to (Select ANY correct answer)
- ranking the performances of more than one model to choose the best one.
- applying a model to unseen data.
- using a model in production.
- determining whether or not a model is overfit.