Get Certificate: Analyzing Big Data in R using Apache SparkCognitive Class Exam Quiz Answers
Analyzing Big Data in R using Apache Spark Cognitive Class Certification Answers
Module 1: Introduction to SparkR
Question 1: What shells are available for running SparkR??
- Spark-shell
- SparkSQL shell
- SparkR shell
- RSpark shell
- None of the options is correct
Question 2: What is the entry point into SparkR?
- SRContext
- SparkContext
- RContext
- SQLContext
Question 3: When would you need to call sparkR.init?
- using the R shell
- using the SR-shell
- using the SparkR shell
- using the Spark-shell
Module 2: Data manipulation in SparkR
Question 1: dataframes make use of Spark RDDs
- False
- True
Question 2: You need read.df to create dataframes from data sources?
- True
- False
Question 3: What does the groupBy function output??
- A AggregateOrder object
- A GroupedData object
- A OrderBy object
- A GroupBy object
Module 3: Machine learning in SparkR
Question 1: What is the goal of MLlib?
- Integration of machine learning into SparkSQL
- To make practical machine learning scalable and easy
- Visualization of Machine Learning in SparkR
- Provide a development workbench for machine learning
- All of the options are correct
Question 2: What would you use to create plots? check all that apply
- pandas
- Multiplot
- Ggplot2
- matplotlib
- all of the above are correct
Question 3: Spark MLlib is a module of Apache Spark
- False
- True
Analyzing Big Data in R using Apache Spark Final Exam Answers – Cognitive Class
Question 1: Which of these are NOT characteristics of Spark R?
- it supports distributed machine learning
- it provides a distributed data frame implementation
- is a cluster computing framework
- a light-weight front end to use Apache Spark from R
- None of the options is correct
Question 2: True or false? The client connection to the Spark execution environment is created by the shell for users using Spark:
- True
- False
Question 3: Which of the following are not features of Spark SQL?
- performs extra optimizations
- works with RDDs
- is a distributed SQL engine
- is a Spark module for structured data processing
- None of the options is correct
Question 4: True or false? Select returns a SparkR dataframe:
- False
- True
Question 5: SparkR defines the following aggregation functions:
- sumDistinct
- Sum
- count
- min
- All of the options are correct
Question 6: We can use SparkR sql function using the sqlContext as follows:
- head(sql(sqlContext, “SELECT * FROM cars WHERE cyl > 6”))
- SparkR:head(sql(sqlContext, “SELECT * FROM cars WHERE cyl > 6”))
- SparkR::head(sql(sqlContext, “SELECT * FROM cars WHERE cyl > 6”))
- SparkR(head(sql(sqlContext, “SELECT * FROM cars WHERE cyl > 6”)))
- None of the options is correct
Question 7: Which of the following are pipeline components?
- Transformers
- Estimators
- Pipeline
- Parameter
- All of the options are correct
Question 8: Which of the following is NOT one of the steps in implementing a GLM in SparkR:
- Evaluate the model
- Train the model
- Implement model
- Prepare and load data
- All of the options are correct
Question 9: True or false? Spark MLlib is a module SparkR to provide distributed machine learning algorithms.
- True
- False
Introduction to Analyzing Big Data in R using Apache Spark
Analyzing Big Data in R using Apache Spark is a powerful combination for handling large-scale data processing and analysis. Apache Spark is a distributed computing framework that provides fast and efficient data processing capabilities, while R is a popular programming language for statistical computing and data analysis.
Here are the general steps to analyze Big Data in R using Apache Spark:
- Setting Up Apache Spark: First, you need to set up Apache Spark on your system or cluster. You can download Apache Spark from its official website and follow the installation instructions provided.
- Connecting R to Apache Spark: There are several ways to connect R to Apache Spark. One common approach is to use the
sparklyr
package, which provides an R interface for Apache Spark. You can installsparklyr
using the following command in R:install.packages("sparklyr")
Once installed, you can connect to an Apache Spark cluster using thespark_connect()
function and specifying the Spark master URL. - Loading Data: After connecting to Apache Spark, you can load data into Spark DataFrames using the
spark_read_csv()
,spark_read_parquet()
, or other similar functions provided bysparklyr
. These functions allow you to read data from various sources such as CSV files, Parquet files, Hive tables, etc. - Data Manipulation and Analysis: Once the data is loaded into Spark DataFrames, you can perform various data manipulation and analysis tasks using
dplyr
syntax provided bysparklyr
. This includes filtering, aggregating, joining, and summarizing data as needed for your analysis. - Running Analytical Algorithms: Apache Spark provides a wide range of machine learning algorithms through its
MLlib
library. You can train and apply these algorithms to your data directly within R usingsparklyr
. Common tasks include regression, classification, clustering, and collaborative filtering. - Visualizing Results: Finally, you can visualize the results of your analysis using R’s visualization libraries such as
ggplot2
orplotly
. You can also use thesparklyr
package’s integration withdplyr
andggplot2
for seamless visualization of Spark DataFrames.
Remember that working with Big Data requires careful consideration of memory and computational resources. Apache Spark handles distributed computing transparently, but you need to ensure that your cluster is configured appropriately for your data and analysis requirements. Additionally, optimizing your Spark jobs for performance may involve techniques such as data partitioning, caching, and tuning Spark configuration parameters.