Get an overview of HBase, how to use the HBase API and Clients, its integration with Hadoop.
HBase is the open source Hadoop database used for random, real-time read/writes to your Big Data.
- Learn how to set it up as a source or sink for MapReduce jobs, and details about its architecture and administration, including labs for practice and hands-on learning.
- Learn how HBase runs on a distributed architecture on top of commodity hardware, including practice and hands-on learning of the following features:
- Linear and modular scalability
- Strictly consistent read and writes
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers
- Easy to use Java API for client access
COURSE SYLLABUS
- Module 1 – Introduction to HBase
- HBase Overview, CAP Theorem and ACID properties
- Roles of HBase and difference between RDBMS
- HBase Shell and Tables
- Module 2 – HBase Client API – The Basics
- Use of Java API for Batch, Scan, and Scan operations
- Module 3 – Client API: Administrative and Advance Features
- Use of administrative operations and schemas
- Use of Filters, Counters, and ImportTSV tool
- Module 4 – Available HBase Clients
- Understand how interactive and batch clients interact with HBase
- Module 5 – HBase and MapReduce Integration
- Understand how MapReduce works in the Hadoop framework
- How to setup HBase as a source and a sink
- Module 6 – HBase Configuration and Administration
- Configuration of HBase for various environmental optimization
- Architecture and administrative tasks
Using HBase for Real-time Access to your Big Data Cognitive Class Exam Quiz Answers
Module 1: Introduction to HBase
Question 1: What are some of the key properties of HBase? Select all that apply.
- All HBase data is stored as bytes
- HBase can run up to 1000 queries per second at the most
- HBase is ACID compliant across all rows and tables
- HBase is a NoSQL technology
- HBase is an open source Apache project
Question 2: Which HBase component is responsible for storing the rows of a table?
- HDFS
- Region
- API
- ZooKeeper
- Master
Question 3: What is NOT a characteristic of an HBase table?
- Columns are grouped into column families
- Columns can have multiple timestamps
- Each row must have a unique row key
- NULL columns aren’t supported
- Columns can be added on the fly
Module 2: HBase Client API – The Basics
Question 1: Which HBase command is used to update existing data in a table?
- Put
- Scan
- Get
- Batch
- Delete
Question 2: The batch command allows the user to determine the order of execution. True or false?
- True
- False
Question 3: Which of the following statements are true of the scan operation? Select all that apply.
- Scanner caching is enabled by default
- The startRow and endRow parameters are both inclusive
- The addColumn() method can be used to restrict a scan
- Scanning is a resource-intensive operation
- Scan operations are used to iterate over HBase tables
Module 3: Client API: Administrative and Advance Features
Question 1: Which statement about HBase tables is incorrect?
- HColumnDescriptor is used to describe columns, not column families
- A table requires two descriptor classes
- Performance may suffer if a table has more than three column families
- Everything in HBase is stored within tables
- Each table must contain at least one column family
Question 2: When using a CompareFilter, you must specify what to include as part of the scan, rather than what to exclude. True or false?
- True
- False
Question 3: What is an example of a Dedicated Filter? Select all that apply.
- SingleColumnValueFilter
- QualifierFilter
- ColumnPrefixFilter
- TimestampsFilter
- FamilyFilter
Module 4: Available HBase Clients
Question 1: Which statements accurately describe the HBase interactive clients? Select all that apply.
- Thrift is included with Hbase
- Thrift and Avro both support C++
- With REST, data transport is always performed in binary
- Avro has a dynamic schema
- REST needs to be complied before it can run
Question 2: Unlike an interactive client, a batch client is used to run a large set of operations in the background. True or false?
- True
- False
Question 3: Which of the following is an example of a batch client?
- PyHBase
- HBql
- Pig
- JRuby
- AsyncHBase
Module 5: HBase and MapReduce Integration
Question 1: HBase can act both as a source and a sink of a MapReduce job.
- False
- True
Question 2: Which HBase class is responsible for splitting the source data?
- TableReducer
- TableOutputFormat
- TableMapReduceUtil
- TableMapper
- TableInputFormat
Question 3: Which of the following is NOT a component of the MapReduce framework?
- Reducer
- OutputFormat
- Mapper
- InputFormat
- All of the above are part of the MapReduce framework
Module 6: HBase Configuration and Administration
Question 1: Which of the following statements accurately describe the HBase run modes? Select all that apply.
- The standalone mode is suited for a production environment
- The pseudo-distributed mode is used for performance evaluation
- The standalone mode uses local file systems
- The distributed mode is suited for a production environment
- The distributed mode requires the HDFS
Question 2: Which is NOT a component of a region server?
- StoreFile
- MemStore
- HFile
- ZooKeeper
- HLog
Question 3: What is an example of an operational task? Select all that apply.
- BulkImport
- CopyTable
- Adding Servers
- Node decommissioning
- Import and export
Using HBase for Real-time Access to your Big Data Final Exam – Cognitive Class
Question 1: Which statements accurately describe column families in HBase? Select all that apply.
- You aren’t required to specify any column families when declaring a table
- Each region contains multiple column families
- You typically want no more than two or three column families per table
- Column families have their own compression methods
- Column families can be defined dynamically after table creation
Question 2: Which of the following is NOT a component of HBase?
- Master
- Region
- ZooKeeper
- Pig
- Region Server
Question 3: Which programming language is supported by Thrift?
- PHP
- C#
- Python
- Perl
- All of the above
Question 4: Which HBase command is used to retrieve data from a table?
- Delete
- Get
- Scan
- Batch
- Put
Question 5: The HBase Shell and the native Java API are the only available tools for interacting with HBase. True or false?
- True
- False
Question 6: Without this filter, a scan will need to check every file to see if a piece of data exists.
- WhileMatchFilter
- TimeStampsFilter
- PageFilter
- SkipFilter
- BloomFilter
Question 7: What are the characteristics of the Avro client? Select all that apply.
- Avro is included with HBase
- Data transport is performed in binary
- Avro needs to be compiled before running
- Avro is a batch client
- Avro supports Python and PHP, among others
Question 8: Deleting an internal table in Hive automatically deletes the corresponding HBase table. True or false?
- True
- False
Question 9: What is the main purpose of an HBase Counter?
- To count the number of regions
- To increment column values for statistical data collection
- To count the number of region servers
- To count the number of column families
- All of the above
Question 10: Which file is used to specify configurations for HBase, HDFS, and ZooKeeper?
- RegionServer
- hbase-site.xml
- log4j.properties
- hbase-default.xml
- hbase-env.sh
Question 11: Which HBase component manages the race to add a backup master?
- Primary master
- Region
- ZooKeeper
- HDFS
- Region Server
Question 12: Which component of a region server is the actual storage file of the data?
- HFile
- Store
- StoreFile
- HLog
- HRegion
Question 13: When the master node is updated, which file can be used to automatically update the other nodes in the cluster?
- syncconf.sh
- synchbase.sh
- hbase-default.xml
- hbase-site.xml
- hbase-env.sh
Question 14: There is a single HLog for each region server. True or false?
- True
- False
Question 15: What is the main purpose of the Write-Ahead log?
- To store HBase configuration details
- To store HDFS configuration details
- To flush data when the system reaches its capacity
- To prevent data loss in the event of a system crash
- To store performance details