Enabling Apache Spark™ on HBase

Apache HBase is a distributed key-value store of data on HDFS. It’s modeled on Google’s Big Table, and provides APIs to query the data. The data is organized, partitioned, and distributed by its “row keys”. Per partition, the data is further physically partitioned by “column families” that specify collections of “columns” of data. The data model is well-suited for wide tables where columns are dynamic and the data is generally sparse.

Although HBase is a very useful big data store, its access mechanism is primitive and requires client-side APIs, Map/Reduce interfaces, and interactive shells. SQL accesses to HBase data are available through Map/Reduce or interface mechanisms like Apache Hive and Impala, or through some “native” SQL technologies like Apache Phoenix. While the former is usually cheaper to implement and use, the latencies and inefficiencies often cannot don’t compare well with the latter — and often are suitable only for offline analysis. The latter category, in contrast, often performs better and can really be considered online engines; they are often enabled on top of purpose-built execution engines.

By leveraging Spark as a unified big data processing engine, we provide a new approach to support HBase. We can benefit from the performance advantage of Spark compared to other approaches (for example, Apache Phoenix). Spark not only has the Spark DataFrame capability to query HBase, but also a command line interface (CLI) to support new DDL / DML commands.

  • Create table

CREATE TABLE tablename (colname datatype, ..., PRIMARY KEY(colname, ...)) MAPPED BY (htablename, COLS=[colname=family_name.qualifier])

A SQL table on HBASE is basically a logical table mapped to a HBase table. This mapping can be many-to-one in order to support “schema-on-read” for SQL access to HBase data.

The “htablename” denotes the HBase table; the “primary key” constraint denotes the HBase row key composition of columns; “colname=familyname.qualifier” denotes the mapping of the second column to the HBase tables column qualifier of “qualifier” of column family “familyname”.

The table and the column families specified have to exist in HBase in order for the CREATE TABLE statement to succeed. In addition, the columns in the primary key cannot be mapped to another column family/column qualifier combination. Other normal SQL sanity checks, such as uniqueness of logical columns, are applied as well.

  • Query table


Query table syntax will simply leverage on Spark query syntax or DataFrame syntax.

  • Alter table

ALTER TABLE tablename ADD (colname data_type, ...) MAPPED BY (expression)

ALTER TABLE tablename DROP colname

Alter table can either add more columns or drop columns to the existing table.

  • Drop table

DROP TABLE table_name

Drop table will not delete the HBase table but simply remove the SQL table with its schema.

  • Insert data

INSERT INTO TABLE table_name SELECT clause

INSERT INTO TABLE table_name VALUES (value, ...)

An HBase key must be present for insertion while inserting the data. Also the regular SQL sanity check (for example, uniqueness of logical columns), will be performed for insertion.

  • Bulk load


Bulk load provides a way for qbuser to load data into an existing HBase table. Currently CSV-type files are supported.

  • Other useful commands

We also support “SHOW TABLES” and “DESCRIBE ” for catalog information.

The current project can be found in Spark Packages:


By combining Spark with HBase, we hope to enable more Spark usage on the popular HBase data story but in a SQL way — while also providing high performance computing on the big data ecosystem.

Author’s Bio:

Bo has more than 15 years software development experience on various projects. He was an Eclipse committer for WTP project. His current interests include HBase, Spark SQL and data federation in IBM Spark Technology Center. Bo holds a Ph.D. in Computer Science from Nanjing University.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More