Spark SQL

CACHE Table in Apache Spark™ SQL

For users wanting to improve performance by caching table data into memory, we offer some considerations…

You can either do sqlContext.cacheTable(“tableName”), dataFram.cache() in an application or “CACHE table tableName” in the Spark-SQL shell. The new query against the cached table will use InMemoryColumnerTableScan for scanning and retrieving only the required column(s).

For example:

scala> sqlContext.cacheTable("t4")

scala> val df = sqlContext.sql("select col1 from t4") df: org.apache.spark.sql.DataFrame = [col1: int]

scala> df.explain(true) == Parsed Logical Plan == 'Project ['col1] +- 'UnresolvedRelation `t4`, None

== Analyzed Logical Plan == col1: int Project [col1#103] +- MetastoreRelation default, t4, None

== Optimized Logical Plan == Project [col1#103] +- InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

== Physical Plan == InMemoryColumnarTableScan [col1#103], InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

It’s worth noting that prior to Apache Spark™ 1.5.2, caching a parquet table had an issue. Specifically, the query selecting the cached parquet table did not actually scan from the InMemoryColumnartableScan. Instead, it scanned from ParquetRelation in the physical plan — which had the potential to downgrade performance.

The problem was that the LogicalRelation that wraps the ParquetRelation has an expectedOutpuAttributes that stores a list of resolved fields with expIds, yet these expIds are not expected to be the same at different times. When caching table, the LogicalRelation that wraps the ParquetRelation becomes the key in the cache and the resulting InMemoryRelation is the value. Then, when a new query comes in, the newly resolved LogicalRelation that wraps the same ParquetRelation has expectedOutpuAttributes with different expIds than the cached key. As a result, the look up of the cached relation is not found and the plan fails to choose the physical ParquetRelation for scanning.

Instead of comparing wrapping LogicalRelations for looking up the key from the cache, the code should directly compare the underlying ParquetRelation. This issue is fixed in 1.6.0 and 1.5.2.

Bios: Xin Wu is an active contributor for Apache Spark with IBM Spark Technology Center(STC).. Xin’s main focus is on Spark SQL component. Prior to joining STC, he was a developer of Big SQL, which is a SQL-on-Hadoop engine by IBM.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More