apache spark

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to help Spark users decide when it makes sense to integrate Alluxio in the data crunching pipeline.

Scalability and expanded storage

What can Alluxio do for you? First, the file API and command line toolkit make it possible to store and access files in the host RAM space. The Alluxio workers can be deployed on multiple hosts to provide greater scalability and more speed through proximity to the application workloads. Aside from RAM, Alluxio can also use the host HDD, and any available SSD resources, to expand the local storage capacity and reduce memory consumption. Beyond that, Alluxio provides a unified interface to various storage systems (such as HDFS, Swift or S3), offering transparent access to stored data and its cache in the local Alluxio. It can also persist the cached data in a robust storage backend.

Apache Spark and other analytic frameworks can connect to Alluxio using the integration modules included in the Alluxio distribution. A growing number of companies (such as Baidu, Barclays, Qunar) deploy Spark with Alluxio in real-life scenarios; these typically involve remote or very slow data sources, where local caching can significantly accelerate data access and exchange.

We ran a series of performance benchmarks to systematically analyze the basic Alluxio capabilities, and their effect on Spark workloads. This helped us find several new scenarios in which Alluxio can accelerate Spark applications. We also found a number of configurations where using Alluxio doesn’t benefit the Spark workloads.

The default Alluxio configuration uses a MEM tier – the host memory. It tries to keep the data in the same host with the application that reads or writes the data. (This is possible, of course, only if an Alluxio worker was deployed on that host.)

We found Alluxio to be extremely fast in this configuration. Applications write/read directly to/from the local RAM disk; meanwhile, the Alluxio worker performs the metadata bookkeeping and is not involved in the data exchange. We observed I/O throughput as high as 20 GBps on strong hosts (1) with multi-threaded test applications writing and reading to/from Alluxio.

You can also deploy Alluxio on hosts that are separate from application workloads. This is practical for running and managing Alluxio in its separate cluster, without consuming the application host resources. Just keep in mind that the throughput of this kind of configuration is limited by the network/NIC speed between the hosts (e.g., 10 Gbps in our setup). We also found that the Alluxio workers’ CPU consumption becomes significant in this configuration, since it must serialize and send data to other hosts.

Running Alluxio with other tiers (HDD or SSD) allowed us to significantly expand the caching capacity and to function without consuming the host RAM. Again, the throughput of such configurations is limited by the disk speed (e.g., 0.2 GBps in our HDD tests).

Alluxio configuration

Running Apache Spark with Alluxio

To connect Apache Spark applications to Alluxio, simply add a glue jar to the classpath, and use “alluxio://..” uri semantics.

To find the Spark data ingestion speed limit, we ran a lightweight “line-count” workload in a Spark shell, deployed on a single host. We read pre-cached data from in-memory Alluxio running on the same host.

With 36 execution cores (9 executors x 4 cores), the Spark line count ran at 4 GBps speed – on the same strong host where we previously observed the 20GBps raw Alluxio IO throughput. This indicates Spark’s ingestion throughput limit (object data read and de-serialization). It also proved that the limit is much lower than the Alluxio IO capacity, when Alluxio is deployed in the default (local -in-memory) configuration. Clearly, Alluxio allows Spark to run at its maximal speed. Just keep in mind that the Spark ingestion limit is higher than Alluxio throughput in separate-host or HDD configurations; this is where Alluxio can become a bottleneck for lightweight Spark workloads.

Using Alluxio with object storage

Object storage systems can retrieve stored data at very high speeds. Still, the delivery rate from the stores to applications is often limited by the network or application host’s NIC bandwidth. With a 10 Gbps NIC, the Spark line-count workload doesn’t reach its maximal throughput when reading the data from an object store – even if the store is deployed in the same data center with the Spark cluster.

We observed a ~3x speedup running a lightweight Spark sample with data from a host-local Alluxio, as opposed to data from a DC-local object store. Of course, this corresponds to the difference between the 4 GBps Spark/Alluxio line-count speed mentioned above, and the 1.2 GBps object store read speed (capped by the Spark host NIC).

In short, caching the data in local Alluxio can accelerate Spark workloads for remote data stores as well as stores deployed in the same data center with the Spark cluster.

However, the latter applies only to lightweight applications. Alluxio doesn’t help with heavier workloads, where most of the time is spent on data processing rather on fetching data from storage. For example, word instance count took exactly the same time when data was stored either in local Alluxio or in DC-local object store.

Working with small data objects

The network bandwidth provides a reasonable indication for data access speed when it comes to medium-sized and large objects. Here, the time of object data delivery significantly exceeds the time of object meta-data operations (such as list and head). But when the data sets comprise many small objects, the network speed is no longer the bottleneck; the IO time becomes dependent primarily on the store overhead of per-object meta-data operations.

We found this kind of overhead for Alluxio storage to be approximately 5 times lower than for an object store (Swift/Cleversafe in our tests). We ran a Spark line count benchmark in an identical setup—but with tens of thousands of 1MB objects instead of hundreds of 64MB objects. Our results showed a 1.2 GBps throughput when reading the data from Alluxio. This is compared to a 0.25 GBps throughput when reading the data from a DC-local object store. We measured the 0.25 GBps throughput on the host with a 10 Gbps NIC, so the storage data retrieval was not limited by the network in this scenario. For data sets comprising many small objects, Alluxio caching can accelerate Spark workloads even when the storage access bandwidth is unlimited. This is true for both local (same-host) and separate-host Alluxio deployments (for the same reason – the NIC is not the bottleneck).

When is Alluxio right for your Spark workload?

Here are some simple rules of thumb to help you decide when Alluxio is useful for your Spark workload:

  1. Multiple Spark applications access (r/w) the same data. 
Or Spark applications that pre-fetch data for single time-sensitive use.
    
With a single Spark application instance, you can use Spark caching instead of Alluxio—unless you want the data pre-cached, before the start of a time-sensitive application.
  2. Storage I/O time is greater than the CPU time of data processing
    
Storage IO time refers to the end-to-end time of data set delivery, from storage to application. This includes the latency, size / bandwidth time, and storage disk read. It also includes the object listing and metadata retrieval.
    CPU time refers to the time it takes for the application to process the data set. This can be estimated in Spark by running the workload with serialized (MEM_SER) caching of a sample RDD.

  3. Storage I/O time greater than Alluxio I/O time

    The Alluxio I/O time (speed) depends on the Alluxio configuration (MEM / SSD / HDD, deployment: same host / separate host or cluster) and includes the overhead of the object listing and metadata delivery

  4. Data set size up to ~ double Alluxio cache size
    
The size of “hot” data sets should not exceed the Alluxio caching capability by much. They can be a bit larger since partial caching still accelerates the data access.

In all of the above scenarios, one should not ignore the costs associated with Alluxio deployment – be it the resources (memory, disks, or separate hosts) unavailable to other services, or the runtime maintenance costs.

You are welcome to contact me (gidon@il.ibm.com) if you have any questions on deployment or would like to share your results.

By: Gidon Gershinsky


1 Dual Intel Xeon E52690 (12 hyper-threaded Cores, 2.60 GHz) 256 GB RAM

Newsletter

You Might Also Enjoy

Kevin Bates
Kevin Bates
3 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
5 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More