Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a non-obvious application of Spark we are exploring to address the challenges of data sync in a polyglot hybrid cloud.

I am not referring to any specific suite of IBM products, but rather sharing insight into some our thinking and experimentation on how to handle data replication in highly transactional systems with multiple versions of the same data, as well as in the context of an IoT platform. I will discuss both as they apply to the use case of a connected car.

The rise of hybrid cloud

Why are we exploring this application of Spark and Kafka for a hybrid cloud environment? The industry has been acting under the delusion that to race is to the public cloud. IBM, our competitors and the IT community have been singularly focused on, or shaping our business models, to drive to this end-game. The end game all along should have been to a private-public hybrid cloud with a focus on getting most uses to public portion of the hybrid cloud where possible. Some of the current partnerships are demonstrating that our competitors are seeing this as well. While growth of public cloud is still inevitable, a private-public hybrid of some kind is a likely equilibrium. While 90% of companies in a survey reported using cloud in some way the same survey reported that on 30% of businesses surveyed are on a public cloud only trajectory. (Northbridge 2016 Cloud Survey). Additionally, Northbridge reports that 30% of respondents have a public cloud strategy, 23% a private cloud strategy, and 47% have a hybrid cloud strategy moving forward.


There are four main factors have led to this normalization: differentiation based on data, data security, data regulation and organizational/industry readiness and how these four factors contribute to organizational application of a data science driven strategy.

Unique challenges of hybrid cloud

Defining what cloud means for your organization is the first step in the journey, to the cloud. The end result will more than likely involve a hybrid cloud of some kind. This will likely be a private-public hybrid and the public portion of your cloud will likely involve more than one cloud provider. Additionally, the mode of operation in the cloud is polyglot persistence, meaning that the same data will live in multiple data stores depending on the use of that data. These stores can live in your private cloud or your public cloud. How do you keep that data in sync?

Keeping data in sync is relatively straightforward when only one store is being transacted on, but becomes more complicated when both stores are being transacted on. Now image you have three, four, five or ten versions of this data in live transactional systems. How do you keep this data synced in some fashion?

Looking at a connected car, depending on the car manufacturer, a connected car operates in one of three states fully private cloud, fully public cloud or a hybrid cloud. However, they all operate with some concept of tenancy and they all have significant data replication occurring on live transactional data and they all have some level of edge compute occurring.


This creates unique challenges for traditional ETL, which requires a single version of source data. We propose that combining tools like Kafka and Spark can modernize how we approach ETL type problems in a modern data environment. We are coining 'Stream Transformation' as the way to describe this process. James Spyker at the Spark Technology Center has done some initial work on this concept and we’ll dig a little deeper into this at the keynote.

See James Spyker’s blog post on Stream Transformation.

J. White Bear from Spark Technology Center will be giving a technical presentation at Spark Summit the day before this keynote on utilizing Kafka and Spark to distribute computation at the edge and in the cloud in IoT applications. Her specific application is Simultaneous Localization and Mapping (SLAM) for anonymous vehicles. This application is the other end of the spectrum with extremely high transaction rates and a need for immediate transformation and integration of multiple data sets to perform near real-time analytics both at the edge and in the cloud.

See J White Bear’s blog post on SLAM with Spark and Kafka.

Moving to the cloud takes work, but the fun and hard challenges start once you get there. Keeping data in sync across multiple environments and multiple availability zones while being able to act on data in near real-time is daunting. But the framework is in front us today, and most of this is possible because of the power of the open source community, and the work the community, including IBM, had done in readying tools like Spark, Kafka, Python, R and others for enterprise application.


You Might Also Enjoy

Kevin Bates
Kevin Bates
22 days ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
3 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More