Apache Spark™

“In 1997, IBM asked James Barry to make sense of the company’s struggling web server business. Barry found that IBM had lots of pieces of the puzzle in different parts of the company, but not an integrated product offering for web services. His idea was to put together a coordinated package, which became WebSphere. The problem was that a key piece of the package, IBM’s web server software, was technically weak. It held less than 1 percent of a market.”

“Barry approached Brian Behlendorf [President of the Apache Software Foundation] and the two quickly discovered common ground on technology issues. Building a practical relationship that worked for both sides was a more complex problem. Behlendorf’s understandable concern was that IBM would somehow dominate Apache. IBM came back with concrete reassurances: It would become a regular player in the Apache process, release its contributions to the Apache code base as open source, and earn a seat on the Apache Committee just the way any programmer would by submitting code and building a reputation on the basis of that code. At the same time, IBM would offer enterprise-level support for Apache and its related WebSphere product line, which would certainly help build the market for Apache.”

  • Reference: The Success of Open Source, Steven Weber 2004

In the 20th century, scale effects in business were largely driven by size and distribution. A company with manufacturing operations around the world had an inherent cost and distribution advantage, leading to more competitive products. A retailer with a global base of stores had a distribution advantage that could not be matched by a smaller company. These scale effects drove competitive advantage for decades.

The Internet changed all of that.

In the modern era, there are three predominant scale effects:

  • Network: lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, etc.)
  • Economies of Scale: lower unit cost, driven by volume (Apple, TSMC, etc.)
  • Data: superior machine learning and insight, driven from a dynamic corpus of data

I profiled a few of the companies that are exploiting data effects in my book Big Data Revolution—CoStar, IMS Health, Monsanto, etc. But by and large, big data is an unexploited scale effect in institutions around the world.

Spark will change all of that.

30 days ago we launched Hack Spark in IBM, and we saw an immediate groundswell of innovation. We made Spark available to over 10,000 developers in IBM. 28,000 showed up for the contest. Teams formed based on interest areas, moonshots were imagined, and many became real. We gave the team ‘free time’ to work on Spark, but the interest was so great that it began to monopolize their nights and weekends. After ten days, we had over 100 submissions in our Hack Spark contest.

We saw things accomplished that we had not previously imagined. That is the power of Spark.

To give you a sampling of what we saw:

Genomics: A team built a powerful development environment of SQL/R/Scala for data scientists to analyze genomic data from the web or other sources. They provided a machine learning wizard for scientists to quickly dig into chromosome data (kmeans classifying genomes by population). This auto-scalable cloud system increased the speed of processing and analyzing massive genome data and put the power in the hands of the person that knows the data best. Exciting.

Traffic Planning: A team built an Internet of Things (IoT) application for urban traffic planning, providing real-time analytics with spatial and cellular data. Messaging queues could not handle the massive and continuous data inputs. Data lakes could not handle the large volume of cellular signaling data in real-time. Spark could. The team exploited Spark as the engine of the computing pool, Oozie, to build the controller module, and Kafka as the messaging module. The result is an application to process massive cellular signal data and visualizes those analytics in real-time.

Disaster Relief: A team built a risk assessment tool that forecasts populations at risk of being affected by disasters—typhoons, earthquakes, hurricanes—before they hit. Traditional geographic information system (GIS) platforms aren’t able to process high-resolution map datasets for a large area efficiently. But a distributed map algebra library over Apache Spark makes it a reality. The team implemented a high-throughput Web Map Service (WMS) on top of a Big Data platform to provide mapping services for web and mobile user-facing apps. This puts life-saving capability literally in the hands of humanitarian relief workers in the field: they can get scenario assessment for an entire country in real-time on a mobile device.

Spark is changing the face of innovation in IBM. We want to bring the rest of the world along with us.

Apache Spark lowers the barrier to entry to build analytics applications, by reducing the time and complexity to develop analytic workflows. Simply put, it is an application framework for doing highly iterative analysis that scales to large volumes of data. Spark provides a platform to bring application developers, data scientists, and data engineers together in a unified environment that is not resource-intensive and is easy to use. This is what enterprises have been clamoring for.

An open-source, in-memory compute engine, Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Business professionals can now complement descriptive analysis—visual dashboards—with intuitive, easy-to-use applications built on Spark that learn from their surroundings and prescribe actions in the moment. Think of it as prescriptive analytics. This means that, with Spark, enterprises can benefit from applications that deploy insights at the front lines of their business exponentially faster than ever before.

Spark is highly complementary to Hadoop. Hadoop makes managing large volumes of data possible for many organizations due to its distributed file system. It has grown to a broad ecosystem of capabilities that span data integration and data discovery. It changed the speed at which data could be collected, and fundamentally changed how we make data available to people.

Spark complements Hadoop by providing an in-memory compute engine to perform non-linear analysis. Hadoop delivers mass quantities of data, fast—but the real value of data can’t always be exposed because there isn’t an engine to push it through. With Spark, there’s a way to understand which data is valuable and which is not. A client can leverage Spark to add to what they are doing with Hadoop or use Spark on a stand-alone basis. The approach is in the eye of the beholder.

While there are many dimensions to the Spark ecosystem, we are most excited by machine learning. Machine learning is better equipped to deal with the modern business environment than traditional statistical approaches, because it can adapt. IBM’s machine learning technology makes expressing algorithms at scale much faster and easier. Our data scientists, mathematicians, and engineers will work with the open source community to help push the boundaries of Spark technology. Our shared goal is to create a new era of smart applications to fuel modern and evolving enterprises.

Applications with machine learning at their core get smarter and more customized through interactions with data, devices and people—and as they learn, they provide previously untapped opportunity. We can take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to the right person, exactly at the moment it’s most needed.

Over the next five years, machine learning applications will lead to breakthroughs that will amplify human abilities, assist us in making good choices, look out for us, and help us navigate our world in powerful new ways.

IBM sees Apache Spark as the analytics operating system of the future, and we are investing to grow Spark into a mature platform. We believe it’s the best technology today for attacking the toughest problems of organizations of all sizes and delivering the benefits of intelligence-based, in-time action. Our goal is to be a leading committer and technology contributor in the community. But actions speak louder than words, which brings us to today’s announcements:

  1. IBM is opening a Spark Technology Center in San Francisco. This center will be focused on working with the open source community and with entrepreneurs to provide a scalable, secure, and usable platform for innovation. The Spark Technology Center is a significant investment, designed to grow to over 300 data scientists, developers, and designers and empowered to make substantial and ongoing contributions to the community.
  2. IBM is contributing its industry leading SystemML technology—a robust algorithm engine for large-scale analytics for any environment—to the Apache Spark movement. This contribution will serve to promote open source innovation and accelerate intelligence into every application. We are proud to be partnering with Databricks to put this SystemML to work in the community.
  3. IBM will host Spark on our developer cloud, IBM BlueMix, offering both a hosted service and system architectures and the tools surrounding the core technology to make it easier to consume. Our approach is to accelerate Spark adoption.
  4. IBM will deliver software offerings and solutions built on Spark, provide infrastructure to host Spark applications such as IBM Power and Z Systems, and offer consulting services to help clients build and deploy Spark applications.

IBM is already adopting Spark throughout our business: IBM BigInsights for Apache Hadoop, a Spark service, InfoSphere Streams, DataWorks, and a number of places in IBM Commerce. Too many to list.  And IBM Research currently has over 30 active Spark projects that address technology underneath, inside, and on top of Apache Spark.

Our own analytics platform is designed with just this sort of environment in mind: it easily blends these new technologies and solutions into existing architectures for innovation and outcomes. The IBM Analytics platform is ready-made to take advantage of whatever innovations lie ahead as more and more data scientists around the globe create solutions based on Spark.

We are just at the start of building many solutions that leverage Spark to the advantage of our clients, users, and the developer community.

IBM is among the top contributors to open source in the world. We have over 300 engineers on Hadoop/Spark, over 100 on Docker, and over 200 on Open Stack. IBM is now and has historically been a significant force supporting open source innovation and collaboration, including a more than $1 billion investment in Linux development. We collaborate in more than 120 projects contributed to the open source community, including Eclipse, Hadoop, Apache Spark Apache Derby, and Apache Geronimo. IBM is also contributing to Apache Tuscany and Apache Harmony. In terms of code contributions, IBM has contributed 12.5 million lines of code to Eclipse alone, not to mention Linux— 6.3 percent of total Linux contributions are from IBM.

We see in Spark the opportunity to benefit data engineers, data scientists, and application developers by driving significant innovation into the community. As these data practitioners benefit from Spark, the innovation will make its way into business applications, as evidenced in the Genomic, Urban Traffic, and Disaster Relief solutions mentioned above. Intuitive, easy to use applications with prescriptive intelligence will unlock the big data scale effect for enterprise.

Spark is about delivering the analytics operating system of the future—an analytics operating system on which new solutions will thrive. And Spark is about a community of Spark-savvy data scientists and data analysts who can quickly transform today’s problems into tomorrow’s solutions. Spark is the fastest-growing open source project in history. We are pleased to be part of the movement.


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More