open source design

Better Data Science Through Open Source Design

Open Source Design and Apache Zeppelin

The Apache Zeppelin notebook is an open source data analytics tool that emphasizes data visualization and has native integration with Apache Spark. It was created in 2013 by Moon Soo Lee and his co-founder at NFLabs, Sejun Ra. As adoption of Zeppelin grew over the year, they made the decision to move it under the Apache Software Foundation and take it global. The popularity of Zeppelin has been growing ever since.

Why Open-Source Design Is Crucial

As a whole, data science tools do not offer the best user experience. The field of data science is growing and evolving at an accelerated pace and the need for better tools has never been more urgent. We need tools that adapt – tools that not only help data professionals work with data, but also empower them to share insights with non-data professionals. Data is nothing if the meaning is lost in translation. The people who make decisions based on data would need to feel they are in familiar territory while trying to understand these insights.

The conversation between data professionals and non-data professionals is one of the major bottlenecks in data science right now. None of the tools we see currently are addressing this problem in an effective way. Data science is a fragmented process. It struggles to keep up with its own momentum. At STC Design, we want to make it whole. We want data to be inclusive and accessible. We feel the path to meeting this goal is in open source. The community behind Apache Zeppelin shares this vision with us, which is what sparked off our collaboration on Zeppelin.

People don’t often think of design when they hear “open source”. We’re looking to change this. It’s part of our mission at STC Design. If we create better experiences for open source data consumption, analytics, and the conversation that needs to happen throughout the data science lifecycle, we have a chance to make a big impact. Design is key.

Collaborating On Apache Zeppelin

Apache Zeppelin already has a devout community of contributors. Its pluggable backend and support for multiple languages within a single notebook encourages a more collaborative experience for data scientists and data engineers. This set the stage for our design enhancements.

We set out on a path to improve collaboration and make Zeppelin more accessible. As we dug in, we began to see huge holes in data science workflows and processes. This kick started our research efforts. Over the next few months, we fostered relationships with schools like Galvanize and USF. We began building relationships with open source collaborators like Pivotal Open Source Hub. These efforts led to a better understanding of the needs and shortcomings at the front lines of data science. We leveraged our findings in the startup and open source spaces, against research findings with big corporations that had larger, more seasoned data science teams. The combined understanding informed our design proposals for Apache Zeppelin.

A Better User Experience

Our research and user testing helped us pinpoint some crucial pain points data scientists and engineers face in their daily workflow, especially when using notebooks. We focused our design efforts around these pain points, looking to find effective solutions and improve the users’ experience.

Notebooks can be messy. They are often used for exploration, to work through problems and look for correlations in data. Given this method of working, it is extremely common for data scientists to move and rearrange their code, as well as insert new scripts. All of these actions were challenging in Zeppelin. The ability to add and move blocks of code (known as paragraphs in Zeppelin) was hidden inside settings menus, and restricted users to only adding new paragraphs directly below the current paragraph. Additionally, users were limited to moving paragraphs up or down only one position at a time.

One of our first contributions to Apache Zeppelin was allowing users to add paragraphs by simply clicking in the space above or below existing paragraphs. This feature greatly sped up how data scientists could work within the notebook. We’ve begun to pair these UI interactions with keyboard shortcuts, making shortcut users more efficient and productive.

These improvements were well received and were released in Zeppelin 0.5.6. We are currently working on implementing a drag and drop model for arranging paragraphs, displaying paragraphs side by side, and storing code for later use. We are aiming to introduce these features in the 0.6.0 release, along with a complete overhaul of the visual language. The plan is to create a comprehensive design language, including visual components and extensible style guide. Providing a design language will give developers an easy and clear guide to customize and adapt Zeppelin for their purpose, while at the same time, maintain consistent integrity in user experience. We believe this will be a huge asset to the open source community contributing to Apache Zeppelin, and will help instill good design practices.

What’s next for Apache Zeppelin?

We want to make Zeppelin friendly and accessible for everyone, not just data scientists and data engineers. We want business people and decision makers to be able to use Zeppelin to make informed decisions. As we move forward, Zeppelin will become more flexible, more collaborative, and more extensible, further closing the gap between data science and the world around us.


Please join me at Strata + Hadoop World (http://conferences.oreilly.com/strata/hadoop-big-data-ca) in San Jose, March 30. I’ll be giving a flash talk on design thinking with data, for the good of society.

My speaker profile is here: http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/speaker/239040

Newsletter

You Might Also Enjoy

Kevin Bates
Kevin Bates
22 days ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
3 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More