machine learning

0 to Life-Changing: Getting Set-up with Spark Clusters

We are so close to finally being able to apply machine learning and deep learning to our awesome, life-changing app...

But as always, wait, there's more!

Now that you have developed your preprocessing code, you need to figure out how to use ALL of your delightful data for your Machine Learning steps.

So how do you do this?

We will be using Spark within the Data Science Experience.

So few technologies to learn and so much time… wait, reverse that.

…to start out let’s focus on Spark. What is it and why is it important? After all, I work in a Spark Technology Center bubble and I don’t want to assume everyone knows about Spark or why they may want to use it (especially for the next step in our Awesome Life-Changing Project).

What you need to know is that Apache Spark has taken the world by storm (not Storm, that’s something else) and that it is cluster-based, providing crazy fast computing. It is an evolution (though not a new version nor dependent on) of Hadoop and uses the MapReduce model for stream processing that is in-memory, supporting batches, queries and streaming.

Basically, because you cannot download giant data to your computer, like the data we’re working with for our super awesome app using Apache SystemML which has 7.8 terabytes of breast cancer tissue data (holy Hadoop that’s a lot of data), you must rely on a tool like Spark! And really, anything that supports that much data deserves some kind of attention. I mean, not only that, but beyond speed, Apache Spark is a better environment for iteration and its core supports multiple languages like Java, Scala and Python as well as SQL querying, data streaming, graphs and machine learning (though we will be using Apache SystemML for this portion).

Being a UC Berkeley student, I also can’t not mention the fact that it was developed at UC Berkeley’s AMPLab in 2009 and was later open sourced and donated to the Apache Software Foundation.

Since then Apache Spark has completely taken off and you no longer have an excuse to avoid learning it…

So let’s install it if you haven’t already.
Make sure you have homebrew:

# OS X:
/usr/bin/ruby -e "$(curl -fsSL"
ruby -e "$(curl -fsSL"  

You also need Java and Scala.

brew tap caskroom/cask  
brew install Caskroom/cask/java  
brew install scala  

While you're at it, install some other things you'll need later on (if you haven't already).

brew install python3  
pip install jupyter matplotlib numpy  

Install Spark.

wget spark-1.6.0-bin-hadoop2.6.tgz  

Ok you've now installed Spark, but you're probably still wondering why is Spark so fast? Why am I using it?

Basically, Spark is so fast because of in-memory clusters that also optimize computations (grouping them into one step in the process) which makes it ideal for machine learning and analytics, which is the focus of our Breast Cancer Project. So, in order to use Spark clusters, you must get a cluster manager to manage your clusters. Typical examples include Standalone mode, YARN or Mesos. We can also use tools like IBM’s Data Science Experience (in beta), Amazon Web Services or Databricks to start up a Spark cluster. There are tons of tutorials out there for most of these, but I'll go ahead and walk you through the Data Science Experience because it is so new (and so helpful).

Not detailed enough? Here is a more technical blog on how to build clusters and a more detailed explanation on the significance of Spark.

First, head to the DSX website.
You can play around and see what resources they have, but if you're in a rush, go ahead and click on "Sign Up For a Free Trial". There you will create your Bluemix id.

Once you log in, you'll be taken to a landing page called "Community". Definitely look around here. You'll see that you can access technical articles and tutorials, datasets and notebooks with various topics such as digit classification. To start a new project, and in this case a notebook, click on the top left corner. It will open up with several selections. You want to choose "New" and then "Notebook". Fill in some info on what Notebook you're working on- this way you can find it later. Here you will also select what kind of spark service you want (size of your spark instance that will automatically generate for you when you start the notebook-you can update this later when you go to your projects, under your profile). Once you're in your new notebook, you can then actually connect your data within the same platform, load SystemML like we did in this tutorial, and write all of your Machine Learning code right in the notebook (and then share it easily with collaborators).

If you don't want to use DSX, I recommend using Apache Spark's documentation, which is actually pretty amazingly clear in walking you through all of this set up.

You can also continue to test portions of your data locally with Jupyter or Zeppelin notebook. This is what my team and I did initially to write our algorithms. We only started using a Spark cluster when we wanted to apply our algorithms to our ENTIRE ginormous data set.

As a reminder, here's how to install and start Jupyter notebook:

#Upgrade Pip
pip3 install --upgrade pip  
#Install Jupyter
pip3 install jupyter  

*Make sure you have installed Python3 before doing this step.

Now, go to terminal, go into the folder that your project is or will be stored in and type the following:

jupyter notebook  

After pushing enter, a notebook should start up on your browser. Leave the terminal tab open and running and move to the browser. Here you can iteratively write your code using the data you have stored on your computer, before using Spark and SystemML. This way you can test and run everything before taking the time to train your whole dataset!

Once that part of your project is complete (and my team's competition due date passes on October 3) I will FINALLY be able to share with you what my team and I have been working on for our super awesome life-changing application of Apache SystemML!

So stay-tuned because very soon the curtains will be pulled and I will finally be able to walk you through the specifics of applying further preprocessing, machine learning and deep learning to your project as well as how to troubleshoot certain problems that may come up along the way.

I can already feel us changing the world.

Stay tuned for more!
By Madison J. Myers


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More