0 to Life-Changing App: Viewing Images with Python 3 and OpenSlide

500 Images at around 7GB each. That's a lot of data.

Not sure what I am referring to? Let me fill you in.

Not too long ago I was assigned the project of my dreams here at IBM's Spark Technology Center: create a super life-changing application that incorporates Apache Spark and Apache SystemML. If you've been following along, you'll know that I am barely a year into my data science career and between my internship here at the STC and my masters degree at UC Berkeley, I have met with a steep learning curve. Because of this, I've decided to blog about every step along the way! That way every data enthusiast and fellow data scientist can follow along and build their own life-changing app. (After all, we might as well crowd source saving the world.)

My last blog post was a tutorial on how to use the new SystemML API on the Spark Shell, but before that, I looked at the frustrating step of finding big, open data. On this quest for delightful data, my team and I came across a breast cancer research competition that was an ideal use case for SystemML and Spark. I mean, it was BIG data, life-changing, and interesting. What's not to love? Let me elaborate. After entering the competition, we were given 500 digital images of breast cancer tissue on medical slides, taken from a microscope. Considering that these images are huge slides (apx 7GB each) with 20-40x zoom, with 50,000 pixels to 100,000 pixels in both directions, we can safely say we are dealing with really big data! Because of the size, it is an excellent challenge for Apache Spark and Apache SystemML and our goal will be to develop an automatic way, or a SystemML solution, to determine the grade of cancer in any given tissue image. In order to solve this problem, we will need to use deep learning and neural networks, but first, we have to clean up our data. That's what this blog is for!

While in this pre-processing stage, I've been able learn about a ravishing resource for viewing large images: Openslide and deepzoom. Because of this, I'll first walk you through how to set up and use these tools. After that, we will go ahead and get started on some pre-processing steps! If you don't have access to images of your own, try this source.

First update.

brew update  
brew upgrade  

Install Python 3.

brew install python3  

Install the Python packages.

pip3 install -U matplotlib numpy pandas scipy jupyter scikit-learn scikit-image flask  

Install OpenSlide.

brew install openslide  
pip3 install openslide-python  

Now, create a new folder and work from there. I named mine AwesomeProject/.

*Note: Check where you installed SystemML in my first tutorial.
*Note #2: If you don't have tissue images lying around, use this source. Download the .svs files.

#Download a few images to get started. 
#Place them in a new folder within AwesomeProject/.
#I called mine data/.
#Start your Jupyter notebook.
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path $SYSTEMML_HOME/SystemML.jar  
#Leave this tab running and Jupyter open in your browser. We #will come back to it later.

Make sure your Jupyter notebook starts up with Python3 in the right hand corner. If it doesn't show up automatically, go to Kernel -> Change Kernel -> Python3. If that doesn't work you may need to make sure Python3 is the version being used.

Now, in a new tab on terminal, go into your data/ folder. You'll now need to clone OpenSlide and go into the folder to start it. Don't know git? Here's a great tutorial.

git clone  
python3 openslide-python/examples/deepzoom/ data/  
cd openslide-python/examples/deepzoom  

Now you need to open OpenSlide on your browser.

#After you push enter, your terminal should say:
*Running on http://address
#copy that http: address and paste it in your browser.

Now you should have two tabs on terminal occupied by Jupyter and OpenSlide. Leave both of them running. When you go to OpenSlide on your browser you should see a list of your image files in your data/ file.

Click on one of the images to see it.
OpenSlide Image

Once you are viewing the image you can use your mouse or track pad to zoom in and out.
OpenSlide DeepZoom

Congrats! You've now looked at all of that tissue using OpenSlide and Python 3. Now, let's do our first pre-processing step using Jupyter.

Navigate back to your Jupyter Notebook that should be in your browser. Remember, we are still in a bit of an exploratory phase, so our aim is to look at example tiles and change it around before applying it to the entire slide and most definitely before applying it to all 500 slides.

Our first step is to load everything we need.

%load_ext autoreload
%autoreload 2
%matplotlib inline

# Add SystemML PySpark API file.

from glob import glob

import matplotlib.pyplot as plt  
import matplotlib as mpl  
import numpy as np  
import openslide  
from openslide import open_slide  
from openslide.deepzoom import DeepZoomGenerator  
import pandas as pd

from scipy.ndimage.morphology import binary_fill_holes, binary_closing, binary_dilation  
from skimage.color import rgb2gray  
from skimage.morphology import closing, binary_closing, disk, remove_small_holes, dilation, remove_small_objects  
from skimage import color, morphology, filters, exposure, feature

plt.rcParams['figure.figsize'] = (10, 6)  

Now we can choose the slide we want to work with.

#Start by getting your images from your data/ file.
files = glob("data/*.svs")  
#Specify which image/slide it is. For this example I will
#use slide 7.
slide_num = 7  
slide = open_slide(files[slide_num-1])  

Now we will generate tiles or, in other words, slice the image up into smaller squares. This will help us look at the image in more detail and will also help us process the content later. We want to do this because we can't process the entire image, but need to instead process them by tile.

tile_size = 1024  
tiles = DeepZoomGenerator(slide, tile_size=tile_size, overlap=0, limit_bounds=False)  
# overlap adds pixels to each side
# See how many tiles there are for each level of magnification.
#choose tiles you want to look at. You can change around 
#the coordinates to get the tile you are looking for.
#This is where OpenSlide helps.
tile = tiles.get_tile(tiles.level_count-1, (85, 35))  

Below are examples of what I did.

Look at you! You have generated your tiles and visualized some examples! You are now officially an expert at OpenSlide after looking at images of tissue, loading your images, and visualizing some example tiles.

Next up will be further pre-processing steps and exploration. Once we have finished our pre-processing on example tiles, we will be able to apply it to all of our slides and use our Spark cluster. This will be followed by our fancy SystemML steps.

It seems we are well on our way to changing lives.

Stay tuned for more!
By Madison J. Myers


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More