Q & A with New Spark Committer Holden Karau

Newly appointed (anointed?) Apache Spark committer Holden Karau isn't resting on her laurels. See her talk this Thursday at Spark Summit East where she'll be presenting "a monster identification guide for Spark exceptions" in an unusual way ...

Congratulations! I know what you becoming a committer means to the Spark community: We now have a passionate educator and writer, and someone who can see both the big picture of where Spark needs to go, and sit down and code it on an international flight, in a position of greater influence. But what does becoming a committer mean to you, in terms of what you can now work on or do for the project or the community that you could not do before, or to you personally? Where were you when you found out, and how did it make you feel?

Thanks, I really appreciate it. For me being a committer means that I can help move PySpark forward faster, and there are a lot of exciting things happening around that (Bryan Cutler here at the STC has some interesting WIP stuff built around Arrow that looks very promising - although still early stages). It also means a shift to doing more reviewing.

I was coming back from a lunch handstands class at Vertical Barre in San Francisco when I found out, and I was so excited I just put in my headphones and went for a very long scooter ride.

What do you think is the most important thing you can do as a Committer?

I think mentoring new contributors. It's certainly possible to do as an experienced non-committer, but as a committer it’s easier to follow through and make sure people don’t drop off out of frustration during the review process. This of course is going to require a lot more dedication to reviews, but I’m trying to set aside some time at the start of every day (with the exception of right before Spark Summit, of course).


Your new book, High Performance Spark, is almost a primer for getting the most out of Spark at scale. Has working at IBM where we implement Spark solutions with so many large enterprise clients afforded you a different view into this than you had before, or is it more, scale is scale, and you work at the puzzle from the code end, regardless of environment?

I've certainly been exposed to different types of scaling problems than I have in the past, and some of those do make it into the book. But for the most part we've tried to stick with sort of the common denominators of scaling problems with Spark, so the things which show up regardless of the specific domain.

We make a big effort to protect — almost wall off — STC engineers so that you can work with "open source mind" in the interests of the community and the project. Has that been successful, as far as your work is concerned?

I think it has. It's a difficult balance to strike — especially at a company as large as IBM. I think both of the leads we’ve had for the STC (Steve Beier and Vijay Bommireddipalli) have done a great job allowing the STC to develop without getting pulled into too many internal projects.

That said, what is the most exciting project you've seen on the commercial side of IBM, or who are the one or two people doing work that excites you the most?

The most exciting thing that I know about I don't think has a public name yet so I'll just say J.P.'s team seems to have some really interesting plans I'm excited for. And on my way back from FOSDEM, I randomly ran into some wonderful people on our JVM team.

What do you think would be the greatest contribution the STC can make?

That’s a good question, I think, much like Databricks’ experiences with running large Spark deployments in the cloud, we can bring the feedback and fixes for problems that are only really seen in the types of organizations IBM has the opportunity to work with. Striking the right balance here though is going to be important.

You owned a company after graduating from the University of Waterloo. That gives you something in common with another STC committer, Nick Pentreath. Is "Founder" a title you'd ever like to have again? If yes, can you envisage what type of company it would be?

I'm not really sure about that, I think I like working with other developers a bit too much to go back to wearing all of the different hats typically required of a founder, but if I did it would probably be a developer-tools-focused company. Although I hope one with paying customers — living in San Francisco is expensive!

You contribute to Spark particularly fulsomely in Spark ML and PySpark. Can you pick a favorite of the two, or do you love them both, but differently? What is the appeal and the opportunity of each of those two avenues for you?

So PySpark is my favorite, and it's because of the wonderful PyData community. Python has done an excellent job building a wonderful community and I'm happy to be able to be a part of it through my work on PySpark & talks at conferences (like PyData in Amsterdam).

What do you think the future holds for PySpark performance?

Performance-wise I think the big challenge has always been the memory barrier between the JVM and Python — and I think we are going to be making progress at improving that situation. Dataframes are a great start at pushing computation into the JVM, and I alluded to some of the potential options for improving this in #1 with Arrow, but we could see similar things happening with making tungsten understandable from other languages.

What’s the one piece of the API you would get rid of if you could?

If we could get rid of groupByKey in the RDD API without breaking perfectly reasonable use cases I would. It’s one of those things where the users expectations and the developers expectations really don’t match -—and this has caused no end of headaches for people.


You studied CS and Math at the University of Waterloo. Why did CS win? I can so easily see you as a mathematician, a professor — you have absolute fluency both in the abstract language of systems and in explaining them in plain English.

That's a hard one, there are so many different factors. I think I didn't see myself doing anything “new” in mathematics, whereas I thought I could do something really exciting with CS. It felt like I could make new different things that no one else was going to do otherwise and that just captured my imagination.

Which bits of Canada do you miss? There are so many different childhoods to have in Canada. Mine involved beaches, white gloves, and bears. Did you toboggan growing up, or perhaps skate along the Rideau Canal, munching on Beaver Tails? Did your family take you to hockey games growing up, or to Cirque de Soleil, or to the symphony? And where did your interest in math come from? Family? A teacher? Sui generis?

I remember Beaver Tails with a certain fondness. I think I miss Canada's approach to multiculturalism the most (followed of course by tim-bits and good poutine,) not that it's perfect, just the communities that flourished. My Canadian childhood involved downhill skiing (I was too lazy for cross-country), a lot of the national orchestra, coffee shops, and a surprising amount of "happy hardcore" (a dance music genre).

I think my interest in math first came from a close friend of my mother's, who while visiting us took it upon herself to teach me some different cool things (I was quite young at the time) — and as someone who was having difficulty academically at the time (I have dysgraphia so writing by hand is challenging) finding something that I could do was really exciting. My next-door neighbors also were engineers (one of them works at IBM Canada now,) which certainly helped my love of math grow.

In your author bio for High Performance Spark, you list your passions outside of software as playing with fire, welding, scooters, poutine, and dancing. Please rank them.

Dancing, scooters, poutine, fire, and welding would be the current order. I’ve had to take a break from playing with fire and welding while I recover from some personal stuff, but I’m hoping to get back to those two this year.

Is there anything else you get up to outside of software besides fire and scooters?

Since I’m trans, I try to do my (small) part to help move that forward a bit — from helping trying to get the transport identity screening regulations in Canada improved to organizing a small trans-focused queer scooter gang in San Francisco. If you want to join get in touch with me, even if you don’t own a scooter, we get over 50% ride on rental “scoots” and we have one motorcycle member too.

To reach Holden about scaling Spark, scooters, or anything else, find her on Twitter @holdenkarau.


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More