data-frack: Visual Mnemonics for the PySpark API

Monday, January 19, 2015

Visual Mnemonics for the PySpark API

Below is a short description of an open source project I created called 'pyspark-pictures', a collection of visual mnemonics and code examples for the PySpark API. If you haven't seen it yet, I recommend taking a quick look at the static version on NBViewer first, because a picture is worth a thousand words.

I've been learning Apache Spark lately. Spark is a distributed computing framework that is faster, and more expressive than Hadoop MapReduce. MapReduce is like a Swiss army knife with a minimum number of tools. It gets the job done but can be awkward for certain jobs. Spark is like a knife with more tools. It lets you select the best tool for the job with minimum effort.

So what is the best way to learn all these new tools? In some cases it's easy. Functions like, 'sum()', 'mean()', and 'distinct()' are immediately obvious from their name. But others are less obvious (e.g. 'flatMap()', 'glom()', 'fold()', 'combineByKey()'). Especially if you are not familiar with the jargon of functional programming.

When I was learning Spark, I found it useful to create visual mnemonics for each function so I could visualize the core operation and distinguish it from other similar functions. Below are a few examples of what I ended up with. The full set is available here as an IPython notebook. In addition to downloading the source code, a static version of the notebook is available here on NBViewer.

Each mnemonic uses blue rectangles to represent elements in the original RDD (resilient distributed dataset). The left side represents the input RDD and the right side represents the output. Elements in the output RDD may contain original elements (blue), elements with potentially different values (purple), or elements with potentially different data types (orange). In some cases, the output is a python object on the driver (dotted rectangle). When key-value pairs are critical to the operation, the 'key' is represented by a black square in the upper left corner, and the 'value' is represented by the remaining rectangle. User defined functions are represented by a green rectangle. When relevant to the core operation, partitions are represented by diagonal lines to the left of the RDD.

A visual mnemonic for 'map', the workhorse of distributed computing. Imagine the user function (green rectangle) iterating over all the elements in the original RDD (left). Original elements (blue) are potentially converted to a different data type (orange) by the user function. A new RDD (right) is generated from the transformed elements.

'flatMap' is similar to 'map', except that the final output is flattened. This means that iterable objects returned by the user-defined function are decomposed into their individual elements in the final output. In the example below, the user function returns a tuple of three elements. Therefore, there are 9 elements in the output, 3 for each of the original elements.

I hope some people find this project useful, either as a learning tool or perhaps as a reference to compliment the Spark API docs.

17 comments:

UFOJune 25, 2015 at 4:22 AM
Thanks.
The best cheet sheet of pyspark.
In the notebook the pictures are not visible and complains about the v3 version, instead of the v4.
Python 2.7.6 , IPython 3.1.0 , in a prepackaged VM with Spark (from Edx mooc)
ReplyDelete
Replies
UnknownDecember 20, 2015 at 4:11 AM
Awesome Work Jeff :-)
ReplyDelete
Replies
TanveerJuly 26, 2016 at 8:17 PM
Thank you so much !! Wonderful stuff !!
ReplyDelete
Replies
UnknownOctober 24, 2016 at 11:57 PM
Very Nice representation
ReplyDelete
Replies
UnknownNovember 10, 2016 at 5:08 AM
Very nice, but why Comic Sans?
ReplyDelete
Replies
VenkatjavaDecember 13, 2016 at 9:56 AM
Wonderful Reference doc. Thanks
ReplyDelete
Replies
UnknownJune 27, 2017 at 11:39 AM
As someone that is new to Apache Spark this guide is absolutely phenomenal for understanding exactly what each transformation / action does. Thank you so much
ReplyDelete
Replies
abhiSeptember 15, 2017 at 8:24 AM
WoW that was awesome
ReplyDelete
Replies
UnknownFebruary 1, 2018 at 12:45 AM
Thank u Sir :)
May i have ur gmail id plz. :)
ReplyDelete
Replies
TejutejuApril 26, 2018 at 4:44 AM
It is very excellent blog and useful article thank you for sharing with us , keep posting Big data hadoop online Training Hyderabad
ReplyDelete
Replies
AnonymousJuly 29, 2019 at 12:05 PM
Amazing
ReplyDelete
Replies
veera cynixitJuly 28, 2020 at 12:02 AM
very nice post,keep sharing more posts with us.

Thank you for u'r info...

big data hadoop training
ReplyDelete
Replies
AnonymousApril 30, 2022 at 7:17 PM
mmorpg oyunlar
instagram takipçi satın al
Tiktok jeton hilesi
Tiktok Jeton Hilesi
antalya saç ekimi
INSTAGRAM TAKİPÇİ
instagram takipçi satın al
metin2 pvp serverlar
instagram takipçi satın al
ReplyDelete
Replies
AnonymousJune 3, 2022 at 7:17 PM
ataşehir bosch klima servisi
maltepe mitsubishi klima servisi
kadıköy mitsubishi klima servisi
çekmeköy toshiba klima servisi
ataşehir toshiba klima servisi
çekmeköy beko klima servisi
ataşehir beko klima servisi
tuzla lg klima servisi
tuzla alarko carrier klima servisi
ReplyDelete
Replies

Add comment

Jeff ThompsonJune 26, 2015 at 6:03 PM

Thanks for the feedback. I created an issue for this on github https://github.com/jkthompson/pyspark-pictures/issues/1
will try to reproduce using the Edx VM when I get a chance.

Jeff ThompsonJune 28, 2015 at 9:51 PM

I think the solution is to copy the entire pyspark-pictures project folder into the VM. If you only upload the .ipynb file using the notebook it breaks the HTML. I updated the issue on github and provided more details there on how to so this. Hope this helps.

https://github.com/jkthompson/pyspark-pictures/issues/1