I've been learning Apache Spark lately. Spark is a distributed computing framework that is faster, and more expressive than Hadoop MapReduce. MapReduce is like a Swiss army knife with a minimum number of tools. It gets the job done but can be awkward for certain jobs. Spark is like a knife with more tools. It lets you select the best tool for the job with minimum effort.
So what is the best way to learn all these new tools? In some cases it's easy. Functions like, 'sum()', 'mean()', and 'distinct()' are immediately obvious from their name. But others are less obvious (e.g. 'flatMap()', 'glom()', 'fold()', 'combineByKey()'). Especially if you are not familiar with the jargon of functional programming.
When I was learning Spark, I found it useful to create visual mnemonics for each function so I could visualize the core operation and distinguish it from other similar functions. Below are a few examples of what I ended up with. The full set is available here as an IPython notebook. In addition to downloading the source code, a static version of the notebook is available here on NBViewer.
Each mnemonic uses blue rectangles to represent elements in the original RDD (resilient distributed dataset). The left side represents the input RDD and the right side represents the output. Elements in the output RDD may contain original elements (blue), elements with potentially different values (purple), or elements with potentially different data types (orange). In some cases, the output is a python object on the driver (dotted rectangle). When key-value pairs are critical to the operation, the 'key' is represented by a black square in the upper left corner, and the 'value' is represented by the remaining rectangle. User defined functions are represented by a green rectangle. When relevant to the core operation, partitions are represented by diagonal lines to the left of the RDD.
A visual mnemonic for 'map', the workhorse of distributed computing. Imagine the user function (green rectangle) iterating over all the elements in the original RDD (left). Original elements (blue) are potentially converted to a different data type (orange) by the user function. A new RDD (right) is generated from the transformed elements.
'flatMap' is similar to 'map', except that the final output is flattened. This means that iterable objects returned by the user-defined function are decomposed into their individual elements in the final output. In the example below, the user function returns a tuple of three elements. Therefore, there are 9 elements in the output, 3 for each of the original elements.
I hope some people find this project useful, either as a learning tool or perhaps as a reference to compliment the Spark API docs.
Thanks.
ReplyDeleteThe best cheet sheet of pyspark.
In the notebook the pictures are not visible and complains about the v3 version, instead of the v4.
Python 2.7.6 , IPython 3.1.0 , in a prepackaged VM with Spark (from Edx mooc)
Thanks for the feedback. I created an issue for this on github https://github.com/jkthompson/pyspark-pictures/issues/1
Deletewill try to reproduce using the Edx VM when I get a chance.
I think the solution is to copy the entire pyspark-pictures project folder into the VM. If you only upload the .ipynb file using the notebook it breaks the HTML. I updated the issue on github and provided more details there on how to so this. Hope this helps.
Deletehttps://github.com/jkthompson/pyspark-pictures/issues/1
Awesome Work Jeff :-)
ReplyDeleteThanks Vishal
DeleteThank you so much !! Wonderful stuff !!
ReplyDeleteVery Nice representation
ReplyDeleteVery nice, but why Comic Sans?
ReplyDeleteWonderful Reference doc. Thanks
ReplyDeleteAs someone that is new to Apache Spark this guide is absolutely phenomenal for understanding exactly what each transformation / action does. Thank you so much
ReplyDeleteWoW that was awesome
ReplyDeleteThank u Sir :)
ReplyDeleteMay i have ur gmail id plz. :)
It is very excellent blog and useful article thank you for sharing with us , keep posting Big data hadoop online Training Hyderabad
ReplyDeleteAmazing
ReplyDeletevery nice post,keep sharing more posts with us.
ReplyDeleteThank you for u'r info...
big data hadoop training
mmorpg oyunlar
ReplyDeleteinstagram takipçi satın al
Tiktok jeton hilesi
Tiktok Jeton Hilesi
antalya saç ekimi
INSTAGRAM TAKİPÇİ
instagram takipçi satın al
metin2 pvp serverlar
instagram takipçi satın al
ataşehir bosch klima servisi
ReplyDeletemaltepe mitsubishi klima servisi
kadıköy mitsubishi klima servisi
çekmeköy toshiba klima servisi
ataşehir toshiba klima servisi
çekmeköy beko klima servisi
ataşehir beko klima servisi
tuzla lg klima servisi
tuzla alarko carrier klima servisi