data-frack: January 2015

Below is a short description of an open source project I created called 'pyspark-pictures', a collection of visual mnemonics and code examples for the PySpark API. If you haven't seen it yet, I recommend taking a quick look at the static version on NBViewer first, because a picture is worth a thousand words.

I've been learning Apache Spark lately. Spark is a distributed computing framework that is faster, and more expressive than Hadoop MapReduce. MapReduce is like a Swiss army knife with a minimum number of tools. It gets the job done but can be awkward for certain jobs. Spark is like a knife with more tools. It lets you select the best tool for the job with minimum effort.

So what is the best way to learn all these new tools? In some cases it's easy. Functions like, 'sum()', 'mean()', and 'distinct()' are immediately obvious from their name. But others are less obvious (e.g. 'flatMap()', 'glom()', 'fold()', 'combineByKey()'). Especially if you are not familiar with the jargon of functional programming.

When I was learning Spark, I found it useful to create visual mnemonics for each function so I could visualize the core operation and distinguish it from other similar functions. Below are a few examples of what I ended up with. The full set is available here as an IPython notebook. In addition to downloading the source code, a static version of the notebook is available here on NBViewer.

Each mnemonic uses blue rectangles to represent elements in the original RDD (resilient distributed dataset). The left side represents the input RDD and the right side represents the output. Elements in the output RDD may contain original elements (blue), elements with potentially different values (purple), or elements with potentially different data types (orange). In some cases, the output is a python object on the driver (dotted rectangle). When key-value pairs are critical to the operation, the 'key' is represented by a black square in the upper left corner, and the 'value' is represented by the remaining rectangle. User defined functions are represented by a green rectangle. When relevant to the core operation, partitions are represented by diagonal lines to the left of the RDD.

A visual mnemonic for 'map', the workhorse of distributed computing. Imagine the user function (green rectangle) iterating over all the elements in the original RDD (left). Original elements (blue) are potentially converted to a different data type (orange) by the user function. A new RDD (right) is generated from the transformed elements.

'flatMap' is similar to 'map', except that the final output is flattened. This means that iterable objects returned by the user-defined function are decomposed into their individual elements in the final output. In the example below, the user function returns a tuple of three elements. Therefore, there are 9 elements in the output, 3 for each of the original elements.

I hope some people find this project useful, either as a learning tool or perhaps as a reference to compliment the Spark API docs.

data-frack

Monday, January 19, 2015

Visual Mnemonics for the PySpark API