Apache Spark 2.0.2 tutorial with PySpark : RDD
In this tutorial, the core concept in Spark, Resilient Distributed Dataset (RDD) will be introduced. RDD is the Spark's core abstraction for working with data. Simply put, an RDD is a distributed collection of elements.
In Spark, all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across our cluster and parallelizes the operations we perform on them.
An RDD is an immutable distributed collection of objects.
Each RDD is split into multiple partitions, which may be computed on different nodes.
RDDs can contain any type of objects from Python, Java, or Scala.
We can create RDDs in two ways:
- By loading an external dataset.
- By distributing a collection of objects such as a list or a set in a driver program.
Once created, RDDs provides two operation types:
- Transformations : constructs a new RDD from a previous one. F/li>
- Actions : computes a result based on an RDD, and either return it to the driver program or save it to an external storage system such as HDFS etc.
Let's take an example of creating RDD by loading an external dataset:
Now we've reated RDD and the RDD offers two types of operations:transformations and actions.
Transformations construct a new RDD from a previous one. For instance, one common transformation is filtering data that matches a predicate. In our text file example, we can use this to create a new RDD holding just the strings that contain the word "Spark":
Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program or save it to an external storage system such as HDFS etc. One example of an action is first(), which returns the first element in an RDD:
Note that Spark computes RDDs only in a lazy fashion (lazy evaluation). In other words, Spark computed them only when the first time they are used in an action. This approach makes a lot of sense when we're working with Big Data.
As an example, take a look at our previous code where we defined a text file and then filtered the lines that include "Spark".
If Spark were to load and store all the lines in the file as soon as we wrote
lines = sc.textFile(...)
it would waste a lot of storage space if spark immediately filters out that many lines. Instead, once Spark sees the whole chain of transformations, it can compute just the data needed for its result. In fact, for the first() action, Spark scans the file only until it finds the first matching line, but it won't even read the whole file.
Also note that Spark's RDDs are by default recomputed (not persisted) each time we run an action on them. If we want to reuse an RDD in multiple actions, we can ask Spark to persist the RDD using
Persisting RDDs on disk instead of memory is also possible.
Here is an example using RDD's persist:
map() and filter() are the two most common transformations we will likely be using.
The map() takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD.
The filter() transformation takes in a function and returns an RDD that only has elements that pass the filter() function.
Here is an example of map() that squares the numbers and with filter() that eliminates '1' in an RD:
Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap(). As with map(), the function we provide to flatMap() is called individually for each element in our input RDD. Instead of returning a single element, we return an iterator with our return values. Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators.
A simple usage of flatMap() is splitting up an input string into words, as shown in he following example:
Picture from "Learning Spark"
We can think of flatMap() as "flattening" the iterators returned to it, so that instead of ending up with an RDD of lists we have an RDD of the elements in those lists. In other words, a flatMap() flattens multiple arrays into one single array.
Though RDDs themselves are not proper sets, RDDs support many of the operations of sets, such as union and intersection etc.
Picture from "Learning Spark"
We can also compute a Cartesian product between two RDDs. The cartesian() transformation returns all possible pairs of (a, b) where a is in the source RDD and b is in the other RDD.
The codes used in this tutorial : Apache Spark 2.0.2 tutorial with PySpark : RDD - PySparkRDD.ipynb
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization