Exercise: Learning Jobs and Partitions Using take Action

The exercise aims for introducing take action and using spark-shell and web UI. It should introduce you to the concepts of partitions and jobs.

The following snippet creates an RDD of 16 elements with 16 partitions.

scala> val r1 = sc.parallelize(0 to 15, 16)
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:18

scala> r1.partitions.size
res63: Int = 16

scala> r1.foreachPartition(it => println(">>> partition size: " + it.size))
...
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1
... // the machine has 8 cores
... // so first 8 tasks get executed immediately
... // with the others after a core is free to take on new tasks.
>>> partition size: 1
...
>>> partition size: 1
...
>>> partition size: 1
...
>>> partition size: 1
>>> partition size: 1
...
>>> partition size: 1
>>> partition size: 1
>>> partition size: 1

All 16 partitions have one element.

When you execute r1.take(1) only one job gets run since it is enough to compute one task on one partition.

FIXME Snapshot from web UI - note the number of tasks

However, when you execute r1.take(2) two jobs get run as the implementation assumes one job with one partition, and if the elements didn’t total to the number of elements requested in take, quadruple the partitions to work on in the following jobs.

FIXME Snapshot from web UI - note the number of tasks

Can you guess how many jobs are run for r1.take(15)? How many tasks per job?

FIXME Snapshot from web UI - note the number of tasks

Answer: 3.