RDD API — Description of Distributed Computation

RDD is a description of a distributed computation over set of records (of type T).

RDD takes the following to be created:

  • SparkContext

  • Parent RDDs, i.e. Dependencies (Seq[Dependency[_]]) that have to be all computed successfully before this RDD

RDD is a Scala abstract class and cannot be created directly. It is created indirectly for concrete RDDs.

RDD is identified by a unique identifier (aka RDD ID) that is unique among all RDDs in the SparkContext.

id: Int

RDD has a storage level that…​FIXME

storageLevel: StorageLevel

The storage level of an RDD is StorageLevel.NONE by default which is…​FIXME

An RDD can be part of a barrier stage. By default, isBarrier flag is enabled (true) when:

  1. There are no ShuffleDependencies among the dependencies of the RDD

  2. There is at least one parent RDD that has the flag enabled

ShuffledRDD has isBarrier flag always disabled.
MapPartitionsRDD is the only one RDD that can have the isBarrier flag enabled.

Getting Or Computing RDD Partition — getOrCompute Internal Method

  partition: Partition,
  context: TaskContext): Iterator[T]

getOrCompute creates a RDDBlockId for the RDD id and the partition index.

getOrCompute requests the BlockManager to getOrElseUpdate for the block ID (with the storage level and the makeIterator function).

getOrCompute uses SparkEnv to access the current BlockManager.

getOrCompute records whether…​FIXME (readCachedBlock)

getOrCompute branches off per the response from the BlockManager and whether the internal readCachedBlock flag is now on or still off. In either case, getOrCompute creates an InterruptibleIterator.

InterruptibleIterator simply delegates to a wrapped internal Iterator, but allows for task killing functionality.

For a BlockResult available and readCachedBlock flag on, getOrCompute…​FIXME

For a BlockResult available and readCachedBlock flag off, getOrCompute…​FIXME

The BlockResult could be found in a local block manager or fetched from a remote block manager. It may also have been stored (persisted) just now. In either case, the BlockResult is available (and BlockManager.getOrElseUpdate gives a Left value with the BlockResult).

For Right(iter) (regardless of the value of readCachedBlock flag since…​FIXME), getOrCompute…​FIXME

BlockManager.getOrElseUpdate gives a Right(iter) value to indicate an error with a block.
getOrCompute is used on Spark executors.
getOrCompute is used exclusively when RDD is requested for the iterator over values in a partition.

Computing Partition (in TaskContext) — compute Method

compute(split: Partition, context: TaskContext): Iterator[T]

The abstract compute method computes the input split partition in the TaskContext to produce a collection of values (of type T).

compute is implemented by any type of RDD in Spark and is called every time the records are requested unless RDD is cached or checkpointed (and the records can be read from an external storage, but this time closer to the compute node).

When an RDD is cached, for specified storage levels (i.e. all but NONE) CacheManager is requested to get or compute partitions.

compute method runs on the driver.

RDD Dependencies — dependencies Final Template Method

dependencies: Seq[Dependency[_]]

dependencies returns the dependencies of a RDD.

dependencies is a final method that no class in Spark can ever override.

Internally, dependencies checks out whether the RDD is checkpointed and acts accordingly.

For a RDD being checkpointed, dependencies returns a single-element collection with a OneToOneDependency.

For a non-checkpointed RDD, dependencies collection is computed using getDependencies method.

getDependencies method is an abstract method that custom RDDs are required to provide.

Accessing Records For Partition Lazily — iterator Final Method

iterator(split: Partition, context: TaskContext): Iterator[T]
iterator is a final method that, despite being public, considered private and only available for implementing custom RDDs.