ShuffleDependency

ShuffleDependency is a Dependency on the output of a ShuffleMapStage for a key-value pair RDD.

ShuffleDependency uses the RDD to know the number of (map-side/pre-shuffle) partitions and the Partitioner for the number of (reduce-size/post-shuffle) partitions.

ShuffleDependency is created as a dependency of ShuffledRDD. ShuffleDependency can also be created as a dependency of CoGroupedRDD and SubtractedRDD.

Creating Instance

ShuffleDependency takes the following to be created:

When created, ShuffleDependency gets shuffle id (as shuffleId).

ShuffleDependency uses the input RDD to access SparkContext and so the shuffleId.

ShuffleDependency registers itself with ShuffleManager and gets a ShuffleHandle (available as shuffleHandle property).

ShuffleDependency accesses ShuffleManager using SparkEnv.

In the end, ShuffleDependency registers itself for cleanup with ContextCleaner.

ShuffleDependency accesses the optional ContextCleaner through SparkContext.
ShuffleDependency is created when ShuffledRDD, CoGroupedRDD, and SubtractedRDD return their RDD dependencies.

Shuffle ID

Every ShuffleDependency has a unique application-wide shuffle ID that is assigned when ShuffleDependency is created (and is used throughout Spark’s code to reference a ShuffleDependency).

Shuffle IDs are tracked by SparkContext.

Parent RDD

ShuffleDependency is given the parent RDD of key-value pairs (RDD[_ <: Product2[K, V]]).

The parent RDD is available as rdd property that is part of the Dependency abstraction.

rdd: RDD[Product2[K, V]]

Partitioner

ShuffleDependency is given a Partitioner that is used to partition the shuffle output (when SortShuffleWriter, BypassMergeSortShuffleWriter and UnsafeShuffleWriter are requested to write).

ShuffleHandle

shuffleHandle: ShuffleHandle

shuffleHandle is the ShuffleHandle of a ShuffleDependency as assigned eagerly when ShuffleDependency was created.

shuffleHandle is used to compute CoGroupedRDDs, ShuffledRDD, SubtractedRDD, and ShuffledRowRDD (to get a ShuffleReader for a ShuffleDependency) and when a ShuffleMapTask runs (to get a ShuffleWriter for a ShuffleDependency).

Map-Size Partial Aggregation Flag

ShuffleDependency uses a mapSideCombine flag that controls whether to perform map-side partial aggregation (map-side combine) using an Aggregator.

mapSideCombine is disabled (false) by default and can be enabled (true) for some use cases of ShuffledRDD.

ShuffleDependency requires that the optional Aggregator is defined when the flag is enabled.

mapSideCombine is used when:

Optional Aggregator

aggregator: Option[Aggregator[K, V, C]] = None

aggregator is a map/reduce-side Aggregator (for a RDD’s shuffle).

aggregator is by default undefined (i.e. None) when ShuffleDependency is created.