Skip to content


ShuffleDependency is a Dependency on the output of a[ShuffleMapStage] for a <>.

ShuffleDependency uses the <> to know the number of (map-side/pre-shuffle) partitions and the <> for the number of (reduce-size/post-shuffle) partitions.

ShuffleDependency is created as a dependency of[ShuffledRDD]. ShuffleDependency can also be created as a dependency of CoGroupedRDD and SubtractedRDD.

== [[creating-instance]] Creating Instance

ShuffleDependency takes the following to be created:

  • <> of key-value pairs (RDD[_ <: Product2[K, V]])
  • <>
  • [[serializer]][Serializer]
  • [[keyOrdering]] Ordering for K keys (Option[Ordering[K]])
  • <> (Option[Aggregator[K, V, C]])
  • <> flag (default: false)

When created, ShuffleDependency gets[shuffle id] (as shuffleId).

NOTE: ShuffleDependency uses the[input RDD to access SparkContext] and so the shuffleId.

ShuffleDependency[registers itself with ShuffleManager] and gets a ShuffleHandle (available as <> property).

NOTE: ShuffleDependency accesses[ShuffleManager using SparkEnv].

In the end, ShuffleDependency[registers itself for cleanup with ContextCleaner].

NOTE: ShuffleDependency accesses the[optional ContextCleaner through SparkContext].

NOTE: ShuffleDependency is created when[ShuffledRDD], CoGroupedRDD, and SubtractedRDD return their RDD dependencies.

== [[shuffleId]] Shuffle ID

Every ShuffleDependency has a unique application-wide shuffle ID that is assigned when <> (and is used throughout Spark's code to reference a ShuffleDependency).

Shuffle IDs are tracked by[SparkContext].

== [[rdd]] Parent RDD

ShuffleDependency is given the parent[RDD] of key-value pairs (RDD[_ <: Product2[K, V]]).

The parent RDD is available as rdd property that is part of the Dependency abstraction.


RDD[Product2[K, V]]

== [[partitioner]] Partitioner

ShuffleDependency is given a[Partitioner] that is used to partition the shuffle output (when[SortShuffleWriter],[BypassMergeSortShuffleWriter] and[UnsafeShuffleWriter] are requested to write).

== [[shuffleHandle]] ShuffleHandle

[source, scala]

shuffleHandle: ShuffleHandle

shuffleHandle is the ShuffleHandle of a ShuffleDependency as assigned eagerly when <>.

shuffleHandle is used to compute CoGroupedRDDs,[ShuffledRDD], SubtractedRDD, and[ShuffledRowRDD] (to get a[ShuffleReader] for a ShuffleDependency) and when a[ShuffleMapTask runs] (to get a ShuffleWriter for a ShuffleDependency).

== [[mapSideCombine]] Map-Size Partial Aggregation Flag

ShuffleDependency uses a mapSideCombine flag that controls whether to perform map-side partial aggregation (map-side combine) using an <>.

mapSideCombine is disabled (false) by default and can be enabled (true) for some use cases of[ShuffledRDD].

ShuffleDependency requires that the optional <> is defined when the flag is enabled.

mapSideCombine is used when:

  • BlockStoreShuffleReader is requested to[read combined records for a reduce task]

  • SortShuffleManager is requested to[register a shuffle]

  • SortShuffleWriter is requested to[write records]

== [[aggregator]] Optional Aggregator

[source, scala]

aggregator: Option[Aggregator[K, V, C]] = None

aggregator is a[map/reduce-side Aggregator] (for a RDD's shuffle).

aggregator is by default undefined (i.e. None) when <>.

NOTE: aggregator is used when[SortShuffleWriter writes records] and[BlockStoreShuffleReader reads combined key-values for a reduce task].

Last update: 2020-11-27