ShuffleDependency is a Dependency on the output of a scheduler:ShuffleMapStage.md[ShuffleMapStage] for a <
ShuffleDependency uses the <
== [[creating-instance]] Creating Instance
ShuffleDependency takes the following to be created:
> of key-value pairs (
RDD[_ <: Product2[K, V]])
- [[serializer]] serializer:Serializer.md[Serializer]
- [[keyOrdering]] Ordering for K keys (
Option[Aggregator[K, V, C]])
> flag (default:
When created, ShuffleDependency gets ROOT:SparkContext.md#nextShuffleId[shuffle id] (as
NOTE: ShuffleDependency uses the index.md#context[input RDD to access
SparkContext] and so the
ShuffleDependency shuffle:ShuffleManager.md#registerShuffle[registers itself with
ShuffleManager] and gets a
ShuffleHandle (available as <
NOTE: ShuffleDependency accesses core:SparkEnv.md#shuffleManager[
In the end, ShuffleDependency core:ContextCleaner.md#registerShuffleForCleanup[registers itself for cleanup with
NOTE: ShuffleDependency accesses the ROOT:SparkContext.md#cleaner[optional
== [[shuffleId]] Shuffle ID
Every ShuffleDependency has a unique application-wide shuffle ID that is assigned when <
Shuffle IDs are tracked by ROOT:SparkContext.md#nextShuffleId[SparkContext].
== [[rdd]] Parent RDD
ShuffleDependency is given the parent RDD.md[RDD] of key-value pairs (
RDD[_ <: Product2[K, V]]).
The parent RDD is available as rdd property that is part of the Dependency abstraction.
== [[partitioner]] Partitioner
ShuffleDependency is given a Partitioner.md[Partitioner] that is used to partition the shuffle output (when shuffle:SortShuffleWriter.md[SortShuffleWriter], shuffle:BypassMergeSortShuffleWriter.md[BypassMergeSortShuffleWriter] and shuffle:UnsafeShuffleWriter.md[UnsafeShuffleWriter] are requested to write).
== [[shuffleHandle]] ShuffleHandle
shuffleHandle is the
ShuffleHandle of a ShuffleDependency as assigned eagerly when <
shuffleHandle is used to compute CoGroupedRDDs, ShuffledRDD.md#compute[ShuffledRDD], SubtractedRDD, and spark-sql-ShuffledRowRDD.md[ShuffledRowRDD] (to get a spark-shuffle-ShuffleReader.md[ShuffleReader] for a ShuffleDependency) and when a scheduler:ShuffleMapTask.md#runTask[
ShuffleMapTask runs] (to get a
ShuffleWriter for a ShuffleDependency).
== [[mapSideCombine]] Map-Size Partial Aggregation Flag
ShuffleDependency uses a mapSideCombine flag that controls whether to perform map-side partial aggregation (map-side combine) using an <
mapSideCombine is disabled (
false) by default and can be enabled (
true) for some use cases of ShuffledRDD.md#mapSideCombine[ShuffledRDD].
ShuffleDependency requires that the optional <
mapSideCombine is used when:
BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task]
SortShuffleManager is requested to shuffle:SortShuffleManager.md#registerShuffle[register a shuffle]
SortShuffleWriter is requested to shuffle:SortShuffleWriter.md#write[write records]
== [[aggregator]] Optional Aggregator
aggregator: Option[Aggregator[K, V, C]] = None¶
aggregator is a Aggregator.md[map/reduce-side Aggregator] (for a RDD's shuffle).
aggregator is by default undefined (i.e.
None) when <
aggregator is used when shuffle:SortShuffleWriter.md#write[
SortShuffleWriter writes records] and shuffle:BlockStoreShuffleReader.md#read[
BlockStoreShuffleReader reads combined key-values for a reduce task].