ShuffleDependency¶
ShuffleDependency
is a Dependency on the output of a scheduler:ShuffleMapStage.md[ShuffleMapStage] for a <
ShuffleDependency uses the <
ShuffleDependency is created as a dependency of ShuffledRDD.md[ShuffledRDD]. ShuffleDependency can also be created as a dependency of CoGroupedRDD and SubtractedRDD.
== [[creating-instance]] Creating Instance
ShuffleDependency takes the following to be created:
- <
> of key-value pairs ( RDD[_ <: Product2[K, V]]
) - <
> - [[serializer]] serializer:Serializer.md[Serializer]
- [[keyOrdering]] Ordering for K keys (
Option[Ordering[K]]
) - <
> ( Option[Aggregator[K, V, C]]
) - <
> flag (default: false
)
When created, ShuffleDependency gets SparkContext.md#nextShuffleId[shuffle id] (as shuffleId
).
NOTE: ShuffleDependency uses the index.md#context[input RDD to access SparkContext
] and so the shuffleId
.
ShuffleDependency shuffle:ShuffleManager.md#registerShuffle[registers itself with ShuffleManager
] and gets a ShuffleHandle
(available as <
NOTE: ShuffleDependency accesses core:SparkEnv.md#shuffleManager[ShuffleManager
using SparkEnv
].
In the end, ShuffleDependency core:ContextCleaner.md#registerShuffleForCleanup[registers itself for cleanup with ContextCleaner
].
NOTE: ShuffleDependency accesses the SparkContext.md#cleaner[optional ContextCleaner
through SparkContext
].
NOTE: ShuffleDependency is created when ShuffledRDD.md#getDependencies[ShuffledRDD], CoGroupedRDD, and SubtractedRDD return their RDD dependencies.
== [[shuffleId]] Shuffle ID
Every ShuffleDependency has a unique application-wide shuffle ID that is assigned when <
Shuffle IDs are tracked by SparkContext.md#nextShuffleId[SparkContext].
== [[rdd]] Parent RDD
ShuffleDependency is given the parent RDD.md[RDD] of key-value pairs (RDD[_ <: Product2[K, V]]
).
The parent RDD is available as rdd property that is part of the Dependency abstraction.
[source,scala]¶
RDD[Product2[K, V]]¶
== [[partitioner]] Partitioner
ShuffleDependency is given a Partitioner.md[Partitioner] that is used to partition the shuffle output (when shuffle:SortShuffleWriter.md[SortShuffleWriter], shuffle:BypassMergeSortShuffleWriter.md[BypassMergeSortShuffleWriter] and shuffle:UnsafeShuffleWriter.md[UnsafeShuffleWriter] are requested to write).
== [[shuffleHandle]] ShuffleHandle
[source, scala]¶
shuffleHandle: ShuffleHandle¶
shuffleHandle is the ShuffleHandle
of a ShuffleDependency as assigned eagerly when <
shuffleHandle is used to compute CoGroupedRDDs, ShuffledRDD.md#compute[ShuffledRDD], SubtractedRDD, and spark-sql-ShuffledRowRDD.md[ShuffledRowRDD] (to get a ShuffleReader.md[ShuffleReader] for a ShuffleDependency) and when a scheduler:ShuffleMapTask.md#runTask[ShuffleMapTask
runs] (to get a ShuffleWriter
for a ShuffleDependency).
== [[mapSideCombine]] Map-Size Partial Aggregation Flag
ShuffleDependency uses a mapSideCombine flag that controls whether to perform map-side partial aggregation (map-side combine) using an <
mapSideCombine is disabled (false
) by default and can be enabled (true
) for some use cases of ShuffledRDD.md#mapSideCombine[ShuffledRDD].
ShuffleDependency requires that the optional <
mapSideCombine is used when:
-
BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task]
-
SortShuffleManager is requested to shuffle:SortShuffleManager.md#registerShuffle[register a shuffle]
-
SortShuffleWriter is requested to shuffle:SortShuffleWriter.md#write[write records]
== [[aggregator]] Optional Aggregator
[source, scala]¶
aggregator: Option[Aggregator[K, V, C]] = None¶
aggregator
is a Aggregator.md[map/reduce-side Aggregator] (for a RDD's shuffle).
aggregator
is by default undefined (i.e. None
) when <
NOTE: aggregator
is used when shuffle:SortShuffleWriter.md#write[SortShuffleWriter
writes records] and shuffle:BlockStoreShuffleReader.md#read[BlockStoreShuffleReader
reads combined key-values for a reduce task].