ShuffleStatus¶
ShuffleStatus
is a registry of MapStatuses per Partition of a ShuffleMapStage.
ShuffleStatus
is used by MapOutputTrackerMaster.
Creating Instance¶
ShuffleStatus
takes the following to be created:
- Number of Partitions (of the RDD of the ShuffleDependency of a ShuffleMapStage)
ShuffleStatus
is created when:
MapOutputTrackerMaster
is requested to register a shuffle (whenDAGScheduler
is requested to create a ShuffleMapStage)
MapStatuses per Partition¶
ShuffleStatus
creates a mapStatuses
internal registry of MapStatuses per partition (using the numPartitions) when created.
A missing partition is when there is no MapStatus
for a partition (null
at the index of the partition ID) and can be requested using findMissingPartitions.
mapStatuses
is all null
(for every partition) initially (and so all partitions are missing / uncomputed).
A new MapStatus
is added in addMapOutput and updateMapOutput.
A MapStatus
is removed (null
ed) in removeMapOutput and removeOutputsByFilter.
The number of available MapStatus
es is tracked by _numAvailableMapOutputs internal counter.
Used when:
Registering Shuffle Map Output¶
addMapOutput(
mapIndex: Int,
status: MapStatus): Unit
addMapOutput
adds the MapStatus to the mapStatuses internal registry.
In case the mapStatuses internal registry had no MapStatus
for the mapIndex
already available, addMapOutput
increments the _numAvailableMapOutputs internal counter and invalidateSerializedMapOutputStatusCache.
addMapOutput
is used when:
MapOutputTrackerMaster
is requested to registerMapOutput
Deregistering Shuffle Map Output¶
removeMapOutput(
mapIndex: Int,
bmAddress: BlockManagerId): Unit
removeMapOutput
...FIXME
removeMapOutput
is used when:
MapOutputTrackerMaster
is requested to unregisterMapOutput
Missing Partitions¶
findMissingPartitions(): Seq[Int]
findMissingPartitions
...FIXME
findMissingPartitions
is used when:
MapOutputTrackerMaster
is requested to findMissingPartitions
Serializing Shuffle Map Output Statuses¶
serializedMapStatus(
broadcastManager: BroadcastManager,
isLocal: Boolean,
minBroadcastSize: Int,
conf: SparkConf): Array[Byte]
serializedMapStatus
...FIXME
serializedMapStatus
is used when:
MessageLoop
(of the MapOutputTrackerMaster) is requested to send map output locations for shuffle
Logging¶
Enable ALL
logging level for org.apache.spark.ShuffleStatus
logger to see what happens inside.
Add the following line to conf/log4j.properties
:
log4j.logger.org.apache.spark.ShuffleStatus=ALL
Refer to Logging.