ShuffleStatus¶
ShuffleStatus is a registry of MapStatuses per Partition of a ShuffleMapStage.
ShuffleStatus is used by MapOutputTrackerMaster.
Creating Instance¶
ShuffleStatus takes the following to be created:
- Number of Partitions (of the RDD of the ShuffleDependency of a ShuffleMapStage)
ShuffleStatus is created when:
MapOutputTrackerMasteris requested to register a shuffle (whenDAGScheduleris requested to create a ShuffleMapStage)
MapStatuses per Partition¶
ShuffleStatus creates a mapStatuses internal registry of MapStatuses per partition (using the numPartitions) when created.
A missing partition is when there is no MapStatus for a partition (null at the index of the partition ID) and can be requested using findMissingPartitions.
mapStatuses is all null (for every partition) initially (and so all partitions are missing / uncomputed).
A new MapStatus is added in addMapOutput and updateMapOutput.
A MapStatus is removed (nulled) in removeMapOutput and removeOutputsByFilter.
The number of available MapStatuses is tracked by _numAvailableMapOutputs internal counter.
Used when:
Registering Shuffle Map Output¶
addMapOutput(
mapIndex: Int,
status: MapStatus): Unit
addMapOutput adds the MapStatus to the mapStatuses internal registry.
In case the mapStatuses internal registry had no MapStatus for the mapIndex already available, addMapOutput increments the _numAvailableMapOutputs internal counter and invalidateSerializedMapOutputStatusCache.
addMapOutput is used when:
MapOutputTrackerMasteris requested to registerMapOutput
Deregistering Shuffle Map Output¶
removeMapOutput(
mapIndex: Int,
bmAddress: BlockManagerId): Unit
removeMapOutput...FIXME
removeMapOutput is used when:
MapOutputTrackerMasteris requested to unregisterMapOutput
Missing Partitions¶
findMissingPartitions(): Seq[Int]
findMissingPartitions...FIXME
findMissingPartitions is used when:
MapOutputTrackerMasteris requested to findMissingPartitions
Serializing Shuffle Map Output Statuses¶
serializedMapStatus(
broadcastManager: BroadcastManager,
isLocal: Boolean,
minBroadcastSize: Int,
conf: SparkConf): Array[Byte]
serializedMapStatus...FIXME
serializedMapStatus is used when:
MessageLoop(of the MapOutputTrackerMaster) is requested to send map output locations for shuffle
Logging¶
Enable ALL logging level for org.apache.spark.ShuffleStatus logger to see what happens inside.
Add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.ShuffleStatus=ALL
Refer to Logging.