ShuffleMapTask is one of the two types of Tasks. When executed,
ShuffleMapTask writes the result of executing a serialized task code over the records (of a RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and estimated size of the result shuffle blocks).
ShuffleMapTask takes the following to be created:
- Stage ID
- Stage Attempt ID
- Broadcast variable with a serialized task binary
- Local Properties
- Serialized task metrics
- Job ID (default:
- Application ID (default:
- Application Attempt ID (default:
ShuffleMapTask is created when
DAGScheduler is requested to submit tasks for all missing partitions of a ShuffleMapStage.
Serialized Task Binary¶
ShuffleMapTask is given a broadcast variable with a reference to a serialized task binary.
runTask( context: TaskContext): MapStatus
runTask writes the result (records) of executing the serialized task code over the records (in the RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and an estimated size of the result shuffle blocks).
This is the moment in
Task's lifecycle (and its corresponding RDD) when a RDD partition is computed and in turn becomes a sequence of records (i.e. real data) on an executor.
In case of any exceptions,
runTask requests the
ShuffleWriter to stop (with the
success flag off) and (re)throws the exception.
runTask may also print out the following DEBUG message to the logs when the
ShuffleWriter could not be stopped.
Could not stop writer
runTask is part of the Task abstraction.
preferredLocations is part of the Task abstraction.
preferredLocs internal property.
ShuffleMapTask initializes the
preferredLocs internal property when created
ALL logging level for
org.apache.spark.scheduler.ShuffleMapTask logger to see what happens inside.
Add the following line to
Refer to Logging.