Skip to content


ShuffleMapTask is one of the two types of tasks that, when <>, writes the result of executing a <> over the records (of a <>) to the[shuffle system] and returns a[MapStatus] (information about the[BlockManager] and estimated size of the result shuffle blocks).

ShuffleMapTask and DAGScheduler

Creating Instance

ShuffleMapTask takes the following to be created:

  • [[stageId]] Stage ID
  • [[stageAttemptId]] Stage attempt ID
  • <>
  • [[partition]] Partition
  • [[locs]] TaskLocations
  • [[localProperties]] Task-specific local properties
  • [[serializedTaskMetrics]] Serialized task metrics (Array[Byte])
  • [[jobId]] Optional job ID (default: None)
  • [[appId]] Optional application ID (default: None)
  • [[appAttemptId]] Optional application attempt ID (default: None)
  • [[isBarrier]] isBarrier flag (default: false)

ShuffleMapTask is created when DAGScheduler is requested to[submit tasks for all missing partitions of a ShuffleMapStage].

== [[taskBinary]] Broadcast Variable and Serialized Task Binary

ShuffleMapTask is given a[] with a reference to a serialized task binary (Broadcast[Array[Byte]]).

<> expects that the serialized task binary is a tuple of an ../rdd/[RDD] and a ShuffleDependency.

== [[runTask]] Running Task

[source, scala]

runTask( context: TaskContext): MapStatus

runTask writes the result (records) of executing the <> over the records (in the <>) to the[shuffle system] and returns a[MapStatus] (with the[BlockManager] and an estimated size of the result shuffle blocks).

Internally, runTask requests the[SparkEnv] for the new instance of[closure serializer] and requests it to[deserialize] the <> (into a tuple of a ../rdd/[RDD] and a ShuffleDependency).

runTask measures the[thread] and[CPU] deserialization times.

runTask requests the[SparkEnv] for the[ShuffleManager] and requests it for a[ShuffleWriter] (for the ShuffleHandle, the RDD partition, and the TaskContext).

runTask then requests the <> for the ../rdd/[records] (of the <>) that the ShuffleWriter is requested to[write out] (to the shuffle system).

In the end, runTask requests the ShuffleWriter to[stop] (with the success flag on) and returns the[shuffle map output status].

NOTE: This is the moment in Task's lifecycle (and its corresponding RDD) when a ../rdd/[RDD partition is computed] and in turn becomes a sequence of records (i.e. real data) on an executor.

In case of any exceptions, runTask requests the ShuffleWriter to[stop] (with the success flag off) and (re)throws the exception.

runTask may also print out the following DEBUG message to the logs when the ShuffleWriter could not be[stopped].


Could not stop writer

runTask is part of[Task] abstraction.

== [[preferredLocations]] preferredLocations Method

[source, scala]

preferredLocations: Seq[TaskLocation]

preferredLocations simply returns the <> internal property.

preferredLocations is part of[Task] abstraction.

== [[logging]] Logging

Enable ALL logging level for org.apache.spark.scheduler.ShuffleMapTask logger to see what happens inside.

Add the following line to conf/


Refer to[Logging].

== [[preferredLocs]] Preferred Locations[TaskLocations] that are the unique entries in the given <> with the only rule that when locs is not defined, it is empty, and no task location preferences are defined.

Initialized when ShuffleMapTask is <>

Used exclusively when ShuffleMapTask is requested for the <>

Last update: 2020-10-10