Skip to content

Task

Task is an abstraction of mallest individual units of execution that can be executed to compute an RDD partition.

Task is created when DAGScheduler is requested to submit missing tasks of a stage.

Tasks Are Runtime Representation of RDD Partitions

Contract

Running Task

runTask(
  context: TaskContext): T

Runs the task (in a TaskContext)

Used when Task is requested to run

Implementations

Creating Instance

Task takes the following to be created:

  • Stage ID
  • Stage (execution) Attempt ID
  • Partition ID
  • Local Properties
  • Serialized TaskMetrics (Array[Byte])
  • Job ID (default: None)
  • Application ID (default: None)
  • Application Attempt ID (default: None)
  • isBarrier flag (default: false)
Abstract Class

Task is an abstract class and cannot be created directly. It is created indirectly for the concrete Tasks.

Serializable

Task is a Java Serializable so it can be serialized and send over the wire from the driver to executors.

Preferred Locations

preferredLocations: Seq[TaskLocation]

TaskLocations that represent preferred locations (executors) to execute the task on.

Empty by default and so no task location preferences are defined that says the task could be launched on any executor.

Note

Defined by the concrete tasks (i.e. ShuffleMapTask and ResultTask).

preferredLocations is used when TaskSetManager is requested to register a task as pending execution and dequeueSpeculativeTask.

Running Task Thread

run(
  taskAttemptId: Long,
  attemptNumber: Int,
  metricsSystem: MetricsSystem): T

run registers the task (identified as taskAttemptId) with the local BlockManager.

Note

run uses SparkEnv to access the current BlockManager.

run creates a TaskContextImpl that in turn becomes the task's TaskContext.

Note

run is a final method and so must not be overriden.

run checks _killed flag and, if enabled, kills the task (with interruptThread flag disabled).

run creates a Hadoop CallerContext and sets it.

run runs the task.

Note

This is the moment when the custom Task's runTask is executed.

In the end, run notifies TaskContextImpl that the task has completed (regardless of the final outcome -- a success or a failure).

In case of any exceptions, run notifies TaskContextImpl that the task has failed. run requests MemoryStore to release unroll memory for this task (for both ON_HEAP and OFF_HEAP memory modes).

Note

run uses SparkEnv to access the current BlockManager that it uses to access MemoryStore.

run requests MemoryManager to notify any tasks waiting for execution memory to be freed to wake up and try to acquire memory again.

run unsets the task's TaskContext.

Note

run uses SparkEnv to access the current MemoryManager.

run is used when TaskRunner is requested to run (when Executor is requested to launch a task (on "Executor task launch worker" thread pool sometime in the future)).

Task States

Task can be in one of the following states (as described by TaskState enumeration):

  • LAUNCHING
  • RUNNING when the task is being started.
  • FINISHED when the task finished with the serialized result.
  • FAILED when the task fails, e.g. when FetchFailedException, CommitDeniedException or any Throwable occurs
  • KILLED when an executor kills a task.
  • LOST

States are the values of org.apache.spark.TaskState.

Note

Task status updates are sent from executors to the driver through ExecutorBackend.

Task is finished when it is in one of FINISHED, FAILED, KILLED, LOST.

LOST and FAILED states are considered failures.

Collecting Latest Values of Accumulators

collectAccumulatorUpdates(
  taskFailed: Boolean = false): Seq[AccumulableInfo]

collectAccumulatorUpdates collects the latest values of internal and external accumulators from a task (and returns the values as a collection of AccumulableInfo).

Internally, collectAccumulatorUpdates takes TaskMetrics.

Note

collectAccumulatorUpdates uses TaskContextImpl to access the task's TaskMetrics.

collectAccumulatorUpdates collects the latest values of:

collectAccumulatorUpdates returns an empty collection when TaskContextImpl is not initialized.

collectAccumulatorUpdates is used when TaskRunner runs a task (and sends a task's final results back to the driver).

Killing Task

kill(
  interruptThread: Boolean): Unit

kill marks the task to be killed, i.e. it sets the internal _killed flag to true.

kill calls TaskContextImpl.markInterrupted when context is set.

If interruptThread is enabled and the internal taskThread is available, kill interrupts it.

CAUTION: FIXME When could context and interruptThread not be set?


Last update: 2020-11-29