Skip to content

ResultTask

== [[ResultTask]] ResultTask -- Task to Compute Result for ResultStage

ResultTask is a scheduler:Task.md[Task] that <>.

<ResultTask is created>> exclusively when scheduler:DAGScheduler.md#submitMissingTasks[DAGScheduler submits missing tasks for a ResultStage].

ResultTask is created with a <> with the RDD and the function to execute it on and the <>.

[[internal-registries]] .ResultTask's Internal Registries and Counters [cols="1,2",options="header",width="100%"] |=== | Name | Description

| [[preferredLocs]] preferredLocs | Collection of scheduler:TaskLocation.md[TaskLocations].

Corresponds directly to unique entries in <> with the only rule that when locs is not defined, it is empty, and no task location preferences are defined.

Initialized when <ResultTask is created>>.

Used exclusively when ResultTask is requested for <>.

|===

=== [[creating-instance]] Creating ResultTask Instance

ResultTask takes the following when created:

  • stageId -- the stage the task is executed for
  • stageAttemptId -- the stage attempt id
  • [[taskBinary]] ROOT:Broadcast.md[] with the serialized task (as Array[Byte]). The broadcast contains of a serialized pair of RDD and the function to execute.
  • [[partition]] spark-rdd-Partition.md[Partition] to compute
  • [[locs]] Collection of scheduler:TaskLocation.md[TaskLocations], i.e. preferred locations (executors) to execute the task on
  • [[outputId]] outputId
  • [[localProperties]] local Properties
  • [[serializedTaskMetrics]] The stage's serialized executor:TaskMetrics.md[] (as Array[Byte])
  • [[jobId]] (optional) spark-scheduler-ActiveJob.md[Job] id
  • [[appId]] (optional) Application id
  • [[appAttemptId]] (optional) Application attempt id

ResultTask initializes the <>.

=== [[preferredLocations]] preferredLocations Method

[source, scala]

preferredLocations: Seq[TaskLocation]

NOTE: preferredLocations is part of scheduler:Task.md#contract[Task contract].

preferredLocations simply returns <> internal property.

=== [[runTask]] Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) -- runTask Method

[source, scala]

runTask(context: TaskContext): U

NOTE: U is the type of a result as defined when <ResultTask is created>>.

runTask deserializes a RDD and a function from the <> and then executes the function (on the records from the RDD <>).

NOTE: runTask is part of scheduler:Task.md#contract[Task contract] to run a task.

Internally, runTask starts by tracking the time required to deserialize a RDD and a function to execute.

runTask serializer:Serializer.md#newInstance[creates a new closure Serializer].

NOTE: runTask uses core:SparkEnv.md#closureSerializer[SparkEnv to access the current closure Serializer].

runTask serializer:Serializer.md#deserialize[requests the closure Serializer to deserialize an RDD and the function to execute] (from <> broadcast).

NOTE: <> broadcast is defined when <ResultTask is created>>.

runTask records scheduler:Task.md#_executorDeserializeTime[_executorDeserializeTime] and scheduler:Task.md#_executorDeserializeCpuTime[_executorDeserializeCpuTime] properties.

In the end, runTask executes the function (passing in the input context and the rdd:index.md#iterator[records from partition of the RDD]).

NOTE: partition to use to access the records in a deserialized RDD is defined when <ResultTask was created>>.


Last update: 2020-10-06