TaskSet¶
TaskSet
is a collection of independent tasks of a stage (and a stage execution attempt) that are missing (uncomputed), i.e. for which computation results are unavailable (as RDD blocks on BlockManagers on executors).
In other words, a TaskSet
represents the missing partitions of a stage that (as tasks) can be run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though they may fail if this data becomes unavailable.
Since the tasks are only the missing tasks, their number does not necessarily have to be the number of all the tasks of a stage. For a brand new stage (that has never been attempted to compute) their numbers are exactly the same.
Once DAGScheduler
submits the missing tasks for execution (to the TaskScheduler), the execution of the TaskSet
is managed by a TaskSetManager that allows for spark.task.maxFailures.
Creating Instance¶
TaskSet
takes the following to be created:
- Tasks
- Stage ID
- Stage (Execution) Attempt ID
- FIFO Priority
- Local Properties
- Resource Profile ID
TaskSet
is created when:
DAGScheduler
is requested to submit the missing tasks of a stage
ID¶
id: String
TaskSet
is uniquely identified by an id
that uses the stageId followed by the stageAttemptId with the comma (.
) in-between:
[stageId].[stageAttemptId]
Textual Representation¶
toString: String
toString
follows the pattern:
TaskSet [stageId].[stageAttemptId]
Task Scheduling Prioritization (FIFO Scheduling)¶
TaskSet
is given a priority
when created.
The priority is the ID of the earliest-created active job that needs the stage (that is given when DAGScheduler
is requested to submit the missing tasks of a stage).
Once submitted for execution, the priority is the priority of the TaskSetManager
(which is a Schedulable) that is used for task prioritization (prioritizing scheduling of tasks) in the FIFO scheduling mode.