Skip to content


TaskSet is a <> of a single <> (and a <>) that are missing (uncomputed), i.e. for which computation results are unavailable (as RDD blocks on[BlockManagers] on executors).

In other words, a TaskSet represents the missing partitions of a stage that (as tasks) can be run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though they may fail if this data becomes unavailable.

NOTE: Since the <> of a TaskSet are only the missing tasks, their number does not necessarily have to be the number of all the tasks of a <>. For a brand new stage (that has never been attempted to compute) their numbers are exactly the same.

TaskSet is <> exclusively when DAGScheduler is requested to[submit the missing tasks of a stage].

NOTE: Once[submitted] for execution (to a[TaskScheduler]), the execution of the TaskSet is managed by a[TaskSetManager] that allows for[spark.task.maxFailures] (default: 1 for <> and 4 for <>).

[[creating-instance]] TaskSet takes the following to be created:

  • [[tasks]] Collection of[tasks] (Array[Task[_]])
  • [[stageId]] Stage ID
  • [[stageAttemptId]] Stage execution attempt ID
  • [[priority]] Priority (for <>)
  • [[properties]] Key-value properties

[[id]] TaskSet is uniquely identified by an id that is the <> followed by the <> with the comma (.) in-between.


[[toString]] A textual representation (toString) of TaskSet is TaskSet [id].

TaskSet [stageId].[stageAttemptId]

== [[fifo-scheduling]] Task Scheduling Prioritization in FIFO Scheduling

The <> of a TaskSet is exactly the ID of the earliest-created active job that needs the stage (that is given when DAGScheduler is requested to[submit the missing tasks of a stage]).

Once[submitted] for execution (to a[TaskScheduler]), the <> of a TaskSet is the[priority] of the TaskSetManager (which is a <>) that is used for task prioritization (prioritizing scheduling of tasks) in the <> scheduling mode.

Last update: 2020-11-27
Back to top