Stage¶
Stage
is an abstraction of steps in a physical execution plan.
Note
The logical DAG or logical execution plan is the RDD lineage.
Indirectly, a Stage
is a set of parallel tasks - one task per partition (of an RDD that computes partial results of a function executed as part of a Spark job).
In other words, a Spark job is a computation "sliced" (not to use the reserved term partitioned) into stages.
Contract¶
Missing Partitions¶
findMissingPartitions(): Seq[Int]
Missing partitions (IDs of the partitions of the RDD that are missing and need to be computed)
Used when:
DAGScheduler
is requested to submit missing tasks
Implementations¶
Creating Instance¶
Stage
takes the following to be created:
Abstract Class
Stage
is an abstract class and cannot be created directly. It is created indirectly for the concrete Stages.
RDD¶
Stage
is given a RDD when created.
Stage ID¶
Stage
is given an unique ID when created.
Note
DAGScheduler
uses nextStageId internal counter to track the number of stage submissions.
Making New Stage Attempt¶
makeNewStageAttempt(
numPartitionsToCompute: Int,
taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit
makeNewStageAttempt
creates a new TaskMetrics and requests it to register itself with the SparkContext of the RDD.
makeNewStageAttempt
creates a StageInfo from this Stage
(and the nextAttemptId). This StageInfo
is saved in the _latestInfo internal registry.
In the end, makeNewStageAttempt
increments the nextAttemptId internal counter.
Note
makeNewStageAttempt
returns Unit
(nothing) and its purpose is to update the latest StageInfo internal registry.
makeNewStageAttempt
is used when:
DAGScheduler
is requested to submit the missing tasks of a stage