Stage¶

Stage is an abstraction of steps in a physical execution plan.

Note

The logical DAG or logical execution plan is the RDD lineage.

Indirectly, a Stage is a set of parallel tasks - one task per partition (of an RDD that computes partial results of a function executed as part of a Spark job).

Stage, tasks and submitting a job

In other words, a Spark job is a computation "sliced" (not to use the reserved term partitioned) into stages.

Contract¶

Missing Partitions¶

findMissingPartitions(): Seq[Int]

Missing partitions (IDs of the partitions of the RDD that are missing and need to be computed)

Used when:

DAGScheduler is requested to submit missing tasks

Implementations¶

Creating Instance¶

Stage takes the following to be created:

Stage ID
RDD
Number of tasks
Parent Stages
First Job ID
CallSite
Resource Profile ID

Abstract Class

Stage is an abstract class and cannot be created directly. It is created indirectly for the concrete Stages.

RDD¶

Stage is given a RDD when created.

Stage ID¶

Stage is given an unique ID when created.

Note

DAGScheduler uses nextStageId internal counter to track the number of stage submissions.

Making New Stage Attempt¶

makeNewStageAttempt(
  numPartitionsToCompute: Int,
  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit

makeNewStageAttempt creates a new TaskMetrics and requests it to register itself with the SparkContext of the RDD.

makeNewStageAttempt creates a StageInfo from this Stage (and the nextAttemptId). This StageInfo is saved in the _latestInfo internal registry.

In the end, makeNewStageAttempt increments the nextAttemptId internal counter.

Note

makeNewStageAttempt returns Unit (nothing) and its purpose is to update the latest StageInfo internal registry.

makeNewStageAttempt is used when:

DAGScheduler is requested to submit the missing tasks of a stage