Skip to content

ActiveJob

ActiveJob (job, action job) is a top-level work item (computation) submitted to DAGScheduler for execution (usually to compute the result of an RDD action).

RDD actions submit jobs to DAGScheduler

Executing a job is equivalent to computing the partitions of the RDD an action has been executed upon. The number of partitions (numPartitions) to compute in a job depends on the type of a stage (ResultStage or ShuffleMapStage).

A job starts with a single target RDD, but can ultimately include other RDDs that are all part of RDD lineage.

The parent stages are always ShuffleMapStages.

Computing a job is computing the partitions of an RDD

Note

Not always all partitions have to be computed for ResultStages (e.g. for actions like first() and lookup()).

Creating Instance

ActiveJob takes the following to be created:

ActiveJob is created when:

Final Stage

ActiveJob is given a Stage when created that determines a logical type:

  1. Map-Stage Job that computes the map output files for a ShuffleMapStage (for submitMapStage) before any downstream stages are submitted
  2. Result job that computes a ResultStage to execute an action

Finished (Computed) Partitions

ActiveJob uses finished registry of flags to track partitions that have already been computed (true) or not (false).