RDD is a description of a fault-tolerant and resilient computation over a possibly distributed collection of records (of type
compute( split: Partition, context: TaskContext): Iterator[T]
compute is implemented by any type of RDD in Spark and is called every time the records are requested unless RDD is cached or checkpointed (and the records can be read from an external storage, but this time closer to the compute node).
compute runs on the driver.
compute is used when RDD is requested to computeOrReadCheckpoint.
getPartitions is used when RDD is requested for the partitions (called only once as the value is cached afterwards).
getDependencies is used when RDD is requested for the dependencies (called only once as the value is cached afterwards).
getPreferredLocations( split: Partition): Seq[String] = Nil
partitioner: Option[Partitioner] = None
RDD can have a Partitioner defined.
Result of calling map-like operations (e.g.
An RDD can be part of a barrier stage. By default,
isBarrier flag is enabled (
ShuffledRDD has the flag always disabled.
MapPartitionsRDD is the only one RDD that can have the flag enabled.
getOrCompute( partition: Partition, context: TaskContext): Iterator[T]
getOrCompute records whether…FIXME (readCachedBlock)
InterruptibleIterator simply delegates to a wrapped internal
BlockResult available and
readCachedBlock flag on,
BlockResult available and
readCachedBlock flag off,
Right(iter) (regardless of the value of
readCachedBlock flag since…FIXME),
BlockManager.getOrElseUpdate gives a
dependencies returns the dependencies of a RDD.
dependencies checks out whether the RDD is checkpointed and acts accordingly.
For a RDD being checkpointed,
dependencies returns a single-element collection with a OneToOneDependency.
For a non-checkpointed RDD,
dependencies collection is computed using
getNarrowAncestors is used when StageInfo is requested to fromStage.
persist(): this.type persist( newLevel: StorageLevel): this.type
Refer to Persisting RDD.
localCheckpoint marks this RDD for local checkpointing using Spark’s caching layer.
getNumPartitions gives the number of partitions of a RDD.
scala> sc.textFile("README.md").getNumPartitions res0: Int = 2 scala> sc.textFile("README.md", 5).getNumPartitions res1: Int = 5
preferredLocations( split: Partition): Seq[String]
preferredLocations is a template method that uses getPreferredLocations that custom RDDs can override to specify placement preferences for a partition. getPreferredLocations defines no placement preferences by default.
preferredLocations is mainly used when DAGScheduler is requested to compute the preferred locations for missing partitions.
partitions returns the Partitions of a
Partitions have the property that their internal index should be equal to their position in the owning RDD.
markCheckpointed is used when…FIXME
doCheckpoint is used when SparkContext is requested to run a job (synchronously).
checkpoint is used when…FIXME
isReliablyCheckpointed is used when…FIXME