Skip to content

Dependency

Dependency[T] is an abstraction of dependencies between RDDs.

Any time an RDD transformation (e.g. map, flatMap) is used (and RDD lineage graph is built), Dependencyies are the edges.

Contract

RDD

rdd: RDD[T]

Used when:

Implementations

Demo

The dependencies of an RDD are available using RDD.dependencies method.

val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)
scala> myRdd.dependencies.foreach(println)
org.apache.spark.ShuffleDependency@41e38d89
scala> myRdd.dependencies.map(_.rdd).foreach(println)
MapPartitionsRDD[6] at groupBy at <console>:39

RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.

scala> println(myRdd.toDebugString)
(16) ShuffledRDD[7] at groupBy at <console>:39 []
 +-(16) MapPartitionsRDD[6] at groupBy at <console>:39 []
    |   ParallelCollectionRDD[5] at parallelize at <console>:39 []