Dependency¶
Dependency[T]
is an abstraction of dependencies between RDD
s.
Any time an RDD transformation (e.g. map
, flatMap
) is used (and RDD lineage graph is built), Dependency
ies are the edges.
Contract¶
RDD¶
rdd: RDD[T]
Used when:
DAGScheduler
is requested for the shuffle dependencies and ResourceProfiles (of anRDD
)RDD
is requested to getNarrowAncestors, cleanShuffleDependencies, firstParent, parent, toDebugString, getOutputDeterministicLevel
Implementations¶
Demo¶
The dependencies of an RDD
are available using RDD.dependencies
method.
val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)
scala> myRdd.dependencies.foreach(println)
org.apache.spark.ShuffleDependency@41e38d89
scala> myRdd.dependencies.map(_.rdd).foreach(println)
MapPartitionsRDD[6] at groupBy at <console>:39
RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.
scala> println(myRdd.toDebugString)
(16) ShuffledRDD[7] at groupBy at <console>:39 []
+-(16) MapPartitionsRDD[6] at groupBy at <console>:39 []
| ParallelCollectionRDD[5] at parallelize at <console>:39 []