Dependency¶
Dependency[T] is an abstraction of dependencies between RDDs.
Any time an RDD transformation (e.g. map, flatMap) is used (and RDD lineage graph is built), Dependencyies are the edges.
Contract¶
RDD¶
rdd: RDD[T]
Used when:
DAGScheduleris requested for the shuffle dependencies and ResourceProfiles (of anRDD)RDDis requested to getNarrowAncestors, cleanShuffleDependencies, firstParent, parent, toDebugString, getOutputDeterministicLevel
Implementations¶
Demo¶
The dependencies of an RDD are available using RDD.dependencies method.
val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)
scala> myRdd.dependencies.foreach(println)
org.apache.spark.ShuffleDependency@41e38d89
scala> myRdd.dependencies.map(_.rdd).foreach(println)
MapPartitionsRDD[6] at groupBy at <console>:39
RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.
scala> println(myRdd.toDebugString)
(16) ShuffledRDD[7] at groupBy at <console>:39 []
+-(16) MapPartitionsRDD[6] at groupBy at <console>:39 []
| ParallelCollectionRDD[5] at parallelize at <console>:39 []