Aggregator

Aggregator is a set of aggregation functions used to aggregate data using PairRDDFunctions.combineByKeyWithClassTag transformation.

Aggregator[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

Aggregator transforms an RDD[(K, V)] into an RDD[(K, C)] (for a "combined type" C) using the functions:

  • createCombiner: V ⇒ C

  • mergeValue: (C, V) ⇒ C

  • mergeCombiners: (C, C) ⇒ C

Aggregator is used to create a ShuffleDependency and ExternalSorter.

combineValuesByKey Method

combineValuesByKey(
  iter: Iterator[_ <: Product2[K, V]],
  context: TaskContext): Iterator[(K, C)]

combineValuesByKey creates a new ExternalAppendOnlyMap (with the aggregation functions).

combineValuesByKey requests the ExternalAppendOnlyMap to insert all key-value pairs from the given iterator (that is the values of a partition).

combineValuesByKey updates the task metrics.

In the end, combineValuesByKey requests the ExternalAppendOnlyMap for an iterator of "combined" pairs.

combineValuesByKey is used when:

combineCombinersByKey Method

combineCombinersByKey(
  iter: Iterator[_ <: Product2[K, C]],
  context: TaskContext): Iterator[(K, C)]

combineCombinersByKey…​FIXME

combineCombinersByKey is used when BlockStoreShuffleReader is requested to read combined records for a reduce task (with the Map-Size Partial Aggregation Flag on).

Updating Task Metrics

updateMetrics(
  context: TaskContext,
  map: ExternalAppendOnlyMap[_, _, _]): Unit

updateMetrics requests the input TaskContext for the TaskMetrics to update the metrics based on the metrics of the input ExternalAppendOnlyMap:

updateMetrics is used when Aggregator is requested to combineValuesByKey and combineCombinersByKey.