RangePartitioner¶

RangePartitioner is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).

RangePartitioner is used for sortByKey operator (mostly).

Creating Instance¶

RangePartitioner takes the following to be created:

Hint for the number of partitions
Key-Value RDD (RDD[_ <: Product2[K, V]])
ascending flag (default: true)
samplePointsPerPartitionHint (default: 20)

Number of Partitions¶

numPartitions: Int

numPartitions is part of the Partitioner abstraction.

numPartitions is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).

Partition for Key¶

getPartition(
  key: Any): Int

getPartition is part of the Partitioner abstraction.

getPartition branches off based on the length of the range bounds.

For up to 128 range bounds, getPartition is either the first range bound (from the rangeBounds) for which the key value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds). getPartition starts finding a candidate partition number from 0 and walks over the rangeBounds until a range bound for which the given key value is greater than the value of the range bound is found or there are no more rangeBounds. getPartition increments the candidate partition candidate every iteration.

For the number of the rangeBounds above 128, getPartition...FIXME

In the end, getPartition returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.

Range Bounds¶

rangeBounds: Array[K]

rangeBounds is an array of upper bounds.

For the number of partitions up to and including 1, rangeBounds is an empty array.

For more than 1 partitions, rangeBounds determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by 1e6. rangeBounds allows for 3x over-sample per partition.

rangeBounds sketches the keys of the input rdd (with the sampleSizePerPartition).

Note

There is more going on in rangeBounds.

In the end, rangeBounds determines the bounds.

determineBounds¶

determineBounds[K: Ordering](
  candidates: ArrayBuffer[(K, Float)],
  partitions: Int): Array[K]

determineBounds...FIXME