RangePartitioner is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).
RangePartitioner is used for sortByKey operator (mostly).
RangePartitioner takes the following to be created:
- Hint for the number of partitions
- Key-Value RDD (
RDD[_ <: Product2[K, V]])
- samplePointsPerPartitionHint (default:
Number of Partitions¶
numPartitions is part of the Partitioner abstraction.
Partition for Key¶
getPartition( key: Any): Int
getPartition is part of the Partitioner abstraction.
getPartition branches off based on the length of the range bounds.
For up to 128 range bounds,
getPartition is either the first range bound (from the rangeBounds) for which the
key value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds).
getPartition starts finding a candidate partition number from
0 and walks over the rangeBounds until a range bound for which the given
key value is greater than the value of the range bound is found or there are no more rangeBounds.
getPartition increments the candidate partition candidate every iteration.
For the number of the rangeBounds above 128,
rangeBounds is an array of upper bounds.
For the number of partitions up to and including 1,
rangeBounds is an empty array.
For more than 1 partitions,
rangeBounds determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by
rangeBounds allows for 3x over-sample per partition.
There is more going on in
In the end,
rangeBounds determines the bounds.
determineBounds[K: Ordering]( candidates: ArrayBuffer[(K, Float)], partitions: Int): Array[K]