RangePartitioner is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).
RangePartitioner is used for sortByKey operator (mostly).
RangePartitioner takes the following to be created:
- Hint for the number of partitions
- Key-Value RDD (
RDD[_ <: Product2[K, V]])
- samplePointsPerPartitionHint (default:
Number of Partitions¶
numPartitions is part of the Partitioner abstraction.
numPartitions is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).
Partition for Key¶
getPartition( key: Any): Int
getPartition is part of the Partitioner abstraction.
getPartition branches off based on the length of the range bounds.
For up to 128 range bounds,
getPartition is either the first range bound (from the rangeBounds) for which the
key value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds).
getPartition starts finding a candidate partition number from
0 and walks over the rangeBounds until a range bound for which the given
key value is greater than the value of the range bound is found or there are no more rangeBounds.
getPartition increments the candidate partition candidate every iteration.
For the number of the rangeBounds above 128,
In the end,
getPartition returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.
rangeBounds is an array of upper bounds.
For the number of partitions up to and including 1,
rangeBounds is an empty array.
For more than 1 partitions,
rangeBounds determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by
rangeBounds allows for 3x over-sample per partition.
rangeBounds sketches the keys of the input rdd (with the
There is more going on in
In the end,
rangeBounds determines the bounds.
determineBounds[K: Ordering]( candidates: ArrayBuffer[(K, Float)], partitions: Int): Array[K]