RangePartitioner¶
RangePartitioner
is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).
RangePartitioner
is used for sortByKey operator (mostly).
Creating Instance¶
RangePartitioner
takes the following to be created:
- Hint for the number of partitions
- Key-Value RDD (
RDD[_ <: Product2[K, V]]
) -
ascending
flag (default:true
) - samplePointsPerPartitionHint (default:
20
)
Number of Partitions¶
numPartitions: Int
numPartitions
is part of the Partitioner abstraction.
numPartitions
is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).
Partition for Key¶
getPartition(
key: Any): Int
getPartition
is part of the Partitioner abstraction.
getPartition
branches off based on the length of the range bounds.
For up to 128 range bounds, getPartition
is either the first range bound (from the rangeBounds) for which the key
value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds). getPartition
starts finding a candidate partition number from 0
and walks over the rangeBounds until a range bound for which the given key
value is greater than the value of the range bound is found or there are no more rangeBounds. getPartition
increments the candidate partition candidate every iteration.
For the number of the rangeBounds above 128, getPartition
...FIXME
In the end, getPartition
returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.
Range Bounds¶
rangeBounds: Array[K]
rangeBounds
is an array of upper bounds.
For the number of partitions up to and including 1, rangeBounds
is an empty array.
For more than 1 partitions, rangeBounds
determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by 1e6
. rangeBounds
allows for 3x over-sample per partition.
rangeBounds
sketches the keys of the input rdd (with the sampleSizePerPartition
).
Note
There is more going on in rangeBounds
.
In the end, rangeBounds
determines the bounds.
determineBounds¶
determineBounds[K: Ordering](
candidates: ArrayBuffer[(K, Float)],
partitions: Int): Array[K]
determineBounds
...FIXME