HilbertClustering¶
HilbertClustering is a SpaceFillingCurveClustering for multi-dimensional clustering with hilbert curve.
HilbertClustering requires between 2 and up to 9 columns to cluster by.
Singleton Object
HilbertClustering is a Scala object which is a class that has exactly one instance. It is created lazily when it is referenced, like a lazy val.
Learn more in Tour of Scala.
Clustering Expression¶
SpaceFillingCurveClustering
getClusteringExpression(
  cols: Seq[Column],
  numRanges: Int): Column
getClusteringExpression is part of the SpaceFillingCurveClustering abstraction.
getClusteringExpression creates a rangeIdCols as range_partition_id for the given cols columns and the numRanges number of partitions (buckets).
In the end, getClusteringExpression hilbert_index with the following:
-  The number of bits being one more than the number of trailing zeros of the int value with at most a single one-bit, in the position of the highest-order ("leftmost") one-bit in the numRangesvalueNumber of Bits ExplainedGiven numRangesis5, the position of the highest-order ("leftmost") one-bit is2.val numRanges = 5 scala> println(s"$numRanges in the two's complement binary representation is ${Integer.toBinaryString(numRanges)}") 5 in the two's complement binary representation is 101Counting positions from left to right, starting from 0, gives2as the position of the highest-order ("leftmost") one-bit.scala> print(s"For ${numRanges}, the int value with at most a single one-bit is ${Integer.highestOneBit(numRanges)}") For 5, the int value with at most a single one-bit is 4The int value with at most a single one-bit in the position of the highest-order ("leftmost") one-bit being 2is4(2^2).The number of zero bits following the lowest-order ("rightmost") one-bit in the two's complement binary representation of the int value ( 4) is2.scala> println(s"For ${Integer.highestOneBit(numRanges)}, the number of zero bits is ${Integer.numberOfTrailingZeros(Integer.highestOneBit(numRanges))}") For 4, the number of zero bits is 2In the end, getClusteringExpressionuses3as the number of bits.
-  The range_partition_id columns