Skip to content

HilbertClustering

HilbertClustering is a SpaceFillingCurveClustering for multi-dimensional clustering with hilbert curve.

HilbertClustering requires between 2 and up to 9 columns to cluster by.

Singleton Object

HilbertClustering is a Scala object which is a class that has exactly one instance. It is created lazily when it is referenced, like a lazy val.

Learn more in Tour of Scala.

Clustering Expression

SpaceFillingCurveClustering
getClusteringExpression(
  cols: Seq[Column],
  numRanges: Int): Column

getClusteringExpression is part of the SpaceFillingCurveClustering abstraction.

getClusteringExpression creates a rangeIdCols as range_partition_id for the given cols columns and the numRanges number of partitions (buckets).

In the end, getClusteringExpression hilbert_index with the following:

  • The number of bits being one more than the number of trailing zeros of the int value with at most a single one-bit, in the position of the highest-order ("leftmost") one-bit in the numRanges value

    Number of Bits Explained

    Given numRanges is 5, the position of the highest-order ("leftmost") one-bit is 2.

    val numRanges = 5
    scala> println(s"$numRanges in the two's complement binary representation is ${Integer.toBinaryString(numRanges)}")
    5 in the two's complement binary representation is 101
    

    Counting positions from left to right, starting from 0, gives 2 as the position of the highest-order ("leftmost") one-bit.

    scala> print(s"For ${numRanges}, the int value with at most a single one-bit is ${Integer.highestOneBit(numRanges)}")
    For 5, the int value with at most a single one-bit is 4
    

    The int value with at most a single one-bit in the position of the highest-order ("leftmost") one-bit being 2 is 4 (2^2).

    The number of zero bits following the lowest-order ("rightmost") one-bit in the two's complement binary representation of the int value (4) is 2.

    scala> println(s"For ${Integer.highestOneBit(numRanges)}, the number of zero bits is ${Integer.numberOfTrailingZeros(Integer.highestOneBit(numRanges))}")
    For 4, the number of zero bits is 2
    

    In the end, getClusteringExpression uses 3 as the number of bits.

  • The range_partition_id columns