Skip to content

ZOrderClustering

ZOrderClustering is a SpaceFillingCurveClustering for multi-dimensional clustering with zorder curve.

Clustering Expression

SpaceFillingCurveClustering
getClusteringExpression(
  cols: Seq[Column],
  numRanges: Int): Column

getClusteringExpression is part of the SpaceFillingCurveClustering abstraction.

getClusteringExpression creates a range_partition_id function (with the given numRanges for the number of partitions) for every Column (in the given cols).

In the end, getClusteringExpression interleave_bits with the range_partition_id columns and casts the (evaluation) result to StringType.

Demo

For some reason, getClusteringExpression is protected[skipping] so let's hop over the fence with the following hack.

Paste the following to spark-shell in :paste -raw mode:

package org.apache.spark.sql.delta.skipping
object protectedHack {
  import org.apache.spark.sql.Column
  def getClusteringExpression(
    cols: Seq[Column], numRanges: Int): Column = {
      ZOrderClustering.getClusteringExpression(cols, numRanges)
    }
}
import org.apache.spark.sql.delta.skipping.protectedHack
val clusterExpr = protectedHack.getClusteringExpression(cols = Seq($"x", $"y"), numRanges = 3)
scala> println(clusterExpr.expr.numberedTreeString)
00 cast(interleavebits(rangepartitionid('x, 3), rangepartitionid('y, 3)) as string)
01 +- interleavebits(rangepartitionid('x, 3), rangepartitionid('y, 3))
02    :- rangepartitionid('x, 3)
03    :  +- 'x
04    +- rangepartitionid('y, 3)
05       +- 'y