ZOrderClustering¶
ZOrderClustering is a SpaceFillingCurveClustering for multi-dimensional clustering with zorder curve.
Clustering Expression¶
SpaceFillingCurveClustering
getClusteringExpression(
cols: Seq[Column],
numRanges: Int): Column
getClusteringExpression is part of the SpaceFillingCurveClustering abstraction.
getClusteringExpression creates a range_partition_id function (with the given numRanges for the number of partitions) for every Column (in the given cols).
In the end, getClusteringExpression interleave_bits with the range_partition_id columns and casts the (evaluation) result to StringType.
Demo¶
For some reason, getClusteringExpression is protected[skipping] so let's hop over the fence with the following hack.
Paste the following to spark-shell in :paste -raw mode:
package org.apache.spark.sql.delta.skipping
object protectedHack {
import org.apache.spark.sql.Column
def getClusteringExpression(
cols: Seq[Column], numRanges: Int): Column = {
ZOrderClustering.getClusteringExpression(cols, numRanges)
}
}
import org.apache.spark.sql.delta.skipping.protectedHack
val clusterExpr = protectedHack.getClusteringExpression(cols = Seq($"x", $"y"), numRanges = 3)
scala> println(clusterExpr.expr.numberedTreeString)
00 cast(interleavebits(rangepartitionid('x, 3), rangepartitionid('y, 3)) as string)
01 +- interleavebits(rangepartitionid('x, 3), rangepartitionid('y, 3))
02 :- rangepartitionid('x, 3)
03 : +- 'x
04 +- rangepartitionid('y, 3)
05 +- 'y