ZOrderClustering¶
ZOrderClustering
is a SpaceFillingCurveClustering for multi-dimensional clustering with zorder curve.
Clustering Expression¶
SpaceFillingCurveClustering
getClusteringExpression(
cols: Seq[Column],
numRanges: Int): Column
getClusteringExpression
is part of the SpaceFillingCurveClustering abstraction.
getClusteringExpression
creates a range_partition_id function (with the given numRanges
for the number of partitions) for every Column
(in the given cols
).
In the end, getClusteringExpression
interleave_bits with the range_partition_id
columns and casts the (evaluation) result to StringType
.
Demo¶
For some reason, getClusteringExpression is protected[skipping]
so let's hop over the fence with the following hack.
Paste the following to spark-shell
in :paste -raw
mode:
package org.apache.spark.sql.delta.skipping
object protectedHack {
import org.apache.spark.sql.Column
def getClusteringExpression(
cols: Seq[Column], numRanges: Int): Column = {
ZOrderClustering.getClusteringExpression(cols, numRanges)
}
}
import org.apache.spark.sql.delta.skipping.protectedHack
val clusterExpr = protectedHack.getClusteringExpression(cols = Seq($"x", $"y"), numRanges = 3)
scala> println(clusterExpr.expr.numberedTreeString)
00 cast(interleavebits(rangepartitionid('x, 3), rangepartitionid('y, 3)) as string)
01 +- interleavebits(rangepartitionid('x, 3), rangepartitionid('y, 3))
02 :- rangepartitionid('x, 3)
03 : +- 'x
04 +- rangepartitionid('y, 3)
05 +- 'y