MultiDimClustering¶
MultiDimClustering
is an abstraction of multi-dimensional clustering algorithms (for changing the data layout).
Contract¶
Multi-Dimensional Clustering¶
cluster(
df: DataFrame,
colNames: Seq[String],
approxNumPartitions: Int,
randomizationExpressionOpt: Option[Column]): DataFrame
Repartition the given df
into approxNumPartitions
based on the provided colNames
See:
Note
randomizationExpressionOpt
is always undefined (None
).
Used when:
MultiDimClustering
utility is requested to cluster a DataFrame
Implementations¶
cluster¶
cluster(
df: DataFrame,
approxNumPartitions: Int,
colNames: Seq[String],
curve: String): DataFrame
curve
Argument and Supported Values: zorder
or hilbert
curve
is based on OptimizeExecutor (and can only be two values, zorder
or hilbert
).
cluster
asserts that the given colNames
contains at least one column name.
AssertionError
cluster
reports an AssertionError
for an unknown curve type name.
assertion failed : Cannot cluster by zero columns!
cluster
selects the multi-dimensional clustering algorithm based on the given curve
name.
Curve Type | Clustering Algorithm |
---|---|
hilbert | HilbertClustering |
zorder | ZOrderClustering |
SparkException
cluster
accepts these two algorithms only or throws a SparkException
:
Unknown curve ([curve]), unable to perform multi dimensional clustering.
cluster
requests the clustering implementation to cluster (with no randomizationExpressionOpt
).
cluster
is used when:
OptimizeExecutor
is requested to runOptimizeBinJob (with isMultiDimClustering flag enabled)
AssertionError¶
cluster
throws an AssertionError
when there are no colNames
specified:
Cannot cluster by zero columns!