MultiDimClustering¶
MultiDimClustering is an abstraction of multi-dimensional clustering algorithms (for changing the data layout).
Contract¶
Multi-Dimensional Clustering¶
cluster(
df: DataFrame,
colNames: Seq[String],
approxNumPartitions: Int,
randomizationExpressionOpt: Option[Column]): DataFrame
Repartition the given df into approxNumPartitions based on the provided colNames
See:
Note
randomizationExpressionOpt is always undefined (None).
Used when:
MultiDimClusteringutility is requested to cluster a DataFrame
Implementations¶
cluster¶
cluster(
df: DataFrame,
approxNumPartitions: Int,
colNames: Seq[String],
curve: String): DataFrame
curve Argument and Supported Values: zorder or hilbert
curve is based on OptimizeExecutor (and can only be two values, zorder or hilbert).
cluster asserts that the given colNames contains at least one column name.
AssertionError
cluster reports an AssertionError for an unknown curve type name.
assertion failed : Cannot cluster by zero columns!
cluster selects the multi-dimensional clustering algorithm based on the given curve name.
| Curve Type | Clustering Algorithm |
|---|---|
hilbert | HilbertClustering |
zorder | ZOrderClustering |
SparkException
cluster accepts these two algorithms only or throws a SparkException:
Unknown curve ([curve]), unable to perform multi dimensional clustering.
cluster requests the clustering implementation to cluster (with no randomizationExpressionOpt).
cluster is used when:
OptimizeExecutoris requested to runOptimizeBinJob (with isMultiDimClustering flag enabled)
AssertionError¶
cluster throws an AssertionError when there are no colNames specified:
Cannot cluster by zero columns!