MultiDimClustering¶

MultiDimClustering is an abstraction of multi-dimensional clustering algorithms (for changing the data layout).

Contract¶

cluster(
  df: DataFrame,
  colNames: Seq[String],
  approxNumPartitions: Int,
  randomizationExpressionOpt: Option[Column]): DataFrame

Repartition the given df into approxNumPartitions based on the provided colNames

See:

Note

randomizationExpressionOpt is always undefined (None).

Used when:

cluster(
  df: DataFrame,
  approxNumPartitions: Int,
  colNames: Seq[String],
  curve: String): DataFrame

curve Argument and Supported Values: zorder or hilbert

curve is based on OptimizeExecutor (and can only be two values, zorder or hilbert).

cluster asserts that the given colNames contains at least one column name.

AssertionError

cluster reports an AssertionError for an unknown curve type name.

assertion failed : Cannot cluster by zero columns!

cluster selects the multi-dimensional clustering algorithm based on the given curve name.

Curve Type	Clustering Algorithm
`hilbert`	HilbertClustering
`zorder`	ZOrderClustering

SparkException

cluster accepts these two algorithms only or throws a SparkException:

Unknown curve ([curve]), unable to perform multi dimensional clustering.

cluster requests the clustering implementation to cluster (with no randomizationExpressionOpt).

cluster is used when:

OptimizeExecutor is requested to runOptimizeBinJob (with isMultiDimClustering flag enabled)

cluster throws an AssertionError when there are no colNames specified:

Cannot cluster by zero columns!