Skip to content

MultiDimClustering

MultiDimClustering is an abstraction of multi-dimensional clustering algorithms (for changing the data layout).

Contract

Multi-Dimensional Clustering

cluster(
  df: DataFrame,
  colNames: Seq[String],
  approxNumPartitions: Int,
  randomizationExpressionOpt: Option[Column]): DataFrame

Repartition the given df into approxNumPartitions based on the provided colNames

See:

Note

randomizationExpressionOpt is always undefined (None).

Used when:

Implementations

cluster

cluster(
  df: DataFrame,
  approxNumPartitions: Int,
  colNames: Seq[String],
  curve: String): DataFrame
curve Argument and Supported Values: zorder or hilbert

curve is based on OptimizeExecutor (and can only be two values, zorder or hilbert).

cluster asserts that the given colNames contains at least one column name.

AssertionError

cluster reports an AssertionError for an unknown curve type name.

assertion failed : Cannot cluster by zero columns!

cluster selects the multi-dimensional clustering algorithm based on the given curve name.

Curve Type Clustering Algorithm
hilbert HilbertClustering
zorder ZOrderClustering
SparkException

cluster accepts these two algorithms only or throws a SparkException:

Unknown curve ([curve]), unable to perform multi dimensional clustering.

cluster requests the clustering implementation to cluster (with no randomizationExpressionOpt).


cluster is used when:

AssertionError

cluster throws an AssertionError when there are no colNames specified:

Cannot cluster by zero columns!