Skip to content

Liquid Clustering

Liquid Clustering (Clustered Tables) is an optimization technique in Delta Lake based on OPTIMIZE command with Hilbert clustering.

Not Recommended for Production Use

  1. A clustered table is currently in preview and is disabled by default.
  2. A clustered table is not recommended for production use (e.g., unsupported incremental clustering).

Liquid Clustering optimization can be executed on delta tables automatically or manually, at write time with Auto Compaction enabled or at any time using OPTIMIZE command, respectively.

Liquid Clustering can be enabled system-wide using spark.databricks.delta.clusteredTable.enableClusteringTablePreview configuration property.

SET spark.databricks.delta.clusteredTable.enableClusteringTablePreview=true

Liquid Clustering can only be used on delta tables created with CLUSTER BY clause.

CREATE TABLE IF NOT EXISTS delta_table
USING delta
CLUSTER BY (id)
AS
  SELECT * FROM values 1, 2, 3 t(id)

At write time, Delta Lake registers AutoCompact post-commit hook (part of Auto Compaction feature) that determines the type of optimization (incl. Liquid Clustering).

The clustering columns of a delta table are stored (persisted) in a table catalog (as clusteringColumns table property).

DESC EXTENDED delta_table
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                                                               |comment|
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|id                          |int                                                                                                                                                     |NULL   |
|                            |                                                                                                                                                        |       |
|# Detailed Table Information|                                                                                                                                                        |       |
|Name                        |spark_catalog.default.delta_table                                                                                                                       |       |
|Type                        |MANAGED                                                                                                                                                 |       |
|Location                    |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_table                                                                                             |       |
|Provider                    |delta                                                                                                                                                   |       |
|Owner                       |jacek                                                                                                                                                   |       |
|Table Properties            |[clusteringColumns=[["id"]],delta.feature.clustering=supported,delta.feature.domainMetadata=supported,delta.minReaderVersion=1,delta.minWriterVersion=7]|       |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

CLUSTER BY Clause

Spark 4.0.0

With SPARK-44886 resolved, support for clustered tables is slated to be available natively in Apache Spark 4.0.0 🔥

CLUSTER BY clause is made available in CREATE/REPLACE SQL using ClusterByParserUtils that is needed until a native support in Apache Spark is provided for catalog/datasource implementations to use for clustering.

CREATE TABLE tbl(a int, b string)
CLUSTER BY (a, b)

Limitations

  1. Liquid Clustering cannot be used with partitioning (PARTITIONED BY)
  2. Liquid Clustering cannot be used with bucketing (CLUSTERED BY INTO BUCKETS)
  3. Liquid Clustering can be used with 2 and up to 9 columns to CLUSTER BY.