OPTIMIZE Command¶

From Optimize performance with file management:

To improve query speed, Delta Lake on Databricks supports the ability to optimize the layout of data stored in cloud storage. Delta Lake on Databricks supports two layout algorithms: bin-packing and Z-Ordering.

As of Delta Lake 2.0.0, the above quote applies to the open source version, too.

OPTIMIZE command can be executed as follows:

OPTIMIZE SQL Command
DeltaTable.optimize operator

OPTIMIZE SQL Command¶

OPTIMIZE SQL command

bin-packing¶

In bin-packing (aka. file compaction) mode, OPTIMIZE command compacts files together (that are smaller than spark.databricks.delta.optimize.minFileSize to files of spark.databricks.delta.optimize.maxFileSize size).

Z-Ordering¶

OPTIMIZE can specify ZORDER BY columns for multi-dimensional clustering.

optimize.maxThreads¶

OPTIMIZE command uses spark.databricks.delta.optimize.maxThreads threads for compaction.

Demo¶

Demo: Optimize

Learning More¶

There seems so many articles and academic papers about space filling curve based clustering algorithms. I'm hoping that one day I'll have read enough to develop my own intuition about z-order multi-dimensional optimization. If you know good articles about this space (pun intended), let me know. I'll collect them here for future reference (for others to learn along).

Thank you! 🙏

Z-order curve