Skip to content

OPTIMIZE Command

From Optimize performance with file management:

To improve query speed, Delta Lake on Databricks supports the ability to optimize the layout of data stored in cloud storage. Delta Lake on Databricks supports two layout algorithms: bin-packing and Z-Ordering.

As of Delta Lake 2.0.0, the above quote applies to the open source version, too.

OPTIMIZE command can be executed using the following:

bin-packing

In bin-packing (aka. file compaction) mode, OPTIMIZE command compacts files together (that are smaller than spark.databricks.delta.optimize.minFileSize to files of spark.databricks.delta.optimize.maxFileSize size).

Z-Ordering

OPTIMIZE can specify ZORDER BY columns for multi-dimensional clustering.

optimize.maxThreads

OPTIMIZE command uses spark.databricks.delta.optimize.maxThreads threads for compaction.

Demo

Demo: Optimize

Learning More

There seems so many articles and academic papers about space filling curve based clustering algorithms. I'm hoping that one day I'll have read enough to develop my own intuition about z-order multi-dimensional optimization. If you know good articles about this space (pun intended), let me know. I'll collect them here for future reference (for others to learn along).

Thank you! 🙏

  1. Z-order curve