OPTIMIZE Command¶
From Optimize performance with file management:
To improve query speed, Delta Lake on Databricks supports the ability to optimize the layout of data stored in cloud storage. Delta Lake on Databricks supports two layout algorithms: bin-packing and Z-Ordering.
As of Delta Lake 2.0.0, the above quote applies to the open source version, too.
OPTIMIZE
command can be executed as follows:
- OPTIMIZE SQL Command
- DeltaTable.optimize operator
OPTIMIZE SQL Command¶
OPTIMIZE SQL command
bin-packing¶
In bin-packing
(aka. file compaction) mode, OPTIMIZE
command compacts files together (that are smaller than spark.databricks.delta.optimize.minFileSize to files of spark.databricks.delta.optimize.maxFileSize size).
Z-Ordering¶
OPTIMIZE
can specify ZORDER BY
columns for multi-dimensional clustering.
optimize.maxThreads¶
OPTIMIZE
command uses spark.databricks.delta.optimize.maxThreads threads for compaction.
Demo¶
Learning More¶
There seems so many articles and academic papers about space filling curve based clustering algorithms. I'm hoping that one day I'll have read enough to develop my own intuition about z-order multi-dimensional optimization. If you know good articles about this space (pun intended), let me know. I'll collect them here for future reference (for others to learn along).
Thank you! 🙏