Skip to content

Data Skipping

Data Skipping is an optimization of queries with filter clauses that uses data skipping column statistics to find the set of parquet data files that need to be queried (and prune away files that do not match the filters and contain no rows the query cares about). That means that no filters effectively skips data skipping.

Data Skipping is enabled using spark.databricks.delta.stats.skipping configuration property.

LIMIT Pushdown

LIMIT Pushdown

Internals

Data Skipping uses DataSkippingReaderBase as the main abstraction for scanning parquet data files (with getDataSkippedFiles being the crucial part).

Demo

Data Skipping

Learn More

  1. Delta Lake 1.2 - More Speed, Efficiency and Extensibility Than Ever