Skip to content

Data Skipping

Data Skipping is an optimization of queries with filter clauses that uses data skipping column statistics to find the set of parquet data files that need to be queried (and prune away files that do not match the filters and contain no rows the query cares about). That means that no filters effectively skips data skipping.

Data Skipping is enabled using spark.databricks.delta.stats.skipping configuration property.

Data Skipping is available as of Delta Lake 1.2.0.

Limitations

Data Skipping supports flat filters only (i.e., filters with no SubqueryExpressions (Spark SQL)).

Internals

Data Skipping uses DataSkippingReaderBase as the main abstraction for scanning parquet data files (with getDataSkippedFiles being the crucial part).

Demo

Data Skipping

Learn More

  1. Delta Lake 1.2 - More Speed, Efficiency and Extensibility Than Ever