Data Skipping¶
Data Skipping is an optimization of queries with filter clauses that uses data skipping column statistics to find the set of parquet data files that need to be queried (and prune away files that do not match the filters and contain no rows the query cares about). That means that no filters effectively skips data skipping.
Data Skipping is enabled using spark.databricks.delta.stats.skipping configuration property.
LIMIT Pushdown¶
Internals¶
Data Skipping uses DataSkippingReaderBase as the main abstraction for scanning parquet data files (with getDataSkippedFiles being the crucial part).