Skip to content

ParquetScanBuilder

ParquetScanBuilder is a FileScanBuilder (of ParquetTable) that SupportsPushDownFilters.

ParquetScanBuilder builds ParquetScans.

ParquetScanBuilder supportsNestedSchemaPruning.

Creating Instance

ParquetScanBuilder takes the following to be created:

ParquetScanBuilder is created when:

Building Scan

Signature
build(): Scan

build is part of the ScanBuilder abstraction.

build creates a ParquetScan with the following:

ParquetScan Value
fileIndex the given fileIndex
dataSchema the given dataSchema
readDataSchema finalSchema
readPartitionSchema readPartitionSchema
pushedFilters pushedDataFilters
options the given options
pushedAggregate pushedAggregations
partitionFilters partitionFilters
dataFilters dataFilters

pushedAggregations

pushedAggregations: Option[Aggregation]

ParquetScanBuilder defines pushedAggregations registry for an Aggregation.

The pushedAggregations is undefined when ParquetScanBuilder is created and can only be assigned when pushAggregation.

pushedAggregations controls the finalSchema. When undefined, the finalSchema is readDataSchema when building a ParquetScan.

pushedAggregations is used to create a ParquetScan.

pushAggregation

Signature
pushAggregation(
  aggregation: Aggregation): Boolean

pushAggregation is part of the SupportsPushDownAggregates abstraction.

pushAggregation does nothing and returns false for spark.sql.parquet.aggregatePushdown disabled.

pushAggregation determines the data schema for aggregate to be pushed down.

With the schema determined, pushAggregation registers it as finalSchema and the given Aggregation as pushedAggregations. pushAggregation returns true.

Otherwise, pushAggregation returns false.

pushDataFilters

Signature
pushDataFilters(
  dataFilters: Array[Filter]): Array[Filter]

pushDataFilters is part of the FileScanBuilder abstraction.

spark.sql.parquet.filterPushdown

pushDataFilters does nothing and returns no Catalyst Filters with spark.sql.parquet.filterPushdown disabled.

pushDataFilters creates a ParquetFilters with the readDataSchema (converted into the corresponding parquet schema) and the following configuration properties:

In the end, pushedParquetFilters requests the ParquetFilters for the convertibleFilters for the given dataFilters.

supportsNestedSchemaPruning

Signature
supportsNestedSchemaPruning: Boolean

supportsNestedSchemaPruning is part of the FileScanBuilder abstraction.

supportsNestedSchemaPruning is enabled (true).