ParquetScanBuilder¶
ParquetScanBuilder
is a FileScanBuilder (of ParquetTable) that SupportsPushDownFilters.
ParquetScanBuilder
builds ParquetScans.
ParquetScanBuilder
supportsNestedSchemaPruning.
Creating Instance¶
ParquetScanBuilder
takes the following to be created:
- SparkSession
- PartitioningAwareFileIndex
- Schema
- Data Schema
- Case-Insensitive Options
ParquetScanBuilder
is created when:
ParquetTable
is requested to newScanBuilder
Building Scan¶
build
creates a ParquetScan with the following:
ParquetScan | Value |
---|---|
fileIndex | the given fileIndex |
dataSchema | the given dataSchema |
readDataSchema | finalSchema |
readPartitionSchema | readPartitionSchema |
pushedFilters | pushedDataFilters |
options | the given options |
pushedAggregate | pushedAggregations |
partitionFilters | partitionFilters |
dataFilters | dataFilters |
pushedAggregations¶
pushedAggregations: Option[Aggregation]
ParquetScanBuilder
defines pushedAggregations
registry for an Aggregation.
The pushedAggregations
is undefined when ParquetScanBuilder
is created and can only be assigned when pushAggregation.
pushedAggregations
controls the finalSchema. When undefined, the finalSchema is readDataSchema when building a ParquetScan.
pushedAggregations
is used to create a ParquetScan.
pushAggregation¶
Signature
pushAggregation(
aggregation: Aggregation): Boolean
pushAggregation
is part of the SupportsPushDownAggregates abstraction.
pushAggregation
does nothing and returns false
for spark.sql.parquet.aggregatePushdown disabled.
pushAggregation
determines the data schema for aggregate to be pushed down.
With the schema determined, pushAggregation
registers it as finalSchema and the given Aggregation as pushedAggregations. pushAggregation
returns true
.
Otherwise, pushAggregation
returns false
.
pushDataFilters¶
Signature
pushDataFilters(
dataFilters: Array[Filter]): Array[Filter]
pushDataFilters
is part of the FileScanBuilder abstraction.
spark.sql.parquet.filterPushdown
pushDataFilters
does nothing and returns no Catalyst Filters with spark.sql.parquet.filterPushdown disabled.
pushDataFilters
creates a ParquetFilters with the readDataSchema (converted into the corresponding parquet schema) and the following configuration properties:
- spark.sql.parquet.filterPushdown.date
- spark.sql.parquet.filterPushdown.decimal
- spark.sql.parquet.filterPushdown.string.startsWith
- spark.sql.parquet.filterPushdown.timestamp
- spark.sql.parquet.pushdown.inFilterThreshold
- spark.sql.caseSensitive
In the end, pushedParquetFilters
requests the ParquetFilters
for the convertibleFilters for the given dataFilters
.
supportsNestedSchemaPruning¶
Signature
supportsNestedSchemaPruning: Boolean
supportsNestedSchemaPruning
is part of the FileScanBuilder abstraction.
supportsNestedSchemaPruning
is enabled (true
).