ParquetScanBuilder¶
ParquetScanBuilder is a FileScanBuilder (of ParquetTable) that SupportsPushDownFilters.
ParquetScanBuilder builds ParquetScans.
ParquetScanBuilder supportsNestedSchemaPruning.
Creating Instance¶
ParquetScanBuilder takes the following to be created:
- SparkSession
- PartitioningAwareFileIndex
- Schema
- Data Schema
- Case-Insensitive Options
ParquetScanBuilder is created when:
ParquetTableis requested to newScanBuilder
Building Scan¶
build creates a ParquetScan with the following:
| ParquetScan | Value |
|---|---|
| fileIndex | the given fileIndex |
| dataSchema | the given dataSchema |
| readDataSchema | finalSchema |
| readPartitionSchema | readPartitionSchema |
| pushedFilters | pushedDataFilters |
| options | the given options |
| pushedAggregate | pushedAggregations |
| partitionFilters | partitionFilters |
| dataFilters | dataFilters |
pushedAggregations¶
pushedAggregations: Option[Aggregation]
ParquetScanBuilder defines pushedAggregations registry for an Aggregation.
The pushedAggregations is undefined when ParquetScanBuilder is created and can only be assigned when pushAggregation.
pushedAggregations controls the finalSchema. When undefined, the finalSchema is readDataSchema when building a ParquetScan.
pushedAggregations is used to create a ParquetScan.
pushAggregation¶
Signature
pushAggregation(
aggregation: Aggregation): Boolean
pushAggregation is part of the SupportsPushDownAggregates abstraction.
pushAggregation does nothing and returns false for spark.sql.parquet.aggregatePushdown disabled.
pushAggregation determines the data schema for aggregate to be pushed down.
With the schema determined, pushAggregation registers it as finalSchema and the given Aggregation as pushedAggregations. pushAggregation returns true.
Otherwise, pushAggregation returns false.
pushDataFilters¶
Signature
pushDataFilters(
dataFilters: Array[Filter]): Array[Filter]
pushDataFilters is part of the FileScanBuilder abstraction.
spark.sql.parquet.filterPushdown
pushDataFilters does nothing and returns no Catalyst Filters with spark.sql.parquet.filterPushdown disabled.
pushDataFilters creates a ParquetFilters with the readDataSchema (converted into the corresponding parquet schema) and the following configuration properties:
- spark.sql.parquet.filterPushdown.date
- spark.sql.parquet.filterPushdown.decimal
- spark.sql.parquet.filterPushdown.string.startsWith
- spark.sql.parquet.filterPushdown.timestamp
- spark.sql.parquet.pushdown.inFilterThreshold
- spark.sql.caseSensitive
In the end, pushedParquetFilters requests the ParquetFilters for the convertibleFilters for the given dataFilters.
supportsNestedSchemaPruning¶
Signature
supportsNestedSchemaPruning: Boolean
supportsNestedSchemaPruning is part of the FileScanBuilder abstraction.
supportsNestedSchemaPruning is enabled (true).