ParquetScan¶
ParquetScan
is the FileScan of Parquet Connector that uses ParquetPartitionReaderFactory with ParquetReadSupport.
Creating Instance¶
ParquetScan
takes the following to be created:
- SparkSession
- Hadoop Configuration
- PartitioningAwareFileIndex
- Data schema
- Read data schema
- Read partition schema
- Pushed Filters
- Case-insensitive options
- Pushed Aggregation
- Partition filter expressions (optional)
- Data filter expressions (optional)
ParquetScan
is created when:
ParquetScanBuilder
is requested to build a Scan
Pushed Aggregation¶
pushedAggregate: Option[Aggregation] = None
ParquetScan
can be given an Aggregation expression (pushedAggregate
) when created. The Aggregation
is optional and undefined by default (None
).
The pushedAggregate
is pushedAggregations when ParquetScanBuilder
is requested to build a ParquetScan.
When defined, ParquetScan
is no longer isSplitable (since with aggregate pushed down, only the file footer will be read once, so file should not be split across multiple tasks).
The Aggregation
is used in the following:
- getMetaData (as pushedAggregationsStr and pushedGroupByStr)
- readSchema
- createReaderFactory (to create a ParquetPartitionReaderFactory)
Creating PartitionReaderFactory¶
Signature
createReaderFactory(): PartitionReaderFactory
createReaderFactory
is part of the Batch abstraction.
createReaderFactory
creates a ParquetPartitionReaderFactory (with the Hadoop Configuration broadcast).
createReaderFactory
adds the following properties to the Hadoop Configuration before broadcasting it (to executors).
Name | Value |
---|---|
ParquetInputFormat.READ_SUPPORT_CLASS | ParquetReadSupport |
others |
isSplitable¶
isSplitable
is enabled (true
) when all the following hold:
- pushedAggregate is not specified
RowIndexUtil.isNeededForSchema
isfalse
for the readSchema
readSchema¶
readSchema
is readDataSchema with aggregate pushed down. Otherwise, readSchema
is the default readSchema.
Custom Metadata¶
Signature
getMetaData(): Map[String, String]
getMetaData
is part of the SupportsMetadata abstraction.
getMetaData
adds the following metadata to the default file-based metadata:
Metadata | Value |
---|---|
PushedFilters | pushedFilters |
PushedAggregation | pushedAggregationsStr |
PushedGroupBy | pushedGroupByStr |