ParquetScan¶
ParquetScan is the FileScan of Parquet Connector that uses ParquetPartitionReaderFactory with ParquetReadSupport.
Creating Instance¶
ParquetScan takes the following to be created:
- SparkSession
- Hadoop Configuration
- PartitioningAwareFileIndex
- Data schema
- Read data schema
- Read partition schema
- Pushed Filters
- Case-insensitive options
- Pushed Aggregation
- Partition filter expressions (optional)
- Data filter expressions (optional)
ParquetScan is created when:
ParquetScanBuilderis requested to build a Scan
Pushed Aggregation¶
pushedAggregate: Option[Aggregation] = None
ParquetScan can be given an Aggregation expression (pushedAggregate) when created. The Aggregation is optional and undefined by default (None).
The pushedAggregate is pushedAggregations when ParquetScanBuilder is requested to build a ParquetScan.
When defined, ParquetScan is no longer isSplitable (since with aggregate pushed down, only the file footer will be read once, so file should not be split across multiple tasks).
The Aggregation is used in the following:
- getMetaData (as pushedAggregationsStr and pushedGroupByStr)
- readSchema
- createReaderFactory (to create a ParquetPartitionReaderFactory)
Creating PartitionReaderFactory¶
Signature
createReaderFactory(): PartitionReaderFactory
createReaderFactory is part of the Batch abstraction.
createReaderFactory creates a ParquetPartitionReaderFactory (with the Hadoop Configuration broadcast).
createReaderFactory adds the following properties to the Hadoop Configuration before broadcasting it (to executors).
| Name | Value |
|---|---|
ParquetInputFormat.READ_SUPPORT_CLASS | ParquetReadSupport |
| others |
isSplitable¶
isSplitable is enabled (true) when all the following hold:
- pushedAggregate is not specified
RowIndexUtil.isNeededForSchemaisfalsefor the readSchema
readSchema¶
readSchema is readDataSchema with aggregate pushed down. Otherwise, readSchema is the default readSchema.
Custom Metadata¶
Signature
getMetaData(): Map[String, String]
getMetaData is part of the SupportsMetadata abstraction.
getMetaData adds the following metadata to the default file-based metadata:
| Metadata | Value |
|---|---|
PushedFilters | pushedFilters |
PushedAggregation | pushedAggregationsStr |
PushedGroupBy | pushedGroupByStr |