ParquetScan¶

ParquetScan is the FileScan of Parquet Connector that uses ParquetPartitionReaderFactory with ParquetReadSupport.

Creating Instance¶

ParquetScan takes the following to be created:

SparkSession
Hadoop Configuration
PartitioningAwareFileIndex
Data schema
Read data schema
Read partition schema
Pushed Filters
Case-insensitive options
Pushed Aggregation
Partition filter expressions (optional)
Data filter expressions (optional)

ParquetScan is created when:

ParquetScanBuilder is requested to build a Scan

Pushed Aggregation¶

pushedAggregate: Option[Aggregation] = None

ParquetScan can be given an Aggregation expression (pushedAggregate) when created. The Aggregation is optional and undefined by default (None).

The pushedAggregate is pushedAggregations when ParquetScanBuilder is requested to build a ParquetScan.

When defined, ParquetScan is no longer isSplitable (since with aggregate pushed down, only the file footer will be read once, so file should not be split across multiple tasks).

The Aggregation is used in the following:

getMetaData (as pushedAggregationsStr and pushedGroupByStr)
readSchema
createReaderFactory (to create a ParquetPartitionReaderFactory)

Creating PartitionReaderFactory¶

Signature

createReaderFactory(): PartitionReaderFactory

createReaderFactory is part of the Batch abstraction.

createReaderFactory creates a ParquetPartitionReaderFactory (with the Hadoop Configuration broadcast).

createReaderFactory adds the following properties to the Hadoop Configuration before broadcasting it (to executors).

Name	Value
`ParquetInputFormat.READ_SUPPORT_CLASS`	ParquetReadSupport
others

isSplitable¶

Signature

isSplitable(
  path: Path): Boolean

isSplitable is part of the FileScan abstraction.

isSplitable is enabled (true) when all the following hold:

pushedAggregate is not specified
RowIndexUtil.isNeededForSchema is false for the readSchema

readSchema¶

Signature

readSchema(): StructType

readSchema is part of the Scan abstraction.

readSchema is readDataSchema with aggregate pushed down. Otherwise, readSchema is the default readSchema.

Custom Metadata¶

Signature

getMetaData(): Map[String, String]

getMetaData is part of the SupportsMetadata abstraction.

getMetaData adds the following metadata to the default file-based metadata:

Metadata	Value
`PushedFilters`	pushedFilters
`PushedAggregation`	pushedAggregationsStr
`PushedGroupBy`	pushedGroupByStr