Skip to content

ParquetScan

ParquetScan is the FileScan of Parquet Connector that uses ParquetPartitionReaderFactory with ParquetReadSupport.

Creating Instance

ParquetScan takes the following to be created:

ParquetScan is created when:

Pushed Aggregation

pushedAggregate: Option[Aggregation] = None

ParquetScan can be given an Aggregation expression (pushedAggregate) when created. The Aggregation is optional and undefined by default (None).

The pushedAggregate is pushedAggregations when ParquetScanBuilder is requested to build a ParquetScan.

When defined, ParquetScan is no longer isSplitable (since with aggregate pushed down, only the file footer will be read once, so file should not be split across multiple tasks).

The Aggregation is used in the following:

Creating PartitionReaderFactory

Signature
createReaderFactory(): PartitionReaderFactory

createReaderFactory is part of the Batch abstraction.

createReaderFactory creates a ParquetPartitionReaderFactory (with the Hadoop Configuration broadcast).

createReaderFactory adds the following properties to the Hadoop Configuration before broadcasting it (to executors).

Name Value
ParquetInputFormat.READ_SUPPORT_CLASS ParquetReadSupport
others

isSplitable

Signature
isSplitable(
  path: Path): Boolean

isSplitable is part of the FileScan abstraction.

isSplitable is enabled (true) when all the following hold:

  1. pushedAggregate is not specified
  2. RowIndexUtil.isNeededForSchema is false for the readSchema

readSchema

Signature
readSchema(): StructType

readSchema is part of the Scan abstraction.

readSchema is readDataSchema with aggregate pushed down. Otherwise, readSchema is the default readSchema.

Custom Metadata

Signature
getMetaData(): Map[String, String]

getMetaData is part of the SupportsMetadata abstraction.

getMetaData adds the following metadata to the default file-based metadata:

Metadata Value
PushedFilters pushedFilters
PushedAggregation pushedAggregationsStr
PushedGroupBy pushedGroupByStr