Skip to content

ParquetPartitionReaderFactory

ParquetPartitionReaderFactory is a FilePartitionReaderFactory.

Creating Instance

ParquetPartitionReaderFactory takes the following to be created:

ParquetPartitionReaderFactory is created when:

supportColumnarReads

supportColumnarReads(
  partition: InputPartition): Boolean

supportColumnarReads is enabled (true) when the following all hold:

  1. spark.sql.parquet.enableVectorizedReader
  2. spark.sql.codegen.wholeStage
  3. The number of the resultSchema fields is at most spark.sql.codegen.maxFields
  4. All the resultSchema fields are AtomicTypes

supportColumnarReads is part of the PartitionReaderFactory abstraction.

buildColumnarReader

buildColumnarReader(
  file: PartitionedFile): PartitionReader[ColumnarBatch]

buildColumnarReader createVectorizedReader (for the given PartitionedFile) and requests it to enableReturningBatches.

In the end, buildColumnarReader returns a PartitionReader that returns ColumnarBatches (when requested for records).


buildColumnarReader is part of the FilePartitionReaderFactory abstraction.

Building PartitionReader

buildReader(
  file: PartitionedFile): PartitionReader[InternalRow]

buildReader determines a Hadoop RecordReader to use based on the enableVectorizedReader flag. When enabled, buildReader createVectorizedReader and createRowBaseReader otherwise.

In the end, buildReader creates a PartitionReaderWithPartitionValues (that is a PartitionReader with partition values appended).


buildReader is part of the FilePartitionReaderFactory abstraction.

enableVectorizedReader

ParquetPartitionReaderFactory uses enableVectorizedReader flag to determines a Hadoop RecordReader to use when requested for a PartitionReader.

enableVectorizedReader is enabled (true) when the following hold:

  1. spark.sql.parquet.enableVectorizedReader is true
  2. All data types in the resultSchema are AtomicTypes

Creating Row-Based RecordReader

createRowBaseReader(
  file: PartitionedFile): RecordReader[Void, InternalRow]

createRowBaseReader buildReaderBase (for the given PartitionedFile and createRowBaseParquetReader).

Creating Vectorized Parquet RecordReader

createVectorizedReader(
  file: PartitionedFile): VectorizedParquetRecordReader

createVectorizedReader buildReaderBase (for the given PartitionedFile and createParquetVectorizedReader).

In the end, createVectorizedReader requests the VectorizedParquetRecordReader to initBatch (with the partitionSchema and the partitionValues of the given PartitionedFile) and returns it.

createVectorizedReader is used when:

buildReaderBase

buildReaderBase[T](
  file: PartitionedFile,
  buildReaderFunc: (
    FileSplit,
    InternalRow,
    TaskAttemptContextImpl,
    Option[FilterPredicate],
    Option[ZoneId],
    RebaseSpec,
    RebaseSpec) => RecordReader[Void, T]): RecordReader[Void, T]

buildReaderBase...FIXME

buildReaderBase is used when: