ParquetPartitionReaderFactory¶
ParquetPartitionReaderFactory
is a FilePartitionReaderFactory.
Creating Instance¶
ParquetPartitionReaderFactory
takes the following to be created:
- SQLConf
- Broadcast variable with a Hadoop Configuration
- Data schema
- Read data schema
- Partition schema
- Filters
- ParquetOptions
ParquetPartitionReaderFactory
is created when:
ParquetScan
is requested to create a PartitionReaderFactory
supportColumnarReads¶
supportColumnarReads(
partition: InputPartition): Boolean
supportColumnarReads
is part of the PartitionReaderFactory abstraction.
supportColumnarReads
is enabled (true
) when the following all hold:
- spark.sql.parquet.enableVectorizedReader
- spark.sql.codegen.wholeStage
- The number of the resultSchema fields is at most spark.sql.codegen.maxFields
- All the resultSchema fields are AtomicTypes
Building Columnar Reader¶
buildColumnarReader(
file: PartitionedFile): PartitionReader[ColumnarBatch]
buildColumnarReader
is part of the FilePartitionReaderFactory abstraction.
buildColumnarReader
createVectorizedReader (for the given PartitionedFile) and requests it to enableReturningBatches.
In the end, buildColumnarReader
returns a PartitionReader that returns ColumnarBatches (when requested for records).
Building PartitionReader¶
buildReader(
file: PartitionedFile): PartitionReader[InternalRow]
buildReader
determines a Hadoop RecordReader to use based on the enableVectorizedReader flag. When enabled, buildReader
createVectorizedReader and createRowBaseReader otherwise.
In the end, buildReader
creates a PartitionReaderWithPartitionValues
(that is a PartitionReader with partition values appended).
buildReader
is part of the FilePartitionReaderFactory abstraction.
enableVectorizedReader¶
ParquetPartitionReaderFactory
uses enableVectorizedReader
flag to determines a Hadoop RecordReader to use when requested for a PartitionReader.
enableVectorizedReader
is enabled (true
) when the following hold:
- spark.sql.parquet.enableVectorizedReader is
true
- All data types in the resultSchema are AtomicTypes
Creating Row-Based RecordReader¶
createRowBaseReader(
file: PartitionedFile): RecordReader[Void, InternalRow]
createRowBaseReader
buildReaderBase (for the given PartitionedFile and createRowBaseParquetReader).
Creating Vectorized Parquet RecordReader¶
createVectorizedReader(
file: PartitionedFile): VectorizedParquetRecordReader
createVectorizedReader
buildReaderBase (for the given PartitionedFile and createParquetVectorizedReader).
In the end, createVectorizedReader
requests the VectorizedParquetRecordReader to initBatch (with the partitionSchema and the partitionValues of the given PartitionedFile) and returns it.
createVectorizedReader
is used when:
ParquetPartitionReaderFactory
is requested to buildReader and buildColumnarReader
buildReaderBase¶
buildReaderBase[T](
file: PartitionedFile,
buildReaderFunc: (
FileSplit,
InternalRow,
TaskAttemptContextImpl,
Option[FilterPredicate],
Option[ZoneId],
RebaseSpec,
RebaseSpec) => RecordReader[Void, T]): RecordReader[Void, T]
buildReaderBase
...FIXME
buildReaderBase
is used when:
ParquetPartitionReaderFactory
is requested to createRowBaseReader and createVectorizedReader