VectorizedParquetRecordReader¶

VectorizedParquetRecordReader is a SpecificParquetRecordReaderBase for Parquet Data Source for Vectorized Decoding.

Creating Instance¶

VectorizedParquetRecordReader takes the following to be created:

ZoneId (for timezone conversion)
Datetime Rebase Mode
Datetime Rebase Timezone
int96 Rebase Mode
int96 Rebase Timezone
useOffHeap
Capacity

VectorizedParquetRecordReader is created when:

ParquetFileFormat is requested to buildReaderWithPartitionValues (with enableVectorizedReader flag enabled)
ParquetPartitionReaderFactory is requested to createParquetVectorizedReader

Capacity¶

VectorizedParquetRecordReader is given capacity when created.

The capacity is configured using spark.sql.parquet.columnarReaderBatchSize configuration property.

Memory Mode¶

VectorizedParquetRecordReader uses the given useOffHeap to initialize the internal MEMORY_MODE registry when created:

useOffHeap	MEMORY_MODE	WritableColumnVector
`true`	`OFF_HEAP`	OffHeapColumnVector
`false`	`ON_HEAP`	OnHeapColumnVector

The MEMORY_MODE is used to initBatch (to choose the correct implementation of WritableColumnVector).

useOffHeap

useOffHeap is the value of spark.sql.columnVector.offheap.enabled configuration property when the following are executed with spark.sql.parquet.enableVectorizedReader enabled and supported schema for columnar read:

ParquetFileFormat is requested to buildReaderWithPartitionValues
ParquetPartitionReaderFactory is requested to createVectorizedReader

WritableColumnVectors¶

VectorizedParquetRecordReader defines an array of allocated WritableColumnVectors.

columnVectors is allocated when initBatch.

columnVectors is used when:

initBatch
nextBatch

enableReturningBatches¶

void enableReturningBatches()

enableReturningBatches simply turns the returnColumnarBatch flag on.

enableReturningBatches is used when:

ParquetFileFormat is requested to buildReaderWithPartitionValues
ParquetPartitionReaderFactory is requested to buildColumnarReader

Initializing Columnar Batch¶

void initBatch() // (1)!
void initBatch(
  StructType partitionColumns,
  InternalRow partitionValues) // (2)!
void initBatch(
  MemoryMode memMode,
  StructType partitionColumns,
  InternalRow partitionValues) // (3)!

Uses the MEMORY_MODE and no partitionColumns nor partitionValues
Uses the MEMORY_MODE
A private helper method

MemoryMode

The given MemoryMode is the value of MEMORY_MODE.

initBatch creates a batch schema that is sparkSchema and the input partitionColumns schema (if available).

initBatch requests OffHeapColumnVector or OnHeapColumnVector to allocate column vectors per the input memMode (i.e., OFF_HEAP or ON_HEAP memory modes, respectively). initBatch records the allocated column vectors as the internal WritableColumnVectors.

spark.sql.columnVector.offheap.enabled

OnHeapColumnVector is used based on spark.sql.columnVector.offheap.enabled configuration property.

initBatch creates a ColumnarBatch (with the allocated WritableColumnVectors).

initBatch does some additional maintenance to the WritableColumnVectors.

initBatch is used when:

ParquetFileFormat is requested to build a data reader (with partition column values appended)
ParquetPartitionReaderFactory is requested to createVectorizedReader
VectorizedParquetRecordReader is requested for a result ColumnarBatch

Allocating ColumnVectors¶

ColumnVector[] allocateColumns(
  int capacity,
  StructType schema,
  boolean useOffHeap,
  int constantColumnLength)

allocateColumns creates ColumnVectors (one per every field in the given StructType).

For the given useOffHeap enabled, allocateColumns creates OffHeapColumnVectors. Otherwise, allocateColumns creates OnHeapColumnVectors.

useOffHeap

useOffHeap flag is enabled when memMode of initBatch is MemoryMode.OFF_HEAP.

nextKeyValue¶

Signature

boolean nextKeyValue()

nextKeyValue is part of the RecordReader (Apache Hadoop) abstraction.

nextKeyValue resultBatch.

With returnColumnarBatch enabled, nextKeyValue returns nextBatch.

Otherwise, nextKeyValue...FIXME

nextKeyValue is used when:

ParquetPartitionReaderFactory is requested to build a PartitionReader and buildColumnarReader

resultBatch¶

ColumnarBatch resultBatch()

resultBatch returns the columnarBatch if available. Otherwise, resultBatch initBatch first.