DeltaParquetFileFormat¶

DeltaParquetFileFormat is a ParquetFileFormat (Spark SQL) to support no restrictions on columns names.

Creating Instance¶

DeltaParquetFileFormat takes the following to be created:

Protocol
Metadata
nullableRowTrackingFields flag (default: false)
optimizationsEnabled flag (default: true)
Optional Table Path (default: None (unspecified))
isCDCRead flag (default: false)

DeltaParquetFileFormat is created when:

DeltaFileFormat is requested for the file format
CDCReaderImpl is requested for the scanIndex

__delta_internal_row_index Internal Metadata Column¶

DeltaParquetFileFormat defines __delta_internal_row_index name for the metadata column name with the index of a row within a file.

__delta_internal_row_index is an internal column.

When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.

Warning

__delta_internal_row_index column is only supported for delta tables with the following features disabled:

File Splitting
Predicate Pushdown

__delta_internal_row_index is used when:

DMLWithDeletionVectorsHelper is requested to replace a FileIndex (in all the delta tables in a logical plan)
DeletionVectorBitmapGenerator is requested to buildRowIndexSetsForFilesMatchingCondition

Building Data Reader (with Partition Values)¶

FileFormat

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is part of the FileFormat (Spark SQL) abstraction.

With neither __delta_internal_is_row_deleted nor row_index columns found in the givenrequiredSchema,buildReaderWithPartitionValues uses the default buildReaderWithPartitionValues (from ParquetFileFormat (Spark SQL)).

row_index Column Name

buildReaderWithPartitionValues uses spark.databricks.delta.deletionVectors.useMetadataRowIndex to determine the row_index column name (straight from ParquetFileFormat or __delta_internal_row_index).

FIXME Other assertions

In the end, buildReaderWithPartitionValues builds a parquet data reader with additional metadata columns.

iteratorWithAdditionalMetadataColumns¶

iteratorWithAdditionalMetadataColumns(
  partitionedFile: PartitionedFile,
  iterator: Iterator[Object],
  isRowDeletedColumnOpt: Option[ColumnMetadata],
  rowIndexColumnOpt: Option[ColumnMetadata],
  useOffHeapBuffers: Boolean,
  serializableHadoopConf: SerializableConfiguration,
  useMetadataRowIndex: Boolean): Iterator[Object]

iteratorWithAdditionalMetadataColumns...FIXME

supportFieldName¶

FileFormat

supportFieldName(
  name: String): Boolean

supportFieldName is part of the FileFormat (Spark SQL) abstraction.

supportFieldName is enabled (true) when either holds true:

The DeltaColumnMappingMode is not NoMapping
The default (parent) supportFieldName (Spark SQL)

metadataSchemaFields¶

FileFormat

metadataSchemaFields: Seq[StructField]

metadataSchemaFields is part of the FileFormat (Spark SQL) abstraction.

Review Me

Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).

ParquetFileFormat

All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.