Skip to content

DeltaParquetFileFormat

DeltaParquetFileFormat is a ParquetFileFormat (Spark SQL) to support no restrictions on columns names.

Creating Instance

DeltaParquetFileFormat takes the following to be created:

  • Protocol
  • Metadata
  • nullableRowTrackingFields flag (default: false)
  • optimizationsEnabled flag (default: true)
  • Optional Table Path (default: None (unspecified))
  • isCDCRead flag (default: false)

DeltaParquetFileFormat is created when:

__delta_internal_row_index Internal Metadata Column

DeltaParquetFileFormat defines __delta_internal_row_index name for the metadata column name with the index of a row within a file.

__delta_internal_row_index is an internal column.

When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.

Warning

__delta_internal_row_index column is only supported for delta tables with the following features disabled:

__delta_internal_row_index is used when:

Building Data Reader (with Partition Values)

FileFormat
buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is part of the FileFormat (Spark SQL) abstraction.

With neither __delta_internal_is_row_deleted nor row_index columns found in the givenrequiredSchema,buildReaderWithPartitionValues uses the default buildReaderWithPartitionValues (from ParquetFileFormat (Spark SQL)).

row_index Column Name

buildReaderWithPartitionValues uses spark.databricks.delta.deletionVectors.useMetadataRowIndex to determine the row_index column name (straight from ParquetFileFormat or __delta_internal_row_index).

FIXME Other assertions

In the end, buildReaderWithPartitionValues builds a parquet data reader with additional metadata columns.

iteratorWithAdditionalMetadataColumns

iteratorWithAdditionalMetadataColumns(
  partitionedFile: PartitionedFile,
  iterator: Iterator[Object],
  isRowDeletedColumnOpt: Option[ColumnMetadata],
  rowIndexColumnOpt: Option[ColumnMetadata],
  useOffHeapBuffers: Boolean,
  serializableHadoopConf: SerializableConfiguration,
  useMetadataRowIndex: Boolean): Iterator[Object]

iteratorWithAdditionalMetadataColumns...FIXME

supportFieldName

FileFormat
supportFieldName(
  name: String): Boolean

supportFieldName is part of the FileFormat (Spark SQL) abstraction.

supportFieldName is enabled (true) when either holds true:

metadataSchemaFields

FileFormat
metadataSchemaFields: Seq[StructField]

metadataSchemaFields is part of the FileFormat (Spark SQL) abstraction.

Review Me

Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).

ParquetFileFormat

All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.