DeltaParquetFileFormat¶
DeltaParquetFileFormat is a ParquetFileFormat (Spark SQL) to support no restrictions on columns names.
Creating Instance¶
DeltaParquetFileFormat takes the following to be created:
- Protocol
- Metadata
-
nullableRowTrackingFieldsflag (default:false) -
optimizationsEnabledflag (default:true) - Optional Table Path (default:
None(unspecified)) -
isCDCReadflag (default:false)
DeltaParquetFileFormat is created when:
DeltaFileFormatis requested for the file formatCDCReaderImplis requested for the scanIndex
__delta_internal_row_index Internal Metadata Column¶
DeltaParquetFileFormat defines __delta_internal_row_index name for the metadata column name with the index of a row within a file.
__delta_internal_row_index is an internal column.
When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.
Warning
__delta_internal_row_index column is only supported for delta tables with the following features disabled:
__delta_internal_row_index is used when:
DMLWithDeletionVectorsHelperis requested to replace a FileIndex (in all the delta tables in a logical plan)DeletionVectorBitmapGeneratoris requested to buildRowIndexSetsForFilesMatchingCondition
Building Data Reader (with Partition Values)¶
FileFormat
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues is part of the FileFormat (Spark SQL) abstraction.
With neither __delta_internal_is_row_deleted nor row_index columns found in the givenrequiredSchema,buildReaderWithPartitionValues uses the default buildReaderWithPartitionValues (from ParquetFileFormat (Spark SQL)).
row_index Column Name
buildReaderWithPartitionValues uses spark.databricks.delta.deletionVectors.useMetadataRowIndex to determine the row_index column name (straight from ParquetFileFormat or __delta_internal_row_index).
FIXME Other assertions
In the end, buildReaderWithPartitionValues builds a parquet data reader with additional metadata columns.
iteratorWithAdditionalMetadataColumns¶
iteratorWithAdditionalMetadataColumns(
partitionedFile: PartitionedFile,
iterator: Iterator[Object],
isRowDeletedColumnOpt: Option[ColumnMetadata],
rowIndexColumnOpt: Option[ColumnMetadata],
useOffHeapBuffers: Boolean,
serializableHadoopConf: SerializableConfiguration,
useMetadataRowIndex: Boolean): Iterator[Object]
iteratorWithAdditionalMetadataColumns...FIXME
supportFieldName¶
FileFormat
supportFieldName is part of the FileFormat (Spark SQL) abstraction.
supportFieldName is enabled (true) when either holds true:
- The DeltaColumnMappingMode is not
NoMapping - The default (parent)
supportFieldName(Spark SQL)
metadataSchemaFields¶
FileFormat
metadataSchemaFields is part of the FileFormat (Spark SQL) abstraction.
Review Me
Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).
ParquetFileFormat
All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.