DeltaParquetFileFormat¶
DeltaParquetFileFormat is a ParquetFileFormat (Spark SQL) to support no restrictions on columns names.
Creating Instance¶
DeltaParquetFileFormat takes the following to be created:
- Protocol
- Metadata
-
nullableRowTrackingFieldsflag (default:false) -
optimizationsEnabledflag (default:true) - Optional Table Path (default:
None(unspecified)) -
isCDCReadflag (default:false)
DeltaParquetFileFormat is created when:
DeltaFileFormatis requested for the file formatCDCReaderImplis requested for the scanIndex
__delta_internal_row_index Internal Metadata Column¶
DeltaParquetFileFormat defines __delta_internal_row_index name for the metadata column name with the index of a row within a file.
__delta_internal_row_index is an internal column.
When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.
Warning
__delta_internal_row_index column is only supported for delta tables with the following features disabled:
__delta_internal_row_index is used when:
DMLWithDeletionVectorsHelperis requested to replace a FileIndex (in all the delta tables in a logical plan)DeletionVectorBitmapGeneratoris requested to buildRowIndexSetsForFilesMatchingCondition
Building Data Reader (with Partition Values)¶
FileFormat
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues is part of the FileFormat (Spark SQL) abstraction.
With neither __delta_internal_is_row_deleted nor row_index columns found in the givenrequiredSchema,buildReaderWithPartitionValues uses the default buildReaderWithPartitionValues (from ParquetFileFormat (Spark SQL)).
row_index Column Name
buildReaderWithPartitionValues uses spark.databricks.delta.deletionVectors.useMetadataRowIndex to determine the row_index column name (straight from ParquetFileFormat or __delta_internal_row_index).
FIXME Other assertions
In the end, buildReaderWithPartitionValues builds a parquet data reader with additional metadata columns.
iteratorWithAdditionalMetadataColumns¶
iteratorWithAdditionalMetadataColumns(
partitionedFile: PartitionedFile,
iterator: Iterator[Object],
isRowDeletedColumnOpt: Option[ColumnMetadata],
rowIndexColumnOpt: Option[ColumnMetadata],
useOffHeapBuffers: Boolean,
serializableHadoopConf: SerializableConfiguration,
useMetadataRowIndex: Boolean): Iterator[Object]
iteratorWithAdditionalMetadataColumns...FIXME
supportFieldName¶
FileFormat
supportFieldName(
name: String): Boolean
supportFieldName is part of the FileFormat (Spark SQL) abstraction.
supportFieldName is enabled (true) when either holds true:
- The DeltaColumnMappingMode is not
NoMapping - The default (parent)
supportFieldName(Spark SQL)
metadataSchemaFields¶
FileFormat
metadataSchemaFields: Seq[StructField]
metadataSchemaFields is part of the FileFormat (Spark SQL) abstraction.
Review Me
Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).
ParquetFileFormat
All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.