DeltaParquetFileFormat¶
DeltaParquetFileFormat
is a ParquetFileFormat
(Spark SQL) to support no restrictions on columns names.
Creating Instance¶
DeltaParquetFileFormat
takes the following to be created:
- Protocol
- Metadata
-
nullableRowTrackingFields
flag (default:false
) -
optimizationsEnabled
flag (default:true
) - Optional Table Path (default:
None
(unspecified)) -
isCDCRead
flag (default:false
)
DeltaParquetFileFormat
is created when:
DeltaFileFormat
is requested for the file formatCDCReaderImpl
is requested for the scanIndex
__delta_internal_row_index Internal Metadata Column¶
DeltaParquetFileFormat
defines __delta_internal_row_index
name for the metadata column name with the index of a row within a file.
__delta_internal_row_index
is an internal column.
When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.
Warning
__delta_internal_row_index
column is only supported for delta tables with the following features disabled:
__delta_internal_row_index
is used when:
DMLWithDeletionVectorsHelper
is requested to replace a FileIndex (in all the delta tables in a logical plan)DeletionVectorBitmapGenerator
is requested to buildRowIndexSetsForFilesMatchingCondition
Building Data Reader (with Partition Values)¶
FileFormat
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues
is part of the FileFormat
(Spark SQL) abstraction.
With neither __delta_internal_is_row_deleted nor row_index
columns found in the givenrequiredSchema
,buildReaderWithPartitionValues
uses the default buildReaderWithPartitionValues
(from ParquetFileFormat
(Spark SQL)).
row_index Column Name
buildReaderWithPartitionValues
uses spark.databricks.delta.deletionVectors.useMetadataRowIndex to determine the row_index
column name (straight from ParquetFileFormat
or __delta_internal_row_index).
FIXME Other assertions
In the end, buildReaderWithPartitionValues
builds a parquet data reader with additional metadata columns.
iteratorWithAdditionalMetadataColumns¶
iteratorWithAdditionalMetadataColumns(
partitionedFile: PartitionedFile,
iterator: Iterator[Object],
isRowDeletedColumnOpt: Option[ColumnMetadata],
rowIndexColumnOpt: Option[ColumnMetadata],
useOffHeapBuffers: Boolean,
serializableHadoopConf: SerializableConfiguration,
useMetadataRowIndex: Boolean): Iterator[Object]
iteratorWithAdditionalMetadataColumns
...FIXME
supportFieldName¶
FileFormat
supportFieldName(
name: String): Boolean
supportFieldName
is part of the FileFormat
(Spark SQL) abstraction.
supportFieldName
is enabled (true
) when either holds true:
- The DeltaColumnMappingMode is not
NoMapping
- The default (parent)
supportFieldName
(Spark SQL)
metadataSchemaFields¶
FileFormat
metadataSchemaFields: Seq[StructField]
metadataSchemaFields
is part of the FileFormat
(Spark SQL) abstraction.
Review Me
Due to an issue in Spark SQL (to be reported), metadataSchemaFields
removes row_index
from the default metadataSchemaFields
(Spark SQL).
ParquetFileFormat
All what ParquetFileFormat
does (when requested for the metadataSchemaFields
) is to add the row_index
. In other words, DeltaParquetFileFormat
reverts this column addition.