DeltaParquetFileFormat¶
DeltaParquetFileFormat
is a ParquetFileFormat
(Spark SQL) to support no restrictions on columns names.
Creating Instance¶
DeltaParquetFileFormat
takes the following to be created:
- Protocol
- Metadata
- isSplittable Flag
- disablePushDowns Flag
- Optional Table Path (default:
None
(unspecified)) - Optional Broadcast variable with
DeletionVectorDescriptorWithFilterType
s perURI
(default:None
(unspecified)) - Optional Broadcast variable with Hadoop Configuration (default:
None
(unspecified))
DeltaParquetFileFormat
is created when:
DeltaFileFormat
is requested for the fileFormatCDCReaderImpl
is requested for the scanIndex
isSplittable Flag¶
DeltaParquetFileFormat
can be given isSplittable
flag when created.
FileFormat
isSplittable
is part of the FileFormat
(Spark SQL) abstraction to indicate whether this delta table is splittable or not.
Unless given, isSplittable
flag is enabled by default (to match the base ParquetFileFormat
(Spark SQL)).
isSplittable
is disabled (false
) when:
DeltaParquetFileFormat
is requested to copyWithDVInfo and created with deletion vectors enabledDMLWithDeletionVectorsHelper
is requested to replace a FileIndex
Note
DeltaParquetFileFormat
is either splittable or supports deletion vectors.
isSplittable
is also used to buildReaderWithPartitionValues (to assert the configuration of this delta table).
disablePushDowns¶
DeltaParquetFileFormat
can be given disablePushDowns
flag when created.
disablePushDowns
flag indicates whether this delta table supports predicate pushdown optimization or not for buildReaderWithPartitionValues to pass the filters down to the parquet data reader or not.
Unless given, disablePushDowns
flag is disabled (false
) by default.
disablePushDowns
is enabled (true
) when:
DeltaParquetFileFormat
is requested to copyWithDVInfo and created with deletion vectors enabledDMLWithDeletionVectorsHelper
is requested to replace a FileIndex
Note
DeltaParquetFileFormat
supports either the predicate pushdown optimization (disablePushDowns
is disabled) or deletion vectors.
__delta_internal_row_index Internal Metadata Column¶
DeltaParquetFileFormat
defines __delta_internal_row_index
name for the metadata column name with the index of a row within a file.
__delta_internal_row_index
is an internal column.
When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.
Warning
__delta_internal_row_index
column is only supported for delta tables with the following features disabled:
__delta_internal_row_index
is used when:
DMLWithDeletionVectorsHelper
is requested to replace a FileIndex (in all the delta tables in a logical plan)DeletionVectorBitmapGenerator
is requested to buildRowIndexSetsForFilesMatchingCondition
Building Data Reader (With Partition Values)¶
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues
prepares the given schemas (e.g., dataSchema
, partitionSchema
and requiredSchema
) before requesting the parent ParquetFileFormat
to buildReaderWithPartitionValues
.
buildReaderWithPartitionValues
is part of the ParquetFileFormat
(Spark SQL) abstraction.
Preparing Schema¶
prepareSchema(
inputSchema: StructType): StructType
prepareSchema
creates a physical schema (for the inputSchema
, the referenceSchema and the DeltaColumnMappingMode).
supportFieldName¶
FileFormat
supportFieldName(
name: String): Boolean
supportFieldName
is part of the FileFormat
(Spark SQL) abstraction.
supportFieldName
is enabled (true
) when either holds true:
- The DeltaColumnMappingMode is not
NoMapping
- The default (parent)
supportFieldName
(Spark SQL)
metadataSchemaFields¶
FileFormat
metadataSchemaFields: Seq[StructField]
metadataSchemaFields
is part of the FileFormat
(Spark SQL) abstraction.
Due to an issue in Spark SQL (to be reported), metadataSchemaFields
removes row_index
from the default metadataSchemaFields
(Spark SQL).
ParquetFileFormat
All what ParquetFileFormat
does (when requested for the metadataSchemaFields
) is to add the row_index
. In other words, DeltaParquetFileFormat
reverts this column addition.