DeltaParquetFileFormatBase¶
DeltaParquetFileFormatBase is a base abstraction of the ParquetFileFormat (Spark SQL) abstraction for delta file formats (with data files in parquet file format).
Deletion Vectors-Enabled Scans¶
DeltaParquetFileFormatBase makes a precondition/invariant check for delta table scans with Deletion Vectors enabled.
When the delta table is referenced by a path (not a name or identifier) and useMetadataRowIndexOpt is defined, the flag should match optimizationsEnabled.
If they differ, DeltaParquetFileFormatBase throws an IllegalArgumentException:
useMetadataRowIndexOpt Flag¶
DeltaParquetFileFormatBase can be given useMetadataRowIndexOpt flag when created for row index source for Deletion Vectors filtering.
useMetadataRowIndexOpt flag is undefined (None) by default.
useMetadataRowIndexOpt is configured using spark.databricks.delta.deletionVectors.useMetadataRowIndex configuration property (for DeltaParquetFileFormat).
useMetadataRowIndexOpt is used for buildReaderWithPartitionValues, if defined, or defaults to spark.databricks.delta.deletionVectors.useMetadataRowIndex configuration property.
For path-based delta tables with useMetadataRowIndexOpt defined, the value must be exactly this optimizationsEnabled (for Deletion Vectors-Enabled Scans).
optimizationsEnabled Flag¶
DeltaParquetFileFormatBase can be given optimizationsEnabled flag when created to enable scan optimizations (file splitting and predicate pushdown).
In other words, optimizationsEnabled flag is equivalent to using scan optimizations (file splitting and predicate pushdown).
optimizationsEnabled is enabled (true) by default.
optimizationsEnabled controls isSplitable predicate.
optimizationsEnabled must be enabled when buildReaderWithPartitionValues with this useMetadataRowIndex flag disabled.
When optimizationsEnabled is disabled, prepareFiltersForRead gives no filters.
optimizationsEnabled must be exactly this useMetadataRowIndexOpt (when defined) for path-based delta tables.
optimizationsEnabled is used when:
DeltaParquetFileFormatBaseis created for Deletion Vectors-Enabled ScansDMLWithDeletionVectorsHelperis requested to replaceFileIndex (where this flag is disabled)
Implementations¶
Creating Instance¶
DeltaParquetFileFormatBase takes the following to be created:
- ProtocolMetadataAdapter
-
nullableRowTrackingConstantFieldsflag (default:false) -
nullableRowTrackingGeneratedFieldsflag (default:false) - optimizationsEnabled flag
- Optional Table Path (default: undefined)
-
isCDCReadflag (default:false) - useMetadataRowIndexOpt flag
While being created, DeltaParquetFileFormatBase asserts the following:
- Deletion Vectors-Enabled Scans
- delta table is readable.
- Either nullableRowTrackingConstantFields is disabled or nullableRowTrackingGeneratedFields is enabled.
- With columnMappingMode as IdMapping, ...FIXME
Abstract Class
DeltaParquetFileFormatBase is an abstract class and cannot be created directly. It is created indirectly for the concrete DeltaParquetFileFormatBases.
isSplitable¶
ParquetFileFormat
isSplitable is part of the ParquetFileFormat (Spark SQL) abstraction.
isSplitable returns this optimizationsEnabled.
Build Data Reader (with Partition Values)¶
ParquetFileFormat
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues is part of the ParquetFileFormat (Spark SQL) abstraction.
buildReaderWithPartitionValues...FIXME
prepareSchemaForRead¶
prepareSchemaForRead...FIXME
prepareFiltersForRead¶
prepareFiltersForRead...FIXME
hasTablePath¶
hasTablePath is enabled (true) when this delta table is referenced by a path (not a name/identifier).
hasTablePath is used when:
DeltaParquetFileFormatBaseis created and requested to buildReaderWithPartitionValues
supportFieldName¶
FileFormat
supportFieldName is part of the FileFormat (Spark SQL) abstraction.
supportFieldName is enabled (true) when either holds true:
- DeltaColumnMappingMode is not
NoMapping - The default (parent)
supportFieldName(Spark SQL) is enabled
metadataSchemaFields¶
ParquetFileFormat
metadataSchemaFields is part of the ParquetFileFormat (Spark SQL) abstraction.
Review Me
Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).
ParquetFileFormat
All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.
prepareWrite¶
ParquetFileFormat
prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
prepareWrite is part of the ParquetFileFormat (Spark SQL) abstraction.
prepareWrite...FIXME
fileConstantMetadataExtractors¶
FileFormat
fileConstantMetadataExtractors is part of the FileFormat (Spark SQL) abstraction.
fileConstantMetadataExtractors...FIXME