DeltaParquetFileFormatBase¶

DeltaParquetFileFormatBase is a base abstraction of the ParquetFileFormat (Spark SQL) abstraction for delta file formats (with data files in parquet file format).

Deletion Vectors-Enabled Scans¶

DeltaParquetFileFormatBase makes a precondition/invariant check for delta table scans with Deletion Vectors enabled.

When the delta table is referenced by a path (not a name or identifier) and useMetadataRowIndexOpt is defined, the flag should match optimizationsEnabled.

If they differ, DeltaParquetFileFormatBase throws an IllegalArgumentException:

Wrong arguments for Delta table scan with deletion vectors

useMetadataRowIndexOpt Flag¶

DeltaParquetFileFormatBase can be given useMetadataRowIndexOpt flag when created for row index source for Deletion Vectors filtering.

useMetadataRowIndexOpt flag is undefined (None) by default.

useMetadataRowIndexOpt is configured using spark.databricks.delta.deletionVectors.useMetadataRowIndex configuration property (for DeltaParquetFileFormat).

useMetadataRowIndexOpt is used for buildReaderWithPartitionValues, if defined, or defaults to spark.databricks.delta.deletionVectors.useMetadataRowIndex configuration property.

For path-based delta tables with useMetadataRowIndexOpt defined, the value must be exactly this optimizationsEnabled (for Deletion Vectors-Enabled Scans).

optimizationsEnabled Flag¶

DeltaParquetFileFormatBase can be given optimizationsEnabled flag when created to enable scan optimizations (file splitting and predicate pushdown).

In other words, optimizationsEnabled flag is equivalent to using scan optimizations (file splitting and predicate pushdown).

optimizationsEnabled is enabled (true) by default.

optimizationsEnabled controls isSplitable predicate.

optimizationsEnabled must be enabled when buildReaderWithPartitionValues with this useMetadataRowIndex flag disabled.

When optimizationsEnabled is disabled, prepareFiltersForRead gives no filters.

optimizationsEnabled must be exactly this useMetadataRowIndexOpt (when defined) for path-based delta tables.

optimizationsEnabled is used when:

DeltaParquetFileFormatBase is created for Deletion Vectors-Enabled Scans
DMLWithDeletionVectorsHelper is requested to replaceFileIndex (where this flag is disabled)

Implementations¶

Creating Instance¶

DeltaParquetFileFormatBase takes the following to be created:

ProtocolMetadataAdapter
nullableRowTrackingConstantFields flag (default: false)
nullableRowTrackingGeneratedFields flag (default: false)
optimizationsEnabled flag
Optional Table Path (default: undefined)
isCDCRead flag (default: false)
useMetadataRowIndexOpt flag

While being created, DeltaParquetFileFormatBase asserts the following:

Deletion Vectors-Enabled Scans
delta table is readable.
Either nullableRowTrackingConstantFields is disabled or nullableRowTrackingGeneratedFields is enabled.
With columnMappingMode as IdMapping, ...FIXME

Abstract Class

DeltaParquetFileFormatBase is an abstract class and cannot be created directly. It is created indirectly for the concrete DeltaParquetFileFormatBases.

isSplitable¶

ParquetFileFormat

isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

isSplitable is part of the ParquetFileFormat (Spark SQL) abstraction.

isSplitable returns this optimizationsEnabled.

Build Data Reader (with Partition Values)¶

ParquetFileFormat

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is part of the ParquetFileFormat (Spark SQL) abstraction.

buildReaderWithPartitionValues...FIXME

prepareSchemaForRead¶

prepareSchemaForRead(
  inputSchema: StructType): StructType

prepareSchemaForRead...FIXME

prepareFiltersForRead¶

prepareFiltersForRead(
  filters: Seq[Filter]): Seq[Filter]

prepareFiltersForRead...FIXME

hasTablePath¶

hasTablePath: Boolean

hasTablePath is enabled (true) when this delta table is referenced by a path (not a name/identifier).

hasTablePath is used when:

DeltaParquetFileFormatBase is created and requested to buildReaderWithPartitionValues

supportFieldName¶

FileFormat

supportFieldName(
  name: String): Boolean

supportFieldName is part of the FileFormat (Spark SQL) abstraction.

supportFieldName is enabled (true) when either holds true:

DeltaColumnMappingMode is not NoMapping
The default (parent) supportFieldName (Spark SQL) is enabled

metadataSchemaFields¶

ParquetFileFormat

metadataSchemaFields: Seq[StructField]

metadataSchemaFields is part of the ParquetFileFormat (Spark SQL) abstraction.

Review Me

Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).

ParquetFileFormat

All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.

prepareWrite¶

ParquetFileFormat

prepareWrite(
   sparkSession: SparkSession,
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory

prepareWrite is part of the ParquetFileFormat (Spark SQL) abstraction.

prepareWrite...FIXME

fileConstantMetadataExtractors¶

FileFormat

fileConstantMetadataExtractors: Map[String, PartitionedFile => Any]

fileConstantMetadataExtractors is part of the FileFormat (Spark SQL) abstraction.

fileConstantMetadataExtractors...FIXME