Skip to content

DeltaParquetFileFormatBase

DeltaParquetFileFormatBase is a base abstraction of the ParquetFileFormat (Spark SQL) abstraction for delta file formats (with data files in parquet file format).

Deletion Vectors-Enabled Scans

DeltaParquetFileFormatBase makes a precondition/invariant check for delta table scans with Deletion Vectors enabled.

When the delta table is referenced by a path (not a name or identifier) and useMetadataRowIndexOpt is defined, the flag should match optimizationsEnabled.

If they differ, DeltaParquetFileFormatBase throws an IllegalArgumentException:

Wrong arguments for Delta table scan with deletion vectors

useMetadataRowIndexOpt Flag

DeltaParquetFileFormatBase can be given useMetadataRowIndexOpt flag when created for row index source for Deletion Vectors filtering.

useMetadataRowIndexOpt flag is undefined (None) by default.

useMetadataRowIndexOpt is configured using spark.databricks.delta.deletionVectors.useMetadataRowIndex configuration property (for DeltaParquetFileFormat).

useMetadataRowIndexOpt is used for buildReaderWithPartitionValues, if defined, or defaults to spark.databricks.delta.deletionVectors.useMetadataRowIndex configuration property.

For path-based delta tables with useMetadataRowIndexOpt defined, the value must be exactly this optimizationsEnabled (for Deletion Vectors-Enabled Scans).

optimizationsEnabled Flag

DeltaParquetFileFormatBase can be given optimizationsEnabled flag when created to enable scan optimizations (file splitting and predicate pushdown).

In other words, optimizationsEnabled flag is equivalent to using scan optimizations (file splitting and predicate pushdown).

optimizationsEnabled is enabled (true) by default.

optimizationsEnabled controls isSplitable predicate.

optimizationsEnabled must be enabled when buildReaderWithPartitionValues with this useMetadataRowIndex flag disabled.

When optimizationsEnabled is disabled, prepareFiltersForRead gives no filters.

optimizationsEnabled must be exactly this useMetadataRowIndexOpt (when defined) for path-based delta tables.

optimizationsEnabled is used when:

Implementations

Creating Instance

DeltaParquetFileFormatBase takes the following to be created:

While being created, DeltaParquetFileFormatBase asserts the following:

  1. Deletion Vectors-Enabled Scans
  2. delta table is readable.
  3. Either nullableRowTrackingConstantFields is disabled or nullableRowTrackingGeneratedFields is enabled.
  4. With columnMappingMode as IdMapping, ...FIXME
Abstract Class

DeltaParquetFileFormatBase is an abstract class and cannot be created directly. It is created indirectly for the concrete DeltaParquetFileFormatBases.

isSplitable

ParquetFileFormat
isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

isSplitable is part of the ParquetFileFormat (Spark SQL) abstraction.

isSplitable returns this optimizationsEnabled.

Build Data Reader (with Partition Values)

ParquetFileFormat
buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is part of the ParquetFileFormat (Spark SQL) abstraction.

buildReaderWithPartitionValues...FIXME

prepareSchemaForRead

prepareSchemaForRead(
  inputSchema: StructType): StructType

prepareSchemaForRead...FIXME

prepareFiltersForRead

prepareFiltersForRead(
  filters: Seq[Filter]): Seq[Filter]

prepareFiltersForRead...FIXME

hasTablePath

hasTablePath: Boolean

hasTablePath is enabled (true) when this delta table is referenced by a path (not a name/identifier).


hasTablePath is used when:

supportFieldName

FileFormat
supportFieldName(
  name: String): Boolean

supportFieldName is part of the FileFormat (Spark SQL) abstraction.

supportFieldName is enabled (true) when either holds true:

metadataSchemaFields

ParquetFileFormat
metadataSchemaFields: Seq[StructField]

metadataSchemaFields is part of the ParquetFileFormat (Spark SQL) abstraction.

Review Me

Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).

ParquetFileFormat

All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.

prepareWrite

ParquetFileFormat
prepareWrite(
   sparkSession: SparkSession,
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory

prepareWrite is part of the ParquetFileFormat (Spark SQL) abstraction.

prepareWrite...FIXME

fileConstantMetadataExtractors

FileFormat
fileConstantMetadataExtractors: Map[String, PartitionedFile => Any]

fileConstantMetadataExtractors is part of the FileFormat (Spark SQL) abstraction.

fileConstantMetadataExtractors...FIXME