Skip to content

DeltaParquetFileFormat

DeltaParquetFileFormat is a ParquetFileFormat (Spark SQL) to support no restrictions on columns names.

Creating Instance

DeltaParquetFileFormat takes the following to be created:

  • Protocol
  • Metadata
  • isSplittable Flag
  • disablePushDowns Flag
  • Optional Table Path (default: None (unspecified))
  • Optional Broadcast variable with DeletionVectorDescriptorWithFilterTypes per URI (default: None (unspecified))
  • Optional Broadcast variable with Hadoop Configuration (default: None (unspecified))

DeltaParquetFileFormat is created when:

isSplittable Flag

DeltaParquetFileFormat can be given isSplittable flag when created.

FileFormat

isSplittable is part of the FileFormat (Spark SQL) abstraction to indicate whether this delta table is splittable or not.

Unless given, isSplittable flag is enabled by default (to match the base ParquetFileFormat (Spark SQL)).

isSplittable is disabled (false) when:

Note

DeltaParquetFileFormat is either splittable or supports deletion vectors.

isSplittable is also used to buildReaderWithPartitionValues (to assert the configuration of this delta table).

disablePushDowns

DeltaParquetFileFormat can be given disablePushDowns flag when created.

disablePushDowns flag indicates whether this delta table supports predicate pushdown optimization or not for buildReaderWithPartitionValues to pass the filters down to the parquet data reader or not.

Unless given, disablePushDowns flag is disabled (false) by default.

disablePushDowns is enabled (true) when:

Note

DeltaParquetFileFormat supports either the predicate pushdown optimization (disablePushDowns is disabled) or deletion vectors.

__delta_internal_row_index Internal Metadata Column

DeltaParquetFileFormat defines __delta_internal_row_index name for the metadata column name with the index of a row within a file.

__delta_internal_row_index is an internal column.

When defined in the schema (of a delta table), DeltaParquetFileFormat creates an iteratorWithAdditionalMetadataColumns.

Warning

__delta_internal_row_index column is only supported for delta tables with the following features disabled:

__delta_internal_row_index is used when:

Building Data Reader (With Partition Values)

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues prepares the given schemas (e.g., dataSchema, partitionSchema and requiredSchema) before requesting the parent ParquetFileFormat to buildReaderWithPartitionValues.

buildReaderWithPartitionValues is part of the ParquetFileFormat (Spark SQL) abstraction.

Preparing Schema

prepareSchema(
  inputSchema: StructType): StructType

prepareSchema creates a physical schema (for the inputSchema, the referenceSchema and the DeltaColumnMappingMode).

supportFieldName

FileFormat
supportFieldName(
  name: String): Boolean

supportFieldName is part of the FileFormat (Spark SQL) abstraction.

supportFieldName is enabled (true) when either holds true:

metadataSchemaFields

FileFormat
metadataSchemaFields: Seq[StructField]

metadataSchemaFields is part of the FileFormat (Spark SQL) abstraction.

Due to an issue in Spark SQL (to be reported), metadataSchemaFields removes row_index from the default metadataSchemaFields (Spark SQL).

ParquetFileFormat

All what ParquetFileFormat does (when requested for the metadataSchemaFields) is to add the row_index. In other words, DeltaParquetFileFormat reverts this column addition.