Skip to content

Configuration Properties

Configuration properties are a way to control features of Delta Lake for the whole cluster (SparkSession, to be precise) and hence for all tables created by the cluster.

Table Properties

Unlike configuration properties, Table Properties are per table only.

spark.databricks.delta

Configuration properties use spark.databricks.delta prefix.

alterLocation.bypassSchemaCheck

spark.databricks.delta.alterLocation.bypassSchemaCheck

Enables Alter Table Set Location on Delta to go through even if the Delta table in the new location has a different schema from the original Delta table

Default: false

alterTable.changeColumn.checkExpressions

spark.databricks.delta.alterTable.changeColumn.checkExpressions (internal)

Given an ALTER TABLE CHANGE COLUMN command, check whether Constraints or Generated Columns use expressions that reference this column (that will be affected by this change and should be changed along).

Turn this off when there is an issue with expression checking logic that prevents a valid column change from going through.

Default: true

Used when:

autoCompact.enabled

spark.databricks.delta.autoCompact.enabled

Enables Auto Compaction on all writes across all delta tables in this session.

Default: false

Used when:

autoCompact.maxFileSize

spark.databricks.delta.autoCompact.maxFileSize

(internal) Target file size produced by Auto Compaction

Default: 128 * 1024 * 1024 (128MB)

Used when:

  • AutoCompactBase is requested to compact

autoCompact.minFileSize

spark.databricks.delta.autoCompact.minFileSize

(internal) Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files in Auto Compaction

Default: The half of spark.databricks.delta.autoCompact.maxFileSize

Used when:

autoCompact.modifiedPartitionsOnly.enabled

(internal) spark.databricks.delta.autoCompact.modifiedPartitionsOnly.enabled

When enabled, Auto Compaction works only on the modified partitions of the delta transaction that triggers compaction.

Default: true

Used when:

autoCompact.nonBlindAppend.enabled

(internal) spark.databricks.delta.autoCompact.nonBlindAppend.enabled

When enabled, Auto Compaction is only triggered by non-blind-append write transactions.

Default: false

Used when:

changeDataFeed.unsafeBatchReadOnIncompatibleSchemaChanges.enabled

(internal) spark.databricks.delta.changeDataFeed.unsafeBatchReadOnIncompatibleSchemaChanges.enabled

Enables (unblocks) reading change data in batch (e.g. using table_changes()) on a delta table with column mapping schema operations

It is currently blocked due to potential data loss and schema confusion, and hence considered risky.

Default: false

Used when:

checkLatestSchemaOnRead

spark.databricks.delta.checkLatestSchemaOnRead enables a check that ensures that users won't read corrupt data if the source schema changes in an incompatible way.

Default: true

Delta always tries to give users the latest version of table data without having to call REFRESH TABLE or redefine their DataFrames when used in the context of streaming. There is a possibility that the schema of the latest version of the table may be incompatible with the schema at the time of DataFrame creation.

checkpoint.partSize

spark.databricks.delta.checkpoint.partSize

(internal) The limit checkpoint parallelization starts at. It attempts to write maximum of this many actions per checkpoint.

Default: (undefined) (and assumed 1)

Must be a positive long number

Used when:

clusteredTable.enableClusteringTablePreview

spark.databricks.delta.clusteredTable.enableClusteringTablePreview

(internal) Controls Liquid Clustering

Default: false

Used when:

commitInfo.enabled

spark.databricks.delta.commitInfo.enabled controls whether to log commit information into a Delta log.

Default: true

commitInfo.userMetadata

spark.databricks.delta.commitInfo.userMetadata is an arbitrary user-defined metadata to include in CommitInfo (requires spark.databricks.delta.commitInfo.enabled).

Default: (empty)

commitLock.enabled

spark.databricks.delta.commitLock.enabled (internal) controls whether or not to use a lock on a delta table at transaction commit.

Default: (undefined)

Used when:

commitValidation.enabled

spark.databricks.delta.commitValidation.enabled (internal) controls whether to perform validation checks before commit or not

Default: true

convert.metadataCheck.enabled

spark.databricks.delta.convert.metadataCheck.enabled enables validation during convert to delta, if there is a difference between the catalog table's properties and the Delta table's configuration, we should error.

If disabled, merge the two configurations with the same semantics as update and merge

Default: true

delete.deletionVectors.persistent

spark.databricks.delta.delete.deletionVectors.persistent

Enables persistent Deletion Vectors in DELETE command

Default: true

dummyFileManager.numOfFiles

spark.databricks.delta.dummyFileManager.numOfFiles (internal) controls how many dummy files to write in DummyFileManager

Default: 3

dummyFileManager.prefix

spark.databricks.delta.dummyFileManager.prefix (internal) is the file prefix to use in DummyFileManager

Default: .s3-optimization-

dynamicPartitionOverwrite.enabled

spark.databricks.delta.dynamicPartitionOverwrite.enabled

(internal) Enables Dynamic Partition Overwrite (whether to overwrite partitions dynamically when partitionOverwriteMode is dynamic in either the SQL configuration or a DataFrameWriter option). When disabled, partitionOverwriteMode will be ignored.

Default: true

Used when:

history.maxKeysPerList

spark.databricks.delta.history.maxKeysPerList

(internal) How many commits to list when performing a parallel search

Default: 1000

The default is the maximum keys returned by S3 per ListObjectsV2 call. Microsoft Azure can return up to 5000 blobs (including all BlobPrefix elements) in a single List Blobs API call, and hence the default 1000.

Used when:

history.metricsEnabled

spark.databricks.delta.history.metricsEnabled

Enables metrics reporting in DESCRIBE HISTORY (CommitInfo will record the operation metrics when a OptimisticTransactionImpl is committed).

Requires spark.databricks.delta.commitInfo.enabled configuration property to be enabled

Default: true

Used when:

Github Commit

The feature was added as part of [SC-24567][DELTA] Add additional metrics to Describe Delta History commit.

import.batchSize.schemaInference

spark.databricks.delta.import.batchSize.schemaInference (internal) is the number of files per batch for schema inference during import.

Default: 1000000

import.batchSize.statsCollection

spark.databricks.delta.import.batchSize.statsCollection (internal) is the number of files per batch for stats collection during import.

Default: 50000

io.skipping.mdc.addNoise

spark.databricks.io.skipping.mdc.addNoise (internal) controls whether or not to add a random byte as a suffix to the interleaved bits when computing the Z-order values for MDC. This can help deal with skew, but may have a negative impact on overall min/max skipping effectiveness.

Default: true

Used when:

  • SpaceFillingCurveClustering is requested to cluster

io.skipping.mdc.rangeId.max

spark.databricks.io.skipping.mdc.rangeId.max (internal) controls the domain of rangeId values to be interleaved. The bigger, the better granularity, but at the expense of performance (more data gets sampled).

Default: 1000

Must be greater than 1

Used when:

  • SpaceFillingCurveClustering is requested to cluster

io.skipping.stringPrefixLength

spark.databricks.io.skipping.stringPrefixLength (internal) The length of the prefix of string columns to store in the data skipping index

Default: 32

Used when:

lastCommitVersionInSession

spark.databricks.delta.lastCommitVersionInSession is the version of the last commit made in the SparkSession for any delta table (after OptimisticTransactionImpl is done with doCommit or DeltaCommand with commitLarge)

Default: (undefined)

loadFileSystemConfigsFromDataFrameOptions

spark.databricks.delta.loadFileSystemConfigsFromDataFrameOptions (internal) controls whether to load file systems configs provided in DataFrameReader or DataFrameWriter options when calling DataFrameReader.load/DataFrameWriter.save using a Delta table path.

Not supported for DataFrameReader.table and DataFrameWriter.saveAsTable

Default: true

maxCommitAttempts

spark.databricks.delta.maxCommitAttempts (internal) is the maximum number of commit attempts to try for a single commit before failing

Default: 10000000

Used when:

maxSnapshotLineageLength

spark.databricks.delta.maxSnapshotLineageLength (internal) is the maximum lineage length of a Snapshot before Delta forces to build a Snapshot from scratch

Default: 50

merge.materializeSource

spark.databricks.delta.merge.materializeSource

(internal) When to materialize the source plan during MERGE execution

Value Meaning
all source always materialized
auto sources not materialized unless non-deterministic
none source never materialized

Default: auto

Used when:

merge.materializeSource.maxAttempts

spark.databricks.delta.merge.materializeSource.maxAttempts

How many times retry execution of MERGE command in case the data (an RDD block) of the materialized source RDD is lost

Default: 4

Used when:

merge.materializeSource.rddStorageLevel

spark.databricks.delta.merge.materializeSource.rddStorageLevel

(internal) What StorageLevel to use to persist the source RDD

Default: DISK_ONLY

Used when:

merge.materializeSource.rddStorageLevelRetry

spark.databricks.delta.merge.materializeSource.rddStorageLevelRetry

(internal) What StorageLevel to use to persist the source RDD when MERGE is retried

Default: DISK_ONLY_2

Used when:

merge.maxInsertCount

spark.databricks.delta.merge.maxInsertCount (internal) is the maximum row count of inserts in each MERGE execution

Default: 10000L

merge.optimizeInsertOnlyMerge.enabled

spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled

(internal) Enables extra optimization for insert-only merges by avoiding rewriting old files and just inserting new files

Default: true

Used when:

merge.optimizeMatchedOnlyMerge.enabled

spark.databricks.delta.merge.optimizeMatchedOnlyMerge.enabled

(internal) Enables optimization of matched-only merges to use a RIGHT OUTER join (instead of a FULL OUTER join) while writing out all merge changes

Default: true

merge.repartitionBeforeWrite.enabled

spark.databricks.delta.merge.repartitionBeforeWrite.enabled

(internal) Enables repartitioning of merge output (by the partition columns of a target table if partitioned) before write data(frame) out

Default: true

Used when:

optimize.maxFileSize

spark.databricks.delta.optimize.maxFileSize

(internal) Target file size produced by OPTIMIZE command.

Default: 1024 * 1024 * 1024 (1GB)

Used when:

  • OptimizeExecutor is requested to optimize

optimize.maxThreads

spark.databricks.delta.optimize.maxThreads (internal) Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum parallel jobs allows OPTIMIZE command to run faster, but increases the job management on the Spark driver.

Default: 15

Used when:

  • OptimizeExecutor is requested to optimize

optimize.minFileSize

spark.databricks.delta.optimize.minFileSize

(internal) Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files by the OPTIMIZE command.

Default: 1024 * 1024 * 1024 (1GB)

Used when:

  • OptimizeExecutor is requested to optimize

optimize.repartition.enabled

spark.databricks.delta.optimize.repartition.enabled

(internal) Use Dataset.repartition(1) instead of Dataset.coalesce(1) to merge small files. Dataset.coalesce(1) is executed with only 1 task, if there are many tiny files within a bin (e.g. 1000 files of 50MB), it cannot be optimized with more executors. On the other hand, Dataset.repartition(1) incurs a shuffle stage yet the job can be distributed.

Default: false

Used when:

optimize.zorder.checkStatsCollection.enabled

spark.databricks.delta.optimize.zorder.checkStatsCollection.enabled

(internal) Controls whether there are column statistics available for the zOrderBy columns of OPTIMIZE command

Default: true

Used when:

partitionColumnValidity.enabled

spark.databricks.delta.partitionColumnValidity.enabled (internal) enables validation of the partition column names (just like the data columns)

Default: true

Used when:

properties.defaults.minReaderVersion

spark.databricks.delta.properties.defaults.minReaderVersion is the default reader protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.

Default: 1

Available values: 1

Used when:

properties.defaults.minWriterVersion

spark.databricks.delta.properties.defaults.minWriterVersion is the default writer protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.

Default: 2

Available values: 1, 2, 3

Used when:

replaceWhere.constraintCheck.enabled

spark.databricks.delta.replaceWhere.constraintCheck.enabled

Controls whether or not replaceWhere on arbitrary expression and arbitrary columns enforces constraints to replace the target table only when all the rows in the source dataframe match that constraint.

If disabled, it will skip the constraint check and replace with all the rows from the new dataframe.

Default: true

Used when:

retentionDurationCheck.enabled

spark.databricks.delta.retentionDurationCheck.enabled adds a check preventing users from running vacuum with a very short retention period, which may end up corrupting a Delta log.

Default: true

sampling.enabled

spark.databricks.delta.sampling.enabled (internal) enables sample-based estimation

Default: false

schema.autoMerge.enabled

spark.databricks.delta.schema.autoMerge.enabled

Enables schema merging on appends (UPDATEs) and overwrites (INSERTs).

Default: false

Equivalent DataFrame option: mergeSchema

Used when:

schema.removeSparkInternalMetadata

(internal) spark.databricks.delta.schema.removeSparkInternalMetadata

Whether to remove leaked Spark's internal metadata from the table schema before returning to Spark. These internal metadata might be stored unintentionally in tables created by old Spark versions.

Default: true

Used when:

schema.typeCheck.enabled

spark.databricks.delta.schema.typeCheck.enabled (internal) controls whether to check unsupported data types while updating a table schema

Disabling this flag may allow users to create unsupported Delta tables and should only be used when trying to read/write legacy tables.

Default: true

Used when:

snapshotIsolation.enabled

spark.databricks.delta.snapshotIsolation.enabled (internal) controls whether queries on Delta tables are guaranteed to have snapshot isolation

Default: true

snapshotPartitions

spark.databricks.delta.snapshotPartitions (internal) is the number of partitions to use for state reconstruction (when building a snapshot of a Delta table).

Default: 50

stalenessLimit

spark.databricks.delta.stalenessLimit (in millis) allows you to query the last loaded state of the Delta table without blocking on a table update. You can use this configuration to reduce the latency on queries when up-to-date results are not a requirement. Table updates will be scheduled on a separate scheduler pool in a FIFO queue, and will share cluster resources fairly with your query. If a table hasn't updated past this time limit, we will block on a synchronous state update before running the query.

Default: 0 (no tables can be stale)

state.corruptionIsFatal

spark.databricks.delta.state.corruptionIsFatal (internal) throws a fatal error when the recreated Delta State doesn't match committed checksum file

Default: true

stateReconstructionValidation.enabled

spark.databricks.delta.stateReconstructionValidation.enabled (internal) controls whether to perform validation checks on the reconstructed state

Default: true

stats.collect

spark.databricks.delta.stats.collect

(internal) Enables statistics to be collected while writing files into a delta table

Default: true

Used when:

stats.collect.using.tableSchema

spark.databricks.delta.stats.collect.using.tableSchema

(internal) When collecting stats while writing files into Delta table (with spark.databricks.delta.stats.collect enabled), whether to use the table schema (true) or the DataFrame schema (false) as the stats collection schema.

Default: true

Used when:

stats.limitPushdown.enabled

spark.databricks.delta.stats.limitPushdown.enabled (internal) enables using the limit clause and file statistics to prune files before they are collected to the driver

Default: true

stats.localCache.maxNumFiles

spark.databricks.delta.stats.localCache.maxNumFiles (internal) is the maximum number of files for a table to be considered a delta small table. Some metadata operations (such as using data skipping) are optimized for small tables using driver local caching and local execution.

Default: 2000

stats.skipping

spark.databricks.delta.stats.skipping (internal) enables Data Skipping

Default: true

Used when:

  • DataSkippingReaderBase is requested for the files to scan
  • PrepareDeltaScanBase logical optimization is executed

streaming.schemaTracking.enabled

spark.databricks.delta.streaming.schemaTracking.enabled

(internal) Controls whether the delta streaming source can support non-additive schema evolution for operations such as rename or drop column on column mapping enabled tables

Default: true

Used when:

streaming.unsafeReadOnIncompatibleColumnMappingSchemaChanges.enabled

spark.databricks.delta.streaming.unsafeReadOnIncompatibleColumnMappingSchemaChanges.enabled

(internal) Controls (possibly unsafe) streaming read on delta tables with column mapping schema operations (e.g. rename or drop column) that could lead to data loss and schema confusion.

Default: false

Used when:

timeTravel.resolveOnIdentifier.enabled

spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled (internal) controls whether to resolve patterns as @v123 and @yyyyMMddHHmmssSSS in path identifiers as time travel nodes.

Default: true

vacuum.parallelDelete.enabled

spark.databricks.delta.vacuum.parallelDelete.enabled enables parallelizing the deletion of files during vacuum command.

Default: false

Enabling may result hitting rate limits on some storage backends. When enabled, parallelization is controlled by the default number of shuffle partitions.

write.txnAppId

spark.databricks.delta.write.txnAppId

(internal) The user-defined application ID a write will be committed with. If specified, spark.databricks.delta.write.txnVersion has to be set, too.

Default: (undefined)

Used when:

write.txnVersion

spark.databricks.delta.write.txnVersion

(internal) The user-defined transaction version a write will be committed with. If specified, spark.databricks.delta.write.txnAppId has to be set, too. To ensure idempotency, txnVersions across different writes need to be monotonically increasing.

Default: (undefined)

Used when:

write.txnVersion.autoReset.enabled

spark.databricks.delta.write.txnVersion.autoReset.enabled

(internal) When enabled, automatically resets spark.databricks.delta.write.txnVersion after every write

If the txnAppId and txnVersion both come from the session config (based on write.txnAppId and write.txnVersion, respectively), spark.databricks.delta.write.txnVersion is reset (unset), after skipping the current transaction, as a safety measure to prevent data loss if the user forgets to manually reset txnVersion

Default: false

Used when:

spark.delta.logStore.class

The fully-qualified class name of a LogStore

Default: HDFSLogStore

Used when:

  • LogStoreProvider is requested for a LogStore