Skip to content

Configuration Properties

spark.databricks.delta

alterLocation.bypassSchemaCheck

spark.databricks.delta.alterLocation.bypassSchemaCheck

Enables Alter Table Set Location on Delta to go through even if the Delta table in the new location has a different schema from the original Delta table

Default: false

alterTable.changeColumn.checkExpressions

spark.databricks.delta.alterTable.changeColumn.checkExpressions (internal)

Given an ALTER TABLE CHANGE COLUMN command, check whether Constraints or Generated Columns use expressions that reference this column (that will be affected by this change and should be changed along).

Turn this off when there is an issue with expression checking logic that prevents a valid column change from going through.

Default: true

Used when:

checkLatestSchemaOnRead

spark.databricks.delta.checkLatestSchemaOnRead enables a check that ensures that users won't read corrupt data if the source schema changes in an incompatible way.

Default: true

Delta always tries to give users the latest version of table data without having to call REFRESH TABLE or redefine their DataFrames when used in the context of streaming. There is a possibility that the schema of the latest version of the table may be incompatible with the schema at the time of DataFrame creation.

checkpoint.partSize

spark.databricks.delta.checkpoint.partSize

(internal) The limit checkpoint parallelization starts at. It attempts to write maximum of this many actions per checkpoint.

Default: (undefined) (and assumed 1)

Must be a positive long number

Used when:

commitInfo.enabled

spark.databricks.delta.commitInfo.enabled controls whether to log commit information into a Delta log.

Default: true

commitInfo.userMetadata

spark.databricks.delta.commitInfo.userMetadata is an arbitrary user-defined metadata to include in CommitInfo (requires spark.databricks.delta.commitInfo.enabled).

Default: (empty)

commitLock.enabled

spark.databricks.delta.commitLock.enabled (internal) controls whether or not to use a lock on a delta table at transaction commit.

Default: (undefined)

Used when:

commitValidation.enabled

spark.databricks.delta.commitValidation.enabled (internal) controls whether to perform validation checks before commit or not

Default: true

convert.metadataCheck.enabled

spark.databricks.delta.convert.metadataCheck.enabled enables validation during convert to delta, if there is a difference between the catalog table's properties and the Delta table's configuration, we should error.

If disabled, merge the two configurations with the same semantics as update and merge

Default: true

dummyFileManager.numOfFiles

spark.databricks.delta.dummyFileManager.numOfFiles (internal) controls how many dummy files to write in DummyFileManager

Default: 3

dummyFileManager.prefix

spark.databricks.delta.dummyFileManager.prefix (internal) is the file prefix to use in DummyFileManager

Default: .s3-optimization-

history.maxKeysPerList

spark.databricks.delta.history.maxKeysPerList (internal) controls how many commits to list when performing a parallel search.

The default is the maximum keys returned by S3 per list call. Azure can return 5000, therefore we choose 1000.

Default: 1000

history.metricsEnabled

spark.databricks.delta.history.metricsEnabled

Enables metrics reporting in DESCRIBE HISTORY (CommitInfo will record the operation metrics when a OptimisticTransactionImpl is committed).

Requires spark.databricks.delta.commitInfo.enabled configuration property to be enabled

Default: true

Used when:

Github Commit

The feature was added as part of [SC-24567][DELTA] Add additional metrics to Describe Delta History commit.

import.batchSize.schemaInference

spark.databricks.delta.import.batchSize.schemaInference (internal) is the number of files per batch for schema inference during import.

Default: 1000000

import.batchSize.statsCollection

spark.databricks.delta.import.batchSize.statsCollection (internal) is the number of files per batch for stats collection during import.

Default: 50000

io.skipping.mdc.addNoise

spark.databricks.io.skipping.mdc.addNoise (internal) controls whether or not to add a random byte as a suffix to the interleaved bits when computing the Z-order values for MDC. This can help deal with skew, but may have a negative impact on overall min/max skipping effectiveness.

Default: true

Used when:

  • SpaceFillingCurveClustering is requested to cluster

io.skipping.mdc.rangeId.max

spark.databricks.io.skipping.mdc.rangeId.max (internal) controls the domain of rangeId values to be interleaved. The bigger, the better granularity, but at the expense of performance (more data gets sampled).

Default: 1000

Must be greater than 1

Used when:

  • SpaceFillingCurveClustering is requested to cluster

io.skipping.stringPrefixLength

spark.databricks.io.skipping.stringPrefixLength (internal) The length of the prefix of string columns to store in the data skipping index

Default: 32

Used when:

lastCommitVersionInSession

spark.databricks.delta.lastCommitVersionInSession is the version of the last commit made in the SparkSession for any delta table (after OptimisticTransactionImpl is done with doCommit or DeltaCommand with commitLarge)

Default: (undefined)

loadFileSystemConfigsFromDataFrameOptions

spark.databricks.delta.loadFileSystemConfigsFromDataFrameOptions (internal) controls whether to load file systems configs provided in DataFrameReader or DataFrameWriter options when calling DataFrameReader.load/DataFrameWriter.save using a Delta table path.

Not supported for DataFrameReader.table and DataFrameWriter.saveAsTable

Default: true

maxCommitAttempts

spark.databricks.delta.maxCommitAttempts (internal) is the maximum number of commit attempts to try for a single commit before failing

Default: 10000000

Used when:

maxSnapshotLineageLength

spark.databricks.delta.maxSnapshotLineageLength (internal) is the maximum lineage length of a Snapshot before Delta forces to build a Snapshot from scratch

Default: 50

merge.materializeSource

spark.databricks.delta.merge.materializeSource

(internal) When to materialize the source plan during MERGE execution

Value Meaning
all source always materialized
auto sources not materialized unless non-deterministic
none source never materialized

Default: auto

Used when:

merge.materializeSource.maxAttempts

spark.databricks.delta.merge.materializeSource.maxAttempts

How many times retry execution of MERGE command in case the data (an RDD block) of the materialized source RDD is lost

Default: 4

Used when:

merge.materializeSource.rddStorageLevel

spark.databricks.delta.merge.materializeSource.rddStorageLevel

(internal) What StorageLevel to use to persist the source RDD

Default: DISK_ONLY

Used when:

merge.materializeSource.rddStorageLevelRetry

spark.databricks.delta.merge.materializeSource.rddStorageLevelRetry

(internal) What StorageLevel to use to persist the source RDD when MERGE is retried

Default: DISK_ONLY_2

Used when:

merge.maxInsertCount

spark.databricks.delta.merge.maxInsertCount (internal) is the maximum row count of inserts in each MERGE execution

Default: 10000L

merge.optimizeInsertOnlyMerge.enabled

spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled

(internal) Enables extra optimization for insert-only merges by avoiding rewriting old files and just inserting new files

Default: true

Used when:

merge.optimizeMatchedOnlyMerge.enabled

spark.databricks.delta.merge.optimizeMatchedOnlyMerge.enabled

(internal) Enables optimization of matched-only merges to use a RIGHT OUTER join (instead of a FULL OUTER join) while writing out all merge changes

Default: true

merge.repartitionBeforeWrite.enabled

spark.databricks.delta.merge.repartitionBeforeWrite.enabled

(internal) Enables repartitioning of merge output (by the partition columns of a target table if partitioned) before write data(frame) out

Default: true

Used when:

optimize.maxFileSize

spark.databricks.delta.optimize.maxFileSize (internal) Target file size produced by OPTIMIZE command.

Default: 1024 * 1024 * 1024

Used when:

  • OptimizeExecutor is requested to optimize

optimize.maxThreads

spark.databricks.delta.optimize.maxThreads (internal) Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum parallel jobs allows OPTIMIZE command to run faster, but increases the job management on the Spark driver.

Default: 15

Used when:

  • OptimizeExecutor is requested to optimize

optimize.minFileSize

spark.databricks.delta.optimize.minFileSize (internal) Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files by the OPTIMIZE command.

Default: 1024 * 1024 * 1024 (1GB)

Used when:

  • OptimizeExecutor is requested to optimize

optimize.zorder.checkStatsCollection.enabled

spark.databricks.delta.optimize.zorder.checkStatsCollection.enabled (internal) Controls whether there are column statistics available for the zOrderBy columns of OPTIMIZE command

Default: true

Used when:

partitionColumnValidity.enabled

spark.databricks.delta.partitionColumnValidity.enabled (internal) enables validation of the partition column names (just like the data columns)

Default: true

Used when:

properties.defaults.minReaderVersion

spark.databricks.delta.properties.defaults.minReaderVersion is the default reader protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.

Default: 1

Available values: 1

Used when:

properties.defaults.minWriterVersion

spark.databricks.delta.properties.defaults.minWriterVersion is the default writer protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.

Default: 2

Available values: 1, 2, 3

Used when:

replaceWhere.constraintCheck.enabled

spark.databricks.delta.replaceWhere.constraintCheck.enabled controls whether or not replaceWhere on arbitrary expression and arbitrary columns enforces constraints to replace the target table only when all the rows in the source dataframe match that constraint.

If disabled, it will skip the constraint check and replace with all the rows from the new dataframe.

Default: true

Used when:

retentionDurationCheck.enabled

spark.databricks.delta.retentionDurationCheck.enabled adds a check preventing users from running vacuum with a very short retention period, which may end up corrupting a Delta log.

Default: true

sampling.enabled

spark.databricks.delta.sampling.enabled (internal) enables sample-based estimation

Default: false

schema.autoMerge.enabled

spark.databricks.delta.schema.autoMerge.enabled

Enables schema merging on appends (UPDATEs) and overwrites (INSERTs).

Default: false

Equivalent DataFrame option: mergeSchema

Used when:

schema.typeCheck.enabled

spark.databricks.delta.schema.typeCheck.enabled (internal) controls whether to check unsupported data types while updating a table schema

Disabling this flag may allow users to create unsupported Delta tables and should only be used when trying to read/write legacy tables.

Default: true

Used when:

snapshotIsolation.enabled

spark.databricks.delta.snapshotIsolation.enabled (internal) controls whether queries on Delta tables are guaranteed to have snapshot isolation

Default: true

snapshotPartitions

spark.databricks.delta.snapshotPartitions (internal) is the number of partitions to use for state reconstruction (when building a snapshot of a Delta table).

Default: 50

stalenessLimit

spark.databricks.delta.stalenessLimit (in millis) allows you to query the last loaded state of the Delta table without blocking on a table update. You can use this configuration to reduce the latency on queries when up-to-date results are not a requirement. Table updates will be scheduled on a separate scheduler pool in a FIFO queue, and will share cluster resources fairly with your query. If a table hasn't updated past this time limit, we will block on a synchronous state update before running the query.

Default: 0 (no tables can be stale)

state.corruptionIsFatal

spark.databricks.delta.state.corruptionIsFatal (internal) throws a fatal error when the recreated Delta State doesn't match committed checksum file

Default: true

stateReconstructionValidation.enabled

spark.databricks.delta.stateReconstructionValidation.enabled (internal) controls whether to perform validation checks on the reconstructed state

Default: true

stats.collect

spark.databricks.delta.stats.collect

(internal) Enables statistics to be collected while writing files into a delta table

Default: true

Used when:

stats.collect.using.tableSchema

spark.databricks.delta.stats.collect.using.tableSchema

(internal) When collecting stats while writing files into Delta table (with spark.databricks.delta.stats.collect enabled), whether to use the table schema (true) or the DataFrame schema (false) as the stats collection schema.

Default: true

Used when:

stats.limitPushdown.enabled

spark.databricks.delta.stats.limitPushdown.enabled (internal) enables using the limit clause and file statistics to prune files before they are collected to the driver

Default: true

stats.localCache.maxNumFiles

spark.databricks.delta.stats.localCache.maxNumFiles (internal) is the maximum number of files for a table to be considered a delta small table. Some metadata operations (such as using data skipping) are optimized for small tables using driver local caching and local execution.

Default: 2000

stats.skipping

spark.databricks.delta.stats.skipping (internal) enables Data Skipping

Default: true

Used when:

  • DataSkippingReaderBase is requested for the files to scan
  • PrepareDeltaScanBase logical optimization is executed

timeTravel.resolveOnIdentifier.enabled

spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled (internal) controls whether to resolve patterns as @v123 and @yyyyMMddHHmmssSSS in path identifiers as time travel nodes.

Default: true

vacuum.parallelDelete.enabled

spark.databricks.delta.vacuum.parallelDelete.enabled enables parallelizing the deletion of files during vacuum command.

Default: false

Enabling may result hitting rate limits on some storage backends. When enabled, parallelization is controlled by the default number of shuffle partitions.

spark.delta.logStore.class

The fully-qualified class name of a LogStore

Default: HDFSLogStore

Used when:

  • LogStoreProvider is requested for a LogStore