DeltaSQLConf — spark.databricks.delta Configuration Properties

DeltaSQLConf contains spark.databricks.delta-prefixed configuration properties to configure behaviour of Delta Lake.

alterLocation.bypassSchemaCheck

spark.databricks.delta.alterLocation.bypassSchemaCheck enables Alter Table Set Location on Delta to go through even if the Delta table in the new location has a different schema from the original Delta table

Default: false

checkLatestSchemaOnRead

spark.databricks.delta.checkLatestSchemaOnRead (internal) enables a check that ensures that users won’t read corrupt data if the source schema changes in an incompatible way.

Default: true

In Delta, we always try to give users the latest version of their data without having to call REFRESH TABLE or redefine their DataFrames when used in the context of streaming. There is a possibility that the schema of the latest version of the table may be incompatible with the schema at the time of DataFrame creation.

checkpoint.partSize

spark.databricks.delta.checkpoint.partSize (internal) is the limit at which we will start parallelizing the checkpoint. We will attempt to write maximum of this many actions per checkpoint.

Default: 5000000

commitInfo.enabled

spark.databricks.delta.commitInfo.enabled controls whether to log commit information into a Delta log.

Default: true

commitInfo.userMetadata

spark.databricks.delta.commitInfo.userMetadata is an arbitrary user-defined metadata to include in CommitInfo (requires spark.databricks.delta.commitInfo.enabled).

Default: (empty)

commitValidation.enabled

spark.databricks.delta.commitValidation.enabled (internal) controls whether to perform validation checks before commit or not

Default: true

convert.metadataCheck.enabled

spark.databricks.delta.convert.metadataCheck.enabled enables validation during convert to delta, if there is a difference between the catalog table’s properties and the Delta table’s configuration, we should error.

If disabled, merge the two configurations with the same semantics as update and merge

Default: true

dummyFileManager.numOfFiles

spark.databricks.delta.dummyFileManager.numOfFiles (internal) controls how many dummy files to write in DummyFileManager

Default: 3

dummyFileManager.prefix

spark.databricks.delta.dummyFileManager.prefix (internal) is the file prefix to use in DummyFileManager

Default: .s3-optimization-

history.maxKeysPerList

spark.databricks.delta.history.maxKeysPerList (internal) controls how many commits to list when performing a parallel search.

The default is the maximum keys returned by S3 per list call. Azure can return 5000, therefore we choose 1000.

Default: 1000

history.metricsEnabled

spark.databricks.delta.history.metricsEnabled enables Metrics reporting in Describe History. CommitInfo will now record the Operation Metrics.

Default: true

import.batchSize.schemaInference

spark.databricks.delta.import.batchSize.schemaInference (internal) is the number of files per batch for schema inference during import.

Default: 1000000

import.batchSize.statsCollection

spark.databricks.delta.import.batchSize.statsCollection (internal) is the number of files per batch for stats collection during import.

Default: 50000

maxSnapshotLineageLength

spark.databricks.delta.maxSnapshotLineageLength (internal) is the maximum lineage length of a Snapshot before Delta forces to build a Snapshot from scratch

Default: 50

merge.maxInsertCount

spark.databricks.delta.merge.maxInsertCount (internal) is the maximum row count of inserts in each MERGE execution

Default: 10000L

merge.optimizeInsertOnlyMerge.enabled

spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled (internal) controls merge without any matched clause (i.e., insert-only merge) will be optimized by avoiding rewriting old files and just inserting new files

Default: true

merge.optimizeMatchedOnlyMerge.enabled

spark.databricks.delta.merge.optimizeMatchedOnlyMerge.enabled (internal) controls merge without 'when not matched' clause will be optimized to use a right outer join instead of a full outer join

Default: true

merge.repartitionBeforeWrite.enabled

spark.databricks.delta.merge.repartitionBeforeWrite.enabled (internal) controls merge will repartition the output by the table’s partition columns before writing the files

Default: false

partitionColumnValidity.enabled

spark.databricks.delta.partitionColumnValidity.enabled (internal) enables validation of the partition column names (just like the data columns)

Default: true

retentionDurationCheck.enabled

spark.databricks.delta.retentionDurationCheck.enabled adds a check preventing users from running vacuum with a very short retention period, which may end up corrupting a Delta log.

Default: true

sampling.enabled

spark.databricks.delta.sampling.enabled (internal) enables sample-based estimation

Default: false

schema.autoMerge.enabled

spark.databricks.delta.schema.autoMerge.enabled enables schema merging on appends and overwrites.

Equivalent DataFrame option: mergeSchema

Default: false

snapshotIsolation.enabled

spark.databricks.delta.snapshotIsolation.enabled (internal) controls whether queries on Delta tables are guaranteed to have snapshot isolation

Default: true

snapshotPartitions

spark.databricks.delta.snapshotPartitions (internal) is the number of partitions to use for state reconstruction (when building a snapshot of a Delta table).

Default: 50

stalenessLimit

spark.databricks.delta.stalenessLimit (in millis) allows you to query the last loaded state of the Delta table without blocking on a table update. You can use this configuration to reduce the latency on queries when up-to-date results are not a requirement. Table updates will be scheduled on a separate scheduler pool in a FIFO queue, and will share cluster resources fairly with your query. If a table hasn’t updated past this time limit, we will block on a synchronous state update before running the query.

Default: 0 (no tables can be stale)

state.corruptionIsFatal

spark.databricks.delta.state.corruptionIsFatal (internal) throws a fatal error when the recreated Delta State doesn’t match committed checksum file

Default: true

stateReconstructionValidation.enabled

spark.databricks.delta.stateReconstructionValidation.enabled (internal) controls whether to perform validation checks on the reconstructed state

Default: true

stats.collect

spark.databricks.delta.stats.collect (internal) enables statistics to be collected while writing files into a Delta table

Default: true

stats.limitPushdown.enabled

spark.databricks.delta.stats.limitPushdown.enabled (internal) enables using the limit clause and file statistics to prune files before they are collected to the driver

Default: true

stats.localCache.maxNumFiles

spark.databricks.delta.stats.localCache.maxNumFiles (internal) is the maximum number of files for a table to be considered a delta small table. Some metadata operations (such as using data skipping) are optimized for small tables using driver local caching and local execution.

Default: 2000

stats.skipping

spark.databricks.delta.stats.skipping (internal) enables statistics used for skipping

Default: true

timeTravel.resolveOnIdentifier.enabled

spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled (internal) controls whether to resolve patterns as @v123 in identifiers as time travel nodes.

Default: true