Configuration Properties¶
Configuration properties are a way to control features of Delta Lake for the whole cluster (SparkSession
, to be precise) and hence for all tables created by the cluster.
Table Properties
Unlike configuration properties, Table Properties are per table only.
spark.databricks.delta¶
Configuration properties use spark.databricks.delta
prefix.
alterLocation.bypassSchemaCheck¶
spark.databricks.delta.alterLocation.bypassSchemaCheck
Enables Alter Table Set Location on Delta to go through even if the Delta table in the new location has a different schema from the original Delta table
Default: false
alterTable.changeColumn.checkExpressions¶
spark.databricks.delta.alterTable.changeColumn.checkExpressions (internal)
Given an ALTER TABLE CHANGE COLUMN command, check whether Constraints or Generated Columns use expressions that reference this column (that will be affected by this change and should be changed along).
Turn this off when there is an issue with expression checking logic that prevents a valid column change from going through.
Default: true
Used when:
AlterDeltaTableCommand
is requested to checkDependentExpressions
autoCompact.enabled¶
spark.databricks.delta.autoCompact.enabled
Enables Auto Compaction on all writes across all delta tables in this session.
Default: false
Used when:
AutoCompactBase
is requested for the type of Auto Compaction
autoCompact.maxFileSize¶
spark.databricks.delta.autoCompact.maxFileSize
(internal) Target file size produced by Auto Compaction
Default: 128 * 1024 * 1024
(128MB
)
Used when:
AutoCompactBase
is requested to compact
autoCompact.minFileSize¶
spark.databricks.delta.autoCompact.minFileSize
(internal) Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files in Auto Compaction
Default: The half of spark.databricks.delta.autoCompact.maxFileSize
Used when:
OptimisticTransactionImpl
is requested to createAutoCompactStatsCollectorAutoCompactBase
is requested to compact
autoCompact.modifiedPartitionsOnly.enabled¶
(internal) spark.databricks.delta.autoCompact.modifiedPartitionsOnly.enabled
When enabled, Auto Compaction works only on the modified partitions of the delta transaction that triggers compaction.
Default: true
Used when:
AutoCompactBase
is requested to isModifiedPartitionsOnlyAutoCompactEnabled
autoCompact.nonBlindAppend.enabled¶
(internal) spark.databricks.delta.autoCompact.nonBlindAppend.enabled
When enabled, Auto Compaction is only triggered by non-blind-append write transactions.
Default: false
Used when:
AutoCompactBase
is requested to isNonBlindAppendAutoCompactEnabled
changeDataFeed.unsafeBatchReadOnIncompatibleSchemaChanges.enabled¶
(internal) spark.databricks.delta.changeDataFeed.unsafeBatchReadOnIncompatibleSchemaChanges.enabled
Enables (unblocks) reading change data in batch (e.g. using table_changes()
) on a delta table with column mapping schema operations
It is currently blocked due to potential data loss and schema confusion, and hence considered risky.
Default: false
Used when:
CDCReaderImpl
is requested for a DataFrame of changes
checkLatestSchemaOnRead¶
spark.databricks.delta.checkLatestSchemaOnRead enables a check that ensures that users won't read corrupt data if the source schema changes in an incompatible way.
Default: true
Delta always tries to give users the latest version of table data without having to call REFRESH TABLE
or redefine their DataFrames when used in the context of streaming. There is a possibility that the schema of the latest version of the table may be incompatible with the schema at the time of DataFrame creation.
checkpoint.partSize¶
spark.databricks.delta.checkpoint.partSize
(internal) The limit checkpoint parallelization starts at. It attempts to write maximum of this many actions per checkpoint.
Default: (undefined) (and assumed 1
)
Must be a positive long
number
Used when:
Checkpoints
is requested to write out a state checkpoint
clusteredTable.enableClusteringTablePreview¶
spark.databricks.delta.clusteredTable.enableClusteringTablePreview
(internal) Controls Liquid Clustering
Default: false
Used when:
ClusteredTableUtilsBase
is requested to validatePreviewEnabledDeltaErrorsBase
is requested to clusteringTablePreviewDisabledException
commitInfo.enabled¶
spark.databricks.delta.commitInfo.enabled controls whether to log commit information into a Delta log.
Default: true
commitInfo.userMetadata¶
spark.databricks.delta.commitInfo.userMetadata is an arbitrary user-defined metadata to include in CommitInfo (requires spark.databricks.delta.commitInfo.enabled).
Default: (empty)
commitLock.enabled¶
spark.databricks.delta.commitLock.enabled (internal) controls whether or not to use a lock on a delta table at transaction commit.
Default: (undefined)
Used when:
OptimisticTransactionImpl
is requested to isCommitLockEnabled
commitValidation.enabled¶
spark.databricks.delta.commitValidation.enabled (internal) controls whether to perform validation checks before commit or not
Default: true
convert.metadataCheck.enabled¶
spark.databricks.delta.convert.metadataCheck.enabled enables validation during convert to delta, if there is a difference between the catalog table's properties and the Delta table's configuration, we should error.
If disabled, merge the two configurations with the same semantics as update and merge
Default: true
delete.deletionVectors.persistent¶
spark.databricks.delta.delete.deletionVectors.persistent
Enables persistent Deletion Vectors in DELETE command
Default: true
deletionVectors.useMetadataRowIndex¶
spark.databricks.delta.deletionVectors.useMetadataRowIndex
(internal) Enables using the Parquet reader generated row_index
column for filtering deleted rows with Deletion Vectors (that gives predicate pushdown and file splitting in scans).
Default: true
Used when:
DeltaParquetFileFormat
is requested to buildReaderWithPartitionValuesScanWithDeletionVectors
is requested to createScanWithSkipRowColumnDeletionVectorBitmapGenerator
is requested to buildRowIndexSetsForFilesMatchingConditionDMLWithDeletionVectorsHelper
is requested to replaceFileIndex
dummyFileManager.numOfFiles¶
spark.databricks.delta.dummyFileManager.numOfFiles (internal) controls how many dummy files to write in DummyFileManager
Default: 3
dummyFileManager.prefix¶
spark.databricks.delta.dummyFileManager.prefix (internal) is the file prefix to use in DummyFileManager
Default: .s3-optimization-
dynamicPartitionOverwrite.enabled¶
spark.databricks.delta.dynamicPartitionOverwrite.enabled
(internal) Enables Dynamic Partition Overwrite (whether to overwrite partitions dynamically when partitionOverwriteMode
is dynamic
in either the SQL configuration or a DataFrameWriter
option). When disabled, partitionOverwriteMode
will be ignored.
Default: true
Used when:
DeltaWriteOptionsImpl
is requested to isDynamicPartitionOverwriteMode
history.maxKeysPerList¶
spark.databricks.delta.history.maxKeysPerList
(internal) How many commits to list when performing a parallel search
Default: 1000
The default is the maximum keys returned by S3 per ListObjectsV2 call. Microsoft Azure can return up to 5000
blobs (including all BlobPrefix
elements) in a single List Blobs API call, and hence the default 1000
.
Used when:
DeltaLog
is requested for the DeltaHistoryManager
history.metricsEnabled¶
spark.databricks.delta.history.metricsEnabled
Enables metrics reporting in DESCRIBE HISTORY
(CommitInfo will record the operation metrics when a OptimisticTransactionImpl
is committed).
Requires spark.databricks.delta.commitInfo.enabled configuration property to be enabled
Default: true
Used when:
OptimisticTransactionImpl
is requested to getOperationMetricsConvertToDeltaCommand
is requested to streamWriteSQLMetricsReporting
is requested to registerSQLMetricsTransactionalWrite
is requested to writeFiles
Github Commit
The feature was added as part of [SC-24567][DELTA] Add additional metrics to Describe Delta History commit.
import.batchSize.schemaInference¶
spark.databricks.delta.import.batchSize.schemaInference (internal) is the number of files per batch for schema inference during import.
Default: 1000000
import.batchSize.statsCollection¶
spark.databricks.delta.import.batchSize.statsCollection (internal) is the number of files per batch for stats collection during import.
Default: 50000
io.skipping.mdc.addNoise¶
spark.databricks.io.skipping.mdc.addNoise (internal) controls whether or not to add a random byte as a suffix to the interleaved bits when computing the Z-order values for MDC. This can help deal with skew, but may have a negative impact on overall min/max skipping effectiveness.
Default: true
Used when:
SpaceFillingCurveClustering
is requested to cluster
io.skipping.mdc.rangeId.max¶
spark.databricks.io.skipping.mdc.rangeId.max (internal) controls the domain of rangeId
values to be interleaved. The bigger, the better granularity, but at the expense of performance (more data gets sampled).
Default: 1000
Must be greater than 1
Used when:
SpaceFillingCurveClustering
is requested to cluster
io.skipping.stringPrefixLength¶
spark.databricks.io.skipping.stringPrefixLength (internal) The length of the prefix of string columns to store in the data skipping index
Default: 32
Used when:
StatisticsCollection
is requested for the statsCollector Column
lastCommitVersionInSession¶
spark.databricks.delta.lastCommitVersionInSession is the version of the last commit made in the SparkSession
for any delta table (after OptimisticTransactionImpl
is done with doCommit or DeltaCommand
with commitLarge)
Default: (undefined)
loadFileSystemConfigsFromDataFrameOptions¶
spark.databricks.delta.loadFileSystemConfigsFromDataFrameOptions (internal) controls whether to load file systems configs provided in DataFrameReader
or DataFrameWriter
options when calling DataFrameReader.load/DataFrameWriter.save
using a Delta table path.
Not supported for DataFrameReader.table
and DataFrameWriter.saveAsTable
Default: true
maxCommitAttempts¶
spark.databricks.delta.maxCommitAttempts (internal) is the maximum number of commit attempts to try for a single commit before failing
Default: 10000000
Used when:
OptimisticTransactionImpl
is requested to doCommitRetryIteratively
maxSnapshotLineageLength¶
spark.databricks.delta.maxSnapshotLineageLength (internal) is the maximum lineage length of a Snapshot before Delta forces to build a Snapshot from scratch
Default: 50
merge.deletionVectors.persistent¶
spark.databricks.delta.merge.deletionVectors.persistent
(internal) Enables persistent Deletion Vectors in MERGE command
Default: true
Used when:
MergeIntoCommandBase
is requested to shouldWritePersistentDeletionVectors
merge.materializeSource¶
spark.databricks.delta.merge.materializeSource
(internal) When to materialize the source plan during MERGE execution
Value | Meaning |
---|---|
all | source always materialized |
auto | sources not materialized unless non-deterministic |
none | source never materialized |
Default: auto
Used when:
MergeIntoMaterializeSource
is requested to shouldMaterializeSource
merge.materializeSource.maxAttempts¶
spark.databricks.delta.merge.materializeSource.maxAttempts
How many times retry execution of MERGE command in case the data (an RDD block) of the materialized source RDD is lost
Default: 4
Used when:
MergeIntoMaterializeSource
is requested to runWithMaterializedSourceLostRetries
merge.materializeSource.rddStorageLevel¶
spark.databricks.delta.merge.materializeSource.rddStorageLevel
(internal) What StorageLevel
to use to persist the source RDD
Default: DISK_ONLY
Used when:
MergeIntoMaterializeSource
is requested to prepare the source table
merge.materializeSource.rddStorageLevelRetry¶
spark.databricks.delta.merge.materializeSource.rddStorageLevelRetry
(internal) What StorageLevel
to use to persist the source RDD when MERGE is retried
Default: DISK_ONLY_2
Used when:
MergeIntoMaterializeSource
is requested to prepare the source table
merge.maxInsertCount¶
spark.databricks.delta.merge.maxInsertCount (internal) is the maximum row count of inserts in each MERGE execution
Default: 10000L
merge.optimizeInsertOnlyMerge.enabled¶
spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled
(internal) Enables extra optimization for insert-only merges by avoiding rewriting old files and just inserting new files
Default: true
Used when:
MergeIntoCommand
is requested to run a merge (for insert-only merge write)MergeIntoMaterializeSource
is requested to shouldMaterializeSource (for an insert-only merge)
merge.optimizeMatchedOnlyMerge.enabled¶
spark.databricks.delta.merge.optimizeMatchedOnlyMerge.enabled
(internal) Enables optimization of matched-only merges to use a RIGHT OUTER join (instead of a FULL OUTER join) while writing out all merge changes
Default: true
merge.repartitionBeforeWrite.enabled¶
spark.databricks.delta.merge.repartitionBeforeWrite.enabled
(internal) Enables repartitioning of merge output (by the partition columns of a target table if partitioned) before write data(frame) out
Default: true
Used when:
MergeIntoCommandBase
is requested to write data(frame) out
optimize.maxFileSize¶
spark.databricks.delta.optimize.maxFileSize
(internal) Target file size produced by OPTIMIZE command.
Default: 1024 * 1024 * 1024
(1GB
)
Used when:
OptimizeExecutor
is requested to optimize
optimize.maxThreads¶
spark.databricks.delta.optimize.maxThreads (internal) Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum parallel jobs allows OPTIMIZE
command to run faster, but increases the job management on the Spark driver.
Default: 15
Used when:
OptimizeExecutor
is requested to optimize
optimize.minFileSize¶
spark.databricks.delta.optimize.minFileSize
(internal) Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files by the OPTIMIZE command.
Default: 1024 * 1024 * 1024
(1GB
)
Used when:
OptimizeExecutor
is requested to optimize
optimize.repartition.enabled¶
spark.databricks.delta.optimize.repartition.enabled
(internal) Use Dataset.repartition(1)
instead of Dataset.coalesce(1)
to merge small files. Dataset.coalesce(1)
is executed with only 1 task, if there are many tiny files within a bin (e.g. 1000 files of 50MB), it cannot be optimized with more executors. On the other hand, Dataset.repartition(1)
incurs a shuffle stage yet the job can be distributed.
Default: false
Used when:
OptimizeExecutor
is requested to optimize (and runOptimizeBinJob)
optimize.zorder.checkStatsCollection.enabled¶
spark.databricks.delta.optimize.zorder.checkStatsCollection.enabled
(internal) Controls whether there are column statistics available for the zOrderBy
columns of OPTIMIZE command
Default: true
Used when:
OptimizeTableCommandBase
is requested to validate zOrderBy columns (and zOrderingOnColumnWithNoStatsException)
partitionColumnValidity.enabled¶
spark.databricks.delta.partitionColumnValidity.enabled (internal) enables validation of the partition column names (just like the data columns)
Default: true
Used when:
OptimisticTransactionImpl
is requested to verify a new metadata (with NoMapping column mapping mode)
properties.defaults.minReaderVersion¶
spark.databricks.delta.properties.defaults.minReaderVersion is the default reader protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.
Default: 1
Available values: 1
Used when:
Protocol
utility is used to create a Protocol
properties.defaults.minWriterVersion¶
spark.databricks.delta.properties.defaults.minWriterVersion is the default writer protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.
Default: 2
Available values: 1
, 2
, 3
Used when:
Protocol
utility is used to create a Protocol
replaceWhere.constraintCheck.enabled¶
spark.databricks.delta.replaceWhere.constraintCheck.enabled
Controls whether or not replaceWhere on arbitrary expression and arbitrary columns enforces constraints to replace the target table only when all the rows in the source dataframe match that constraint.
If disabled, it will skip the constraint check and replace with all the rows from the new dataframe.
Default: true
Used when:
WriteIntoDelta
is requested to extract constraints
retentionDurationCheck.enabled¶
spark.databricks.delta.retentionDurationCheck.enabled adds a check preventing users from running vacuum with a very short retention period, which may end up corrupting a Delta log.
Default: true
sampling.enabled¶
spark.databricks.delta.sampling.enabled (internal) enables sample-based estimation
Default: false
schema.autoMerge.enabled¶
spark.databricks.delta.schema.autoMerge.enabled
Enables schema merging on appends (UPDATEs) and overwrites (INSERTs).
Default: false
Equivalent DataFrame option: mergeSchema
Used when:
DeltaMergeInto
utility is used to resolveReferencesAndSchemaMetadataMismatchErrorBuilder
is requested toaddSchemaMismatch
DeltaWriteOptionsImpl
is requested for canMergeSchemaMergeIntoCommand
is requested for canMergeSchema
schema.removeSparkInternalMetadata¶
(internal) spark.databricks.delta.schema.removeSparkInternalMetadata
Whether to remove leaked Spark's internal metadata from the table schema before returning to Spark. These internal metadata might be stored unintentionally in tables created by old Spark versions.
Default: true
Used when:
DeltaTableUtils
utility is used to removeInternalMetadata
schema.typeCheck.enabled¶
spark.databricks.delta.schema.typeCheck.enabled (internal) controls whether to check unsupported data types while updating a table schema
Disabling this flag may allow users to create unsupported Delta tables and should only be used when trying to read/write legacy tables.
Default: true
Used when:
OptimisticTransactionImpl
is requested for checkUnsupportedDataType flagDeltaErrorsBase
is requested to unsupportedDataTypes
snapshotIsolation.enabled¶
spark.databricks.delta.snapshotIsolation.enabled (internal) controls whether queries on Delta tables are guaranteed to have snapshot isolation
Default: true
snapshotPartitions¶
spark.databricks.delta.snapshotPartitions (internal) is the number of partitions to use for state reconstruction (when building a snapshot of a Delta table).
Default: 50
stalenessLimit¶
spark.databricks.delta.stalenessLimit (in millis) allows you to query the last loaded state of the Delta table without blocking on a table update. You can use this configuration to reduce the latency on queries when up-to-date results are not a requirement. Table updates will be scheduled on a separate scheduler pool in a FIFO queue, and will share cluster resources fairly with your query. If a table hasn't updated past this time limit, we will block on a synchronous state update before running the query.
Default: 0
(no tables can be stale)
state.corruptionIsFatal¶
spark.databricks.delta.state.corruptionIsFatal (internal) throws a fatal error when the recreated Delta State doesn't match committed checksum file
Default: true
stateReconstructionValidation.enabled¶
spark.databricks.delta.stateReconstructionValidation.enabled (internal) controls whether to perform validation checks on the reconstructed state
Default: true
stats.collect¶
spark.databricks.delta.stats.collect
(internal) Enables statistics to be collected while writing files into a delta table
Default: true
Used when:
- ConvertToDeltaCommand is executed
TransactionalWrite
is requested to write data out (and getOptionalStatsTrackerAndStatsCollection)
stats.collect.using.tableSchema¶
spark.databricks.delta.stats.collect.using.tableSchema
(internal) When collecting stats while writing files into Delta table (with spark.databricks.delta.stats.collect enabled), whether to use the table schema (true
) or the DataFrame schema (false
) as the stats collection schema.
Default: true
Used when:
TransactionalWrite
is requested to getOptionalStatsTrackerAndStatsCollection
stats.limitPushdown.enabled¶
spark.databricks.delta.stats.limitPushdown.enabled (internal) enables using the limit clause and file statistics to prune files before they are collected to the driver
Default: true
stats.localCache.maxNumFiles¶
spark.databricks.delta.stats.localCache.maxNumFiles (internal) is the maximum number of files for a table to be considered a delta small table. Some metadata operations (such as using data skipping) are optimized for small tables using driver local caching and local execution.
Default: 2000
stats.skipping¶
spark.databricks.delta.stats.skipping (internal) enables Data Skipping
Default: true
Used when:
DataSkippingReaderBase
is requested for the files to scanPrepareDeltaScanBase
logical optimization is executed
streaming.schemaTracking.enabled¶
spark.databricks.delta.streaming.schemaTracking.enabled
(internal) Controls whether the delta streaming source can support non-additive schema evolution for operations such as rename or drop column on column mapping enabled tables
Default: true
Used when:
DeltaErrorsBase
is requested to blockStreamingReadsWithIncompatibleColumnMappingSchemaChangesDeltaDataSource
is requested for a DeltaSourceMetadataTrackingLog
streaming.unsafeReadOnIncompatibleColumnMappingSchemaChanges.enabled¶
spark.databricks.delta.streaming.unsafeReadOnIncompatibleColumnMappingSchemaChanges.enabled
(internal) Controls (possibly unsafe) streaming read on delta tables with column mapping schema operations (e.g. rename or drop column) that could lead to data loss and schema confusion.
Default: false
Used when:
DeltaSourceBase
is requested for the allowUnsafeStreamingReadOnColumnMappingSchemaChanges
timeTravel.resolveOnIdentifier.enabled¶
spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled
(internal) Enables time travel patterns (as @v123
and @yyyyMMddHHmmssSSS
) in the path identifiers of delta tables
Default: true
update.deletionVectors.persistent¶
spark.databricks.delta.update.deletionVectors.persistent
(internal) Enables persistent Deletion Vectors in UPDATE command
Default: true
Used when:
- UpdateCommand is executed (and shouldWritePersistentDeletionVectors)
vacuum.parallelDelete.enabled¶
spark.databricks.delta.vacuum.parallelDelete.enabled enables parallelizing the deletion of files during vacuum command.
Default: false
Enabling may result hitting rate limits on some storage backends. When enabled, parallelization is controlled by the default number of shuffle partitions.
write.txnAppId¶
spark.databricks.delta.write.txnAppId
(internal) The user-defined application ID a write will be committed with. If specified, spark.databricks.delta.write.txnVersion has to be set, too.
Default: (undefined)
Used when:
DeltaCommand
is requested for the txnVersion and txnAppId
write.txnVersion¶
spark.databricks.delta.write.txnVersion
(internal) The user-defined transaction version a write will be committed with. If specified, spark.databricks.delta.write.txnAppId has to be set, too. To ensure idempotency, txnVersion
s across different writes need to be monotonically increasing.
Default: (undefined)
Used when:
DeltaCommand
is requested to create a SetTransaction action, for the txnVersion and txnAppId and hasBeenExecuted
write.txnVersion.autoReset.enabled¶
spark.databricks.delta.write.txnVersion.autoReset.enabled
(internal) When enabled, automatically resets spark.databricks.delta.write.txnVersion after every write
If the txnAppId
and txnVersion
both come from the session config (based on write.txnAppId and write.txnVersion, respectively), spark.databricks.delta.write.txnVersion is reset (unset), after skipping the current transaction, as a safety measure to prevent data loss if the user forgets to manually reset txnVersion
Default: false
Used when:
DeltaCommand
is requested to create a SetTransaction action and hasBeenExecuted
spark.delta.logStore.class¶
The fully-qualified class name of a LogStore
Default: HDFSLogStore
Used when:
LogStoreProvider
is requested for a LogStore