DeltaSQLConf — spark.databricks.delta Configuration Properties¶
DeltaSQLConf
contains spark.databricks.delta-prefixed configuration properties to configure behaviour of Delta Lake.
alterLocation.bypassSchemaCheck¶
spark.databricks.delta.alterLocation.bypassSchemaCheck enables Alter Table Set Location on Delta to go through even if the Delta table in the new location has a different schema from the original Delta table
Default: false
checkLatestSchemaOnRead¶
spark.databricks.delta.checkLatestSchemaOnRead enables a check that ensures that users won't read corrupt data if the source schema changes in an incompatible way.
Default: true
Delta always tries to give users the latest version of table data without having to call REFRESH TABLE
or redefine their DataFrames when used in the context of streaming. There is a possibility that the schema of the latest version of the table may be incompatible with the schema at the time of DataFrame creation.
checkpoint.partSize¶
spark.databricks.delta.checkpoint.partSize (internal) is the limit at which we will start parallelizing the checkpoint. We will attempt to write maximum of this many actions per checkpoint.
Default: 5000000
commitInfo.enabled¶
spark.databricks.delta.commitInfo.enabled controls whether to log commit information into a Delta log.
Default: true
commitInfo.userMetadata¶
spark.databricks.delta.commitInfo.userMetadata is an arbitrary user-defined metadata to include in CommitInfo (requires spark.databricks.delta.commitInfo.enabled).
Default: (empty)
commitValidation.enabled¶
spark.databricks.delta.commitValidation.enabled (internal) controls whether to perform validation checks before commit or not
Default: true
convert.metadataCheck.enabled¶
spark.databricks.delta.convert.metadataCheck.enabled enables validation during convert to delta, if there is a difference between the catalog table's properties and the Delta table's configuration, we should error.
If disabled, merge the two configurations with the same semantics as update and merge
Default: true
dummyFileManager.numOfFiles¶
spark.databricks.delta.dummyFileManager.numOfFiles (internal) controls how many dummy files to write in DummyFileManager
Default: 3
dummyFileManager.prefix¶
spark.databricks.delta.dummyFileManager.prefix (internal) is the file prefix to use in DummyFileManager
Default: .s3-optimization-
history.maxKeysPerList¶
spark.databricks.delta.history.maxKeysPerList (internal) controls how many commits to list when performing a parallel search.
The default is the maximum keys returned by S3 per list call. Azure can return 5000, therefore we choose 1000.
Default: 1000
history.metricsEnabled¶
spark.databricks.delta.history.metricsEnabled enables metrics reporting in DESCRIBE HISTORY
(CommitInfo will record the operation metrics when a OptimisticTransactionImpl
is committed and the spark.databricks.delta.commitInfo.enabled configuration property is enabled).
Requires spark.databricks.delta.commitInfo.enabled configuration property to be enabled
Default: true
Used when:
OptimisticTransactionImpl
is requested to getOperationMetricsConvertToDeltaCommand
is requested to streamWriteSQLMetricsReporting
is requested to registerSQLMetricsTransactionalWrite
is requested to writeFiles
Github Commit
The feature was added as part of [SC-24567][DELTA] Add additional metrics to Describe Delta History commit.
import.batchSize.schemaInference¶
spark.databricks.delta.import.batchSize.schemaInference (internal) is the number of files per batch for schema inference during import.
Default: 1000000
import.batchSize.statsCollection¶
spark.databricks.delta.import.batchSize.statsCollection (internal) is the number of files per batch for stats collection during import.
Default: 50000
loadFileSystemConfigsFromDataFrameOptions¶
spark.databricks.delta.loadFileSystemConfigsFromDataFrameOptions (internal) controls whether to load file systems configs provided in DataFrameReader
or DataFrameWriter
options when calling DataFrameReader.load/DataFrameWriter.save
using a Delta table path.
Not supported for DataFrameReader.table
and DataFrameWriter.saveAsTable
Default: true
maxSnapshotLineageLength¶
spark.databricks.delta.maxSnapshotLineageLength (internal) is the maximum lineage length of a Snapshot before Delta forces to build a Snapshot from scratch
Default: 50
merge.maxInsertCount¶
spark.databricks.delta.merge.maxInsertCount (internal) is the maximum row count of inserts in each MERGE execution
Default: 10000L
merge.optimizeInsertOnlyMerge.enabled¶
spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled (internal) controls merge without any matched clause (i.e., insert-only merge) will be optimized by avoiding rewriting old files and just inserting new files
Default: true
merge.optimizeMatchedOnlyMerge.enabled¶
spark.databricks.delta.merge.optimizeMatchedOnlyMerge.enabled (internal) controls merge without 'when not matched' clause will be optimized to use a right outer join instead of a full outer join
Default: true
merge.repartitionBeforeWrite.enabled¶
spark.databricks.delta.merge.repartitionBeforeWrite.enabled (internal) controls whether MERGE command repartitions output before writing the files (by the table's partition columns)
Default: true
Used when:
MergeIntoCommand
is requested to repartitionIfNeeded
optimize.maxFileSize¶
spark.databricks.delta.optimize.maxFileSize (internal) Target file size produced by OPTIMIZE command.
Default: 1024 * 1024 * 1024
Used when:
OptimizeExecutor
is requested to optimize
optimize.maxThreads¶
spark.databricks.delta.optimize.maxThreads (internal) Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum parallel jobs allows OPTIMIZE
command to run faster, but increases the job management on the Spark driver.
Default: 15
Used when:
OptimizeExecutor
is requested to optimize
optimize.minFileSize¶
spark.databricks.delta.optimize.minFileSize (internal) Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files by the OPTIMIZE command.
Default: 1024 * 1024 * 1024
Used when:
OptimizeExecutor
is requested to optimize
partitionColumnValidity.enabled¶
spark.databricks.delta.partitionColumnValidity.enabled (internal) enables validation of the partition column names (just like the data columns)
Default: true
properties.defaults.minReaderVersion¶
spark.databricks.delta.properties.defaults.minReaderVersion is the default reader protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.
Default: 1
Available values: 1
Used when:
Protocol
utility is used to create a Protocol
properties.defaults.minWriterVersion¶
spark.databricks.delta.properties.defaults.minWriterVersion is the default writer protocol version to create new tables with, unless a feature that requires a higher version for correctness is enabled.
Default: 2
Available values: 1
, 2
, 3
Used when:
Protocol
utility is used to create a Protocol
retentionDurationCheck.enabled¶
spark.databricks.delta.retentionDurationCheck.enabled adds a check preventing users from running vacuum with a very short retention period, which may end up corrupting a Delta log.
Default: true
sampling.enabled¶
spark.databricks.delta.sampling.enabled (internal) enables sample-based estimation
Default: false
schema.autoMerge.enabled¶
spark.databricks.delta.schema.autoMerge.enabled enables schema merging on appends and overwrites.
Default: false
Equivalent DataFrame option: mergeSchema
snapshotIsolation.enabled¶
spark.databricks.delta.snapshotIsolation.enabled (internal) controls whether queries on Delta tables are guaranteed to have snapshot isolation
Default: true
snapshotPartitions¶
spark.databricks.delta.snapshotPartitions (internal) is the number of partitions to use for state reconstruction (when building a snapshot of a Delta table).
Default: 50
stalenessLimit¶
spark.databricks.delta.stalenessLimit (in millis) allows you to query the last loaded state of the Delta table without blocking on a table update. You can use this configuration to reduce the latency on queries when up-to-date results are not a requirement. Table updates will be scheduled on a separate scheduler pool in a FIFO queue, and will share cluster resources fairly with your query. If a table hasn't updated past this time limit, we will block on a synchronous state update before running the query.
Default: 0
(no tables can be stale)
state.corruptionIsFatal¶
spark.databricks.delta.state.corruptionIsFatal (internal) throws a fatal error when the recreated Delta State doesn't match committed checksum file
Default: true
stateReconstructionValidation.enabled¶
spark.databricks.delta.stateReconstructionValidation.enabled (internal) controls whether to perform validation checks on the reconstructed state
Default: true
stats.collect¶
spark.databricks.delta.stats.collect (internal) Enables statistics to be collected while writing files into a Delta table
Default: true
Used when:
TransactionalWrite
is requested to writeFiles
stats.limitPushdown.enabled¶
spark.databricks.delta.stats.limitPushdown.enabled (internal) enables using the limit clause and file statistics to prune files before they are collected to the driver
Default: true
stats.localCache.maxNumFiles¶
spark.databricks.delta.stats.localCache.maxNumFiles (internal) is the maximum number of files for a table to be considered a delta small table. Some metadata operations (such as using data skipping) are optimized for small tables using driver local caching and local execution.
Default: 2000
stats.skipping¶
spark.databricks.delta.stats.skipping (internal) enables statistics used for skipping
Default: true
timeTravel.resolveOnIdentifier.enabled¶
spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled (internal) controls whether to resolve patterns as @v123
and @yyyyMMddHHmmssSSS
in path identifiers as time travel nodes.
Default: true
vacuum.parallelDelete.enabled¶
spark.databricks.delta.vacuum.parallelDelete.enabled enables parallelizing the deletion of files during vacuum command.
Default: false
Enabling may result hitting rate limits on some storage backends. When enabled, parallelization is controlled by the default number of shuffle partitions.
io.skipping.stringPrefixLength¶
spark.databricks.io.skipping.stringPrefixLength (internal) The length of the prefix of string columns to store in the data skipping index
Default: 32
Used when:
StatisticsCollection
is requested for the statsCollector Column