DeltaSQLConf — spark.databricks.delta Configuration Properties¶
DeltaSQLConf contains spark.databricks.delta-prefixed configuration properties to configure behaviour of Delta Lake.
alterLocation.bypassSchemaCheck¶
spark.databricks.delta.alterLocation.bypassSchemaCheck enables Alter Table Set Location on Delta to go through even if the Delta table in the new location has a different schema from the original Delta table
Default: false
checkLatestSchemaOnRead¶
spark.databricks.delta.checkLatestSchemaOnRead (internal) enables a check that ensures that users won't read corrupt data if the source schema changes in an incompatible way.
Default: true
In Delta, we always try to give users the latest version of their data without having to call REFRESH TABLE or redefine their DataFrames when used in the context of streaming. There is a possibility that the schema of the latest version of the table may be incompatible with the schema at the time of DataFrame creation.
checkpoint.partSize¶
spark.databricks.delta.checkpoint.partSize (internal) is the limit at which we will start parallelizing the checkpoint. We will attempt to write maximum of this many actions per checkpoint.
Default: 5000000
commitInfo.enabled¶
spark.databricks.delta.commitInfo.enabled controls whether to log commit information into a Delta log.
Default: true
commitInfo.userMetadata¶
spark.databricks.delta.commitInfo.userMetadata is an arbitrary user-defined metadata to include in CommitInfo (requires spark.databricks.delta.commitInfo.enabled).
Default: (empty)
commitValidation.enabled¶
spark.databricks.delta.commitValidation.enabled (internal) controls whether to perform validation checks before commit or not
Default: true
convert.metadataCheck.enabled¶
spark.databricks.delta.convert.metadataCheck.enabled enables validation during convert to delta, if there is a difference between the catalog table's properties and the Delta table's configuration, we should error.
If disabled, merge the two configurations with the same semantics as update and merge
Default: true
dummyFileManager.numOfFiles¶
spark.databricks.delta.dummyFileManager.numOfFiles (internal) controls how many dummy files to write in DummyFileManager
Default: 3
dummyFileManager.prefix¶
spark.databricks.delta.dummyFileManager.prefix (internal) is the file prefix to use in DummyFileManager
Default: .s3-optimization-
history.maxKeysPerList¶
spark.databricks.delta.history.maxKeysPerList (internal) controls how many commits to list when performing a parallel search.
The default is the maximum keys returned by S3 per list call. Azure can return 5000, therefore we choose 1000.
Default: 1000
history.metricsEnabled¶
spark.databricks.delta.history.metricsEnabled enables Metrics reporting in Describe History. CommitInfo will now record the Operation Metrics.
Default: true
Used when:
OptimisticTransactionImpl
is requested to getOperationMetricsConvertToDeltaCommand
is requested to streamWriteSQLMetricsReporting
is requested to registerSQLMetricsTransactionalWrite
is requested to writeFiles
import.batchSize.schemaInference¶
spark.databricks.delta.import.batchSize.schemaInference (internal) is the number of files per batch for schema inference during import.
Default: 1000000
import.batchSize.statsCollection¶
spark.databricks.delta.import.batchSize.statsCollection (internal) is the number of files per batch for stats collection during import.
Default: 50000
maxSnapshotLineageLength¶
spark.databricks.delta.maxSnapshotLineageLength (internal) is the maximum lineage length of a Snapshot before Delta forces to build a Snapshot from scratch
Default: 50
merge.maxInsertCount¶
spark.databricks.delta.merge.maxInsertCount (internal) is the maximum row count of inserts in each MERGE execution
Default: 10000L
merge.optimizeInsertOnlyMerge.enabled¶
spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled (internal) controls merge without any matched clause (i.e., insert-only merge) will be optimized by avoiding rewriting old files and just inserting new files
Default: true
merge.optimizeMatchedOnlyMerge.enabled¶
spark.databricks.delta.merge.optimizeMatchedOnlyMerge.enabled (internal) controls merge without 'when not matched' clause will be optimized to use a right outer join instead of a full outer join
Default: true
merge.repartitionBeforeWrite.enabled¶
spark.databricks.delta.merge.repartitionBeforeWrite.enabled (internal) controls merge will repartition the output by the table's partition columns before writing the files
Default: false
partitionColumnValidity.enabled¶
spark.databricks.delta.partitionColumnValidity.enabled (internal) enables validation of the partition column names (just like the data columns)
Default: true
retentionDurationCheck.enabled¶
spark.databricks.delta.retentionDurationCheck.enabled adds a check preventing users from running vacuum with a very short retention period, which may end up corrupting a Delta log.
Default: true
sampling.enabled¶
spark.databricks.delta.sampling.enabled (internal) enables sample-based estimation
Default: false
schema.autoMerge.enabled¶
spark.databricks.delta.schema.autoMerge.enabled enables schema merging on appends and overwrites.
Default: false
Equivalent DataFrame option: mergeSchema
snapshotIsolation.enabled¶
spark.databricks.delta.snapshotIsolation.enabled (internal) controls whether queries on Delta tables are guaranteed to have snapshot isolation
Default: true
snapshotPartitions¶
spark.databricks.delta.snapshotPartitions (internal) is the number of partitions to use for state reconstruction (when building a snapshot of a Delta table).
Default: 50
stalenessLimit¶
spark.databricks.delta.stalenessLimit (in millis) allows you to query the last loaded state of the Delta table without blocking on a table update. You can use this configuration to reduce the latency on queries when up-to-date results are not a requirement. Table updates will be scheduled on a separate scheduler pool in a FIFO queue, and will share cluster resources fairly with your query. If a table hasn't updated past this time limit, we will block on a synchronous state update before running the query.
Default: 0
(no tables can be stale)
state.corruptionIsFatal¶
spark.databricks.delta.state.corruptionIsFatal (internal) throws a fatal error when the recreated Delta State doesn't match committed checksum file
Default: true
stateReconstructionValidation.enabled¶
spark.databricks.delta.stateReconstructionValidation.enabled (internal) controls whether to perform validation checks on the reconstructed state
Default: true
stats.collect¶
spark.databricks.delta.stats.collect (internal) enables statistics to be collected while writing files into a Delta table
Default: true
stats.limitPushdown.enabled¶
spark.databricks.delta.stats.limitPushdown.enabled (internal) enables using the limit clause and file statistics to prune files before they are collected to the driver
Default: true
stats.localCache.maxNumFiles¶
spark.databricks.delta.stats.localCache.maxNumFiles (internal) is the maximum number of files for a table to be considered a delta small table. Some metadata operations (such as using data skipping) are optimized for small tables using driver local caching and local execution.
Default: 2000
stats.skipping¶
spark.databricks.delta.stats.skipping (internal) enables statistics used for skipping
Default: true
timeTravel.resolveOnIdentifier.enabled¶
spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled (internal) controls whether to resolve patterns as @v123
in identifiers as time travel nodes.
Default: true