Skip to content

Configuration Properties

Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.

Configuration properties are configured in a SparkSession while creating a new instance using config method (e.g. spark.sql.warehouse.dir).

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.warehouse.dir", "c:/Temp") // (1)!
  .getOrCreate
  1. Sets spark.sql.warehouse.dir

You can also set a property using SQL SET command.

assert(spark.conf.getOption("spark.sql.hive.metastore.version").isEmpty)

scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)
+--------------------------------+-----+
|key                             |value|
+--------------------------------+-----+
|spark.sql.hive.metastore.version|2.3.2|
+--------------------------------+-----+

assert(spark.conf.get("spark.sql.hive.metastore.version") == "2.3.2")

spark.sql.adaptive

advisoryPartitionSizeInBytes

spark.sql.adaptive.advisoryPartitionSizeInBytes

The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is enabled). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.

Default: 64MB

Fallback Property: spark.sql.adaptive.shuffle.targetPostShuffleInputSize

Use SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES to reference the name.

autoBroadcastJoinThreshold

spark.sql.adaptive.autoBroadcastJoinThreshold

The maximum size (in bytes) of a table to be broadcast when performing a join. -1 turns broadcasting off. The default value is same as spark.sql.autoBroadcastJoinThreshold.

Used only in Adaptive Query Execution

Default: (undefined)

Available as SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD value.

coalescePartitions.enabled

spark.sql.adaptive.coalescePartitions.enabled

Controls coalescing shuffle partitions

When true and spark.sql.adaptive.enabled is enabled, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks.

Default: true

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

coalescePartitions.minPartitionSize

spark.sql.adaptive.coalescePartitions.minPartitionSize

The minimum size (in bytes unless specified) of shuffle partitions after coalescing. This is useful when the adaptively calculated target size is too small during partition coalescing

Default: 1MB

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

coalescePartitions.minPartitionSize

spark.sql.adaptive.coalescePartitions.minPartitionSize

The minimum size (in bytes) of shuffle partitions after coalescing.

Useful when the adaptively calculated target size is too small during partition coalescing.

Default: (undefined)

Must be positive

Used when:

coalescePartitions.initialPartitionNum

spark.sql.adaptive.coalescePartitions.initialPartitionNum

The initial number of shuffle partitions before coalescing.

By default it equals to spark.sql.shuffle.partitions. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.

Default: (undefined)

coalescePartitions.parallelismFirst

spark.sql.adaptive.coalescePartitions.parallelismFirst

When true, Spark does not respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes when coalescing contiguous shuffle partitions, but adaptively calculate the target size according to the default parallelism of the Spark cluster. The calculated size is usually smaller than the configured target size. This is to maximize the parallelism and avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false and respect the configured target size.

Default: true

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

customCostEvaluatorClass

spark.sql.adaptive.customCostEvaluatorClass

The fully-qualified class name of the CostEvaluator in Adaptive Query Execution

Default: SimpleCostEvaluator

Use SQLConf.ADAPTIVE_CUSTOM_COST_EVALUATOR_CLASS method to access the property (in a type-safe way).

Used when:

enabled

spark.sql.adaptive.enabled

Enables Adaptive Query Execution

Default: true

Use SQLConf.adaptiveExecutionEnabled method to access the current value.

fetchShuffleBlocksInBatch

spark.sql.adaptive.fetchShuffleBlocksInBatch

(internal) Whether to fetch the contiguous shuffle blocks in batch. Instead of fetching blocks one by one, fetching contiguous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple contiguous blocks exist in single "fetch request only happen when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. This feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol.

Default: true

Use SQLConf.fetchShuffleBlocksInBatch method to access the current value.

forceApply

spark.sql.adaptive.forceApply

(internal) When true (together with spark.sql.adaptive.enabled enabled), Spark will force apply adaptive query execution for all supported queries.

Default: false

Use SQLConf.ADAPTIVE_EXECUTION_FORCE_APPLY method to access the property (in a type-safe way).

forceOptimizeSkewedJoin

spark.sql.adaptive.forceOptimizeSkewedJoin

Enables OptimizeSkewedJoin physical optimization to be executed even if it introduces extra shuffle

Default: false

Requires spark.sql.adaptive.skewJoin.enabled to be enabled

Use SQLConf.ADAPTIVE_FORCE_OPTIMIZE_SKEWED_JOIN to access the property (in a type-safe way).

Used when:

localShuffleReader.enabled

spark.sql.adaptive.localShuffleReader.enabled

When true (and spark.sql.adaptive.enabled is true), Spark SQL tries to use local shuffle reader to read the shuffle data when the shuffle partitioning is not needed, for example, after converting sort-merge join to broadcast-hash join.

Default: true

Use SQLConf.LOCAL_SHUFFLE_READER_ENABLED to access the property (in a type-safe way)

logLevel

spark.sql.adaptive.logLevel

(internal) Log level for adaptive execution logging of plan changes. The value can be TRACE, DEBUG, INFO, WARN or ERROR.

Default: DEBUG

Use SQLConf.adaptiveExecutionLogLevel for the current value

maxShuffledHashJoinLocalMapThreshold

spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold

The maximum size (in bytes) per partition that can be allowed to build local hash map. If this value is not smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes and all the partition size are not larger than this config, join selection prefer to use shuffled hash join instead of sort merge join regardless of the value of spark.sql.join.preferSortMergeJoin.

Default: 0

Available as SQLConf.ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD

nonEmptyPartitionRatioForBroadcastJoin

spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin

(internal) A relation with a non-empty partition ratio (the number of non-empty partitions to all partitions) lower than this config will not be considered as the build side of a broadcast-hash join in Adaptive Query Execution regardless of the size.

Effective with spark.sql.adaptive.enabled true

Default: 0.2

Use SQLConf.nonEmptyPartitionRatioForBroadcastJoin method to access the current value.

optimizer.excludedRules

spark.sql.adaptive.optimizer.excludedRules

A comma-separated list of rules (names) to be disabled (excluded) in the AQE Logical Optimizer

Default: undefined

Use SQLConf.ADAPTIVE_OPTIMIZER_EXCLUDED_RULES to reference the property.

Used when:

  • AQEOptimizer is requested for the batches

optimizeSkewsInRebalancePartitions.enabled

When true and spark.sql.adaptive.enabled is true, Spark SQL will optimize the skewed shuffle partitions in RebalancePartitions and split them to smaller ones according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid data skew

Default: true

Use SQLConf.ADAPTIVE_OPTIMIZE_SKEWS_IN_REBALANCE_PARTITIONS_ENABLED method to access the property (in a type-safe way)

skewJoin.enabled

spark.sql.adaptive.skewJoin.enabled

When true and spark.sql.adaptive.enabled is enabled, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.

Default: true

Use SQLConf.SKEW_JOIN_ENABLED to reference the property.

skewJoin.skewedPartitionFactor

spark.sql.adaptive.skewJoin.skewedPartitionFactor

A partition is considered skewed if its size is larger than this factor multiplying the median partition size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes.

Default: 5

skewJoin.skewedPartitionThresholdInBytes

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

A partition is considered skewed if its size in bytes is larger than this threshold and also larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor multiplying the median partition size. Ideally this config should be set larger than spark.sql.adaptive.advisoryPartitionSizeInBytes.

Default: 256MB

allowNamedFunctionArguments

spark.sql.allowNamedFunctionArguments

Controls support for named parameters in function calls in SQL statements

Default: true

Used when:

autoBroadcastJoinThreshold

spark.sql.autoBroadcastJoinThreshold

Maximum size (in bytes) for a table that can be broadcast (to all worker nodes) in a join

Default: 10M

-1 (or any negative value) disables broadcasting

Use SQLConf.autoBroadcastJoinThreshold method to access the current value.

cache.serializer

spark.sql.cache.serializer

The name of CachedBatchSerializer implementation to translate SQL data into a format that can more efficiently be cached.

Default: org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer

spark.sql.cache.serializer is a StaticSQLConf

Use SQLConf.SPARK_CACHE_SERIALIZER for the name

Used when:

spark.sql.codegen

aggregate.fastHashMap.capacityBit

spark.sql.codegen.aggregate.fastHashMap.capacityBit

(internal) Capacity for the max number of rows to be held in memory by the fast hash aggregate product operator. The bit is not for actual value, but the actual numBuckets is determined by loadFactor (e.g., the default bit value 16, the actual numBuckets is ((1 << 16) / 0.5).

Default: 16

Must be in the range of [10, 30] (inclusive)

Use SQLConf.fastHashAggregateRowMaxCapacityBit for the current value

Used when:

aggregate.map.twolevel.enabled

spark.sql.codegen.aggregate.map.twolevel.enabled

(internal) Enable two-level aggregate hash map. When enabled, records will first be inserted/looked-up at a 1st-level, small, fast map, and then fallback to a 2nd-level, larger, slower map when 1st level is full or keys cannot be found. When disabled, records go directly to the 2nd level.

Default: true

Use SQLConf.enableTwoLevelAggMap for the current value

Used when:

aggregate.map.vectorized.enable

spark.sql.codegen.aggregate.map.vectorized.enable

(internal) Enables vectorized aggregate hash map. For testing/benchmarking only.

Default: false

Use SQLConf.enableVectorizedHashMap for the current value

Used when:

aggregate.map.twolevel.partialOnly

spark.sql.codegen.aggregate.map.twolevel.partialOnly

(internal) Enables two-level aggregate hash map for partial aggregate only, because final aggregate might get more distinct keys compared to partial aggregate. "Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.

Default: true

Used when:

hugeMethodLimit

spark.sql.codegen.hugeMethodLimit

(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled code has a function that exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the query plan.

Default: 65535

The default value 65535 is the largest bytecode size possible for a valid Java method. When running on HotSpot, it may be preferable to set the value to 8000 (which is the value of HugeMethodLimit in the OpenJDK JVM settings)

Use SQLConf.hugeMethodLimit method to access the current value.

Used when:

  • WholeStageCodegenExec physical operator is executed

fallback

spark.sql.codegen.fallback

(internal) Whether the whole-stage codegen could be temporary disabled for the part of a query that has failed to compile generated code (true) or not (false).

Default: true

Use SQLConf.wholeStageFallback method to access the current value.

Used when:

  • WholeStageCodegenExec physical operator is executed

join.fullOuterShuffledHashJoin.enabled

spark.sql.codegen.join.fullOuterShuffledHashJoin.enabled

(internal) Enables Whole-Stage Code Generation for FULL OUTER shuffled hash join

Default: true

Use SQLConf.ENABLE_FULL_OUTER_SHUFFLED_HASH_JOIN_CODEGEN to access the property

methodSplitThreshold

spark.sql.codegen.methodSplitThreshold

(internal) The threshold of source-code splitting in the codegen. When the number of characters in a single Java function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones. We cannot know how many bytecode will be generated, so use the code length as metric. When running on HotSpot, a function's bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise there will be many function calls.

Default: 1024

Use SQLConf.methodSplitThreshold for the current value

wholeStage

spark.sql.codegen.wholeStage

(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method (true) or not (false).

Default: true

Use SQLConf.wholeStageEnabled method to access the current value.

columnVector.offheap.enabled

spark.sql.columnVector.offheap.enabled

(internal) Enables OffHeapColumnVector (true) or OnHeapColumnVector (false) in ColumnarBatch

Default: false

Use SQLConf.offHeapColumnVectorEnabled for the current value

Used when:

defaultColumn.enabled

spark.sql.defaultColumn.enabled

(internal) When true, allows CREATE TABLE, REPLACE TABLE, and ALTER COLUMN statements to set or update default values for specific columns. Following INSERT, MERGE, and UPDATE statements may then omit these values and their values will be injected automatically instead.

Default: true

Use SQLConf.enableDefaultColumns for the current value

Used when:

exchange.reuse

spark.sql.exchange.reuse

(internal) When enabled (true), the Spark planner will find duplicated exchanges and subqueries and re-use them.

When disabled (false), ReuseExchange and ReuseSubquery physical optimizations (that the Spark planner uses for physical query plan optimization) do nothing.

Default: true

Use SQLConf.exchangeReuseEnabled for the current value

spark.sql.execution

arrow.pyspark.enabled

spark.sql.execution.arrow.pyspark.enabled

When true, make use of Apache Arrow for columnar data transfers in PySpark. This optimization applies to:

  1. pyspark.sql.DataFrame.toPandas
  2. pyspark.sql.SparkSession.createDataFrame when its input is a Pandas DataFrame

The following data types are unsupported: BinaryType, MapType, ArrayType of TimestampType, and nested StructType.

Default: false

arrow.pyspark.fallback.enabled

spark.sql.execution.arrow.pyspark.fallback.enabled

When true, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled will fallback automatically to non-optimized implementations if an error occurs.

Default: true

arrow.pyspark.selfDestruct.enabled

spark.sql.execution.arrow.pyspark.selfDestruct.enabled

(Experimental) When true, make use of Apache Arrow's self-destruct and split-blocks options for columnar data transfers in PySpark, when converting from Arrow to Pandas. This reduces memory usage at the cost of some CPU time. Applies to: pyspark.sql.DataFrame.toPandas when spark.sql.execution.arrow.pyspark.enabled is true.

Default: false

Use SQLConf.arrowPySparkSelfDestructEnabled for the current value

pandas.convertToArrowArraySafely

spark.sql.execution.pandas.convertToArrowArraySafely

(internal) When true, Arrow will perform safe type conversion when converting Pandas. Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. When false, disabling Arrow's type check and do type conversions anyway. This config only works for Arrow 0.11.0+.

Default: false

pandas.udf.buffer.size

spark.sql.execution.pandas.udf.buffer.size

Same as ${BUFFER_SIZE.key} but only applies to Pandas UDF executions. If it is not set, the fallback is ${BUFFER_SIZE.key}. Note that Pandas execution requires more than 4 bytes. Lowering this value could make small Pandas UDF batch iterated and pipelined; however, it might degrade performance. See SPARK-27870.

Default: 65536

rangeExchange.sampleSizePerPartition

spark.sql.execution.rangeExchange.sampleSizePerPartition

(internal) Number of points to sample per partition in order to determine the range boundaries for range partitioning, typically used in global sorting (without limit).

Default: 100

Use SQLConf.rangeExchangeSampleSizePerPartition method to access the current value.

removeRedundantSorts

spark.sql.execution.removeRedundantSorts

(internal) Whether to remove redundant physical sort node

Default: true

Used as SQLConf.REMOVE_REDUNDANT_SORTS_ENABLED

replaceHashWithSortAgg

spark.sql.execution.replaceHashWithSortAgg

(internal) Enables replacing hash aggregate operators (i.e., HashAggregateExec and ObjectHashAggregateExec) with SortAggregateExec based on children's ordering

Default: false

Used when:

reuseSubquery

spark.sql.execution.reuseSubquery

(internal) When true, the planner will try to find duplicated subqueries and re-use them.

Default: true

Use SQLConf.subqueryReuseEnabled for the current value

sortBeforeRepartition

spark.sql.execution.sortBeforeRepartition

(internal) When perform a repartition following a shuffle, the output row ordering would be nondeterministic. If some downstream stages fail and some tasks of the repartition stage retry, these tasks may generate different data, and that can lead to correctness issues. Turn on this config to insert a local sort before actually doing repartition to generate consistent repartition results. The performance of repartition() may go down since we insert extra local sort before it.

Default: true

Use SQLConf.sortBeforeRepartition method to access the current value.

usePartitionEvaluator

spark.sql.execution.usePartitionEvaluator

(internal) Enables PartitionEvaluator (Spark Core) to execute physical operators (using RDD.mapPartitionsWithEvaluator operator)

Default: false

Use SQLConf.usePartitionEvaluator for the current value

Used when:

  • FilterExec physical operator is executed
  • ProjectExec physical operator is executed
  • ColumnarToRowExec physical operator is executed
  • RowToColumnarExec physical operator is executed
  • WholeStageCodegenExec physical operator is executed
  • SortMergeJoinExec physical operator is executed
  • MapInBatchExec (PySpark) physical operator is executed
  • WindowGroupLimitExec physical operator is executed

spark.sql.hive

filesourcePartitionFileCacheSize

spark.sql.hive.filesourcePartitionFileCacheSize

When greater than 0, enables caching of partition file metadata in memory (using SharedInMemoryCache). All tables share a cache that can use up to specified num bytes for file metadata.

Requires spark.sql.hive.manageFilesourcePartitions to be enabled

Default: 250 * 1024 * 1024

Use SQLConf.filesourcePartitionFileCacheSize for the current value

Used when:

manageFilesourcePartitions

spark.sql.hive.manageFilesourcePartitions

Enables metastore partition management for file source tables.

This includes both datasource and Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning when spark.sql.hive.metastorePartitionPruning is enabled

Default: true

Use SQLConf.manageFilesourcePartitions for the current value

Used when:

inMemoryColumnarStorage.partitionPruning

spark.sql.inMemoryColumnarStorage.partitionPruning

(internal) Enables partition pruning for in-memory columnar tables

Default: true

Use SQLConf.inMemoryPartitionPruning for the current value

Used when:

spark.sql.optimizer

canChangeCachedPlanOutputPartitioning

spark.sql.optimizer.canChangeCachedPlanOutputPartitioning

(internal) Whether to forcibly enable some optimization rules that can change the output partitioning of a cached query when executing it for caching. If it is set to true, queries may need an extra shuffle to read the cached data. This configuration is disabled by default. Currently, the optimization rules enabled by this configuration are spark.sql.adaptive.enabled and spark.sql.sources.bucketing.autoBucketedScan.enabled.

Default: false

Use SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING to access the property

decorrelateInnerQuery.enabled

spark.sql.optimizer.decorrelateInnerQuery.enabled

(internal) Decorrelates inner queries by eliminating correlated references and build domain joins

Default: true

Use SQLConf.decorrelateInnerQueryEnabled for the current value

dynamicPartitionPruning.fallbackFilterRatio

spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio

(internal) When statistics are not available or configured not to be used, this config will be used as the fallback filter ratio for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.

Default: 0.5

Use SQLConf.dynamicPartitionPruningFallbackFilterRatio method to access the current value.

dynamicPartitionPruning.pruningSideExtraFilterRatio

spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio

(internal) When filtering side doesn't support broadcast by join type, and doing DPP means running an extra query that may have significant overhead. This config will be used as the extra filter ratio for computing the data size of the pruning side after DPP, in order to evaluate if it is worth adding an extra subquery as the pruning filter.

Must be a double between 0.0 and 1.0

Default: 0.04

Use SQLConf.dynamicPartitionPruningPruningSideExtraFilterRatio to access the current value.

dynamicPartitionPruning.useStats

spark.sql.optimizer.dynamicPartitionPruning.useStats

(internal) When true, distinct count statistics will be used for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.

Default: true

Use SQLConf.dynamicPartitionPruningUseStats for the current value

Used when:

dynamicPartitionPruning.enabled

spark.sql.optimizer.dynamicPartitionPruning.enabled

Enables generating predicates for partition columns used as join keys

Default: true

Use SQLConf.dynamicPartitionPruningEnabled for the current value

Used to control whether to execute the following optimizations or skip them altogether:

dynamicPartitionPruning.reuseBroadcastOnly

spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly

(internal) When true, dynamic partition pruning will only apply when the broadcast exchange of a broadcast hash join operation can be reused as the dynamic pruning filter.

Default: true

Use SQLConf.dynamicPartitionPruningReuseBroadcastOnly for the current value

Used when:

enableCsvExpressionOptimization

spark.sql.optimizer.enableCsvExpressionOptimization

Whether to optimize CSV expressions in SQL optimizer. It includes pruning unnecessary columns from from_csv.

Default: true

Use SQLConf.csvExpressionOptimization for the current value

excludedRules

spark.sql.optimizer.excludedRules

Comma-separated list of fully-qualified class names of the optimization rules that should be disabled (excluded) from logical query optimization.

Default: (empty)

Use SQLConf.optimizerExcludedRules method to access the current value.

Important

It is not guaranteed that all the rules to be excluded will eventually be excluded, as some rules are non-excludable.

expressionProjectionCandidateLimit

spark.sql.optimizer.expressionProjectionCandidateLimit

(internal) The maximum number of the candidates of output expressions whose alias are replaced. It can preserve the output partitioning and ordering. Negative value means disable this optimization.

Default: 100

Used when:

expression.nestedPruning.enabled

spark.sql.optimizer.expression.nestedPruning.enabled

(internal) Prune nested fields from expressions in an operator which are unnecessary in satisfying a query. Note that this optimization doesn't prune nested fields from physical data source scanning. For pruning nested fields from scanning, please use spark.sql.optimizer.nestedSchemaPruning.enabled config.

Default: true

inSetConversionThreshold

spark.sql.optimizer.inSetConversionThreshold

(internal) The threshold of set size for InSet conversion.

Default: 10

Use SQLConf.optimizerInSetConversionThreshold method to access the current value.

inSetSwitchThreshold

spark.sql.optimizer.inSetSwitchThreshold

(internal) Configures the max set size in InSet for which Spark will generate code with switch statements. This is applicable only to bytes, shorts, ints, dates.

Must be non-negative and less than or equal to 600.

Default: 400

maxIterations

spark.sql.optimizer.maxIterations

Maximum number of iterations for Analyzer and Logical Optimizer.

Default: 100

nestedSchemaPruning.enabled

spark.sql.optimizer.nestedSchemaPruning.enabled

(internal) Prune nested fields from the output of a logical relation that are not necessary in satisfying a query. This optimization allows columnar file format readers to avoid reading unnecessary nested column data.

Default: true

Use SQLConf.nestedSchemaPruningEnabled method to access the current value.

nestedPredicatePushdown.supportedFileSources

spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources

(internal) A comma-separated list of data source short names or fully qualified data source implementation class names for which Spark tries to push down predicates for nested columns and/or names containing dots to data sources. This configuration is only effective with file-based data source in DSv1. Currently, Parquet implements both optimizations while ORC only supports predicates for names containing dots. The other data sources don't support this feature yet.

Default: parquet,orc

optimizeOneRowRelationSubquery

spark.sql.optimizer.optimizeOneRowRelationSubquery

(internal) When true, the optimizer will inline subqueries with OneRowRelation leaf nodes

Default: true

Use SQLConf.OPTIMIZE_ONE_ROW_RELATION_SUBQUERY method to access the property (in a type-safe way)

planChangeLog.batches

spark.sql.optimizer.planChangeLog.batches

(internal) Configures a list of batches to be logged in the optimizer, in which the batches are specified by their batch names and separated by comma.

Default: (undefined)

planChangeLog.level

spark.sql.optimizer.planChangeLog.level

(internal) Configures the log level for logging the change from the original plan to the new plan after a rule or batch is applied. The value can be TRACE, DEBUG, INFO, WARN or ERROR.

Default: TRACE

planChangeLog.rules

spark.sql.optimizer.planChangeLog.rules

(internal) Configures a list of rules to be logged in the optimizer, in which the rules are specified by their rule names and separated by comma.

Default: (undefined)

propagateDistinctKeys.enabled

spark.sql.optimizer.propagateDistinctKeys.enabled

(internal) Controls whether the Logical Query Optimizer propagates the distinct attributes of logical operators for query optimization

Default: true

Used when:

  • LogicalPlanDistinctKeys logical operator is requested for the distinct keys

replaceExceptWithFilter

spark.sql.optimizer.replaceExceptWithFilter

(internal) When true, the apply function of the rule verifies whether the right node of the except operation is of type Filter or Project followed by Filter. If so, the rule further verifies the following conditions:

  1. Excluding the filter operations from the right (as well as the left node, if any) on the top, whether both the nodes evaluates to a same result
  2. The left and right nodes don't contain any SubqueryExpressions
  3. The output column names of the left node are distinct

If all the conditions are met, the rule will replace the except operation with a Filter by flipping the filter condition(s) of the right node.

Default: true

runtime.bloomFilter.creationSideThreshold

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold

Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter.

Default: 10MB

Use SQLConf.runtimeFilterCreationSideThreshold for the current value

Used when:

runtime.bloomFilter.enabled

spark.sql.optimizer.runtime.bloomFilter.enabled

Enables a bloom filter on one side of a shuffle join if the other side has a selective predicate (to reduce the amount of shuffle data)

Default: true

Use SQLConf.runtimeFilterBloomFilterEnabled for the current value

Used when:

runtime.bloomFilter.expectedNumItems

spark.sql.optimizer.runtime.bloomFilter.expectedNumItems

The default number of expected items for the runtime bloomfilter

Default: 1000000L

SQLConf.RUNTIME_BLOOM_FILTER_EXPECTED_NUM_ITEMS

Used when:

runtime.bloomFilter.maxNumBits

spark.sql.optimizer.runtime.bloomFilter.maxNumBits

Maximum number of bits for the runtime bloom filter

Default: 67108864L (8MB)

Must be a non-zero positive number

SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS

Used when:

runtime.rowLevelOperationGroupFilter.enabled

spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled

Enables runtime group filtering for group-based row-level operations.

Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten.

Default: true

Current value: SQLConf.runtimeRowLevelOperationGroupFilterEnabled

Used when:

  • RowLevelOperationRuntimeGroupFiltering logical optimization is executed

runtimeFilter.number.threshold

spark.sql.optimizer.runtimeFilter.number.threshold

The total number of injected runtime filters (non-DPP) for a single query. This is to prevent driver OOMs with too many Bloom filters.

Default: 10

Must be a non-zero positive number

SQLConf.RUNTIME_FILTER_NUMBER_THRESHOLD

Used when:

runtimeFilter.semiJoinReduction.enabled

spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled

Enables inserting a semi join on one side of a shuffle join if the other side has a selective predicate (to reduce the amount of shuffle data)

Default: false

Use SQLConf.runtimeFilterSemiJoinReductionEnabled for the current value

Used when:

serializer.nestedSchemaPruning.enabled

spark.sql.optimizer.serializer.nestedSchemaPruning.enabled

(internal) Prune nested fields from object serialization operator which are unnecessary in satisfying a query. This optimization allows object serializers to avoid executing unnecessary nested expressions.

Default: true

spark.sql

retainGroupColumns

spark.sql.retainGroupColumns

(internal) Controls whether to include (retain) grouping columns or not in Aggregation Queries

Default: true

Use SQLConf.dataFrameRetainGroupColumns for the current value

spark.sql.files

maxPartitionBytes

spark.sql.files.maxPartitionBytes

Maximum number of bytes to pack into a single partition when reading files for file-based data sources (e.g., Parquet)

Default: 128MB (like parquet.block.size)

Use SQLConf.filesMaxPartitionBytes for the current value

Used when:

maxPartitionNum

spark.sql.files.maxPartitionNum

The suggested (not guaranteed) maximum number of split file partitions. If set, Spark will rescale each partition to make the number of partitions close to this value if the initial number of partitions exceeds this value.

Effective only with file-based sources such as Parquet, JSON and ORC.

Default: (undefined)

Use SQLConf.filesMaxPartitionNum for the current value

Used when:

maxRecordsPerFile

spark.sql.files.maxRecordsPerFile

Maximum number of records to write out to a single file. If 0 or negative, there is no limit.

Default: 0

Use SQLConf.maxRecordsPerFile method for the current value

Used when:

minPartitionNum

spark.sql.files.minPartitionNum

Hint about the minimum number of partitions for file-based data sources (e.g., Parquet)

Default: spark.sql.leafNodeDefaultParallelism

Use SQLConf.filesMinPartitionNum for the current value

Used when:

openCostInBytes

spark.sql.files.openCostInBytes

(internal) The estimated cost to open a file, measured by the number of bytes could be scanned at the same time (to include multiple files into a partition). Effective only for file-based sources such as Parquet, JSON and ORC.

Default: 4MB

It's better to over-estimate it, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).

Use SQLConf.filesOpenCostInBytes for the current value

Used when:

spark.sql.parquet

aggregatePushdown

spark.sql.parquet.aggregatePushdown

Controls aggregate pushdown in parquet connector

Supports MIN, MAX and COUNT as aggregate expression:

  • For MIN/MAX, support boolean, integer, float and date types.
  • For COUNT, support all data types.

If statistics is missing from any Parquet file footer, exception would be thrown.

Default: false

Use SQLConf.parquetAggregatePushDown for the current value

Used when:

columnarReaderBatchSize

spark.sql.parquet.columnarReaderBatchSize

The number of rows to include in a parquet vectorized reader batch (the capacity of VectorizedParquetRecordReader)

Default: 4096 (4k)

The number should be carefully chosen to minimize overhead and avoid OOMs while reading data.

Use SQLConf.parquetVectorizedReaderBatchSize for the current value

Used when:

enableNestedColumnVectorizedReader

spark.sql.parquet.enableNestedColumnVectorizedReader

Enables vectorized parquet decoding for nested columns (e.g., arrays, structs and maps). Requires spark.sql.parquet.enableVectorizedReader to be enabled

Default: true

Use SQLConf.parquetVectorizedReaderNestedColumnEnabled for the current value

Used when:

filterPushdown

spark.sql.parquet.filterPushdown

Controls filter predicate push-down optimization for parquet connector

Default: true

Use SQLConf.parquetFilterPushDown for the current value

Used when:

filterPushdown.stringPredicate

spark.sql.parquet.filterPushdown.stringPredicate

(internal) Controls Parquet filter push-down optimization for string predicate such as startsWith/endsWith/contains functions. Effective only with spark.sql.parquet.filterPushdown enabled.

Default: spark.sql.parquet.filterPushdown.string.startsWith

Use SQLConf.parquetFilterPushDownStringPredicate for the current value

Used when:

mergeSchema

spark.sql.parquet.mergeSchema

Controls whether the Parquet data source merges schemas collected from all data files or not. If false, the schema is picked from the summary file or a random data file if no summary file is available.

Default: false

Use SQLConf.isParquetSchemaMergingEnabled for the current value

Parquet option (of higher priority): mergeSchema

Used when:

  • ParquetOptions is created (and initializes mergeSchema option)

output.committer.class

spark.sql.parquet.output.committer.class

(internal) The output committer class used by parquet data source. The specified class needs to be a subclass of org.apache.hadoop.mapreduce.OutputCommitter. Typically, it's also a subclass of org.apache.parquet.hadoop.ParquetOutputCommitter. If it is not, then metadata summaries will never be created, irrespective of the value of parquet.summary.metadata.level.

Default: org.apache.parquet.hadoop.ParquetOutputCommitter

Use SQLConf.parquetOutputCommitterClass for the current value

Used when:

spark.sql.sources

bucketing.enabled

spark.sql.sources.bucketing.enabled

Enables Bucketing

Default: true

When disabled (i.e. false), bucketed tables are considered regular (non-bucketed) tables.

Use SQLConf.bucketingEnabled method for the current value

commitProtocolClass

spark.sql.sources.commitProtocolClass

(internal) Fully-qualified class name of a FileCommitProtocol (Spark Core) for Transactional Writes

Default: SQLHadoopMapReduceCommitProtocol

Use SQLConf.fileCommitProtocolClass method for the current value

Used when:

outputCommitterClass

spark.sql.sources.outputCommitterClass

(internal) The fully-qualified class name of the user-defined Hadoop OutputCommitter for SQLHadoopMapReduceCommitProtocol

Default: (undefined)

Use SQLConf.OUTPUT_COMMITTER_CLASS to access the property

Note

ParquetUtils uses spark.sql.parquet.output.committer.class or the default ParquetOutputCommitter instead.

spark.sql.objectHashAggregate.sortBased.fallbackThreshold

(internal) The number of entires in an in-memory hash map (to store aggregation buffers per grouping keys) before ObjectHashAggregateExec (ObjectAggregationIterator, precisely) falls back to sort-based aggregation

Default: 128 (entries)

Use SQLConf.objectAggSortBasedFallbackThreshold for the current value

Learn more in Demo: ObjectHashAggregateExec and Sort-Based Fallback Tasks

spark.sql.legacy.allowNonEmptyLocationInCTAS

(internal) When false, CTAS with LOCATION throws an analysis exception if the location is not empty.

Default: false

Use SQLConf.allowNonEmptyLocationInCTAS for the current value

spark.sql.legacy.allowAutoGeneratedAliasForView

(internal) When true, it's allowed to use an input query without explicit alias when creating a permanent view.

Default: false

Use SQLConf.allowAutoGeneratedAliasForView for the current value

spark.sql.sessionWindow.buffer.spill.threshold

(internal) The threshold for number of rows to be spilled by window operator. Note that the buffer is used only for the query Spark SQL cannot apply aggregations on determining session window.

Default: spark.shuffle.spill.numElementsForceSpillThreshold

Use SQLConf.sessionWindowBufferSpillThreshold for the current value

spark.sql.legacy.allowStarWithSingleTableIdentifierInCount

(internal) When true, the SQL function count is allowed to take a single tblName.* as parameter

Default: false

Use SQLConf.allowStarWithSingleTableIdentifierInCount for the current value

spark.sql.sessionWindow.buffer.in.memory.threshold

(internal) Threshold for number of windows guaranteed to be held in memory by the session window operator. Note that the buffer is used only for the query Spark SQL cannot apply aggregations on determining session window.

Default: 4096

Use SQLConf.sessionWindowBufferInMemoryThreshold for the current value

spark.sql.orc.enableNestedColumnVectorizedReader

Enables vectorized orc decoding for nested column

Default: false

Use SQLConf.orcVectorizedReaderNestedColumnEnabled for the current value

spark.sql.analyzer.maxIterations

(internal) The max number of iterations the analyzer runs.

Default: 100

spark.sql.analyzer.failAmbiguousSelfJoin

(internal) When true, fail the Dataset query if it contains ambiguous self-join.

Default: true

spark.sql.ansi.enabled

When true, Spark tries to conform to the ANSI SQL specification:

  1. Spark will throw a runtime exception if an overflow occurs in any operation on integral/decimal field.
  2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in the SQL parser.

Default: false

spark.sql.cli.print.header

When true, spark-sql CLI prints the names of the columns in query output

Default: false

Use SQLConf.cliPrintHeader for the current value

spark.sql.debug.maxToStringFields

Maximum number of fields of sequence-like entries can be converted to strings in debug output. Any elements beyond the limit will be dropped and replaced by a "... N more fields" placeholder.

Default: 25

Use SQLConf.maxToStringFields method to access the current value.

spark.sql.defaultCatalog

Name of the default catalog

Default: spark_catalog

Use SQLConf.DEFAULT_CATALOG to access the current value.

spark.sql.statistics.histogram.enabled

Enables generating histograms for ANALYZE TABLE SQL statement

Default: false

Equi-Height Histogram

Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.

Use SQLConf.histogramEnabled method to access the current value.

spark.sql.session.timeZone

The ID of session-local timezone (e.g. "GMT", "America/Los_Angeles")

Default: Java's TimeZone.getDefault.getID

Use SQLConf.sessionLocalTimeZone method to access the current value.

spark.sql.sources.ignoreDataLocality

(internal) When true, Spark will not fetch the block locations for each file on listing files. This speeds up file listing, but the scheduler cannot schedule tasks to take advantage of data locality. It can be particularly useful if data is read from a remote cluster so the scheduler could never take advantage of locality anyway.

Default: false

spark.sql.sources.validatePartitionColumns

(internal) When this option is set to true, partition column values will be validated with user-specified schema. If the validation fails, a runtime exception is thrown. When this option is set to false, the partition column value will be converted to null if it can not be casted to corresponding user-specified schema.

Default: true

spark.sql.sources.useV1SourceList

(internal) A comma-separated list of data source short names (DataSourceRegisters) or fully-qualified canonical class names of the data sources (TableProviders) for which DataSource V2 code path is disabled (and Data Source V1 code path used).

Default: avro,csv,json,kafka,orc,parquet,text

Used when:

spark.sql.storeAssignmentPolicy

When inserting a value into a column with different data type, Spark will perform type coercion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy and strict. With ANSI policy, Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string to int or double to boolean. With legacy policy, Spark allows the type coercion as long as it is a valid Cast, which is very loose. e.g. converting string to int or double to boolean is allowed. It is also the only behavior in Spark 2.x and it is compatible with Hive. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. converting double to int or decimal to double is not allowed.

Possible values: ANSI, LEGACY, STRICT

Default: ANSI

spark.sql.thriftServer.interruptOnCancel

When true, all running tasks will be interrupted if one cancels a query. When false, all running tasks will remain until finished.

Default: false

Use SQLConf.THRIFTSERVER_FORCE_CANCEL to access the property

spark.sql.hive.tablePropertyLengthThreshold

(internal) The maximum length allowed in a single cell when storing Spark-specific information in Hive's metastore as table properties. Currently it covers 2 things: the schema's JSON string, the histogram of column statistics.

Default: (undefined)

Use SQLConf.dynamicPartitionPruningEnabled to access the current value.

spark.sql.orc.mergeSchema

When true, the Orc data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.

Default: false

spark.sql.sources.bucketing.autoBucketedScan.enabled

When true, decide whether to do bucketed scan on input tables based on query plan automatically. Do not use bucketed scan if 1. query does not have operators to utilize bucketing (e.g. join, group-by, etc), or 2. there's an exchange operator between these operators and table scan.

Note when spark.sql.sources.bucketing.enabled is set to false, this configuration does not take any effect.

Default: true

Use SQLConf.autoBucketedScanEnabled to access the property

spark.sql.datetime.java8API.enabled

When true, java.time.Instant and java.time.LocalDate classes of Java 8 API are used as external types for Catalyst's TimestampType and DateType. When false, java.sql.Timestamp and java.sql.Date are used for the same purpose.

Default: false

spark.sql.legacy.interval.enabled

(internal) When true, Spark SQL uses the mixed legacy interval type CalendarIntervalType instead of the ANSI compliant interval types YearMonthIntervalType and DayTimeIntervalType. For instance, the date subtraction expression returns CalendarIntervalType when the SQL config is set to true otherwise an ANSI interval.

Default: false

Use SQLConf.legacyIntervalEnabled to access the current value

spark.sql.sources.binaryFile.maxLength

(internal) The max length of a file that can be read by the binary file data source. Spark will fail fast and not attempt to read the file if its length exceeds this value. The theoretical max is Int.MaxValue, though VMs might implement a smaller max.

Default: Int.MaxValue

spark.sql.mapKeyDedupPolicy

The policy to deduplicate map keys in builtin function: CreateMap, MapFromArrays, MapFromEntries, StringToMap, MapConcat and TransformKeys. When EXCEPTION, the query fails if duplicated map keys are detected. When LAST_WIN, the map key that is inserted at last takes precedence.

Possible values: EXCEPTION, LAST_WIN

Default: EXCEPTION

spark.sql.maxConcurrentOutputFileWriters

(internal) Maximum number of output file writers for FileFormatWriter to use concurrently (writing out a query result). If number of writers needed reaches this limit, a task will sort rest of output then writing them.

Default: 0

Use SQLConf.maxConcurrentOutputFileWriters for the current value

spark.sql.maxMetadataStringLength

Maximum number of characters to output for a metadata string (e.g., Location in FileScan)

Default: 100

Must be bigger than 3

Use SQLConf.maxMetadataStringLength for the current value

spark.sql.maven.additionalRemoteRepositories

A comma-delimited string config of the optional additional remote Maven mirror repositories. This is only used for downloading Hive jars in IsolatedClientLoader if the default Maven Central repo is unreachable.

Default: https://maven-central.storage-download.googleapis.com/maven2/

spark.sql.maxPlanStringLength

Maximum number of characters to output for a plan string. If the plan is longer, further output will be truncated. The default setting always generates a full plan. Set this to a lower value such as 8k if plan strings are taking up too much memory or are causing OutOfMemory errors in the driver or UI processes.

Default: Integer.MAX_VALUE - 15

spark.sql.addPartitionInBatch.size

(internal) The number of partitions to be handled in one turn when use AlterTableAddPartitionCommand to add partitions into table. The smaller batch size is, the less memory is required for the real handler, e.g. Hive Metastore.

Default: 100

spark.sql.scriptTransformation.exitTimeoutInSeconds

(internal) Timeout for executor to wait for the termination of transformation script when EOF.

Default: 10 seconds

spark.sql.avro.compression.codec

The compression codec to use when writing Avro data to disk

Default: snappy

The supported codecs are:

  • uncompressed
  • deflate
  • snappy
  • bzip2
  • xz

Use SQLConf.avroCompressionCodec method to access the current value.

spark.sql.broadcastTimeout

Timeout in seconds for the broadcast wait time in broadcast joins.

Default: 5 * 60

When negative, it is assumed infinite (i.e. Duration.Inf)

Use SQLConf.broadcastTimeout method to access the current value.

spark.sql.bucketing.coalesceBucketsInJoin.enabled

When enabled (true), if two bucketed tables with the different number of buckets are joined, the side with a bigger number of buckets will be coalesced to have the same number of buckets as the other side. Bigger number of buckets is divisible by the smaller number of buckets. Bucket coalescing is applied to sort-merge joins and shuffled hash join.

Note

Coalescing bucketed table can avoid unnecessary shuffling in join, but it also reduces parallelism and could possibly cause OOM for shuffled hash join.

Default: false

Use SQLConf.coalesceBucketsInJoinEnabled method to access the current value.

spark.sql.caseSensitive

(internal) Controls whether the query analyzer should be case sensitive (true) or not (false).

Default: false

It is highly discouraged to turn on case sensitive mode.

Use SQLConf.caseSensitiveAnalysis method to access the current value.

spark.sql.catalog.spark_catalog

The CatalogPlugin for spark_catalog

Default: defaultSessionCatalog

spark.sql.cbo.enabled

Enables Cost-Based Optimization (CBO) for estimation of plan statistics when true.

Default: false

Use SQLConf.cboEnabled method to access the current value.

spark.sql.cbo.joinReorder.enabled

Enables join reorder for cost-based optimization (CBO).

Default: false

Use SQLConf.joinReorderEnabled method to access the current value.

spark.sql.cbo.planStats.enabled

When true, the logical plan will fetch row counts and column statistics from catalog.

Default: false

spark.sql.cbo.starSchemaDetection

Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization.

Default: false

Use SQLConf.starSchemaDetection method to access the current value.

spark.sql.codegen

aggregate.map.vectorized.enable

spark.sql.codegen.aggregate.map.vectorized.enable

(internal) Enables vectorized aggregate hash map. This is for testing/benchmarking only.

Default: false

aggregate.sortAggregate.enabled

spark.sql.codegen.aggregate.sortAggregate.enabled

Enables code generation for SortAggregateExec physical operator

Default: true

aggregate.splitAggregateFunc.enabled

spark.sql.codegen.aggregate.splitAggregateFunc.enabled

(internal) When true, the code generator would split aggregate code into individual methods instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Default: true

comments

spark.sql.codegen.comments

Controls whether CodegenContext should register comments (true) or not (false).

Default: false

factoryMode

spark.sql.codegen.factoryMode

(internal) Determines the codegen generator fallback behavior

Default: FALLBACK

Acceptable values:

  • CODEGEN_ONLY - disable fallback mode
  • FALLBACK - try codegen first and, if any compile error happens, fallback to interpreted mode
  • NO_CODEGEN - skips codegen and always uses interpreted path

Used when CodeGeneratorWithInterpretedFallback is requested to createObject (when UnsafeProjection is requested to create an UnsafeProjection for Catalyst expressions)

useIdInClassName

spark.sql.codegen.useIdInClassName

(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix (true) or not (false)

Default: true

Use SQLConf.wholeStageUseIdInClassName method to access the current value.

maxFields

spark.sql.codegen.maxFields

(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen.

Default: 100

Use SQLConf.wholeStageMaxNumFields method to access the current value.

splitConsumeFuncByOperator

spark.sql.codegen.splitConsumeFuncByOperator

(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Default: true

Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value.

spark.sql.columnNameOfCorruptRecord

spark.sql.constraintPropagation.enabled

(internal) When true, the query optimizer will infer and propagate data constraints in the query plan to optimize them. Constraint propagation can sometimes be computationally expensive for certain kinds of query plans (such as those with a large number of predicates and aliases) which might negatively impact overall runtime.

Default: true

Use SQLConf.constraintPropagationEnabled method to access the current value.

spark.sql.csv.filterPushdown.enabled

(internal) When true, enable filter pushdown to CSV datasource.

Default: true

spark.sql.defaultSizeInBytes

(internal) Estimated size of a table or relation used in query planning

Default: Java's Long.MaxValue

Set to Java's Long.MaxValue which is larger than spark.sql.autoBroadcastJoinThreshold to be more conservative. That is to say by default the optimizer will not choose to broadcast a table unless it knows for sure that the table size is small enough.

Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast.

Use SQLConf.defaultSizeInBytes method to access the current value.

spark.sql.dialect

execution.useObjectHashAggregateExec

spark.sql.execution.useObjectHashAggregateExec

(internal) Prefers ObjectHashAggregateExec (over SortAggregateExec) for aggregation

Default: true

Use SQLConf.useObjectHashAggregation method to access the current value.

spark.sql.files.ignoreCorruptFiles

Controls whether to ignore corrupt files (true) or not (false). If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.

Default: false

Use SQLConf.ignoreCorruptFiles method to access the current value.

spark.sql.files.ignoreMissingFiles

Controls whether to ignore missing files (true) or not (false). If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned.

Default: false

Use SQLConf.ignoreMissingFiles method to access the current value.

spark.sql.inMemoryColumnarStorage.compressed

When enabled, Spark SQL will automatically select a compression codec for each column based on statistics of the data.

Default: true

Use SQLConf.useCompression method to access the current value.

spark.sql.inMemoryColumnarStorage.batchSize

Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.

Default: 10000

Use SQLConf.columnBatchSize method to access the current value.

spark.sql.inMemoryTableScanStatistics.enable

(internal) When true, enable in-memory table scan accumulators.

Default: false

spark.sql.inMemoryColumnarStorage.enableVectorizedReader

Enables vectorized reader for columnar caching

Default: true

Use SQLConf.cacheVectorizedReaderEnabled method to access the current value.

spark.sql.join.preferSortMergeJoin

(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join.

Default: true

Use SQLConf.preferSortMergeJoin method to access the current value.

spark.sql.jsonGenerator.ignoreNullFields

Whether to ignore null fields when generating JSON objects in JSON data source and JSON functions such as to_json. If false, it generates null for null fields in JSON objects.

Default: true

spark.sql.leafNodeDefaultParallelism

The default parallelism of leaf operators that produce data

Must be positive

Default: SparkContext.defaultParallelism (Spark Core)

spark.sql.legacy.doLooseUpcast

(internal) When true, the upcast will be loose and allows string to atomic types.

Default: false

spark.sql.legacy.ctePrecedencePolicy

(internal) This config will be removed in future versions and CORRECTED will be the only behavior.

Possible values:

  1. CORRECTED - inner CTE definitions take precedence
  2. EXCEPTION - AnalysisException is thrown while name conflict is detected in nested CTE
  3. LEGACY - outer CTE definitions takes precedence over inner definitions

Default: EXCEPTION

spark.sql.legacy.timeParserPolicy

(internal) When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. When set to CORRECTED, classes from java.time.* packages are used for the same purpose. The default value is EXCEPTION, RuntimeException is thrown when we will get different results.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.followThreeValuedLogicInArrayExists

(internal) When true, the ArrayExists will follow the three-valued boolean logic.

Default: true

spark.sql.legacy.fromDayTimeString.enabled

(internal) When true, the from bound is not taken into account in conversion of a day-time string to an interval, and the to bound is used to skip all interval units out of the specified range. When false, ParseException is thrown if the input does not match to the pattern defined by from and to.

Default: false

spark.sql.legacy.notReserveProperties

(internal) When true, all database and table properties are not reserved and available for create/alter syntaxes. But please be aware that the reserved properties will be silently removed.

Default: false

spark.sql.legacy.addSingleFileInAddFile

(internal) When true, only a single file can be added using ADD FILE. If false, then users can add directory by passing directory path to ADD FILE.

Default: false

spark.sql.legacy.exponentLiteralAsDecimal.enabled

(internal) When true, a literal with an exponent (e.g. 1E-30) would be parsed as Decimal rather than Double.

Default: false

spark.sql.legacy.allowNegativeScaleOfDecimal

(internal) When true, negative scale of Decimal type is allowed. For example, the type of number 1E10BD under legacy mode is DecimalType(2, -9), but is Decimal(11, 0) in non legacy mode.

Default: false

spark.sql.legacy.bucketedTableScan.outputOrdering

(internal) When true, the bucketed table scan will list files during planning to figure out the output ordering, which is expensive and may make the planning quite slow.

Default: false

spark.sql.legacy.json.allowEmptyString.enabled

(internal) When true, the parser of JSON data source treats empty strings as null for some data types such as IntegerType.

Default: false

spark.sql.legacy.createEmptyCollectionUsingStringType

(internal) When true, Spark returns an empty collection with StringType as element type if the array/map function is called without any parameters. Otherwise, Spark returns an empty collection with NullType as element type.

Default: false

spark.sql.legacy.allowUntypedScalaUDF

(internal) When true, user is allowed to use org.apache.spark.sql.functions.udf(f: AnyRef, dataType: DataType). Otherwise, an exception will be thrown at runtime.

Default: false

spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue

(internal) When true, the key attribute resulted from running Dataset.groupByKey for non-struct key type, will be named as value, following the behavior of Spark version 2.4 and earlier.

Default: false

spark.sql.legacy.setCommandRejectsSparkCoreConfs

(internal) If it is set to true, SET command will fail when the key is registered as a SparkConf entry.

Default: true

spark.sql.legacy.typeCoercion.datetimeToString.enabled

(internal) When true, date/timestamp will cast to string in binary comparisons with String

Default: false

spark.sql.legacy.allowHashOnMapType

(internal) When true, hash expressions can be applied on elements of MapType. Otherwise, an analysis exception will be thrown.

Default: false

spark.sql.legacy.parquet.datetimeRebaseModeInWrite

(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.parquet.datetimeRebaseModeInRead

(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.avro.datetimeRebaseModeInWrite

(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.avro.datetimeRebaseModeInRead

(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.rdd.applyConf

(internal) Enables propagation of SQL configurations when executing operations on the RDD that represents a structured query. This is the (buggy) behavior up to 2.4.4.

Default: true

This is for cases not tracked by SQL execution, when a Dataset is converted to an RDD either using Dataset.md#rdd[rdd] operation or QueryExecution, and then the returned RDD is used to invoke actions on it.

This config is deprecated and will be removed in 3.0.0.

spark.sql.legacy.replaceDatabricksSparkAvro.enabled

Enables resolving (mapping) the data source provider com.databricks.spark.avro to the built-in (but external) Avro data source module for backward compatibility.

Default: true

Use SQLConf.replaceDatabricksSparkAvroEnabled method to access the current value.

spark.sql.limit.scaleUpFactor

(internal) Minimal increase rate in the number of partitions between attempts when executing take operator on a structured query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be run.

Default: 4

Use SQLConf.limitScaleUpFactor method to access the current value.

spark.sql.optimizeNullAwareAntiJoin

(internal) Enables single-column NULL-aware anti join execution planning into BroadcastHashJoinExec (with flag isNullAwareAntiJoin enabled), optimized from O(M*N) calculation into O(M) calculation using hash lookup instead of looping lookup.

Default: true

Use SQLConf.optimizeNullAwareAntiJoin method to access the current value.

spark.sql.orc.impl

(internal) When native, use the native version of ORC support instead of the ORC library in Hive 1.2.1.

Default: native

Acceptable values:

  • hive
  • native

spark.sql.planChangeLog.level

(internal) Log level for logging the change from the original plan to the new plan after a rule or batch is applied.

Default: trace

Supported Values (case-insensitive):

  • trace
  • debug
  • info
  • warn
  • error

Use SQLConf.planChangeLogLevel method to access the current value.

spark.sql.planChangeLog.batches

(internal) Comma-separated list of batch names for plan changes logging

Default: (undefined)

Use SQLConf.planChangeBatches method to access the current value.

spark.sql.planChangeLog.rules

(internal) Comma-separated list of rule names for plan changes logging

Default: (undefined)

Use SQLConf.planChangeRules method to access the current value.

spark.sql.pyspark.jvmStacktrace.enabled

When true, it shows the JVM stacktrace in the user-facing PySpark exception together with Python stacktrace. By default, it is disabled and hides JVM stacktrace and shows a Python-friendly exception only.

Default: false

spark.sql.parquet.binaryAsString

Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.

Default: false

Use SQLConf.isParquetBinaryAsString method to access the current value.

spark.sql.parquet.compression.codec

Sets the compression codec used when writing Parquet files. If either compression or parquet.compression is specified in the table-specific options/properties, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec.

Acceptable values:

  • brotli
  • gzip
  • lz4
  • lzo
  • none
  • snappy
  • uncompressed
  • zstd

Default: snappy

Use SQLConf.parquetCompressionCodec method to access the current value.

spark.sql.parquet.enableVectorizedReader

Enables vectorized parquet decoding.

Default: true

Use SQLConf.parquetVectorizedReaderEnabled method to access the current value.

spark.sql.parquet.filterPushdown.date

(internal) Enables parquet filter push-down optimization for Date type (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownDate method to access the current value.

spark.sql.parquet.filterPushdown.decimal

(internal) Enables parquet filter push-down optimization for Decimal type (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownDecimal method to access the current value.

spark.sql.parquet.int96RebaseModeInWrite

(internal) Enables rebasing timestamps while writing Parquet files

Formerly known as spark.sql.legacy.parquet.int96RebaseModeInWrite

Acceptable values:

  • EXCEPTION - Fail writing parquet files if there are ancient timestamps that are ambiguous between the two calendars
  • LEGACY - Rebase INT96 timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar (gives maximum interoperability)
  • CORRECTED - Write datetime values with no change (rabase). Only when you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar

Default: EXCEPTION

Writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar.

See more details in SPARK-31404.

spark.sql.parquet.pushdown.inFilterThreshold

(internal) For IN predicate, Parquet filter will push-down a set of OR clauses if its number of values not exceeds this threshold. Otherwise, Parquet filter will push-down a value greater than or equal to its minimum value and less than or equal to its maximum value (when spark.sql.parquet.filterPushdown is enabled)

Disabled when 0

Default: 10

Use SQLConf.parquetFilterPushDownInFilterThreshold method to access the current value.

spark.sql.parquet.filterPushdown.string.startsWith

(internal) Enables parquet filter push-down optimization for startsWith function (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownStringStartWith method to access the current value.

spark.sql.parquet.filterPushdown.timestamp

(internal) Enables parquet filter push-down optimization for Timestamp type. It can only have an effect when the following hold:

  1. spark.sql.parquet.filterPushdown is enabled
  2. Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type

Default: true

Use SQLConf.parquetFilterPushDownTimestamp method to access the current value.

spark.sql.parquet.int96AsTimestamp

Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

Default: true

Use SQLConf.isParquetINT96AsTimestamp method to access the current value.

spark.sql.parquet.int96TimestampConversion

Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala.

Default: false

This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark.

Use SQLConf.isParquetINT96TimestampConversion method to access the current value.

spark.sql.parquet.outputTimestampType

Sets which Parquet timestamp type to use when Spark writes data to Parquet files. INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.

Acceptable values:

  • INT96
  • TIMESTAMP_MICROS
  • TIMESTAMP_MILLIS

Default: INT96

Use SQLConf.parquetOutputTimestampType method to access the current value.

spark.sql.parquet.recordLevelFilter.enabled

Enables Parquet's native record-level filtering using the pushed down filters (when spark.sql.parquet.filterPushdown is enabled).

Default: false

Use SQLConf.parquetRecordFilterEnabled method to access the current value.

spark.sql.parquet.respectSummaryFiles

When true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn't be enabled before knowing what it means exactly.

Default: false

Use SQLConf.isParquetSchemaRespectSummaries method to access the current value.

spark.sql.parser.quotedRegexColumnNames

Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions.

Default: false

Use SQLConf.supportQuotedRegexColumnName method to access the current value.

spark.sql.pivotMaxValues

Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column)

Default: 10000

Use SQLConf.dataFramePivotMaxValues method to access the current value.

spark.sql.redaction.options.regex

Regular expression to find options of a Spark SQL command with sensitive information

Default: (?i)secret!password

The values of the options matched will be redacted in the explain output.

This redaction is applied on top of the global redaction configuration defined by spark.redaction.regex configuration.

Used exclusively when SQLConf is requested to redactOptions.

spark.sql.redaction.string.regex

Regular expression to point at sensitive information in text output

Default: (undefined)

When this regex matches a string part, it is replaced by a dummy value (i.e. *********(redacted)). This is currently used to redact the output of SQL explain commands.

NOTE: When this conf is not set, the value of spark.redaction.string.regex is used instead.

Use SQLConf.stringRedactionPattern method to access the current value.

spark.sql.runSQLOnFiles

(internal) Enables datasource`.`path table names in SQL queries for FileFormat-based data sources (excluding hive tables)

Default: true

Use SQLConf.runSQLonFile method to access the current value.

Used when:

spark.sql.selfJoinAutoResolveAmbiguity

Controls whether to resolve ambiguity in join conditions for self-joins automatically (true) or not (false)

Default: true

spark.sql.sort.enableRadixSort

(internal) Controls whether to use radix sort (true) or not (false) in ShuffleExchangeExec and SortExec physical operators

Default: true

Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more).

Use SQLConf.enableRadixSort method to access the current value.

spark.sql.sources.default

Default data source to use for loading or saving data

Default: parquet

Use SQLConf.defaultDataSourceName method to access the current value.

spark.sql.statistics.fallBackToHdfs

Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata.

Default: false

This can be useful in determining if a table is small enough for auto broadcast joins in query planning.

Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value.

spark.sql.statistics.histogram.numBins

(internal) The number of bins when generating histograms.

Default: 254

NOTE: The number of bins must be greater than 1.

Use SQLConf.histogramNumBins method to access the current value.

spark.sql.statisticsparallelFileListingInStatsComputation.enabled*

(internal) Enables parallel file listing in SQL commands, e.g. ANALYZE TABLE (as opposed to single thread listing that can be particularly slow with tables with hundreds of partitions)

Default: true

Use SQLConf.parallelFileListingInStatsComputation method to access the current value.

spark.sql.statistics.ndv.maxError

(internal) The maximum estimation error allowed in HyperLogLog++ algorithm when generating column level statistics.

Default: 0.05

spark.sql.statistics.percentile.accuracy

(internal) Accuracy of percentile approximation when generating equi-height histograms. Larger value means better accuracy. The relative error can be deduced by 1.0 / PERCENTILE_ACCURACY.

Default: 10000

spark.sql.statistics.size.autoUpdate.enabled

Enables automatic update of the table size statistic of a table after the table has changed.

Default: false

IMPORTANT: If the total number of files of the table is very large this can be expensive and slow down data change commands.

Use SQLConf.autoSizeUpdateEnabled method to access the current value.

spark.sql.subexpressionElimination.enabled

(internal) Enables Subexpression Elimination

Default: true

Use SQLConf.subexpressionEliminationEnabled method to access the current value.

spark.sql.subexpressionElimination.enabled

(internal) Enables shortcut eliminate subexpression with AND and OR.

The subexpression may not need to eval even if it appears more than once, e.g., if(or(a, and(b, b))), the expression b would be skipped if a is true.

Default: false

Use SQLConf.subexpressionEliminationSkipForShotcutExpr method to access the current value.

Used when:

spark.sql.shuffle.partitions

The default number of partitions to use when shuffling data for joins or aggregations.

Default: 200

Note

Corresponds to Apache Hive's mapred.reduce.tasks property that Spark SQL considers deprecated.

Spark Structured Streaming

spark.sql.shuffle.partitions cannot be changed in Spark Structured Streaming between query restarts from the same checkpoint location.

Use SQLConf.numShufflePartitions method to access the current value.

spark.sql.sources.fileCompressionFactor

(internal) When estimating the output data size of a table scan, multiply the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated result.

Default: 1.0

Use SQLConf.fileCompressionFactor method to access the current value.

spark.sql.sources.partitionOverwriteMode

Enables dynamic partition inserts when dynamic

Default: static

When INSERT OVERWRITE a partitioned data source table with dynamic partition columns, Spark SQL supports two modes (case-insensitive):

  • static - Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting

  • dynamic - Spark doesn't delete partitions ahead, and only overwrites those partitions that have data written into it

The default STATIC overwrite mode is to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode.

Use SQLConf.partitionOverwriteMode method to access the current value.

spark.sql.truncateTable.ignorePermissionAcl.enabled

(internal) Disables setting back original permission and ACLs when re-creating the table/partition paths for TRUNCATE TABLE command.

Default: false

Use SQLConf.truncateTableIgnorePermissionAcl method to access the current value.

spark.sql.ui.retainedExecutions

Number of SQLExecutionUIDatas to keep in failedExecutions and completedExecutions internal registries.

Default: 1000

When a query execution finishes, the execution is removed from the internal activeExecutions registry and stored in failedExecutions or completedExecutions given the end execution status. It is when SQLListener makes sure that the number of SQLExecutionUIData entires does not exceed spark.sql.ui.retainedExecutions Spark property and removes the excess of entries.

spark.sql.variable.substitute

Enables Variable Substitution

Default: true

Use SQLConf.variableSubstituteEnabled method to access the current value.

spark.sql.windowExec.buffer.in.memory.threshold

(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator.

Default: 4096

Use SQLConf.windowExecBufferInMemoryThreshold method to access the current value.

spark.sql.windowExec.buffer.spill.threshold

(internal) Threshold for number of rows buffered in a WindowExec physical operator.

Default: 4096

Use SQLConf.windowExecBufferSpillThreshold method to access the current value.