Configuration Properties¶

Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.

Configuration properties are configured in a SparkSession while creating a new instance using config method (e.g. spark.sql.warehouse.dir).

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.warehouse.dir", "c:/Temp") // (1)!
  .getOrCreate

Sets spark.sql.warehouse.dir

You can also set a property using SQL SET command.

assert(spark.conf.getOption("spark.sql.hive.metastore.version").isEmpty)

scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)
+--------------------------------+-----+
|key                             |value|
+--------------------------------+-----+
|spark.sql.hive.metastore.version|2.3.2|
+--------------------------------+-----+

assert(spark.conf.get("spark.sql.hive.metastore.version") == "2.3.2")

spark.sql.adaptive¶

advisoryPartitionSizeInBytes¶

spark.sql.adaptive.advisoryPartitionSizeInBytes

The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is enabled). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.

Default: 64MB

Fallback Property: spark.sql.adaptive.shuffle.targetPostShuffleInputSize

Use SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES to reference the name.

autoBroadcastJoinThreshold¶

spark.sql.adaptive.autoBroadcastJoinThreshold

The maximum size (in bytes) of a table to be broadcast when performing a join. -1 turns broadcasting off. The default value is same as spark.sql.autoBroadcastJoinThreshold.

Used only in Adaptive Query Execution

Default: (undefined)

Available as SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD value.

coalescePartitions.enabled¶

spark.sql.adaptive.coalescePartitions.enabled

Controls coalescing shuffle partitions

When true and spark.sql.adaptive.enabled is enabled, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks.

Default: true

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

coalescePartitions.minPartitionSize¶

spark.sql.adaptive.coalescePartitions.minPartitionSize

The minimum size (in bytes unless specified) of shuffle partitions after coalescing. This is useful when the adaptively calculated target size is too small during partition coalescing

Default: 1MB

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

coalescePartitions.minPartitionSize¶

spark.sql.adaptive.coalescePartitions.minPartitionSize

The minimum size (in bytes) of shuffle partitions after coalescing.

Useful when the adaptively calculated target size is too small during partition coalescing.

Default: (undefined)

Must be positive

Used when:

CoalesceShufflePartitions adaptive physical optimization is executed

coalescePartitions.initialPartitionNum¶

spark.sql.adaptive.coalescePartitions.initialPartitionNum

The initial number of shuffle partitions before coalescing.

By default it equals to spark.sql.shuffle.partitions. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.

Default: (undefined)

coalescePartitions.parallelismFirst¶

spark.sql.adaptive.coalescePartitions.parallelismFirst

When true, Spark does not respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes when coalescing contiguous shuffle partitions, but adaptively calculate the target size according to the default parallelism of the Spark cluster. The calculated size is usually smaller than the configured target size. This is to maximize the parallelism and avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false and respect the configured target size.

Default: true

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

customCostEvaluatorClass¶

spark.sql.adaptive.customCostEvaluatorClass

The fully-qualified class name of the CostEvaluator in Adaptive Query Execution

Default: SimpleCostEvaluator

Use SQLConf.ADAPTIVE_CUSTOM_COST_EVALUATOR_CLASS method to access the property (in a type-safe way).

Used when:

AdaptiveSparkPlanExec physical operator is requested for the AQE cost evaluator

enabled¶

spark.sql.adaptive.enabled

Enables Adaptive Query Execution

Default: true

Use SQLConf.adaptiveExecutionEnabled method to access the current value.

fetchShuffleBlocksInBatch¶

spark.sql.adaptive.fetchShuffleBlocksInBatch

(internal) Whether to fetch the contiguous shuffle blocks in batch. Instead of fetching blocks one by one, fetching contiguous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple contiguous blocks exist in single "fetch request only happen when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. This feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol.

Default: true

Use SQLConf.fetchShuffleBlocksInBatch method to access the current value.

forceApply¶

spark.sql.adaptive.forceApply

(internal) When true (together with spark.sql.adaptive.enabled enabled), Spark will force apply adaptive query execution for all supported queries.

Default: false

Use SQLConf.ADAPTIVE_EXECUTION_FORCE_APPLY method to access the property (in a type-safe way).

forceOptimizeSkewedJoin¶

spark.sql.adaptive.forceOptimizeSkewedJoin

Enables OptimizeSkewedJoin physical optimization to be executed even if it introduces extra shuffle

Default: false

Requires spark.sql.adaptive.skewJoin.enabled to be enabled

Use SQLConf.ADAPTIVE_FORCE_OPTIMIZE_SKEWED_JOIN to access the property (in a type-safe way).

Used when:

AdaptiveSparkPlanExec physical operator is requested for the AQE cost evaluator (and creates a SimpleCostEvaluator)
OptimizeSkewedJoin physical optimization is executed

localShuffleReader.enabled¶

spark.sql.adaptive.localShuffleReader.enabled

When true (and spark.sql.adaptive.enabled is true), Spark SQL tries to use local shuffle reader to read the shuffle data when the shuffle partitioning is not needed, for example, after converting sort-merge join to broadcast-hash join.

Default: true

Use SQLConf.LOCAL_SHUFFLE_READER_ENABLED to access the property (in a type-safe way)

logLevel¶

spark.sql.adaptive.logLevel

(internal) Log level for adaptive execution logging of plan changes. The value can be TRACE, DEBUG, INFO, WARN or ERROR.

Default: DEBUG

Use SQLConf.adaptiveExecutionLogLevel for the current value

maxShuffledHashJoinLocalMapThreshold¶

spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold

The maximum size (in bytes) per partition that can be allowed to build local hash map. If this value is not smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes and all the partition size are not larger than this config, join selection prefer to use shuffled hash join instead of sort merge join regardless of the value of spark.sql.join.preferSortMergeJoin.

Default: 0

Available as SQLConf.ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD

nonEmptyPartitionRatioForBroadcastJoin¶

spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin

(internal) A relation with a non-empty partition ratio (the number of non-empty partitions to all partitions) lower than this config will not be considered as the build side of a broadcast-hash join in Adaptive Query Execution regardless of the size.

Effective with spark.sql.adaptive.enabled true

Default: 0.2

Use SQLConf.nonEmptyPartitionRatioForBroadcastJoin method to access the current value.

optimizer.excludedRules¶

spark.sql.adaptive.optimizer.excludedRules

A comma-separated list of rules (names) to be disabled (excluded) in the AQE Logical Optimizer

Default: undefined

Use SQLConf.ADAPTIVE_OPTIMIZER_EXCLUDED_RULES to reference the property.

Used when:

AQEOptimizer is requested for the batches

optimizeSkewsInRebalancePartitions.enabled¶

When true and spark.sql.adaptive.enabled is true, Spark SQL will optimize the skewed shuffle partitions in RebalancePartitions and split them to smaller ones according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid data skew

Default: true

Use SQLConf.ADAPTIVE_OPTIMIZE_SKEWS_IN_REBALANCE_PARTITIONS_ENABLED method to access the property (in a type-safe way)

skewJoin.enabled¶

spark.sql.adaptive.skewJoin.enabled

When true and spark.sql.adaptive.enabled is enabled, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.

Default: true

Use SQLConf.SKEW_JOIN_ENABLED to reference the property.

skewJoin.skewedPartitionFactor¶

spark.sql.adaptive.skewJoin.skewedPartitionFactor

A partition is considered skewed if its size is larger than this factor multiplying the median partition size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes.

Default: 5

skewJoin.skewedPartitionThresholdInBytes¶

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

A partition is considered skewed if its size in bytes is larger than this threshold and also larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor multiplying the median partition size. Ideally this config should be set larger than spark.sql.adaptive.advisoryPartitionSizeInBytes.

Default: 256MB

allowNamedFunctionArguments¶

spark.sql.allowNamedFunctionArguments

Controls support for named parameters in function calls in SQL statements

Default: true

Used when:

AstBuilder is requested to parse table-valued function (and extractFunctionTableNamedArgument and extractNamedArgument)

autoBroadcastJoinThreshold¶

spark.sql.autoBroadcastJoinThreshold

Maximum size (in bytes) for a table that can be broadcast (to all worker nodes) in a join

Default: 10M

-1 (or any negative value) disables broadcasting

Use SQLConf.autoBroadcastJoinThreshold method to access the current value.

cache.serializer¶

spark.sql.cache.serializer

The name of CachedBatchSerializer implementation to translate SQL data into a format that can more efficiently be cached.

Default: org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer

spark.sql.cache.serializer is a StaticSQLConf

Use SQLConf.SPARK_CACHE_SERIALIZER for the name

Used when:

InMemoryRelation is requested for the CachedBatchSerializer

spark.sql.codegen¶

aggregate.fastHashMap.capacityBit¶

spark.sql.codegen.aggregate.fastHashMap.capacityBit

(internal) Capacity for the max number of rows to be held in memory by the fast hash aggregate product operator. The bit is not for actual value, but the actual numBuckets is determined by loadFactor (e.g., the default bit value 16, the actual numBuckets is ((1 << 16) / 0.5).

Default: 16

Must be in the range of [10, 30] (inclusive)

Use SQLConf.fastHashAggregateRowMaxCapacityBit for the current value

Used when:

HashAggregateExec physical operator is requested to doProduceWithKeys

aggregate.map.twolevel.enabled¶

spark.sql.codegen.aggregate.map.twolevel.enabled

(internal) Enable two-level aggregate hash map. When enabled, records will first be inserted/looked-up at a 1^st-level, small, fast map, and then fallback to a 2^nd-level, larger, slower map when 1^st level is full or keys cannot be found. When disabled, records go directly to the 2^nd level.

Default: true

Use SQLConf.enableTwoLevelAggMap for the current value

Used when:

HashAggregateExec physical operator is requested to doProduceWithKeys

aggregate.map.vectorized.enable¶

spark.sql.codegen.aggregate.map.vectorized.enable

(internal) Enables vectorized aggregate hash map. For testing/benchmarking only.

Default: false

Use SQLConf.enableVectorizedHashMap for the current value

Used when:

HashAggregateExec physical operator is requested to enableTwoLevelHashMap, doProduceWithKeys

aggregate.map.twolevel.partialOnly¶

spark.sql.codegen.aggregate.map.twolevel.partialOnly

(internal) Enables two-level aggregate hash map for partial aggregate only, because final aggregate might get more distinct keys compared to partial aggregate. "Overhead of looking up 1^st-level map might dominate when having a lot of distinct keys.

Default: true

Used when:

HashAggregateExec physical operator is requested to checkIfFastHashMapSupported

hugeMethodLimit¶

spark.sql.codegen.hugeMethodLimit

(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled code has a function that exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the query plan.

Default: 65535

The default value 65535 is the largest bytecode size possible for a valid Java method. When running on HotSpot, it may be preferable to set the value to 8000 (which is the value of HugeMethodLimit in the OpenJDK JVM settings)

Use SQLConf.hugeMethodLimit method to access the current value.

Used when:

WholeStageCodegenExec physical operator is executed

fallback¶

spark.sql.codegen.fallback

(internal) Whether the whole-stage codegen could be temporary disabled for the part of a query that has failed to compile generated code (true) or not (false).

Default: true

Use SQLConf.wholeStageFallback method to access the current value.

Used when:

WholeStageCodegenExec physical operator is executed

join.fullOuterShuffledHashJoin.enabled¶

spark.sql.codegen.join.fullOuterShuffledHashJoin.enabled

(internal) Enables Whole-Stage Code Generation for FULL OUTER shuffled hash join

Default: true

Use SQLConf.ENABLE_FULL_OUTER_SHUFFLED_HASH_JOIN_CODEGEN to access the property

methodSplitThreshold¶

spark.sql.codegen.methodSplitThreshold

(internal) The threshold of source-code splitting in the codegen. When the number of characters in a single Java function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones. We cannot know how many bytecode will be generated, so use the code length as metric. When running on HotSpot, a function's bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise there will be many function calls.

Default: 1024

Use SQLConf.methodSplitThreshold for the current value

wholeStage¶

spark.sql.codegen.wholeStage

(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method (true) or not (false).

Default: true

Use SQLConf.wholeStageEnabled method to access the current value.

columnVector.offheap.enabled¶

spark.sql.columnVector.offheap.enabled

(internal) Enables OffHeapColumnVector (true) or OnHeapColumnVector (false) in ColumnarBatch

Default: false

Use SQLConf.offHeapColumnVectorEnabled for the current value

Used when:

RowToColumnarExec physical operator is requested to doExecuteColumnar
DefaultCachedBatchSerializer is requested to vectorTypes and convertCachedBatchToColumnarBatch
ParquetFileFormat is requested to vectorTypes and buildReaderWithPartitionValues
ParquetPartitionReaderFactory is created

defaultColumn.enabled¶

spark.sql.defaultColumn.enabled

(internal) When true, allows CREATE TABLE, REPLACE TABLE, and ALTER COLUMN statements to set or update default values for specific columns. Following INSERT, MERGE, and UPDATE statements may then omit these values and their values will be injected automatically instead.

Default: true

Use SQLConf.enableDefaultColumns for the current value

Used when:

AstBuilder is requested to visitCreateOrReplaceTableColType, visitQualifiedColTypeWithPosition, visitAlterTableAlterColumn
ResolveDefaultColumns logical resolution rule is executed, constantFoldCurrentDefaultsToExistDefaults, validateCatalogForDefaultValue, validateTableProviderForDefaultValue

exchange.reuse¶

spark.sql.exchange.reuse

(internal) When enabled (true), the Spark planner will find duplicated exchanges and subqueries and re-use them.

When disabled (false), ReuseExchange and ReuseSubquery physical optimizations (that the Spark planner uses for physical query plan optimization) do nothing.

Default: true

Use SQLConf.exchangeReuseEnabled for the current value

spark.sql.execution¶

arrow.pyspark.enabled¶

spark.sql.execution.arrow.pyspark.enabled

When true, make use of Apache Arrow for columnar data transfers in PySpark. This optimization applies to:

pyspark.sql.DataFrame.toPandas
pyspark.sql.SparkSession.createDataFrame when its input is a Pandas DataFrame

The following data types are unsupported: BinaryType, MapType, ArrayType of TimestampType, and nested StructType.

Default: false

arrow.pyspark.fallback.enabled¶

spark.sql.execution.arrow.pyspark.fallback.enabled

When true, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled will fallback automatically to non-optimized implementations if an error occurs.

Default: true

arrow.pyspark.selfDestruct.enabled¶

spark.sql.execution.arrow.pyspark.selfDestruct.enabled

(Experimental) When true, make use of Apache Arrow's self-destruct and split-blocks options for columnar data transfers in PySpark, when converting from Arrow to Pandas. This reduces memory usage at the cost of some CPU time. Applies to: pyspark.sql.DataFrame.toPandas when spark.sql.execution.arrow.pyspark.enabled is true.

Default: false

Use SQLConf.arrowPySparkSelfDestructEnabled for the current value

pandas.convertToArrowArraySafely¶

spark.sql.execution.pandas.convertToArrowArraySafely

(internal) When true, Arrow will perform safe type conversion when converting Pandas. Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. When false, disabling Arrow's type check and do type conversions anyway. This config only works for Arrow 0.11.0+.

Default: false

pandas.udf.buffer.size¶

spark.sql.execution.pandas.udf.buffer.size

Same as ${BUFFER_SIZE.key} but only applies to Pandas UDF executions. If it is not set, the fallback is ${BUFFER_SIZE.key}. Note that Pandas execution requires more than 4 bytes. Lowering this value could make small Pandas UDF batch iterated and pipelined; however, it might degrade performance. See SPARK-27870.

Default: 65536

rangeExchange.sampleSizePerPartition¶

spark.sql.execution.rangeExchange.sampleSizePerPartition

(internal) Number of points to sample per partition in order to determine the range boundaries for range partitioning, typically used in global sorting (without limit).

Default: 100

Use SQLConf.rangeExchangeSampleSizePerPartition method to access the current value.

removeRedundantSorts¶

spark.sql.execution.removeRedundantSorts

(internal) Whether to remove redundant physical sort node

Default: true

Used as SQLConf.REMOVE_REDUNDANT_SORTS_ENABLED

replaceHashWithSortAgg¶

spark.sql.execution.replaceHashWithSortAgg

(internal) Enables replacing hash aggregate operators (i.e., HashAggregateExec and ObjectHashAggregateExec) with SortAggregateExec based on children's ordering

Default: false

Used when:

ReplaceHashWithSortAgg physical optimization is executed

reuseSubquery¶

spark.sql.execution.reuseSubquery

(internal) When true, the planner will try to find duplicated subqueries and re-use them.

Default: true

Use SQLConf.subqueryReuseEnabled for the current value

sortBeforeRepartition¶

spark.sql.execution.sortBeforeRepartition

(internal) When perform a repartition following a shuffle, the output row ordering would be nondeterministic. If some downstream stages fail and some tasks of the repartition stage retry, these tasks may generate different data, and that can lead to correctness issues. Turn on this config to insert a local sort before actually doing repartition to generate consistent repartition results. The performance of repartition() may go down since we insert extra local sort before it.

Default: true

Use SQLConf.sortBeforeRepartition method to access the current value.

usePartitionEvaluator¶

spark.sql.execution.usePartitionEvaluator

(internal) Enables PartitionEvaluator (Spark Core) to execute physical operators (using RDD.mapPartitionsWithEvaluator operator)

Default: false

Use SQLConf.usePartitionEvaluator for the current value

Used when:

FilterExec physical operator is executed
ProjectExec physical operator is executed
ColumnarToRowExec physical operator is executed
RowToColumnarExec physical operator is executed
WholeStageCodegenExec physical operator is executed
SortMergeJoinExec physical operator is executed
MapInBatchExec (PySpark) physical operator is executed
WindowGroupLimitExec physical operator is executed

spark.sql.hive¶

filesourcePartitionFileCacheSize¶

spark.sql.hive.filesourcePartitionFileCacheSize

When greater than 0, enables caching of partition file metadata in memory (using SharedInMemoryCache). All tables share a cache that can use up to specified num bytes for file metadata.

Requires spark.sql.hive.manageFilesourcePartitions to be enabled

Default: 250 * 1024 * 1024

Use SQLConf.filesourcePartitionFileCacheSize for the current value

Used when:

FileStatusCache is requested to look up the system-wide FileStatusCache

manageFilesourcePartitions¶

spark.sql.hive.manageFilesourcePartitions

Enables metastore partition management for file source tables.

This includes both datasource and Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning when spark.sql.hive.metastorePartitionPruning is enabled

Default: true

Use SQLConf.manageFilesourcePartitions for the current value

Used when:

HiveMetastoreCatalog is requested to convert a HiveTableRelation to a LogicalRelation over a HadoopFsRelation
CreateDataSourceTableCommand, CreateDataSourceTableAsSelectCommand and InsertIntoHadoopFsRelationCommand logical commands are executed
DDLUtils utility is used to verifyPartitionProviderIsHive
DataSource is requested to resolve a BaseRelation (for file-based data source tables and creates a HadoopFsRelation)
FileStatusCache is created
V2SessionCatalog is requested to create a table (deprecated)

inMemoryColumnarStorage.partitionPruning¶

spark.sql.inMemoryColumnarStorage.partitionPruning

(internal) Enables partition pruning for in-memory columnar tables

Default: true

Use SQLConf.inMemoryPartitionPruning for the current value

Used when:

InMemoryTableScanExec physical operator is requested to filter cached column batches

spark.sql.optimizer¶

canChangeCachedPlanOutputPartitioning¶

spark.sql.optimizer.canChangeCachedPlanOutputPartitioning

(internal) Whether to forcibly enable some optimization rules that can change the output partitioning of a cached query when executing it for caching. If it is set to true, queries may need an extra shuffle to read the cached data. This configuration is disabled by default. Currently, the optimization rules enabled by this configuration are spark.sql.adaptive.enabled and spark.sql.sources.bucketing.autoBucketedScan.enabled.

Default: false

Use SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING to access the property

decorrelateInnerQuery.enabled¶

spark.sql.optimizer.decorrelateInnerQuery.enabled

(internal) Decorrelates inner queries by eliminating correlated references and build domain joins

Default: true

Use SQLConf.decorrelateInnerQueryEnabled for the current value

dynamicPartitionPruning.fallbackFilterRatio¶

spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio

(internal) When statistics are not available or configured not to be used, this config will be used as the fallback filter ratio for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.

Default: 0.5

Use SQLConf.dynamicPartitionPruningFallbackFilterRatio method to access the current value.

dynamicPartitionPruning.pruningSideExtraFilterRatio¶

spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio

(internal) When filtering side doesn't support broadcast by join type, and doing DPP means running an extra query that may have significant overhead. This config will be used as the extra filter ratio for computing the data size of the pruning side after DPP, in order to evaluate if it is worth adding an extra subquery as the pruning filter.

Must be a double between 0.0 and 1.0

Default: 0.04

Use SQLConf.dynamicPartitionPruningPruningSideExtraFilterRatio to access the current value.

dynamicPartitionPruning.useStats¶

spark.sql.optimizer.dynamicPartitionPruning.useStats

(internal) When true, distinct count statistics will be used for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.

Default: true

Use SQLConf.dynamicPartitionPruningUseStats for the current value

Used when:

PartitionPruning logical optimization rule is executed

dynamicPartitionPruning.enabled¶

spark.sql.optimizer.dynamicPartitionPruning.enabled

Enables generating predicates for partition columns used as join keys

Default: true

Use SQLConf.dynamicPartitionPruningEnabled for the current value

Used to control whether to execute the following optimizations or skip them altogether:

CleanupDynamicPruningFilters logical optimization
PartitionPruning logical optimization
PlanAdaptiveDynamicPruningFilters physical optimization
PlanDynamicPruningFilters physical optimization

dynamicPartitionPruning.reuseBroadcastOnly¶

spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly

(internal) When true, dynamic partition pruning will only apply when the broadcast exchange of a broadcast hash join operation can be reused as the dynamic pruning filter.

Default: true

Use SQLConf.dynamicPartitionPruningReuseBroadcastOnly for the current value

Used when:

PartitionPruning logical optimization is executed (and requested to insertPredicate)

enableCsvExpressionOptimization¶

spark.sql.optimizer.enableCsvExpressionOptimization

Whether to optimize CSV expressions in SQL optimizer. It includes pruning unnecessary columns from from_csv.

Default: true

Use SQLConf.csvExpressionOptimization for the current value

excludedRules¶

spark.sql.optimizer.excludedRules

Comma-separated list of fully-qualified class names of the optimization rules that should be disabled (excluded) from logical query optimization.

Default: (empty)

Use SQLConf.optimizerExcludedRules method to access the current value.

Important

It is not guaranteed that all the rules to be excluded will eventually be excluded, as some rules are non-excludable.

expressionProjectionCandidateLimit¶

spark.sql.optimizer.expressionProjectionCandidateLimit

(internal) The maximum number of the candidates of output expressions whose alias are replaced. It can preserve the output partitioning and ordering. Negative value means disable this optimization.

Default: 100

Used when:

AliasAwareOutputExpression is requested for the aliasCandidateLimit

expression.nestedPruning.enabled¶

spark.sql.optimizer.expression.nestedPruning.enabled

(internal) Prune nested fields from expressions in an operator which are unnecessary in satisfying a query. Note that this optimization doesn't prune nested fields from physical data source scanning. For pruning nested fields from scanning, please use spark.sql.optimizer.nestedSchemaPruning.enabled config.

Default: true

inSetConversionThreshold¶

spark.sql.optimizer.inSetConversionThreshold

(internal) The threshold of set size for InSet conversion.

Default: 10

Use SQLConf.optimizerInSetConversionThreshold method to access the current value.

inSetSwitchThreshold¶

spark.sql.optimizer.inSetSwitchThreshold

(internal) Configures the max set size in InSet for which Spark will generate code with switch statements. This is applicable only to bytes, shorts, ints, dates.

Must be non-negative and less than or equal to 600.

Default: 400

maxIterations¶

spark.sql.optimizer.maxIterations

Maximum number of iterations for Analyzer and Logical Optimizer.

Default: 100

nestedSchemaPruning.enabled¶

spark.sql.optimizer.nestedSchemaPruning.enabled

(internal) Prune nested fields from the output of a logical relation that are not necessary in satisfying a query. This optimization allows columnar file format readers to avoid reading unnecessary nested column data.

Default: true

Use SQLConf.nestedSchemaPruningEnabled method to access the current value.

nestedPredicatePushdown.supportedFileSources¶

spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources

(internal) A comma-separated list of data source short names or fully qualified data source implementation class names for which Spark tries to push down predicates for nested columns and/or names containing dots to data sources. This configuration is only effective with file-based data source in DSv1. Currently, Parquet implements both optimizations while ORC only supports predicates for names containing dots. The other data sources don't support this feature yet.

Default: parquet,orc

optimizeOneRowRelationSubquery¶

spark.sql.optimizer.optimizeOneRowRelationSubquery

(internal) When true, the optimizer will inline subqueries with OneRowRelation leaf nodes

Default: true

Use SQLConf.OPTIMIZE_ONE_ROW_RELATION_SUBQUERY method to access the property (in a type-safe way)

planChangeLog.batches¶

spark.sql.optimizer.planChangeLog.batches

(internal) Configures a list of batches to be logged in the optimizer, in which the batches are specified by their batch names and separated by comma.

Default: (undefined)

planChangeLog.level¶

spark.sql.optimizer.planChangeLog.level

(internal) Configures the log level for logging the change from the original plan to the new plan after a rule or batch is applied. The value can be TRACE, DEBUG, INFO, WARN or ERROR.

Default: TRACE

planChangeLog.rules¶

spark.sql.optimizer.planChangeLog.rules

(internal) Configures a list of rules to be logged in the optimizer, in which the rules are specified by their rule names and separated by comma.

Default: (undefined)

propagateDistinctKeys.enabled¶

spark.sql.optimizer.propagateDistinctKeys.enabled

(internal) Controls whether the Logical Query Optimizer propagates the distinct attributes of logical operators for query optimization

Default: true

Used when:

LogicalPlanDistinctKeys logical operator is requested for the distinct keys

replaceExceptWithFilter¶

spark.sql.optimizer.replaceExceptWithFilter

(internal) When true, the apply function of the rule verifies whether the right node of the except operation is of type Filter or Project followed by Filter. If so, the rule further verifies the following conditions:

Excluding the filter operations from the right (as well as the left node, if any) on the top, whether both the nodes evaluates to a same result
The left and right nodes don't contain any SubqueryExpressions
The output column names of the left node are distinct

If all the conditions are met, the rule will replace the except operation with a Filter by flipping the filter condition(s) of the right node.

Default: true

runtime.bloomFilter.creationSideThreshold¶

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold

Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter.

Default: 10MB

Use SQLConf.runtimeFilterCreationSideThreshold for the current value

Used when:

InjectRuntimeFilter logical optimization is executed (to injectBloomFilter)

runtime.bloomFilter.enabled¶

spark.sql.optimizer.runtime.bloomFilter.enabled

Enables a bloom filter on one side of a shuffle join if the other side has a selective predicate (to reduce the amount of shuffle data)

Default: true

Use SQLConf.runtimeFilterBloomFilterEnabled for the current value

Used when:

InjectRuntimeFilter logical optimization is executed

runtime.bloomFilter.expectedNumItems¶

spark.sql.optimizer.runtime.bloomFilter.expectedNumItems

The default number of expected items for the runtime bloomfilter

Default: 1000000L

SQLConf.RUNTIME_BLOOM_FILTER_EXPECTED_NUM_ITEMS

Used when:

BloomFilterAggregate expression is created

runtime.bloomFilter.maxNumBits¶

spark.sql.optimizer.runtime.bloomFilter.maxNumBits

Maximum number of bits for the runtime bloom filter

Default: 67108864L (8MB)

Must be a non-zero positive number

SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS

Used when:

BloomFilterAggregate is requested to checkInputDataTypes and for the numBits

runtime.rowLevelOperationGroupFilter.enabled¶

spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled

Enables runtime group filtering for group-based row-level operations.

Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten.

Default: true

Current value: SQLConf.runtimeRowLevelOperationGroupFilterEnabled

Used when:

RowLevelOperationRuntimeGroupFiltering logical optimization is executed

runtimeFilter.number.threshold¶

spark.sql.optimizer.runtimeFilter.number.threshold

The total number of injected runtime filters (non-DPP) for a single query. This is to prevent driver OOMs with too many Bloom filters.

Default: 10

Must be a non-zero positive number

SQLConf.RUNTIME_FILTER_NUMBER_THRESHOLD

Used when:

InjectRuntimeFilter logical optimization is executed

runtimeFilter.semiJoinReduction.enabled¶

spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled

Enables inserting a semi join on one side of a shuffle join if the other side has a selective predicate (to reduce the amount of shuffle data)

Default: false

Use SQLConf.runtimeFilterSemiJoinReductionEnabled for the current value

Used when:

InjectRuntimeFilter logical optimization is executed

serializer.nestedSchemaPruning.enabled¶

spark.sql.optimizer.serializer.nestedSchemaPruning.enabled

(internal) Prune nested fields from object serialization operator which are unnecessary in satisfying a query. This optimization allows object serializers to avoid executing unnecessary nested expressions.

Default: true

spark.sql¶

retainGroupColumns¶

spark.sql.retainGroupColumns

(internal) Controls whether to include (retain) grouping columns or not in Aggregation Queries

Default: true

Use SQLConf.dataFrameRetainGroupColumns for the current value

spark.sql.files¶

maxPartitionBytes¶

spark.sql.files.maxPartitionBytes

Maximum number of bytes to pack into a single partition when reading files for file-based data sources (e.g., Parquet)

Default: 128MB (like parquet.block.size)

Use SQLConf.filesMaxPartitionBytes for the current value

Used when:

FilePartition is requested for maxSplitBytes

maxPartitionNum¶

spark.sql.files.maxPartitionNum

The suggested (not guaranteed) maximum number of split file partitions. If set, Spark will rescale each partition to make the number of partitions close to this value if the initial number of partitions exceeds this value.

Effective only with file-based sources such as Parquet, JSON and ORC.

Default: (undefined)

Use SQLConf.filesMaxPartitionNum for the current value

Used when:

FilePartition is requested for the file partitions

maxRecordsPerFile¶

spark.sql.files.maxRecordsPerFile

Maximum number of records to write out to a single file. If 0 or negative, there is no limit.

Default: 0

Use SQLConf.maxRecordsPerFile method for the current value

Used when:

FileFormatWriter is requested to write data out
FileWrite is requested for a BatchWrite (and creates a WriteJobDescription)

minPartitionNum¶

spark.sql.files.minPartitionNum

Hint about the minimum number of partitions for file-based data sources (e.g., Parquet)

Default: spark.sql.leafNodeDefaultParallelism

Use SQLConf.filesMinPartitionNum for the current value

Used when:

FilePartition is requested for maxSplitBytes

openCostInBytes¶

spark.sql.files.openCostInBytes

(internal) The estimated cost to open a file, measured by the number of bytes could be scanned at the same time (to include multiple files into a partition). Effective only for file-based sources such as Parquet, JSON and ORC.

Default: 4MB

It's better to over-estimate it, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).

Use SQLConf.filesOpenCostInBytes for the current value

Used when:

FileSourceScanExec physical operator is requested to create an RDD for a non-bucketed read
FilePartition is requested to getFilePartitions and maxSplitBytes

spark.sql.parquet¶

aggregatePushdown¶

spark.sql.parquet.aggregatePushdown

Controls aggregate pushdown in parquet connector

Supports MIN, MAX and COUNT as aggregate expression:

For MIN/MAX, support boolean, integer, float and date types.
For COUNT, support all data types.

If statistics is missing from any Parquet file footer, exception would be thrown.

Default: false

Use SQLConf.parquetAggregatePushDown for the current value

Used when:

ParquetScanBuilder is requested to pushAggregation

columnarReaderBatchSize¶

spark.sql.parquet.columnarReaderBatchSize

The number of rows to include in a parquet vectorized reader batch (the capacity of VectorizedParquetRecordReader)

Default: 4096 (4k)

The number should be carefully chosen to minimize overhead and avoid OOMs while reading data.

Use SQLConf.parquetVectorizedReaderBatchSize for the current value

Used when:

ParquetFileFormat is requested for a data reader (and creates a VectorizedParquetRecordReader for Vectorized Parquet Decoding)
ParquetPartitionReaderFactory is created
WritableColumnVector is requested to reserve required capacity (and fails)

enableNestedColumnVectorizedReader¶

spark.sql.parquet.enableNestedColumnVectorizedReader

Enables vectorized parquet decoding for nested columns (e.g., arrays, structs and maps). Requires spark.sql.parquet.enableVectorizedReader to be enabled

Default: true

Use SQLConf.parquetVectorizedReaderNestedColumnEnabled for the current value

Used when:

ParquetUtils is requested to isBatchReadSupported

filterPushdown¶

spark.sql.parquet.filterPushdown

Controls filter predicate push-down optimization for parquet connector

Default: true

Use SQLConf.parquetFilterPushDown for the current value

Used when:

ParquetFileFormat is created
ParquetPartitionReaderFactory is created
ParquetScanBuilder is requested to pushDataFilters

filterPushdown.stringPredicate¶

spark.sql.parquet.filterPushdown.stringPredicate

(internal) Controls Parquet filter push-down optimization for string predicate such as startsWith/endsWith/contains functions. Effective only with spark.sql.parquet.filterPushdown enabled.

Default: spark.sql.parquet.filterPushdown.string.startsWith

Use SQLConf.parquetFilterPushDownStringPredicate for the current value

Used when:

ParquetFileFormat is requested to buildReaderWithPartitionValues
ParquetPartitionReaderFactory is created
ParquetScanBuilder is requested to pushDataFilters

mergeSchema¶

spark.sql.parquet.mergeSchema

Controls whether the Parquet data source merges schemas collected from all data files or not. If false, the schema is picked from the summary file or a random data file if no summary file is available.

Default: false

Use SQLConf.isParquetSchemaMergingEnabled for the current value

Parquet option (of higher priority): mergeSchema

Used when:

ParquetOptions is created (and initializes mergeSchema option)

output.committer.class¶

spark.sql.parquet.output.committer.class

(internal) The output committer class used by parquet data source. The specified class needs to be a subclass of org.apache.hadoop.mapreduce.OutputCommitter. Typically, it's also a subclass of org.apache.parquet.hadoop.ParquetOutputCommitter. If it is not, then metadata summaries will never be created, irrespective of the value of parquet.summary.metadata.level.

Default: org.apache.parquet.hadoop.ParquetOutputCommitter

Use SQLConf.parquetOutputCommitterClass for the current value

Used when:

ParquetUtils is requested to prepareWrite

spark.sql.sources¶

bucketing.enabled¶

spark.sql.sources.bucketing.enabled

Enables Bucketing

Default: true

When disabled (i.e. false), bucketed tables are considered regular (non-bucketed) tables.

Use SQLConf.bucketingEnabled method for the current value

commitProtocolClass¶

spark.sql.sources.commitProtocolClass

(internal) Fully-qualified class name of a FileCommitProtocol (Spark Core) for Transactional Writes

Default: SQLHadoopMapReduceCommitProtocol

Use SQLConf.fileCommitProtocolClass method for the current value

Used when:

FileWrite is requested for a BatchWrite
InsertIntoHadoopFsRelationCommand logical command is executed
SaveAsHiveFile is requested to saveAsHiveFile

outputCommitterClass¶

spark.sql.sources.outputCommitterClass

(internal) The fully-qualified class name of the user-defined Hadoop OutputCommitter for SQLHadoopMapReduceCommitProtocol

Default: (undefined)

Use SQLConf.OUTPUT_COMMITTER_CLASS to access the property

Note

ParquetUtils uses spark.sql.parquet.output.committer.class or the default ParquetOutputCommitter instead.

spark.sql.objectHashAggregate.sortBased.fallbackThreshold¶

(internal) The number of entires in an in-memory hash map (to store aggregation buffers per grouping keys) before ObjectHashAggregateExec (ObjectAggregationIterator, precisely) falls back to sort-based aggregation

Default: 128 (entries)

Use SQLConf.objectAggSortBasedFallbackThreshold for the current value

Learn more in Demo: ObjectHashAggregateExec and Sort-Based Fallback Tasks

spark.sql.legacy.allowNonEmptyLocationInCTAS¶

(internal) When false, CTAS with LOCATION throws an analysis exception if the location is not empty.

Default: false

Use SQLConf.allowNonEmptyLocationInCTAS for the current value

spark.sql.legacy.allowAutoGeneratedAliasForView¶

(internal) When true, it's allowed to use an input query without explicit alias when creating a permanent view.

Default: false

Use SQLConf.allowAutoGeneratedAliasForView for the current value

spark.sql.sessionWindow.buffer.spill.threshold¶

(internal) The threshold for number of rows to be spilled by window operator. Note that the buffer is used only for the query Spark SQL cannot apply aggregations on determining session window.

Default: spark.shuffle.spill.numElementsForceSpillThreshold

Use SQLConf.sessionWindowBufferSpillThreshold for the current value

spark.sql.legacy.allowStarWithSingleTableIdentifierInCount¶

(internal) When true, the SQL function count is allowed to take a single tblName.* as parameter

Default: false

Use SQLConf.allowStarWithSingleTableIdentifierInCount for the current value

spark.sql.sessionWindow.buffer.in.memory.threshold¶

(internal) Threshold for number of windows guaranteed to be held in memory by the session window operator. Note that the buffer is used only for the query Spark SQL cannot apply aggregations on determining session window.

Default: 4096

Use SQLConf.sessionWindowBufferInMemoryThreshold for the current value

spark.sql.orc.enableNestedColumnVectorizedReader¶

Enables vectorized orc decoding for nested column

Default: false

Use SQLConf.orcVectorizedReaderNestedColumnEnabled for the current value

spark.sql.analyzer.maxIterations¶

(internal) The max number of iterations the analyzer runs.

Default: 100

spark.sql.analyzer.failAmbiguousSelfJoin¶

(internal) When true, fail the Dataset query if it contains ambiguous self-join.

Default: true

spark.sql.ansi.enabled¶

When true, Spark tries to conform to the ANSI SQL specification:

Spark will throw a runtime exception if an overflow occurs in any operation on integral/decimal field.
Spark will forbid using the reserved keywords of ANSI SQL as identifiers in the SQL parser.

Default: false

spark.sql.cli.print.header¶

When true, spark-sql CLI prints the names of the columns in query output

Default: false

Use SQLConf.cliPrintHeader for the current value

spark.sql.debug.maxToStringFields¶

Maximum number of fields of sequence-like entries can be converted to strings in debug output. Any elements beyond the limit will be dropped and replaced by a "... N more fields" placeholder.

Default: 25

Use SQLConf.maxToStringFields method to access the current value.

spark.sql.defaultCatalog¶

Name of the default catalog

Default: spark_catalog

Use SQLConf.DEFAULT_CATALOG to access the current value.

spark.sql.statistics.histogram.enabled¶

Enables generating histograms for ANALYZE TABLE SQL statement

Default: false

Equi-Height Histogram

Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.

Use SQLConf.histogramEnabled method to access the current value.

spark.sql.session.timeZone¶

The ID of session-local timezone (e.g. "GMT", "America/Los_Angeles")

Default: Java's TimeZone.getDefault.getID

Use SQLConf.sessionLocalTimeZone method to access the current value.

spark.sql.sources.ignoreDataLocality¶

(internal) When true, Spark will not fetch the block locations for each file on listing files. This speeds up file listing, but the scheduler cannot schedule tasks to take advantage of data locality. It can be particularly useful if data is read from a remote cluster so the scheduler could never take advantage of locality anyway.

Default: false

spark.sql.sources.validatePartitionColumns¶

(internal) When this option is set to true, partition column values will be validated with user-specified schema. If the validation fails, a runtime exception is thrown. When this option is set to false, the partition column value will be converted to null if it can not be casted to corresponding user-specified schema.

Default: true

spark.sql.sources.useV1SourceList¶

(internal) A comma-separated list of data source short names (DataSourceRegisters) or fully-qualified canonical class names of the data sources (TableProviders) for which DataSource V2 code path is disabled (and Data Source V1 code path used).

Default: avro,csv,json,kafka,orc,parquet,text

Used when:

DataSource utility is used to lookupDataSourceV2

spark.sql.storeAssignmentPolicy¶

When inserting a value into a column with different data type, Spark will perform type coercion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy and strict. With ANSI policy, Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string to int or double to boolean. With legacy policy, Spark allows the type coercion as long as it is a valid Cast, which is very loose. e.g. converting string to int or double to boolean is allowed. It is also the only behavior in Spark 2.x and it is compatible with Hive. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. converting double to int or decimal to double is not allowed.

Possible values: ANSI, LEGACY, STRICT

Default: ANSI

spark.sql.thriftServer.interruptOnCancel¶

When true, all running tasks will be interrupted if one cancels a query. When false, all running tasks will remain until finished.

Default: false

Use SQLConf.THRIFTSERVER_FORCE_CANCEL to access the property

spark.sql.hive.tablePropertyLengthThreshold¶

(internal) The maximum length allowed in a single cell when storing Spark-specific information in Hive's metastore as table properties. Currently it covers 2 things: the schema's JSON string, the histogram of column statistics.

Default: (undefined)

Use SQLConf.dynamicPartitionPruningEnabled to access the current value.

spark.sql.orc.mergeSchema¶

When true, the Orc data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.

Default: false

spark.sql.sources.bucketing.autoBucketedScan.enabled¶

When true, decide whether to do bucketed scan on input tables based on query plan automatically. Do not use bucketed scan if 1. query does not have operators to utilize bucketing (e.g. join, group-by, etc), or 2. there's an exchange operator between these operators and table scan.

Note when spark.sql.sources.bucketing.enabled is set to false, this configuration does not take any effect.

Default: true

Use SQLConf.autoBucketedScanEnabled to access the property

spark.sql.datetime.java8API.enabled¶

When true, java.time.Instant and java.time.LocalDate classes of Java 8 API are used as external types for Catalyst's TimestampType and DateType. When false, java.sql.Timestamp and java.sql.Date are used for the same purpose.

Default: false

spark.sql.legacy.interval.enabled¶

(internal) When true, Spark SQL uses the mixed legacy interval type CalendarIntervalType instead of the ANSI compliant interval types YearMonthIntervalType and DayTimeIntervalType. For instance, the date subtraction expression returns CalendarIntervalType when the SQL config is set to true otherwise an ANSI interval.

Default: false

Use SQLConf.legacyIntervalEnabled to access the current value

spark.sql.sources.binaryFile.maxLength¶

(internal) The max length of a file that can be read by the binary file data source. Spark will fail fast and not attempt to read the file if its length exceeds this value. The theoretical max is Int.MaxValue, though VMs might implement a smaller max.

Default: Int.MaxValue

spark.sql.mapKeyDedupPolicy¶

The policy to deduplicate map keys in builtin function: CreateMap, MapFromArrays, MapFromEntries, StringToMap, MapConcat and TransformKeys. When EXCEPTION, the query fails if duplicated map keys are detected. When LAST_WIN, the map key that is inserted at last takes precedence.

Possible values: EXCEPTION, LAST_WIN

Default: EXCEPTION

spark.sql.maxConcurrentOutputFileWriters¶

(internal) Maximum number of output file writers for FileFormatWriter to use concurrently (writing out a query result). If number of writers needed reaches this limit, a task will sort rest of output then writing them.

Default: 0

Use SQLConf.maxConcurrentOutputFileWriters for the current value

spark.sql.maxMetadataStringLength¶

Maximum number of characters to output for a metadata string (e.g., Location in FileScan)

Default: 100

Must be bigger than 3

Use SQLConf.maxMetadataStringLength for the current value

spark.sql.maven.additionalRemoteRepositories¶

A comma-delimited string config of the optional additional remote Maven mirror repositories. This is only used for downloading Hive jars in IsolatedClientLoader if the default Maven Central repo is unreachable.

Default: https://maven-central.storage-download.googleapis.com/maven2/

spark.sql.maxPlanStringLength¶

Maximum number of characters to output for a plan string. If the plan is longer, further output will be truncated. The default setting always generates a full plan. Set this to a lower value such as 8k if plan strings are taking up too much memory or are causing OutOfMemory errors in the driver or UI processes.

Default: Integer.MAX_VALUE - 15

spark.sql.addPartitionInBatch.size¶

(internal) The number of partitions to be handled in one turn when use AlterTableAddPartitionCommand to add partitions into table. The smaller batch size is, the less memory is required for the real handler, e.g. Hive Metastore.

Default: 100

spark.sql.scriptTransformation.exitTimeoutInSeconds¶

(internal) Timeout for executor to wait for the termination of transformation script when EOF.

Default: 10 seconds

spark.sql.avro.compression.codec¶

The compression codec to use when writing Avro data to disk

Default: snappy

The supported codecs are:

uncompressed
deflate
snappy
bzip2
xz

Use SQLConf.avroCompressionCodec method to access the current value.

spark.sql.broadcastTimeout¶

Timeout in seconds for the broadcast wait time in broadcast joins.

Default: 5 * 60

When negative, it is assumed infinite (i.e. Duration.Inf)

Use SQLConf.broadcastTimeout method to access the current value.

spark.sql.bucketing.coalesceBucketsInJoin.enabled¶

When enabled (true), if two bucketed tables with the different number of buckets are joined, the side with a bigger number of buckets will be coalesced to have the same number of buckets as the other side. Bigger number of buckets is divisible by the smaller number of buckets. Bucket coalescing is applied to sort-merge joins and shuffled hash join.

Note

Coalescing bucketed table can avoid unnecessary shuffling in join, but it also reduces parallelism and could possibly cause OOM for shuffled hash join.

Default: false

Use SQLConf.coalesceBucketsInJoinEnabled method to access the current value.

spark.sql.caseSensitive¶

(internal) Controls whether the query analyzer should be case sensitive (true) or not (false).

Default: false

It is highly discouraged to turn on case sensitive mode.

Use SQLConf.caseSensitiveAnalysis method to access the current value.

spark.sql.catalog.spark_catalog¶

The CatalogPlugin for spark_catalog

Default: defaultSessionCatalog

spark.sql.cbo.enabled¶

Enables Cost-Based Optimization (CBO) for estimation of plan statistics when true.

Default: false

Use SQLConf.cboEnabled method to access the current value.

spark.sql.cbo.joinReorder.enabled¶

Enables join reorder for cost-based optimization (CBO).

Default: false

Use SQLConf.joinReorderEnabled method to access the current value.

spark.sql.cbo.planStats.enabled¶

When true, the logical plan will fetch row counts and column statistics from catalog.

Default: false

spark.sql.cbo.starSchemaDetection¶

Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization.

Default: false

Use SQLConf.starSchemaDetection method to access the current value.

spark.sql.codegen¶

aggregate.map.vectorized.enable¶

spark.sql.codegen.aggregate.map.vectorized.enable

(internal) Enables vectorized aggregate hash map. This is for testing/benchmarking only.

Default: false

aggregate.sortAggregate.enabled¶

spark.sql.codegen.aggregate.sortAggregate.enabled

Enables code generation for SortAggregateExec physical operator

Default: true

aggregate.splitAggregateFunc.enabled¶

spark.sql.codegen.aggregate.splitAggregateFunc.enabled

(internal) When true, the code generator would split aggregate code into individual methods instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Default: true

comments¶

spark.sql.codegen.comments

Controls whether CodegenContext should register comments (true) or not (false).

Default: false

factoryMode¶

spark.sql.codegen.factoryMode

(internal) Determines the codegen generator fallback behavior

Default: FALLBACK

Acceptable values:

CODEGEN_ONLY - disable fallback mode
FALLBACK - try codegen first and, if any compile error happens, fallback to interpreted mode
NO_CODEGEN - skips codegen and always uses interpreted path

Used when CodeGeneratorWithInterpretedFallback is requested to createObject (when UnsafeProjection is requested to create an UnsafeProjection for Catalyst expressions)

useIdInClassName¶

spark.sql.codegen.useIdInClassName

(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix (true) or not (false)

Default: true

Use SQLConf.wholeStageUseIdInClassName method to access the current value.

maxFields¶

spark.sql.codegen.maxFields

(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen.

Default: 100

Use SQLConf.wholeStageMaxNumFields method to access the current value.

splitConsumeFuncByOperator¶

spark.sql.codegen.splitConsumeFuncByOperator

(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Default: true

Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value.

spark.sql.columnNameOfCorruptRecord¶

spark.sql.constraintPropagation.enabled¶

(internal) When true, the query optimizer will infer and propagate data constraints in the query plan to optimize them. Constraint propagation can sometimes be computationally expensive for certain kinds of query plans (such as those with a large number of predicates and aliases) which might negatively impact overall runtime.

Default: true

Use SQLConf.constraintPropagationEnabled method to access the current value.

spark.sql.csv.filterPushdown.enabled¶

(internal) When true, enable filter pushdown to CSV datasource.

Default: true

spark.sql.defaultSizeInBytes¶

(internal) Estimated size of a table or relation used in query planning

Default: Java's Long.MaxValue

Set to Java's Long.MaxValue which is larger than spark.sql.autoBroadcastJoinThreshold to be more conservative. That is to say by default the optimizer will not choose to broadcast a table unless it knows for sure that the table size is small enough.

Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast.

Use SQLConf.defaultSizeInBytes method to access the current value.

spark.sql.dialect¶

execution.useObjectHashAggregateExec¶

spark.sql.execution.useObjectHashAggregateExec

(internal) Prefers ObjectHashAggregateExec (over SortAggregateExec) for aggregation

Default: true

Use SQLConf.useObjectHashAggregation method to access the current value.

spark.sql.files.ignoreCorruptFiles¶

Controls whether to ignore corrupt files (true) or not (false). If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.

Default: false

Use SQLConf.ignoreCorruptFiles method to access the current value.

spark.sql.files.ignoreMissingFiles¶

Controls whether to ignore missing files (true) or not (false). If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned.

Default: false

Use SQLConf.ignoreMissingFiles method to access the current value.

spark.sql.inMemoryColumnarStorage.compressed¶

When enabled, Spark SQL will automatically select a compression codec for each column based on statistics of the data.

Default: true

Use SQLConf.useCompression method to access the current value.

spark.sql.inMemoryColumnarStorage.batchSize¶

Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.

Default: 10000

Use SQLConf.columnBatchSize method to access the current value.

spark.sql.inMemoryTableScanStatistics.enable¶

(internal) When true, enable in-memory table scan accumulators.

Default: false

spark.sql.inMemoryColumnarStorage.enableVectorizedReader¶

Enables vectorized reader for columnar caching

Default: true

Use SQLConf.cacheVectorizedReaderEnabled method to access the current value.

spark.sql.join.preferSortMergeJoin¶

(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join.

Default: true

Use SQLConf.preferSortMergeJoin method to access the current value.

spark.sql.jsonGenerator.ignoreNullFields¶

Whether to ignore null fields when generating JSON objects in JSON data source and JSON functions such as to_json. If false, it generates null for null fields in JSON objects.

Default: true

spark.sql.leafNodeDefaultParallelism¶

The default parallelism of leaf operators that produce data

Must be positive

Default: SparkContext.defaultParallelism (Spark Core)

spark.sql.legacy.doLooseUpcast¶

(internal) When true, the upcast will be loose and allows string to atomic types.

Default: false

spark.sql.legacy.ctePrecedencePolicy¶

(internal) This config will be removed in future versions and CORRECTED will be the only behavior.

Possible values:

CORRECTED - inner CTE definitions take precedence
EXCEPTION - AnalysisException is thrown while name conflict is detected in nested CTE
LEGACY - outer CTE definitions takes precedence over inner definitions

Default: EXCEPTION

spark.sql.legacy.timeParserPolicy¶

(internal) When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. When set to CORRECTED, classes from java.time.* packages are used for the same purpose. The default value is EXCEPTION, RuntimeException is thrown when we will get different results.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.followThreeValuedLogicInArrayExists¶

(internal) When true, the ArrayExists will follow the three-valued boolean logic.

Default: true

spark.sql.legacy.fromDayTimeString.enabled¶

(internal) When true, the from bound is not taken into account in conversion of a day-time string to an interval, and the to bound is used to skip all interval units out of the specified range. When false, ParseException is thrown if the input does not match to the pattern defined by from and to.

Default: false

spark.sql.legacy.notReserveProperties¶

(internal) When true, all database and table properties are not reserved and available for create/alter syntaxes. But please be aware that the reserved properties will be silently removed.

Default: false

spark.sql.legacy.addSingleFileInAddFile¶

(internal) When true, only a single file can be added using ADD FILE. If false, then users can add directory by passing directory path to ADD FILE.

Default: false

spark.sql.legacy.exponentLiteralAsDecimal.enabled¶

(internal) When true, a literal with an exponent (e.g. 1E-30) would be parsed as Decimal rather than Double.

Default: false

spark.sql.legacy.allowNegativeScaleOfDecimal¶

(internal) When true, negative scale of Decimal type is allowed. For example, the type of number 1E10BD under legacy mode is DecimalType(2, -9), but is Decimal(11, 0) in non legacy mode.

Default: false

spark.sql.legacy.bucketedTableScan.outputOrdering¶

(internal) When true, the bucketed table scan will list files during planning to figure out the output ordering, which is expensive and may make the planning quite slow.

Default: false

spark.sql.legacy.json.allowEmptyString.enabled¶

(internal) When true, the parser of JSON data source treats empty strings as null for some data types such as IntegerType.

Default: false

spark.sql.legacy.createEmptyCollectionUsingStringType¶

(internal) When true, Spark returns an empty collection with StringType as element type if the array/map function is called without any parameters. Otherwise, Spark returns an empty collection with NullType as element type.

Default: false

spark.sql.legacy.allowUntypedScalaUDF¶

(internal) When true, user is allowed to use org.apache.spark.sql.functions.udf(f: AnyRef, dataType: DataType). Otherwise, an exception will be thrown at runtime.

Default: false

spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue¶

(internal) When true, the key attribute resulted from running Dataset.groupByKey for non-struct key type, will be named as value, following the behavior of Spark version 2.4 and earlier.

Default: false

spark.sql.legacy.setCommandRejectsSparkCoreConfs¶

(internal) If it is set to true, SET command will fail when the key is registered as a SparkConf entry.

Default: true

spark.sql.legacy.typeCoercion.datetimeToString.enabled¶

(internal) When true, date/timestamp will cast to string in binary comparisons with String

Default: false

spark.sql.legacy.allowHashOnMapType¶

(internal) When true, hash expressions can be applied on elements of MapType. Otherwise, an analysis exception will be thrown.

Default: false

spark.sql.legacy.parquet.datetimeRebaseModeInWrite¶

(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.parquet.datetimeRebaseModeInRead¶

(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.avro.datetimeRebaseModeInWrite¶

(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.avro.datetimeRebaseModeInRead¶

(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

spark.sql.legacy.rdd.applyConf¶

(internal) Enables propagation of SQL configurations when executing operations on the RDD that represents a structured query. This is the (buggy) behavior up to 2.4.4.

Default: true

This is for cases not tracked by SQL execution, when a Dataset is converted to an RDD either using Dataset.md#rdd[rdd] operation or QueryExecution, and then the returned RDD is used to invoke actions on it.

This config is deprecated and will be removed in 3.0.0.

spark.sql.legacy.replaceDatabricksSparkAvro.enabled¶

Enables resolving (mapping) the data source provider com.databricks.spark.avro to the built-in (but external) Avro data source module for backward compatibility.

Default: true

Use SQLConf.replaceDatabricksSparkAvroEnabled method to access the current value.

spark.sql.limit.scaleUpFactor¶

(internal) Minimal increase rate in the number of partitions between attempts when executing take operator on a structured query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be run.

Default: 4

Use SQLConf.limitScaleUpFactor method to access the current value.

spark.sql.optimizeNullAwareAntiJoin¶

(internal) Enables single-column NULL-aware anti join execution planning into BroadcastHashJoinExec (with flag isNullAwareAntiJoin enabled), optimized from O(M*N) calculation into O(M) calculation using hash lookup instead of looping lookup.

Default: true

Use SQLConf.optimizeNullAwareAntiJoin method to access the current value.

spark.sql.orc.impl¶

(internal) When native, use the native version of ORC support instead of the ORC library in Hive 1.2.1.

Default: native

Acceptable values:

hive
native

spark.sql.planChangeLog.level¶

(internal) Log level for logging the change from the original plan to the new plan after a rule or batch is applied.

Default: trace

Supported Values (case-insensitive):

trace
debug
info
warn
error

Use SQLConf.planChangeLogLevel method to access the current value.

spark.sql.planChangeLog.batches¶

(internal) Comma-separated list of batch names for plan changes logging

Default: (undefined)

Use SQLConf.planChangeBatches method to access the current value.

spark.sql.planChangeLog.rules¶

(internal) Comma-separated list of rule names for plan changes logging

Default: (undefined)

Use SQLConf.planChangeRules method to access the current value.

spark.sql.pyspark.jvmStacktrace.enabled¶

When true, it shows the JVM stacktrace in the user-facing PySpark exception together with Python stacktrace. By default, it is disabled and hides JVM stacktrace and shows a Python-friendly exception only.

Default: false

spark.sql.parquet.binaryAsString¶

Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.

Default: false

Use SQLConf.isParquetBinaryAsString method to access the current value.

spark.sql.parquet.compression.codec¶

Sets the compression codec used when writing Parquet files. If either compression or parquet.compression is specified in the table-specific options/properties, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec.

Acceptable values:

brotli
gzip
lz4
lzo
none
snappy
uncompressed
zstd

Default: snappy

Use SQLConf.parquetCompressionCodec method to access the current value.

spark.sql.parquet.enableVectorizedReader¶

Enables vectorized parquet decoding.

Default: true

Use SQLConf.parquetVectorizedReaderEnabled method to access the current value.

spark.sql.parquet.filterPushdown.date¶

(internal) Enables parquet filter push-down optimization for Date type (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownDate method to access the current value.

spark.sql.parquet.filterPushdown.decimal¶

(internal) Enables parquet filter push-down optimization for Decimal type (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownDecimal method to access the current value.

spark.sql.parquet.int96RebaseModeInWrite¶

(internal) Enables rebasing timestamps while writing Parquet files

Formerly known as spark.sql.legacy.parquet.int96RebaseModeInWrite

Acceptable values:

EXCEPTION - Fail writing parquet files if there are ancient timestamps that are ambiguous between the two calendars
LEGACY - Rebase INT96 timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar (gives maximum interoperability)
CORRECTED - Write datetime values with no change (rabase). Only when you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar

Default: EXCEPTION

Writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar.

See more details in SPARK-31404.

spark.sql.parquet.pushdown.inFilterThreshold¶

(internal) For IN predicate, Parquet filter will push-down a set of OR clauses if its number of values not exceeds this threshold. Otherwise, Parquet filter will push-down a value greater than or equal to its minimum value and less than or equal to its maximum value (when spark.sql.parquet.filterPushdown is enabled)

Disabled when 0

Default: 10

Use SQLConf.parquetFilterPushDownInFilterThreshold method to access the current value.

spark.sql.parquet.filterPushdown.string.startsWith¶

(internal) Enables parquet filter push-down optimization for startsWith function (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownStringStartWith method to access the current value.

spark.sql.parquet.filterPushdown.timestamp¶

(internal) Enables parquet filter push-down optimization for Timestamp type. It can only have an effect when the following hold:

spark.sql.parquet.filterPushdown is enabled
Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type

Default: true

Use SQLConf.parquetFilterPushDownTimestamp method to access the current value.

spark.sql.parquet.int96AsTimestamp¶

Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

Default: true

Use SQLConf.isParquetINT96AsTimestamp method to access the current value.

spark.sql.parquet.int96TimestampConversion¶

Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala.

Default: false

This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark.

Use SQLConf.isParquetINT96TimestampConversion method to access the current value.

spark.sql.parquet.outputTimestampType¶

Sets which Parquet timestamp type to use when Spark writes data to Parquet files. INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.

Acceptable values:

INT96
TIMESTAMP_MICROS
TIMESTAMP_MILLIS

Default: INT96

Use SQLConf.parquetOutputTimestampType method to access the current value.

spark.sql.parquet.recordLevelFilter.enabled¶

Enables Parquet's native record-level filtering using the pushed down filters (when spark.sql.parquet.filterPushdown is enabled).

Default: false

Use SQLConf.parquetRecordFilterEnabled method to access the current value.

spark.sql.parquet.respectSummaryFiles¶

When true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn't be enabled before knowing what it means exactly.

Default: false

Use SQLConf.isParquetSchemaRespectSummaries method to access the current value.

spark.sql.parser.quotedRegexColumnNames¶

Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions.

Default: false

Use SQLConf.supportQuotedRegexColumnName method to access the current value.

spark.sql.pivotMaxValues¶

Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column)

Default: 10000

Use SQLConf.dataFramePivotMaxValues method to access the current value.

spark.sql.redaction.options.regex¶

Regular expression to find options of a Spark SQL command with sensitive information

Default: (?i)secret!password

The values of the options matched will be redacted in the explain output.

This redaction is applied on top of the global redaction configuration defined by spark.redaction.regex configuration.

Used exclusively when SQLConf is requested to redactOptions.

spark.sql.redaction.string.regex¶

Regular expression to point at sensitive information in text output

Default: (undefined)

When this regex matches a string part, it is replaced by a dummy value (i.e. *********(redacted)). This is currently used to redact the output of SQL explain commands.

NOTE: When this conf is not set, the value of spark.redaction.string.regex is used instead.

Use SQLConf.stringRedactionPattern method to access the current value.

spark.sql.runSQLOnFiles¶

(internal) Enables datasource`.`path table names in SQL queries for FileFormat-based data sources (excluding hive tables)

Default: true

Use SQLConf.runSQLonFile method to access the current value.

Used when:

ResolveSQLOnFile logical analysis rule is executed

spark.sql.selfJoinAutoResolveAmbiguity¶

Controls whether to resolve ambiguity in join conditions for self-joins automatically (true) or not (false)

Default: true

spark.sql.sort.enableRadixSort¶

(internal) Controls whether to use radix sort (true) or not (false) in ShuffleExchangeExec and SortExec physical operators

Default: true

Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more).

Use SQLConf.enableRadixSort method to access the current value.

spark.sql.sources.default¶

Default data source to use for loading or saving data

Default: parquet

Use SQLConf.defaultDataSourceName method to access the current value.

spark.sql.statistics.fallBackToHdfs¶

Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata.

Default: false

This can be useful in determining if a table is small enough for auto broadcast joins in query planning.

Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value.

spark.sql.statistics.histogram.numBins¶

(internal) The number of bins when generating histograms.

Default: 254

NOTE: The number of bins must be greater than 1.

Use SQLConf.histogramNumBins method to access the current value.

spark.sql.statisticsparallelFileListingInStatsComputation.enabled*¶

(internal) Enables parallel file listing in SQL commands, e.g. ANALYZE TABLE (as opposed to single thread listing that can be particularly slow with tables with hundreds of partitions)

Default: true

Use SQLConf.parallelFileListingInStatsComputation method to access the current value.

spark.sql.statistics.ndv.maxError¶

(internal) The maximum estimation error allowed in HyperLogLog++ algorithm when generating column level statistics.

Default: 0.05

spark.sql.statistics.percentile.accuracy¶

(internal) Accuracy of percentile approximation when generating equi-height histograms. Larger value means better accuracy. The relative error can be deduced by 1.0 / PERCENTILE_ACCURACY.

Default: 10000

spark.sql.statistics.size.autoUpdate.enabled¶

Enables automatic update of the table size statistic of a table after the table has changed.

Default: false

IMPORTANT: If the total number of files of the table is very large this can be expensive and slow down data change commands.

Use SQLConf.autoSizeUpdateEnabled method to access the current value.

spark.sql.subexpressionElimination.enabled¶

(internal) Enables Subexpression Elimination

Default: true

Use SQLConf.subexpressionEliminationEnabled method to access the current value.

spark.sql.subexpressionElimination.enabled¶

(internal) Enables shortcut eliminate subexpression with AND and OR.

The subexpression may not need to eval even if it appears more than once, e.g., if(or(a, and(b, b))), the expression b would be skipped if a is true.

Default: false

Use SQLConf.subexpressionEliminationSkipForShotcutExpr method to access the current value.

Used when:

EquivalentExpressions is created

spark.sql.shuffle.partitions¶

The default number of partitions to use when shuffling data for joins or aggregations.

Default: 200

Note

Corresponds to Apache Hive's mapred.reduce.tasks property that Spark SQL considers deprecated.

Spark Structured Streaming

spark.sql.shuffle.partitions cannot be changed in Spark Structured Streaming between query restarts from the same checkpoint location.

Use SQLConf.numShufflePartitions method to access the current value.

spark.sql.sources.fileCompressionFactor¶

(internal) When estimating the output data size of a table scan, multiply the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated result.

Default: 1.0

Use SQLConf.fileCompressionFactor method to access the current value.

spark.sql.sources.partitionOverwriteMode¶

Enables dynamic partition inserts when dynamic

Default: static

When INSERT OVERWRITE a partitioned data source table with dynamic partition columns, Spark SQL supports two modes (case-insensitive):

static - Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting
dynamic - Spark doesn't delete partitions ahead, and only overwrites those partitions that have data written into it

The default STATIC overwrite mode is to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode.

Use SQLConf.partitionOverwriteMode method to access the current value.

spark.sql.truncateTable.ignorePermissionAcl.enabled¶

(internal) Disables setting back original permission and ACLs when re-creating the table/partition paths for TRUNCATE TABLE command.

Default: false

Use SQLConf.truncateTableIgnorePermissionAcl method to access the current value.

spark.sql.ui.retainedExecutions¶

Number of SQLExecutionUIDatas to keep in failedExecutions and completedExecutions internal registries.

Default: 1000

When a query execution finishes, the execution is removed from the internal activeExecutions registry and stored in failedExecutions or completedExecutions given the end execution status. It is when SQLListener makes sure that the number of SQLExecutionUIData entires does not exceed spark.sql.ui.retainedExecutions Spark property and removes the excess of entries.

spark.sql.variable.substitute¶

Enables Variable Substitution

Default: true

Use SQLConf.variableSubstituteEnabled method to access the current value.

spark.sql.windowExec.buffer.in.memory.threshold¶

(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator.

Default: 4096

Use SQLConf.windowExecBufferInMemoryThreshold method to access the current value.

spark.sql.windowExec.buffer.spill.threshold¶

(internal) Threshold for number of rows buffered in a WindowExec physical operator.

Default: 4096

Use SQLConf.windowExecBufferSpillThreshold method to access the current value.