Configuration Properties¶
Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.
Configuration properties are configured in a SparkSession while creating a new instance using config method (e.g. spark.sql.warehouse.dir).
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.appName("My Spark Application")
.config("spark.sql.warehouse.dir", "c:/Temp") // (1)!
.getOrCreate
You can also set a property using SQL SET
command.
assert(spark.conf.getOption("spark.sql.hive.metastore.version").isEmpty)
scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)
+--------------------------------+-----+
|key |value|
+--------------------------------+-----+
|spark.sql.hive.metastore.version|2.3.2|
+--------------------------------+-----+
assert(spark.conf.get("spark.sql.hive.metastore.version") == "2.3.2")
spark.sql.adaptive¶
advisoryPartitionSizeInBytes¶
spark.sql.adaptive.advisoryPartitionSizeInBytes
The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is enabled). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
Default: 64MB
Fallback Property: spark.sql.adaptive.shuffle.targetPostShuffleInputSize
Use SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES to reference the name.
autoBroadcastJoinThreshold¶
spark.sql.adaptive.autoBroadcastJoinThreshold
The maximum size (in bytes) of a table to be broadcast when performing a join. -1
turns broadcasting off. The default value is same as spark.sql.autoBroadcastJoinThreshold.
Used only in Adaptive Query Execution
Default: (undefined)
Available as SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD value.
coalescePartitions.enabled¶
spark.sql.adaptive.coalescePartitions.enabled
Controls coalescing shuffle partitions
When true
and spark.sql.adaptive.enabled is enabled, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks.
Default: true
Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.
coalescePartitions.minPartitionSize¶
spark.sql.adaptive.coalescePartitions.minPartitionSize
The minimum size (in bytes unless specified) of shuffle partitions after coalescing. This is useful when the adaptively calculated target size is too small during partition coalescing
Default: 1MB
Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.
coalescePartitions.minPartitionSize¶
spark.sql.adaptive.coalescePartitions.minPartitionSize
The minimum size (in bytes) of shuffle partitions after coalescing.
Useful when the adaptively calculated target size is too small during partition coalescing.
Default: (undefined)
Must be positive
Used when:
- CoalesceShufflePartitions adaptive physical optimization is executed
coalescePartitions.initialPartitionNum¶
spark.sql.adaptive.coalescePartitions.initialPartitionNum
The initial number of shuffle partitions before coalescing.
By default it equals to spark.sql.shuffle.partitions. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.
Default: (undefined)
coalescePartitions.parallelismFirst¶
spark.sql.adaptive.coalescePartitions.parallelismFirst
When true
, Spark does not respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes when coalescing contiguous shuffle partitions, but adaptively calculate the target size according to the default parallelism of the Spark cluster. The calculated size is usually smaller than the configured target size. This is to maximize the parallelism and avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false
and respect the configured target size.
Default: true
Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.
customCostEvaluatorClass¶
spark.sql.adaptive.customCostEvaluatorClass
The fully-qualified class name of the CostEvaluator in Adaptive Query Execution
Default: SimpleCostEvaluator
Use SQLConf.ADAPTIVE_CUSTOM_COST_EVALUATOR_CLASS method to access the property (in a type-safe way).
Used when:
AdaptiveSparkPlanExec
physical operator is requested for the AQE cost evaluator
enabled¶
spark.sql.adaptive.enabled
Enables Adaptive Query Execution
Default: true
Use SQLConf.adaptiveExecutionEnabled method to access the current value.
fetchShuffleBlocksInBatch¶
spark.sql.adaptive.fetchShuffleBlocksInBatch
(internal) Whether to fetch the contiguous shuffle blocks in batch. Instead of fetching blocks one by one, fetching contiguous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple contiguous blocks exist in single "fetch request only happen when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. This feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol.
Default: true
Use SQLConf.fetchShuffleBlocksInBatch method to access the current value.
forceApply¶
spark.sql.adaptive.forceApply
(internal) When true
(together with spark.sql.adaptive.enabled enabled), Spark will force apply adaptive query execution for all supported queries.
Default: false
Use SQLConf.ADAPTIVE_EXECUTION_FORCE_APPLY method to access the property (in a type-safe way).
forceOptimizeSkewedJoin¶
spark.sql.adaptive.forceOptimizeSkewedJoin
Enables OptimizeSkewedJoin physical optimization to be executed even if it introduces extra shuffle
Default: false
Requires spark.sql.adaptive.skewJoin.enabled to be enabled
Use SQLConf.ADAPTIVE_FORCE_OPTIMIZE_SKEWED_JOIN to access the property (in a type-safe way).
Used when:
AdaptiveSparkPlanExec
physical operator is requested for the AQE cost evaluator (and creates a SimpleCostEvaluator)- OptimizeSkewedJoin physical optimization is executed
localShuffleReader.enabled¶
spark.sql.adaptive.localShuffleReader.enabled
When true
(and spark.sql.adaptive.enabled is true
), Spark SQL tries to use local shuffle reader to read the shuffle data when the shuffle partitioning is not needed, for example, after converting sort-merge join to broadcast-hash join.
Default: true
Use SQLConf.LOCAL_SHUFFLE_READER_ENABLED to access the property (in a type-safe way)
logLevel¶
spark.sql.adaptive.logLevel
(internal) Log level for adaptive execution logging of plan changes. The value can be TRACE
, DEBUG
, INFO
, WARN
or ERROR
.
Default: DEBUG
Use SQLConf.adaptiveExecutionLogLevel for the current value
maxShuffledHashJoinLocalMapThreshold¶
spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold
The maximum size (in bytes) per partition that can be allowed to build local hash map. If this value is not smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes and all the partition size are not larger than this config, join selection prefer to use shuffled hash join instead of sort merge join regardless of the value of spark.sql.join.preferSortMergeJoin.
Default: 0
Available as SQLConf.ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD
nonEmptyPartitionRatioForBroadcastJoin¶
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin
(internal) A relation with a non-empty partition ratio (the number of non-empty partitions to all partitions) lower than this config will not be considered as the build side of a broadcast-hash join in Adaptive Query Execution regardless of the size.
Effective with spark.sql.adaptive.enabled true
Default: 0.2
Use SQLConf.nonEmptyPartitionRatioForBroadcastJoin method to access the current value.
optimizer.excludedRules¶
spark.sql.adaptive.optimizer.excludedRules
A comma-separated list of rules (names) to be disabled (excluded) in the AQE Logical Optimizer
Default: undefined
Use SQLConf.ADAPTIVE_OPTIMIZER_EXCLUDED_RULES to reference the property.
Used when:
AQEOptimizer
is requested for the batches
optimizeSkewsInRebalancePartitions.enabled¶
When true
and spark.sql.adaptive.enabled is true
, Spark SQL will optimize the skewed shuffle partitions in RebalancePartitions and split them to smaller ones according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid data skew
Default: true
Use SQLConf.ADAPTIVE_OPTIMIZE_SKEWS_IN_REBALANCE_PARTITIONS_ENABLED method to access the property (in a type-safe way)
skewJoin.enabled¶
spark.sql.adaptive.skewJoin.enabled
When true
and spark.sql.adaptive.enabled is enabled, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.
Default: true
Use SQLConf.SKEW_JOIN_ENABLED to reference the property.
skewJoin.skewedPartitionFactor¶
spark.sql.adaptive.skewJoin.skewedPartitionFactor
A partition is considered skewed if its size is larger than this factor multiplying the median partition size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes.
Default: 5
skewJoin.skewedPartitionThresholdInBytes¶
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes
A partition is considered skewed if its size in bytes is larger than this threshold and also larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor multiplying the median partition size. Ideally this config should be set larger than spark.sql.adaptive.advisoryPartitionSizeInBytes.
Default: 256MB
allowNamedFunctionArguments¶
spark.sql.allowNamedFunctionArguments
Controls support for named parameters in function calls in SQL statements
Default: true
Used when:
AstBuilder
is requested to parse table-valued function (and extractFunctionTableNamedArgument and extractNamedArgument)
autoBroadcastJoinThreshold¶
spark.sql.autoBroadcastJoinThreshold
Maximum size (in bytes) for a table that can be broadcast (to all worker nodes) in a join
Default: 10M
-1
(or any negative value) disables broadcasting
Use SQLConf.autoBroadcastJoinThreshold method to access the current value.
cache.serializer¶
spark.sql.cache.serializer
The name of CachedBatchSerializer implementation to translate SQL data into a format that can more efficiently be cached.
Default: org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer
spark.sql.cache.serializer
is a StaticSQLConf
Use SQLConf.SPARK_CACHE_SERIALIZER for the name
Used when:
InMemoryRelation
is requested for the CachedBatchSerializer
spark.sql.catalog.spark_catalog¶
The fully-qualified class name of a CatalogPlugin to handle spark_catalog
built-in session catalog
Default: defaultSessionCatalog
Used when:
CatalogManager
is requested for the user-specified v2 session catalog
spark.sql.codegen¶
aggregate.fastHashMap.capacityBit¶
spark.sql.codegen.aggregate.fastHashMap.capacityBit
(internal) Capacity for the max number of rows to be held in memory by the fast hash aggregate product operator. The bit is not for actual value, but the actual numBuckets
is determined by loadFactor
(e.g., the default bit value 16
, the actual numBuckets is ((1 << 16) / 0.5
).
Default: 16
Must be in the range of [10, 30]
(inclusive)
Use SQLConf.fastHashAggregateRowMaxCapacityBit for the current value
Used when:
HashAggregateExec
physical operator is requested to doProduceWithKeys
aggregate.map.twolevel.enabled¶
spark.sql.codegen.aggregate.map.twolevel.enabled
(internal) Enable two-level aggregate hash map. When enabled, records will first be inserted/looked-up at a 1st-level, small, fast map, and then fallback to a 2nd-level, larger, slower map when 1st level is full or keys cannot be found. When disabled, records go directly to the 2nd level.
Default: true
Use SQLConf.enableTwoLevelAggMap for the current value
Used when:
HashAggregateExec
physical operator is requested to doProduceWithKeys
aggregate.map.vectorized.enable¶
spark.sql.codegen.aggregate.map.vectorized.enable
(internal) Enables vectorized aggregate hash map. For testing/benchmarking only.
Default: false
Use SQLConf.enableVectorizedHashMap for the current value
Used when:
HashAggregateExec
physical operator is requested to enableTwoLevelHashMap, doProduceWithKeys
aggregate.map.twolevel.partialOnly¶
spark.sql.codegen.aggregate.map.twolevel.partialOnly
(internal) Enables two-level aggregate hash map for partial aggregate only, because final aggregate might get more distinct keys compared to partial aggregate. "Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.
Default: true
Used when:
- HashAggregateExec physical operator is requested to checkIfFastHashMapSupported
hugeMethodLimit¶
spark.sql.codegen.hugeMethodLimit
(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled code has a function that exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the query plan.
Default: 65535
The default value 65535
is the largest bytecode size possible for a valid Java method. When running on HotSpot, it may be preferable to set the value to 8000
(which is the value of HugeMethodLimit
in the OpenJDK JVM settings)
Use SQLConf.hugeMethodLimit method to access the current value.
Used when:
WholeStageCodegenExec
physical operator is executed
fallback¶
spark.sql.codegen.fallback
(internal) Whether the whole-stage codegen could be temporary disabled for the part of a query that has failed to compile generated code (true
) or not (false
).
Default: true
Use SQLConf.wholeStageFallback method to access the current value.
Used when:
WholeStageCodegenExec
physical operator is executed
join.fullOuterShuffledHashJoin.enabled¶
spark.sql.codegen.join.fullOuterShuffledHashJoin.enabled
(internal) Enables Whole-Stage Code Generation for FULL OUTER shuffled hash join
Default: true
Use SQLConf.ENABLE_FULL_OUTER_SHUFFLED_HASH_JOIN_CODEGEN to access the property
methodSplitThreshold¶
spark.sql.codegen.methodSplitThreshold
(internal) The threshold of source-code splitting in the codegen. When the number of characters in a single Java function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones. We cannot know how many bytecode will be generated, so use the code length as metric. When running on HotSpot, a function's bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise there will be many function calls.
Default: 1024
Use SQLConf.methodSplitThreshold for the current value
wholeStage¶
spark.sql.codegen.wholeStage
(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method (true
) or not (false
).
Default: true
Use SQLConf.wholeStageEnabled method to access the current value.
columnVector.offheap.enabled¶
spark.sql.columnVector.offheap.enabled
(internal) Enables OffHeapColumnVector (true
) or OnHeapColumnVector (false
) in ColumnarBatch
Default: false
Use SQLConf.offHeapColumnVectorEnabled for the current value
Used when:
RowToColumnarExec
physical operator is requested to doExecuteColumnarDefaultCachedBatchSerializer
is requested tovectorTypes
andconvertCachedBatchToColumnarBatch
ParquetFileFormat
is requested to vectorTypes and buildReaderWithPartitionValuesParquetPartitionReaderFactory
is created
defaultColumn.enabled¶
spark.sql.defaultColumn.enabled
(internal) When true, allows CREATE TABLE, REPLACE TABLE, and ALTER COLUMN statements to set or update default values for specific columns. Following INSERT, MERGE, and UPDATE statements may then omit these values and their values will be injected automatically instead.
Default: true
Use SQLConf.enableDefaultColumns for the current value
Used when:
AstBuilder
is requested to visitCreateOrReplaceTableColType, visitQualifiedColTypeWithPosition, visitAlterTableAlterColumn- ResolveDefaultColumns logical resolution rule is executed, constantFoldCurrentDefaultsToExistDefaults, validateCatalogForDefaultValue, validateTableProviderForDefaultValue
exchange.reuse¶
spark.sql.exchange.reuse
(internal) When enabled (true
), the Spark planner will find duplicated exchanges and subqueries and re-use them.
When disabled (false
), ReuseExchange and ReuseSubquery physical optimizations (that the Spark planner uses for physical query plan optimization) do nothing.
Default: true
Use SQLConf.exchangeReuseEnabled for the current value
spark.sql.execution¶
arrow.pyspark.enabled¶
spark.sql.execution.arrow.pyspark.enabled
When true, make use of Apache Arrow for columnar data transfers in PySpark. This optimization applies to:
- pyspark.sql.DataFrame.toPandas
- pyspark.sql.SparkSession.createDataFrame when its input is a Pandas DataFrame
The following data types are unsupported: BinaryType, MapType, ArrayType of TimestampType, and nested StructType.
Default: false
arrow.pyspark.fallback.enabled¶
spark.sql.execution.arrow.pyspark.fallback.enabled
When true, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled will fallback automatically to non-optimized implementations if an error occurs.
Default: true
arrow.pyspark.selfDestruct.enabled¶
spark.sql.execution.arrow.pyspark.selfDestruct.enabled
(Experimental) When true
, make use of Apache Arrow's self-destruct and split-blocks options for columnar data transfers in PySpark, when converting from Arrow to Pandas. This reduces memory usage at the cost of some CPU time. Applies to: pyspark.sql.DataFrame.toPandas
when spark.sql.execution.arrow.pyspark.enabled is true
.
Default: false
Use SQLConf.arrowPySparkSelfDestructEnabled for the current value
pandas.convertToArrowArraySafely¶
spark.sql.execution.pandas.convertToArrowArraySafely
(internal) When true, Arrow will perform safe type conversion when converting Pandas. Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. When false, disabling Arrow's type check and do type conversions anyway. This config only works for Arrow 0.11.0+.
Default: false
pandas.udf.buffer.size¶
spark.sql.execution.pandas.udf.buffer.size
Same as ${BUFFER_SIZE.key}
but only applies to Pandas UDF executions. If it is not set, the fallback is ${BUFFER_SIZE.key}
. Note that Pandas execution requires more than 4 bytes. Lowering this value could make small Pandas UDF batch iterated and pipelined; however, it might degrade performance. See SPARK-27870.
Default: 65536
rangeExchange.sampleSizePerPartition¶
spark.sql.execution.rangeExchange.sampleSizePerPartition
(internal) Number of points to sample per partition in order to determine the range boundaries for range partitioning, typically used in global sorting (without limit).
Default: 100
Use SQLConf.rangeExchangeSampleSizePerPartition method to access the current value.
removeRedundantSorts¶
spark.sql.execution.removeRedundantSorts
(internal) Whether to remove redundant physical sort node
Default: true
Used as SQLConf.REMOVE_REDUNDANT_SORTS_ENABLED
replaceHashWithSortAgg¶
spark.sql.execution.replaceHashWithSortAgg
(internal) Enables replacing hash aggregate operators (i.e., HashAggregateExec and ObjectHashAggregateExec) with SortAggregateExec based on children's ordering
Default: false
Used when:
- ReplaceHashWithSortAgg physical optimization is executed
reuseSubquery¶
spark.sql.execution.reuseSubquery
(internal) When true
, the planner will try to find duplicated subqueries and re-use them.
Default: true
Use SQLConf.subqueryReuseEnabled for the current value
sortBeforeRepartition¶
spark.sql.execution.sortBeforeRepartition
(internal) When perform a repartition following a shuffle, the output row ordering would be nondeterministic. If some downstream stages fail and some tasks of the repartition stage retry, these tasks may generate different data, and that can lead to correctness issues. Turn on this config to insert a local sort before actually doing repartition to generate consistent repartition results. The performance of repartition()
may go down since we insert extra local sort before it.
Default: true
Use SQLConf.sortBeforeRepartition method to access the current value.
usePartitionEvaluator¶
spark.sql.execution.usePartitionEvaluator
(internal) Enables PartitionEvaluator
(Spark Core) to execute physical operators (using RDD.mapPartitionsWithEvaluator
operator)
Default: false
Use SQLConf.usePartitionEvaluator for the current value
Used when:
FilterExec
physical operator is executedProjectExec
physical operator is executedColumnarToRowExec
physical operator is executedRowToColumnarExec
physical operator is executedWholeStageCodegenExec
physical operator is executedSortMergeJoinExec
physical operator is executedMapInBatchExec
(PySpark) physical operator is executedWindowGroupLimitExec
physical operator is executed
spark.sql.hive¶
filesourcePartitionFileCacheSize¶
spark.sql.hive.filesourcePartitionFileCacheSize
When greater than 0
, enables caching of partition file metadata in memory (using SharedInMemoryCache). All tables share a cache that can use up to specified num bytes for file metadata.
Requires spark.sql.hive.manageFilesourcePartitions to be enabled
Default: 250 * 1024 * 1024
Use SQLConf.filesourcePartitionFileCacheSize for the current value
Used when:
FileStatusCache
is requested to look up the system-wide FileStatusCache
manageFilesourcePartitions¶
spark.sql.hive.manageFilesourcePartitions
Enables metastore partition management for file source tables.
This includes both datasource and Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning when spark.sql.hive.metastorePartitionPruning is enabled
Default: true
Use SQLConf.manageFilesourcePartitions for the current value
Used when:
HiveMetastoreCatalog
is requested to convert a HiveTableRelation to a LogicalRelation over a HadoopFsRelation- CreateDataSourceTableCommand, CreateDataSourceTableAsSelectCommand and InsertIntoHadoopFsRelationCommand logical commands are executed
DDLUtils
utility is used toverifyPartitionProviderIsHive
DataSource
is requested to resolve a BaseRelation (for file-based data source tables and creates aHadoopFsRelation
)FileStatusCache
is createdV2SessionCatalog
is requested to create a table (deprecated)
inMemoryColumnarStorage.partitionPruning¶
spark.sql.inMemoryColumnarStorage.partitionPruning
(internal) Enables partition pruning for in-memory columnar tables
Default: true
Use SQLConf.inMemoryPartitionPruning for the current value
Used when:
InMemoryTableScanExec
physical operator is requested to filter cached column batches
spark.sql.optimizer¶
canChangeCachedPlanOutputPartitioning¶
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning
(internal) Whether to forcibly enable some optimization rules that can change the output partitioning of a cached query when executing it for caching. If it is set to true, queries may need an extra shuffle to read the cached data. This configuration is disabled by default. Currently, the optimization rules enabled by this configuration are spark.sql.adaptive.enabled and spark.sql.sources.bucketing.autoBucketedScan.enabled.
Default: false
Use SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING to access the property
decorrelateInnerQuery.enabled¶
spark.sql.optimizer.decorrelateInnerQuery.enabled
(internal) Decorrelates inner queries by eliminating correlated references and build domain joins
Default: true
Use SQLConf.decorrelateInnerQueryEnabled for the current value
dynamicPartitionPruning.fallbackFilterRatio¶
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio
(internal) When statistics are not available or configured not to be used, this config will be used as the fallback filter ratio for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.
Default: 0.5
Use SQLConf.dynamicPartitionPruningFallbackFilterRatio method to access the current value.
dynamicPartitionPruning.pruningSideExtraFilterRatio¶
spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio
(internal) When filtering side doesn't support broadcast by join type, and doing DPP means running an extra query that may have significant overhead. This config will be used as the extra filter ratio for computing the data size of the pruning side after DPP, in order to evaluate if it is worth adding an extra subquery as the pruning filter.
Must be a double between 0.0
and 1.0
Default: 0.04
Use SQLConf.dynamicPartitionPruningPruningSideExtraFilterRatio to access the current value.
dynamicPartitionPruning.useStats¶
spark.sql.optimizer.dynamicPartitionPruning.useStats
(internal) When true, distinct count statistics will be used for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.
Default: true
Use SQLConf.dynamicPartitionPruningUseStats for the current value
Used when:
- PartitionPruning logical optimization rule is executed
dynamicPartitionPruning.enabled¶
spark.sql.optimizer.dynamicPartitionPruning.enabled
Enables generating predicates for partition columns used as join keys
Default: true
Use SQLConf.dynamicPartitionPruningEnabled for the current value
Used to control whether to execute the following optimizations or skip them altogether:
- CleanupDynamicPruningFilters logical optimization
- PartitionPruning logical optimization
- PlanAdaptiveDynamicPruningFilters physical optimization
- PlanDynamicPruningFilters physical optimization
dynamicPartitionPruning.reuseBroadcastOnly¶
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly
(internal) When true
, dynamic partition pruning will only apply when the broadcast exchange of a broadcast hash join operation can be reused as the dynamic pruning filter.
Default: true
Use SQLConf.dynamicPartitionPruningReuseBroadcastOnly for the current value
Used when:
- PartitionPruning logical optimization is executed (and requested to insertPredicate)
enableCsvExpressionOptimization¶
spark.sql.optimizer.enableCsvExpressionOptimization
Whether to optimize CSV expressions in SQL optimizer. It includes pruning unnecessary columns from from_csv
.
Default: true
Use SQLConf.csvExpressionOptimization for the current value
excludedRules¶
spark.sql.optimizer.excludedRules
Comma-separated list of fully-qualified class names of the optimization rules that should be disabled (excluded) from logical query optimization.
Default: (empty)
Use SQLConf.optimizerExcludedRules method to access the current value.
Important
It is not guaranteed that all the rules to be excluded will eventually be excluded, as some rules are non-excludable.
expressionProjectionCandidateLimit¶
spark.sql.optimizer.expressionProjectionCandidateLimit
(internal) The maximum number of the candidates of output expressions whose alias are replaced. It can preserve the output partitioning and ordering. Negative value means disable this optimization.
Default: 100
Used when:
AliasAwareOutputExpression
is requested for the aliasCandidateLimit
expression.nestedPruning.enabled¶
spark.sql.optimizer.expression.nestedPruning.enabled
(internal) Prune nested fields from expressions in an operator which are unnecessary in satisfying a query. Note that this optimization doesn't prune nested fields from physical data source scanning. For pruning nested fields from scanning, please use spark.sql.optimizer.nestedSchemaPruning.enabled config.
Default: true
inSetConversionThreshold¶
spark.sql.optimizer.inSetConversionThreshold
(internal) The threshold of set size for InSet
conversion.
Default: 10
Use SQLConf.optimizerInSetConversionThreshold method to access the current value.
inSetSwitchThreshold¶
spark.sql.optimizer.inSetSwitchThreshold
(internal) Configures the max set size in InSet for which Spark will generate code with switch statements. This is applicable only to bytes, shorts, ints, dates.
Must be non-negative and less than or equal to 600.
Default: 400
maxIterations¶
spark.sql.optimizer.maxIterations
Maximum number of iterations for Analyzer and Logical Optimizer.
Default: 100
nestedSchemaPruning.enabled¶
spark.sql.optimizer.nestedSchemaPruning.enabled
(internal) Prune nested fields from the output of a logical relation that are not necessary in satisfying a query. This optimization allows columnar file format readers to avoid reading unnecessary nested column data.
Default: true
Use SQLConf.nestedSchemaPruningEnabled method to access the current value.
nestedPredicatePushdown.supportedFileSources¶
spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources
(internal) A comma-separated list of data source short names or fully qualified data source implementation class names for which Spark tries to push down predicates for nested columns and/or names containing dots
to data sources. This configuration is only effective with file-based data source in DSv1. Currently, Parquet implements both optimizations while ORC only supports predicates for names containing dots
. The other data sources don't support this feature yet.
Default: parquet,orc
optimizeOneRowRelationSubquery¶
spark.sql.optimizer.optimizeOneRowRelationSubquery
(internal) When true
, the optimizer will inline subqueries with OneRowRelation
leaf nodes
Default: true
Use SQLConf.OPTIMIZE_ONE_ROW_RELATION_SUBQUERY method to access the property (in a type-safe way)
planChangeLog.batches¶
spark.sql.optimizer.planChangeLog.batches
(internal) Configures a list of batches to be logged in the optimizer, in which the batches are specified by their batch names and separated by comma.
Default: (undefined)
planChangeLog.level¶
spark.sql.optimizer.planChangeLog.level
(internal) Configures the log level for logging the change from the original plan to the new plan after a rule or batch is applied. The value can be TRACE
, DEBUG
, INFO
, WARN
or ERROR
.
Default: TRACE
planChangeLog.rules¶
spark.sql.optimizer.planChangeLog.rules
(internal) Configures a list of rules to be logged in the optimizer, in which the rules are specified by their rule names and separated by comma.
Default: (undefined)
propagateDistinctKeys.enabled¶
spark.sql.optimizer.propagateDistinctKeys.enabled
(internal) Controls whether the Logical Query Optimizer propagates the distinct attributes of logical operators for query optimization
Default: true
Used when:
LogicalPlanDistinctKeys
logical operator is requested for the distinct keys
replaceExceptWithFilter¶
spark.sql.optimizer.replaceExceptWithFilter
(internal) When true
, the apply
function of the rule verifies whether the right node of the except
operation is of type Filter
or Project
followed by Filter
. If so, the rule further verifies the following conditions:
- Excluding the filter operations from the right (as well as the left node, if any) on the top, whether both the nodes evaluates to a same result
- The left and right nodes don't contain any SubqueryExpressions
- The output column names of the left node are distinct
If all the conditions are met, the rule will replace the except
operation with a Filter
by flipping the filter condition(s) of the right node.
Default: true
runtime.bloomFilter.creationSideThreshold¶
spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold
Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter.
Default: 10MB
Use SQLConf.runtimeFilterCreationSideThreshold for the current value
Used when:
- InjectRuntimeFilter logical optimization is executed (to injectBloomFilter)
runtime.bloomFilter.enabled¶
spark.sql.optimizer.runtime.bloomFilter.enabled
Enables a bloom filter on one side of a shuffle join if the other side has a selective predicate (to reduce the amount of shuffle data)
Default: true
Use SQLConf.runtimeFilterBloomFilterEnabled for the current value
Used when:
- InjectRuntimeFilter logical optimization is executed
runtime.bloomFilter.expectedNumItems¶
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems
The default number of expected items for the runtime bloomfilter
Default: 1000000L
SQLConf.RUNTIME_BLOOM_FILTER_EXPECTED_NUM_ITEMS
Used when:
- BloomFilterAggregate expression is created
runtime.bloomFilter.maxNumBits¶
spark.sql.optimizer.runtime.bloomFilter.maxNumBits
Maximum number of bits for the runtime bloom filter
Default: 67108864L
(8MB)
Must be a non-zero positive number
SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS
Used when:
BloomFilterAggregate
is requested to checkInputDataTypes and for the numBits
runtime.rowLevelOperationGroupFilter.enabled¶
spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled
Enables runtime group filtering for group-based row-level operations.
Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten.
Default: true
Current value: SQLConf.runtimeRowLevelOperationGroupFilterEnabled
Used when:
RowLevelOperationRuntimeGroupFiltering
logical optimization is executed
runtimeFilter.number.threshold¶
spark.sql.optimizer.runtimeFilter.number.threshold
The total number of injected runtime filters (non-DPP) for a single query. This is to prevent driver OOMs with too many Bloom filters.
Default: 10
Must be a non-zero positive number
SQLConf.RUNTIME_FILTER_NUMBER_THRESHOLD
Used when:
- InjectRuntimeFilter logical optimization is executed
runtimeFilter.semiJoinReduction.enabled¶
spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled
Enables inserting a semi join on one side of a shuffle join if the other side has a selective predicate (to reduce the amount of shuffle data)
Default: false
Use SQLConf.runtimeFilterSemiJoinReductionEnabled for the current value
Used when:
- InjectRuntimeFilter logical optimization is executed
serializer.nestedSchemaPruning.enabled¶
spark.sql.optimizer.serializer.nestedSchemaPruning.enabled
(internal) Prune nested fields from object serialization operator which are unnecessary in satisfying a query. This optimization allows object serializers to avoid executing unnecessary nested expressions.
Default: true
spark.sql¶
retainGroupColumns¶
spark.sql.retainGroupColumns
(internal) Controls whether to include (retain) grouping columns or not in Aggregation Queries
Default: true
Use SQLConf.dataFrameRetainGroupColumns for the current value
spark.sql.files¶
maxPartitionBytes¶
spark.sql.files.maxPartitionBytes
Maximum number of bytes to pack into a single partition when reading files for file-based data sources (e.g., Parquet)
Default: 128MB
(like parquet.block.size
)
Use SQLConf.filesMaxPartitionBytes for the current value
Used when:
FilePartition
is requested for maxSplitBytes
maxPartitionNum¶
spark.sql.files.maxPartitionNum
The suggested (not guaranteed) maximum number of split file partitions. If set, Spark will rescale each partition to make the number of partitions close to this value if the initial number of partitions exceeds this value.
Effective only with file-based sources such as Parquet, JSON and ORC.
Default: (undefined)
Use SQLConf.filesMaxPartitionNum for the current value
Used when:
FilePartition
is requested for the file partitions
maxRecordsPerFile¶
spark.sql.files.maxRecordsPerFile
Maximum number of records to write out to a single file. If 0
or negative, there is no limit.
Default: 0
Use SQLConf.maxRecordsPerFile method for the current value
Used when:
FileFormatWriter
is requested to write data outFileWrite
is requested for a BatchWrite (and creates a WriteJobDescription)
minPartitionNum¶
spark.sql.files.minPartitionNum
Hint about the minimum number of partitions for file-based data sources (e.g., Parquet)
Default: spark.sql.leafNodeDefaultParallelism
Use SQLConf.filesMinPartitionNum for the current value
Used when:
FilePartition
is requested for maxSplitBytes
openCostInBytes¶
spark.sql.files.openCostInBytes
(internal) The estimated cost to open a file, measured by the number of bytes could be scanned at the same time (to include multiple files into a partition). Effective only for file-based sources such as Parquet, JSON and ORC.
Default: 4MB
It's better to over-estimate it, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).
Use SQLConf.filesOpenCostInBytes for the current value
Used when:
FileSourceScanExec
physical operator is requested to create an RDD for a non-bucketed readFilePartition
is requested to getFilePartitions and maxSplitBytes
spark.sql.parquet¶
aggregatePushdown¶
spark.sql.parquet.aggregatePushdown
Controls aggregate pushdown in parquet connector
Supports MIN, MAX and COUNT as aggregate expression:
- For MIN/MAX, support boolean, integer, float and date types.
- For COUNT, support all data types.
If statistics is missing from any Parquet file footer, exception would be thrown.
Default: false
Use SQLConf.parquetAggregatePushDown for the current value
Used when:
ParquetScanBuilder
is requested to pushAggregation
columnarReaderBatchSize¶
spark.sql.parquet.columnarReaderBatchSize
The number of rows to include in a parquet vectorized reader batch (the capacity of VectorizedParquetRecordReader)
Default: 4096
(4k)
The number should be carefully chosen to minimize overhead and avoid OOMs while reading data.
Use SQLConf.parquetVectorizedReaderBatchSize for the current value
Used when:
ParquetFileFormat
is requested for a data reader (and creates a VectorizedParquetRecordReader for Vectorized Parquet Decoding)ParquetPartitionReaderFactory
is createdWritableColumnVector
is requested toreserve
required capacity (and fails)
enableNestedColumnVectorizedReader¶
spark.sql.parquet.enableNestedColumnVectorizedReader
Enables vectorized parquet decoding for nested columns (e.g., arrays, structs and maps). Requires spark.sql.parquet.enableVectorizedReader to be enabled
Default: true
Use SQLConf.parquetVectorizedReaderNestedColumnEnabled for the current value
Used when:
ParquetUtils
is requested to isBatchReadSupported
filterPushdown¶
spark.sql.parquet.filterPushdown
Controls filter predicate push-down optimization for parquet connector
Default: true
Use SQLConf.parquetFilterPushDown for the current value
Used when:
ParquetFileFormat
is createdParquetPartitionReaderFactory
is createdParquetScanBuilder
is requested to pushDataFilters
filterPushdown.stringPredicate¶
spark.sql.parquet.filterPushdown.stringPredicate
(internal) Controls Parquet filter push-down optimization for string predicate such as startsWith/endsWith/contains functions. Effective only with spark.sql.parquet.filterPushdown enabled.
Default: spark.sql.parquet.filterPushdown.string.startsWith
Use SQLConf.parquetFilterPushDownStringPredicate for the current value
Used when:
ParquetFileFormat
is requested to buildReaderWithPartitionValuesParquetPartitionReaderFactory
is createdParquetScanBuilder
is requested to pushDataFilters
mergeSchema¶
spark.sql.parquet.mergeSchema
Controls whether the Parquet data source merges schemas collected from all data files or not. If false
, the schema is picked from the summary file or a random data file if no summary file is available.
Default: false
Use SQLConf.isParquetSchemaMergingEnabled for the current value
Parquet option (of higher priority): mergeSchema
Used when:
ParquetOptions
is created (and initializes mergeSchema option)
output.committer.class¶
spark.sql.parquet.output.committer.class
(internal) The output committer class used by parquet data source. The specified class needs to be a subclass of org.apache.hadoop.mapreduce.OutputCommitter
. Typically, it's also a subclass of org.apache.parquet.hadoop.ParquetOutputCommitter
. If it is not, then metadata summaries will never be created, irrespective of the value of parquet.summary.metadata.level
.
Default: org.apache.parquet.hadoop.ParquetOutputCommitter
Use SQLConf.parquetOutputCommitterClass for the current value
Used when:
ParquetUtils
is requested to prepareWrite
spark.sql.sources¶
bucketing.enabled¶
spark.sql.sources.bucketing.enabled
Enables Bucketing
Default: true
When disabled (i.e. false
), bucketed tables are considered regular (non-bucketed) tables.
Use SQLConf.bucketingEnabled method for the current value
commitProtocolClass¶
spark.sql.sources.commitProtocolClass
(internal) Fully-qualified class name of a FileCommitProtocol
(Spark Core) for Transactional Writes
Default: SQLHadoopMapReduceCommitProtocol
Use SQLConf.fileCommitProtocolClass method for the current value
Used when:
FileWrite
is requested for a BatchWrite- InsertIntoHadoopFsRelationCommand logical command is executed
SaveAsHiveFile
is requested to saveAsHiveFile
outputCommitterClass¶
spark.sql.sources.outputCommitterClass
(internal) The fully-qualified class name of the user-defined Hadoop OutputCommitter for SQLHadoopMapReduceCommitProtocol
Default: (undefined)
Use SQLConf.OUTPUT_COMMITTER_CLASS to access the property
Note
ParquetUtils
uses spark.sql.parquet.output.committer.class or the default ParquetOutputCommitter
instead.
v2.bucketing.enabled¶
spark.sql.sources.v2.bucketing.enabled
Enables bucketing for connectors (V2 data sources).
When enabled, Spark will recognize the specific distribution reported by a V2 data source through SupportsReportPartitioning, and avoid shuffle if necessary.
Similar to spark.sql.sources.bucketing.enabled
Use SQLConf.v2BucketingEnabled for the current value
Used when:
DataSourceV2ScanExecBase
is requested to groupPartitions
v2.bucketing.partiallyClusteredDistribution.enabled¶
spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled
During a Storage-Partitioned Join, whether to allow input partitions to be partially clustered, when both sides of the join are of KeyGroupedPartitioning.
Default: false
At planning time, Spark will pick the side with less data size based on table statistics, group and replicate them to match the other side.
This is an optimization on skew join and can help to reduce data skewness when certain partitions are assigned large amount of data.
Requires both spark.sql.sources.v2.bucketing.enabled and spark.sql.sources.v2.bucketing.pushPartValues.enabled to be enabled
Use SQLConf.v2BucketingPartiallyClusteredDistributionEnabled for the current value
Used when:
BatchScanExec
physical operator is requested for the input RDDDataSourceV2ScanExecBase
physical operator is requested for groupPartitions- EnsureRequirements physical optimization is executed (to checkKeyGroupCompatible)
v2.bucketing.pushPartValues.enabled¶
spark.sql.sources.v2.bucketing.pushPartValues.enabled
Whether to pushdown common partition values when spark.sql.sources.v2.bucketing.enabled is enabled.
Default: false
When enabled, if both sides of a join are of KeyGroupedPartitioning
and if they share compatible partition keys, even if they don't have the exact same partition values, Spark will calculate a superset of partition values and pushdown that info to scan nodes, which will use empty partitions for the missing partition values on either side. This could help to eliminate unnecessary shuffles.
Use SQLConf.v2BucketingPushPartValuesEnabled for the current value
Used when:
DataSourceV2ScanExecBase
physical operator is requested to groupPartitionsBatchScanExec
physical operator is requested for the inputRDDEnsureRequirements
physical optimization is requested to checkKeyGroupCompatible
spark.sql.objectHashAggregate.sortBased.fallbackThreshold¶
(internal) The number of entires in an in-memory hash map (to store aggregation buffers per grouping keys) before ObjectHashAggregateExec (ObjectAggregationIterator, precisely) falls back to sort-based aggregation
Default: 128
(entries)
Use SQLConf.objectAggSortBasedFallbackThreshold for the current value
Learn more in Demo: ObjectHashAggregateExec and Sort-Based Fallback Tasks
spark.sql.legacy.allowNonEmptyLocationInCTAS¶
(internal) When false
, CTAS with LOCATION throws an analysis exception if the location is not empty.
Default: false
Use SQLConf.allowNonEmptyLocationInCTAS for the current value
spark.sql.legacy.allowAutoGeneratedAliasForView¶
(internal) When true
, it's allowed to use an input query without explicit alias when creating a permanent view.
Default: false
Use SQLConf.allowAutoGeneratedAliasForView for the current value
spark.sql.sessionWindow.buffer.spill.threshold¶
(internal) The threshold for number of rows to be spilled by window operator. Note that the buffer is used only for the query Spark SQL cannot apply aggregations on determining session window.
Default: spark.shuffle.spill.numElementsForceSpillThreshold
Use SQLConf.sessionWindowBufferSpillThreshold for the current value
spark.sql.legacy.allowStarWithSingleTableIdentifierInCount¶
(internal) When true
, the SQL function count
is allowed to take a single tblName.*
as parameter
Default: false
Use SQLConf.allowStarWithSingleTableIdentifierInCount for the current value
spark.sql.sessionWindow.buffer.in.memory.threshold¶
(internal) Threshold for number of windows guaranteed to be held in memory by the session window operator. Note that the buffer is used only for the query Spark SQL cannot apply aggregations on determining session window.
Default: 4096
Use SQLConf.sessionWindowBufferInMemoryThreshold for the current value
spark.sql.orc.enableNestedColumnVectorizedReader¶
Enables vectorized orc decoding for nested column
Default: false
Use SQLConf.orcVectorizedReaderNestedColumnEnabled for the current value
spark.sql.analyzer.maxIterations¶
(internal) The max number of iterations the analyzer runs.
Default: 100
spark.sql.analyzer.failAmbiguousSelfJoin¶
(internal) When true
, fail the Dataset query if it contains ambiguous self-join.
Default: true
spark.sql.ansi.enabled¶
When true
, Spark tries to conform to the ANSI SQL specification:
- Spark will throw a runtime exception if an overflow occurs in any operation on integral/decimal field.
- Spark will forbid using the reserved keywords of ANSI SQL as identifiers in the SQL parser.
Default: false
spark.sql.cli.print.header¶
When true
, spark-sql CLI prints the names of the columns in query output
Default: false
Use SQLConf.cliPrintHeader for the current value
spark.sql.debug.maxToStringFields¶
Maximum number of fields of sequence-like entries can be converted to strings in debug output. Any elements beyond the limit will be dropped and replaced by a "... N more fields" placeholder.
Default: 25
Use SQLConf.maxToStringFields method to access the current value.
spark.sql.defaultCatalog¶
Name of the default catalog
Default: spark_catalog
Use SQLConf.DEFAULT_CATALOG to access the current value.
spark.sql.statistics.histogram.enabled¶
Enables generating histograms for ANALYZE TABLE SQL statement
Default: false
Equi-Height Histogram
Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.
Use SQLConf.histogramEnabled method to access the current value.
spark.sql.session.timeZone¶
The ID of session-local timezone (e.g. "GMT", "America/Los_Angeles")
Default: Java's TimeZone.getDefault.getID
Use SQLConf.sessionLocalTimeZone method to access the current value.
spark.sql.sources.ignoreDataLocality¶
(internal) When true
, Spark will not fetch the block locations for each file on listing files. This speeds up file listing, but the scheduler cannot schedule tasks to take advantage of data locality. It can be particularly useful if data is read from a remote cluster so the scheduler could never take advantage of locality anyway.
Default: false
spark.sql.sources.validatePartitionColumns¶
(internal) When this option is set to true, partition column values will be validated with user-specified schema. If the validation fails, a runtime exception is thrown. When this option is set to false, the partition column value will be converted to null if it can not be casted to corresponding user-specified schema.
Default: true
spark.sql.sources.useV1SourceList¶
(internal) A comma-separated list of data source short names (DataSourceRegisters) or fully-qualified canonical class names of the data sources (TableProviders) for which DataSource V2 code path is disabled (and Data Source V1 code path used).
Default: avro,csv,json,kafka,orc,parquet,text
Used when:
DataSource
utility is used to lookupDataSourceV2
spark.sql.storeAssignmentPolicy¶
When inserting a value into a column with different data type, Spark will perform type coercion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy and strict. With ANSI policy, Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string
to int
or double
to boolean
. With legacy policy, Spark allows the type coercion as long as it is a valid Cast
, which is very loose. e.g. converting string
to int
or double
to boolean
is allowed. It is also the only behavior in Spark 2.x and it is compatible with Hive. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. converting double
to int
or decimal
to double
is not allowed.
Possible values: ANSI
, LEGACY
, STRICT
Default: ANSI
spark.sql.thriftServer.interruptOnCancel¶
When true
, all running tasks will be interrupted if one cancels a query. When false
, all running tasks will remain until finished.
Default: false
Use SQLConf.THRIFTSERVER_FORCE_CANCEL to access the property
spark.sql.hive.tablePropertyLengthThreshold¶
(internal) The maximum length allowed in a single cell when storing Spark-specific information in Hive's metastore as table properties. Currently it covers 2 things: the schema's JSON string, the histogram of column statistics.
Default: (undefined)
Use SQLConf.dynamicPartitionPruningEnabled to access the current value.
spark.sql.orc.mergeSchema¶
When true, the Orc data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.
Default: false
spark.sql.sources.bucketing.autoBucketedScan.enabled¶
When true
, decide whether to do bucketed scan on input tables based on query plan automatically. Do not use bucketed scan if 1. query does not have operators to utilize bucketing (e.g. join, group-by, etc), or 2. there's an exchange operator between these operators and table scan.
Note when spark.sql.sources.bucketing.enabled is set to false
, this configuration does not take any effect.
Default: true
Use SQLConf.autoBucketedScanEnabled to access the property
spark.sql.datetime.java8API.enabled¶
When true
, java.time.Instant and java.time.LocalDate classes of Java 8 API are used as external types for Catalyst's TimestampType and DateType. When false
, java.sql.Timestamp and java.sql.Date are used for the same purpose.
Default: false
spark.sql.legacy.interval.enabled¶
(internal) When true
, Spark SQL uses the mixed legacy interval type CalendarIntervalType
instead of the ANSI compliant interval types YearMonthIntervalType
and DayTimeIntervalType
. For instance, the date subtraction expression returns CalendarIntervalType
when the SQL config is set to true
otherwise an ANSI interval.
Default: false
Use SQLConf.legacyIntervalEnabled to access the current value
spark.sql.sources.binaryFile.maxLength¶
(internal) The max length of a file that can be read by the binary file data source. Spark will fail fast and not attempt to read the file if its length exceeds this value. The theoretical max is Int.MaxValue, though VMs might implement a smaller max.
Default: Int.MaxValue
spark.sql.mapKeyDedupPolicy¶
The policy to deduplicate map keys in builtin function: CreateMap, MapFromArrays, MapFromEntries, StringToMap, MapConcat and TransformKeys. When EXCEPTION, the query fails if duplicated map keys are detected. When LAST_WIN, the map key that is inserted at last takes precedence.
Possible values: EXCEPTION
, LAST_WIN
Default: EXCEPTION
spark.sql.maxConcurrentOutputFileWriters¶
(internal) Maximum number of output file writers for FileFormatWriter
to use concurrently (writing out a query result). If number of writers needed reaches this limit, a task will sort rest of output then writing them.
Default: 0
Use SQLConf.maxConcurrentOutputFileWriters for the current value
spark.sql.maxMetadataStringLength¶
Maximum number of characters to output for a metadata string (e.g., Location
in FileScan)
Default: 100
Must be bigger than 3
Use SQLConf.maxMetadataStringLength for the current value
spark.sql.maven.additionalRemoteRepositories¶
A comma-delimited string config of the optional additional remote Maven mirror repositories. This is only used for downloading Hive jars in IsolatedClientLoader if the default Maven Central repo is unreachable.
Default: https://maven-central.storage-download.googleapis.com/maven2/
spark.sql.maxPlanStringLength¶
Maximum number of characters to output for a plan string. If the plan is longer, further output will be truncated. The default setting always generates a full plan. Set this to a lower value such as 8k if plan strings are taking up too much memory or are causing OutOfMemory errors in the driver or UI processes.
Default: Integer.MAX_VALUE - 15
spark.sql.addPartitionInBatch.size¶
(internal) The number of partitions to be handled in one turn when use AlterTableAddPartitionCommand
to add partitions into table. The smaller batch size is, the less memory is required for the real handler, e.g. Hive Metastore.
Default: 100
spark.sql.scriptTransformation.exitTimeoutInSeconds¶
(internal) Timeout for executor to wait for the termination of transformation script when EOF.
Default: 10
seconds
spark.sql.avro.compression.codec¶
The compression codec to use when writing Avro data to disk
Default: snappy
The supported codecs are:
uncompressed
deflate
snappy
bzip2
xz
Use SQLConf.avroCompressionCodec method to access the current value.
spark.sql.broadcastTimeout¶
Timeout in seconds for the broadcast wait time in broadcast joins.
Default: 5 * 60
When negative, it is assumed infinite (i.e. Duration.Inf
)
Use SQLConf.broadcastTimeout method to access the current value.
spark.sql.bucketing.coalesceBucketsInJoin.enabled¶
When enabled (true
), if two bucketed tables with the different number of buckets are joined, the side with a bigger number of buckets will be coalesced to have the same number of buckets as the other side. Bigger number of buckets is divisible by the smaller number of buckets. Bucket coalescing is applied to sort-merge joins and shuffled hash join.
Note
Coalescing bucketed table can avoid unnecessary shuffling in join, but it also reduces parallelism and could possibly cause OOM for shuffled hash join.
Default: false
Use SQLConf.coalesceBucketsInJoinEnabled method to access the current value.
spark.sql.caseSensitive¶
(internal) Controls whether the query analyzer should be case sensitive (true
) or not (false
).
Default: false
It is highly discouraged to turn on case sensitive mode.
Use SQLConf.caseSensitiveAnalysis method to access the current value.
spark.sql.cbo.enabled¶
Enables Cost-Based Optimization (CBO) for estimation of plan statistics when true
.
Default: false
Use SQLConf.cboEnabled method to access the current value.
spark.sql.cbo.joinReorder.enabled¶
Enables join reorder for cost-based optimization (CBO).
Default: false
Use SQLConf.joinReorderEnabled method to access the current value.
spark.sql.cbo.planStats.enabled¶
When true
, the logical plan will fetch row counts and column statistics from catalog.
Default: false
spark.sql.cbo.starSchemaDetection¶
Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization.
Default: false
Use SQLConf.starSchemaDetection method to access the current value.
spark.sql.codegen¶
aggregate.map.vectorized.enable¶
spark.sql.codegen.aggregate.map.vectorized.enable
(internal) Enables vectorized aggregate hash map. This is for testing/benchmarking only.
Default: false
aggregate.sortAggregate.enabled¶
spark.sql.codegen.aggregate.sortAggregate.enabled
Enables code generation for SortAggregateExec physical operator
Default: true
aggregate.splitAggregateFunc.enabled¶
spark.sql.codegen.aggregate.splitAggregateFunc.enabled
(internal) When true, the code generator would split aggregate code into individual methods instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.
Default: true
comments¶
spark.sql.codegen.comments
Controls whether CodegenContext
should register comments (true
) or not (false
).
Default: false
factoryMode¶
spark.sql.codegen.factoryMode
(internal) Determines the codegen generator fallback behavior
Default: FALLBACK
Acceptable values:
CODEGEN_ONLY
- disable fallback modeFALLBACK
- try codegen first and, if any compile error happens, fallback to interpreted modeNO_CODEGEN
- skips codegen and always uses interpreted path
Used when CodeGeneratorWithInterpretedFallback
is requested to createObject (when UnsafeProjection
is requested to create an UnsafeProjection for Catalyst expressions)
useIdInClassName¶
spark.sql.codegen.useIdInClassName
(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix (true
) or not (false
)
Default: true
Use SQLConf.wholeStageUseIdInClassName method to access the current value.
maxFields¶
spark.sql.codegen.maxFields
(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen.
Default: 100
Use SQLConf.wholeStageMaxNumFields method to access the current value.
splitConsumeFuncByOperator¶
spark.sql.codegen.splitConsumeFuncByOperator
(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.
Default: true
Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value.
spark.sql.columnNameOfCorruptRecord¶
spark.sql.constraintPropagation.enabled¶
(internal) When true, the query optimizer will infer and propagate data constraints in the query plan to optimize them. Constraint propagation can sometimes be computationally expensive for certain kinds of query plans (such as those with a large number of predicates and aliases) which might negatively impact overall runtime.
Default: true
Use SQLConf.constraintPropagationEnabled method to access the current value.
spark.sql.csv.filterPushdown.enabled¶
(internal) When true
, enable filter pushdown to CSV datasource.
Default: true
spark.sql.defaultSizeInBytes¶
(internal) Estimated size of a table or relation used in query planning
Default: Java's Long.MaxValue
Set to Java's Long.MaxValue
which is larger than spark.sql.autoBroadcastJoinThreshold to be more conservative. That is to say by default the optimizer will not choose to broadcast a table unless it knows for sure that the table size is small enough.
Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast.
Use SQLConf.defaultSizeInBytes method to access the current value.
spark.sql.dialect¶
execution.useObjectHashAggregateExec¶
spark.sql.execution.useObjectHashAggregateExec
(internal) Prefers ObjectHashAggregateExec (over SortAggregateExec) for aggregation
Default: true
Use SQLConf.useObjectHashAggregation method to access the current value.
spark.sql.files.ignoreCorruptFiles¶
Controls whether to ignore corrupt files (true
) or not (false
). If true
, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.
Default: false
Use SQLConf.ignoreCorruptFiles method to access the current value.
spark.sql.files.ignoreMissingFiles¶
Controls whether to ignore missing files (true
) or not (false
). If true
, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned.
Default: false
Use SQLConf.ignoreMissingFiles method to access the current value.
spark.sql.inMemoryColumnarStorage.compressed¶
When enabled, Spark SQL will automatically select a compression codec for each column based on statistics of the data.
Default: true
Use SQLConf.useCompression method to access the current value.
spark.sql.inMemoryColumnarStorage.batchSize¶
Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.
Default: 10000
Use SQLConf.columnBatchSize method to access the current value.
spark.sql.inMemoryTableScanStatistics.enable¶
(internal) When true, enable in-memory table scan accumulators.
Default: false
spark.sql.inMemoryColumnarStorage.enableVectorizedReader¶
Enables vectorized reader for columnar caching
Default: true
Use SQLConf.cacheVectorizedReaderEnabled method to access the current value.
spark.sql.join.preferSortMergeJoin¶
(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join.
Default: true
Use SQLConf.preferSortMergeJoin method to access the current value.
spark.sql.jsonGenerator.ignoreNullFields¶
Whether to ignore null fields when generating JSON objects in JSON data source and JSON functions such as to_json. If false, it generates null for null fields in JSON objects.
Default: true
spark.sql.leafNodeDefaultParallelism¶
The default parallelism of leaf operators that produce data
Must be positive
Default: SparkContext.defaultParallelism
(Spark Core)
spark.sql.legacy.doLooseUpcast¶
(internal) When true
, the upcast will be loose and allows string to atomic types.
Default: false
spark.sql.legacy.ctePrecedencePolicy¶
(internal) This config will be removed in future versions and CORRECTED
will be the only behavior.
Possible values:
CORRECTED
- inner CTE definitions take precedenceEXCEPTION
-AnalysisException
is thrown while name conflict is detected in nested CTELEGACY
- outer CTE definitions takes precedence over inner definitions
Default: EXCEPTION
spark.sql.legacy.timeParserPolicy¶
(internal) When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. When set to CORRECTED, classes from java.time.*
packages are used for the same purpose. The default value is EXCEPTION, RuntimeException is thrown when we will get different results.
Possible values: EXCEPTION
, LEGACY
, CORRECTED
Default: EXCEPTION
spark.sql.legacy.followThreeValuedLogicInArrayExists¶
(internal) When true, the ArrayExists will follow the three-valued boolean logic.
Default: true
spark.sql.legacy.fromDayTimeString.enabled¶
(internal) When true
, the from
bound is not taken into account in conversion of a day-time string to an interval, and the to
bound is used to skip all interval units out of the specified range. When false
, ParseException
is thrown if the input does not match to the pattern defined by from
and to
.
Default: false
spark.sql.legacy.notReserveProperties¶
(internal) When true
, all database and table properties are not reserved and available for create/alter syntaxes. But please be aware that the reserved properties will be silently removed.
Default: false
spark.sql.legacy.addSingleFileInAddFile¶
(internal) When true
, only a single file can be added using ADD FILE. If false, then users can add directory by passing directory path to ADD FILE.
Default: false
spark.sql.legacy.exponentLiteralAsDecimal.enabled¶
(internal) When true
, a literal with an exponent (e.g. 1E-30) would be parsed as Decimal rather than Double.
Default: false
spark.sql.legacy.allowNegativeScaleOfDecimal¶
(internal) When true
, negative scale of Decimal type is allowed. For example, the type of number 1E10BD under legacy mode is DecimalType(2, -9), but is Decimal(11, 0) in non legacy mode.
Default: false
spark.sql.legacy.bucketedTableScan.outputOrdering¶
(internal) When true
, the bucketed table scan will list files during planning to figure out the output ordering, which is expensive and may make the planning quite slow.
Default: false
spark.sql.legacy.json.allowEmptyString.enabled¶
(internal) When true
, the parser of JSON data source treats empty strings as null for some data types such as IntegerType
.
Default: false
spark.sql.legacy.createEmptyCollectionUsingStringType¶
(internal) When true
, Spark returns an empty collection with StringType
as element type if the array
/map
function is called without any parameters. Otherwise, Spark returns an empty collection with NullType
as element type.
Default: false
spark.sql.legacy.allowUntypedScalaUDF¶
(internal) When true
, user is allowed to use org.apache.spark.sql.functions.udf(f: AnyRef, dataType: DataType)
. Otherwise, an exception will be thrown at runtime.
Default: false
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue¶
(internal) When true
, the key attribute resulted from running Dataset.groupByKey
for non-struct key type, will be named as value
, following the behavior of Spark version 2.4 and earlier.
Default: false
spark.sql.legacy.setCommandRejectsSparkCoreConfs¶
(internal) If it is set to true, SET command will fail when the key is registered as a SparkConf entry.
Default: true
spark.sql.legacy.typeCoercion.datetimeToString.enabled¶
(internal) When true
, date/timestamp will cast to string in binary comparisons with String
Default: false
spark.sql.legacy.allowHashOnMapType¶
(internal) When true
, hash expressions can be applied on elements of MapType. Otherwise, an analysis exception will be thrown.
Default: false
spark.sql.legacy.parquet.datetimeRebaseModeInWrite¶
(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.
Possible values: EXCEPTION
, LEGACY
, CORRECTED
Default: EXCEPTION
spark.sql.legacy.parquet.datetimeRebaseModeInRead¶
(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
Possible values: EXCEPTION
, LEGACY
, CORRECTED
Default: EXCEPTION
spark.sql.legacy.avro.datetimeRebaseModeInWrite¶
(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.
Possible values: EXCEPTION
, LEGACY
, CORRECTED
Default: EXCEPTION
spark.sql.legacy.avro.datetimeRebaseModeInRead¶
(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.
Possible values: EXCEPTION
, LEGACY
, CORRECTED
Default: EXCEPTION
spark.sql.legacy.rdd.applyConf¶
(internal) Enables propagation of SQL configurations when executing operations on the RDD that represents a structured query. This is the (buggy) behavior up to 2.4.4.
Default: true
This is for cases not tracked by SQL execution, when a Dataset
is converted to an RDD either using Dataset.md#rdd[rdd] operation or QueryExecution, and then the returned RDD is used to invoke actions on it.
This config is deprecated and will be removed in 3.0.0.
spark.sql.legacy.replaceDatabricksSparkAvro.enabled¶
Enables resolving (mapping) the data source provider com.databricks.spark.avro
to the built-in (but external) Avro data source module for backward compatibility.
Default: true
Use SQLConf.replaceDatabricksSparkAvroEnabled method to access the current value.
spark.sql.limit.scaleUpFactor¶
(internal) Minimal increase rate in the number of partitions between attempts when executing take
operator on a structured query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be run.
Default: 4
Use SQLConf.limitScaleUpFactor method to access the current value.
spark.sql.optimizeNullAwareAntiJoin¶
(internal) Enables single-column NULL-aware anti join execution planning into BroadcastHashJoinExec (with flag isNullAwareAntiJoin enabled), optimized from O(M*N) calculation into O(M) calculation using hash lookup instead of looping lookup.
Default: true
Use SQLConf.optimizeNullAwareAntiJoin method to access the current value.
spark.sql.orc.impl¶
(internal) When native
, use the native version of ORC support instead of the ORC library in Hive 1.2.1.
Default: native
Acceptable values:
hive
native
spark.sql.planChangeLog.level¶
(internal) Log level for logging the change from the original plan to the new plan after a rule or batch is applied.
Default: trace
Supported Values (case-insensitive):
- trace
- debug
- info
- warn
- error
Use SQLConf.planChangeLogLevel method to access the current value.
spark.sql.planChangeLog.batches¶
(internal) Comma-separated list of batch names for plan changes logging
Default: (undefined)
Use SQLConf.planChangeBatches method to access the current value.
spark.sql.planChangeLog.rules¶
(internal) Comma-separated list of rule names for plan changes logging
Default: (undefined)
Use SQLConf.planChangeRules method to access the current value.
spark.sql.pyspark.jvmStacktrace.enabled¶
When true, it shows the JVM stacktrace in the user-facing PySpark exception together with Python stacktrace. By default, it is disabled and hides JVM stacktrace and shows a Python-friendly exception only.
Default: false
spark.sql.parquet.binaryAsString¶
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
Default: false
Use SQLConf.isParquetBinaryAsString method to access the current value.
spark.sql.parquet.compression.codec¶
Sets the compression codec used when writing Parquet files. If either compression
or parquet.compression
is specified in the table-specific options/properties, the precedence would be compression
, parquet.compression
, spark.sql.parquet.compression.codec
.
Acceptable values:
brotli
gzip
lz4
lzo
none
snappy
uncompressed
zstd
Default: snappy
Use SQLConf.parquetCompressionCodec method to access the current value.
spark.sql.parquet.enableVectorizedReader¶
Enables vectorized parquet decoding.
Default: true
Use SQLConf.parquetVectorizedReaderEnabled method to access the current value.
spark.sql.parquet.filterPushdown.date¶
(internal) Enables parquet filter push-down optimization for Date
type (when spark.sql.parquet.filterPushdown is enabled)
Default: true
Use SQLConf.parquetFilterPushDownDate method to access the current value.
spark.sql.parquet.filterPushdown.decimal¶
(internal) Enables parquet filter push-down optimization for Decimal
type (when spark.sql.parquet.filterPushdown is enabled)
Default: true
Use SQLConf.parquetFilterPushDownDecimal method to access the current value.
spark.sql.parquet.int96RebaseModeInWrite¶
(internal) Enables rebasing timestamps while writing Parquet files
Formerly known as spark.sql.legacy.parquet.int96RebaseModeInWrite
Acceptable values:
EXCEPTION
- Fail writing parquet files if there are ancient timestamps that are ambiguous between the two calendarsLEGACY
- RebaseINT96
timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar (gives maximum interoperability)CORRECTED
- Write datetime values with no change (rabase). Only when you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar
Default: EXCEPTION
Writing dates before 1582-10-15
or timestamps before 1900-01-01T00:00:00Z
can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar.
See more details in SPARK-31404.
spark.sql.parquet.pushdown.inFilterThreshold¶
(internal) For IN predicate, Parquet filter will push-down a set of OR clauses if its number of values not exceeds this threshold. Otherwise, Parquet filter will push-down a value greater than or equal to its minimum value and less than or equal to its maximum value (when spark.sql.parquet.filterPushdown is enabled)
Disabled when 0
Default: 10
Use SQLConf.parquetFilterPushDownInFilterThreshold method to access the current value.
spark.sql.parquet.filterPushdown.string.startsWith¶
(internal) Enables parquet filter push-down optimization for startsWith
function (when spark.sql.parquet.filterPushdown is enabled)
Default: true
Use SQLConf.parquetFilterPushDownStringStartWith method to access the current value.
spark.sql.parquet.filterPushdown.timestamp¶
(internal) Enables parquet filter push-down optimization for Timestamp
type. It can only have an effect when the following hold:
- spark.sql.parquet.filterPushdown is enabled
Timestamp
stored asTIMESTAMP_MICROS
orTIMESTAMP_MILLIS
type
Default: true
Use SQLConf.parquetFilterPushDownTimestamp method to access the current value.
spark.sql.parquet.int96AsTimestamp¶
Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
Default: true
Use SQLConf.isParquetINT96AsTimestamp method to access the current value.
spark.sql.parquet.int96TimestampConversion¶
Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala.
Default: false
This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark.
Use SQLConf.isParquetINT96TimestampConversion method to access the current value.
spark.sql.parquet.outputTimestampType¶
Sets which Parquet timestamp type to use when Spark writes data to Parquet files. INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.
Acceptable values:
INT96
TIMESTAMP_MICROS
TIMESTAMP_MILLIS
Default: INT96
Use SQLConf.parquetOutputTimestampType method to access the current value.
spark.sql.parquet.recordLevelFilter.enabled¶
Enables Parquet's native record-level filtering using the pushed down filters (when spark.sql.parquet.filterPushdown is enabled).
Default: false
Use SQLConf.parquetRecordFilterEnabled method to access the current value.
spark.sql.parquet.respectSummaryFiles¶
When true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn't be enabled before knowing what it means exactly.
Default: false
Use SQLConf.isParquetSchemaRespectSummaries method to access the current value.
spark.sql.parser.quotedRegexColumnNames¶
Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions.
Default: false
Use SQLConf.supportQuotedRegexColumnName method to access the current value.
spark.sql.pivotMaxValues¶
Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column)
Default: 10000
Use SQLConf.dataFramePivotMaxValues method to access the current value.
spark.sql.redaction.options.regex¶
Regular expression to find options of a Spark SQL command with sensitive information
Default: (?i)secret!password
The values of the options matched will be redacted in the explain output.
This redaction is applied on top of the global redaction configuration defined by spark.redaction.regex
configuration.
Used exclusively when SQLConf
is requested to redactOptions.
spark.sql.redaction.string.regex¶
Regular expression to point at sensitive information in text output
Default: (undefined)
When this regex matches a string part, it is replaced by a dummy value (i.e. *********(redacted)
). This is currently used to redact the output of SQL explain commands.
NOTE: When this conf is not set, the value of spark.redaction.string.regex
is used instead.
Use SQLConf.stringRedactionPattern method to access the current value.
spark.sql.runSQLOnFiles¶
(internal) Enables datasource`.`path
table names in SQL queries for FileFormat-based data sources (excluding hive tables)
Default: true
Use SQLConf.runSQLonFile method to access the current value.
Used when:
- ResolveSQLOnFile logical analysis rule is executed
spark.sql.selfJoinAutoResolveAmbiguity¶
Controls whether to resolve ambiguity in join conditions for self-joins automatically (true
) or not (false
)
Default: true
spark.sql.sort.enableRadixSort¶
(internal) Controls whether to use radix sort (true
) or not (false
) in ShuffleExchangeExec and SortExec physical operators
Default: true
Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more).
Use SQLConf.enableRadixSort method to access the current value.
spark.sql.sources.default¶
Default data source to use for loading or saving data
Default: parquet
Use SQLConf.defaultDataSourceName method to access the current value.
spark.sql.statistics.fallBackToHdfs¶
Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata.
Default: false
This can be useful in determining if a table is small enough for auto broadcast joins in query planning.
Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value.
spark.sql.statistics.histogram.numBins¶
(internal) The number of bins when generating histograms.
Default: 254
NOTE: The number of bins must be greater than 1.
Use SQLConf.histogramNumBins method to access the current value.
spark.sql.statisticsparallelFileListingInStatsComputation.enabled*¶
(internal) Enables parallel file listing in SQL commands, e.g. ANALYZE TABLE
(as opposed to single thread listing that can be particularly slow with tables with hundreds of partitions)
Default: true
Use SQLConf.parallelFileListingInStatsComputation method to access the current value.
spark.sql.statistics.ndv.maxError¶
(internal) The maximum estimation error allowed in HyperLogLog++ algorithm when generating column level statistics.
Default: 0.05
spark.sql.statistics.percentile.accuracy¶
(internal) Accuracy of percentile approximation when generating equi-height histograms. Larger value means better accuracy. The relative error can be deduced by 1.0 / PERCENTILE_ACCURACY.
Default: 10000
spark.sql.statistics.size.autoUpdate.enabled¶
Enables automatic update of the table size statistic of a table after the table has changed.
Default: false
IMPORTANT: If the total number of files of the table is very large this can be expensive and slow down data change commands.
Use SQLConf.autoSizeUpdateEnabled method to access the current value.
spark.sql.subexpressionElimination.enabled¶
(internal) Enables Subexpression Elimination
Default: true
Use SQLConf.subexpressionEliminationEnabled method to access the current value.
spark.sql.subexpressionElimination.enabled¶
(internal) Enables shortcut eliminate subexpression with AND
and OR
.
The subexpression may not need to eval even if it appears more than once, e.g., if(or(a, and(b, b)))
, the expression b
would be skipped if a
is true.
Default: false
Use SQLConf.subexpressionEliminationSkipForShotcutExpr method to access the current value.
Used when:
- EquivalentExpressions is created
spark.sql.shuffle.partitions¶
The default number of partitions to use when shuffling data for joins or aggregations.
Default: 200
Note
Corresponds to Apache Hive's mapred.reduce.tasks property that Spark SQL considers deprecated.
Spark Structured Streaming
spark.sql.shuffle.partitions
cannot be changed in Spark Structured Streaming between query restarts from the same checkpoint location.
Use SQLConf.numShufflePartitions method to access the current value.
spark.sql.sources.fileCompressionFactor¶
(internal) When estimating the output data size of a table scan, multiply the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated result.
Default: 1.0
Use SQLConf.fileCompressionFactor method to access the current value.
spark.sql.sources.partitionOverwriteMode¶
Enables dynamic partition inserts when dynamic
Default: static
When INSERT OVERWRITE
a partitioned data source table with dynamic partition columns, Spark SQL supports two modes (case-insensitive):
-
static - Spark deletes all the partitions that match the partition specification (e.g.
PARTITION(a=1,b)
) in the INSERT statement, before overwriting -
dynamic - Spark doesn't delete partitions ahead, and only overwrites those partitions that have data written into it
The default STATIC
overwrite mode is to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode.
Use SQLConf.partitionOverwriteMode method to access the current value.
spark.sql.truncateTable.ignorePermissionAcl.enabled¶
(internal) Disables setting back original permission and ACLs when re-creating the table/partition paths for TRUNCATE TABLE command.
Default: false
Use SQLConf.truncateTableIgnorePermissionAcl method to access the current value.
spark.sql.ui.retainedExecutions¶
Number of SQLExecutionUIData
s to keep in failedExecutions
and completedExecutions
internal registries.
Default: 1000
When a query execution finishes, the execution is removed from the internal activeExecutions
registry and stored in failedExecutions
or completedExecutions
given the end execution status. It is when SQLListener
makes sure that the number of SQLExecutionUIData
entires does not exceed spark.sql.ui.retainedExecutions
Spark property and removes the excess of entries.
spark.sql.variable.substitute¶
Enables Variable Substitution
Default: true
Use SQLConf.variableSubstituteEnabled method to access the current value.
spark.sql.windowExec.buffer.in.memory.threshold¶
(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator.
Default: 4096
Use SQLConf.windowExecBufferInMemoryThreshold method to access the current value.
spark.sql.windowExec.buffer.spill.threshold¶
(internal) Threshold for number of rows buffered in a WindowExec physical operator.
Default: 4096
Use SQLConf.windowExecBufferSpillThreshold method to access the current value.