SharedState — State Shared Across SparkSessions¶
SharedState
holds the state that can be shared across SparkSessions:
- CacheManager
- ExternalCatalogWithListener
- GlobalTempViewManager
- Hadoop Configuration
-
NonClosableMutableURLClassLoader
- SparkConf
- SparkContext
- SQLAppStatusStore
StreamingQueryStatusListener
SharedState
is shared when SparkSession
is created using SparkSession.newSession:
assert(spark.sharedState == spark.newSession.sharedState)
Creating Instance¶
SharedState
takes the following to be created:
-
SparkContext
- Initial configuration properties
SharedState
is created for SparkSession (and cached for later reuse).
Accessing SharedState¶
SharedState
is available using SparkSession.sharedState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sharedState
org.apache.spark.sql.internal.SharedState
Shared SQL Services¶
ExternalCatalog¶
externalCatalog: ExternalCatalog
ExternalCatalog that is created reflectively based on spark.sql.catalogImplementation internal configuration property:
- HiveExternalCatalog for
hive
- InMemoryCatalog for
in-memory
While initialized:
-
Creates the default database (with
default database
description and warehousePath location) unless available already. -
Registers a
ExternalCatalogEventListener
that propagates external catalog events to the Spark listener bus.
GlobalTempViewManager¶
globalTempViewManager: GlobalTempViewManager
When accessed for the very first time, globalTempViewManager
gets the name of the global temporary view database based on spark.sql.globalTempDatabase internal static configuration property.
In the end, globalTempViewManager
creates a new GlobalTempViewManager (with the configured database name).
globalTempViewManager
throws a SparkException
when the global temporary view database exist in the ExternalCatalog:
[globalTempDB] is a system preserved database, please rename your existing database to resolve the name conflict, or set a different value for spark.sql.globalTempDatabase, and launch your Spark application again.
globalTempViewManager
is used when BaseSessionStateBuilder and HiveSessionStateBuilder are requested for a SessionCatalog.
SQLAppStatusStore¶
statusStore: SQLAppStatusStore
SharedState
creates a SQLAppStatusStore when created.
When initialized, statusStore
requests the SparkContext for AppStatusStore
that is then requested for the KVStore
(which is assumed a ElementTrackingStore
).
statusStore
creates a SQLAppStatusListener (with the live
flag on) and registers it with the LiveListenerBus
to application status queue.
statusStore
creates a SQLAppStatusStore (with the KVStore
and the SQLAppStatusListener
).
In the end, statusStore
creates a SQLTab (with the SQLAppStatusStore
and the SparkUI
if available).
externalCatalogClassName Internal Method¶
externalCatalogClassName(
conf: SparkConf): String
externalCatalogClassName
gives the name of the class of the ExternalCatalog implementation based on spark.sql.catalogImplementation configuration property:
- org.apache.spark.sql.hive.HiveExternalCatalog for
hive
- org.apache.spark.sql.catalyst.catalog.InMemoryCatalog for
in-memory
externalCatalogClassName
is used when SharedState
is requested for the ExternalCatalog.
Warehouse Location¶
warehousePath: String
Warning
This is no longer part of SharedState and will go away once I find out where. Your help is appreciated.
warehousePath
is the location of the warehouse.
warehousePath
is hive.metastore.warehouse.dir (if defined) or spark.sql.warehouse.dir.
warehousePath
prints out the following INFO message to the logs when SharedState
is created:
Warehouse path is '[warehousePath]'.
warehousePath
is used when SharedState
initializes ExternalCatalog (and creates the default database in the metastore).
While initialized, warehousePath
does the following:
-
Loads
hive-site.xml
when found on CLASSPATH, i.e. adds it as a configuration resource to Hadoop's http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/conf/Configuration.html[Configuration] (ofSparkContext
). -
Removes
hive.metastore.warehouse.dir
fromSparkConf
(ofSparkContext
) and leaves it off if defined using any of the Hadoop configuration resources. -
Sets spark.sql.warehouse.dir or hive.metastore.warehouse.dir in the Hadoop configuration (of
SparkContext
)- If
hive.metastore.warehouse.dir
has been defined in any of the Hadoop configuration resources but spark.sql.warehouse.dir has not,spark.sql.warehouse.dir
becomes the value ofhive.metastore.warehouse.dir
.
warehousePath
prints out the following INFO message to the logs:spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('[hiveWarehouseDir]').
- Otherwise, the Hadoop configuration's
hive.metastore.warehouse.dir
is set tospark.sql.warehouse.dir
warehousePath
prints out the following INFO message to the logs:Setting hive.metastore.warehouse.dir ('[hiveWarehouseDir]') to the value of spark.sql.warehouse.dir ('[sparkWarehouseDir]').
- If
Logging¶
Enable ALL
logging level for org.apache.spark.sql.internal.SharedState
logger to see what happens inside.
Add the following line to conf/log4j2.properties
:
log4j.logger.org.apache.spark.sql.internal.SharedState=ALL
Refer to Logging.