SharedState — State Shared Across SparkSessions¶
SharedState holds the state that can be shared across SparkSessions:
- CacheManager
- ExternalCatalogWithListener
- GlobalTempViewManager
- Hadoop Configuration
-
NonClosableMutableURLClassLoader - SparkConf
- SparkContext
- SQLAppStatusStore
StreamingQueryStatusListener
SharedState is shared when SparkSession is created using SparkSession.newSession:
assert(spark.sharedState == spark.newSession.sharedState)
Creating Instance¶
SharedState takes the following to be created:
-
SparkContext - Initial configuration properties
SharedState is created for SparkSession (and cached for later reuse).
Accessing SharedState¶
SharedState is available using SparkSession.sharedState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sharedState
org.apache.spark.sql.internal.SharedState
Shared SQL Services¶
ExternalCatalog¶
externalCatalog: ExternalCatalog
ExternalCatalog that is created reflectively based on spark.sql.catalogImplementation internal configuration property:
- HiveExternalCatalog for
hive - InMemoryCatalog for
in-memory
While initialized:
-
Creates the default database (with
default databasedescription and warehousePath location) unless available already. -
Registers a
ExternalCatalogEventListenerthat propagates external catalog events to the Spark listener bus.
GlobalTempViewManager¶
globalTempViewManager: GlobalTempViewManager
When accessed for the very first time, globalTempViewManager gets the name of the global temporary view database based on spark.sql.globalTempDatabase internal static configuration property.
In the end, globalTempViewManager creates a new GlobalTempViewManager (with the configured database name).
globalTempViewManager throws a SparkException when the global temporary view database exist in the ExternalCatalog:
[globalTempDB] is a system preserved database, please rename your existing database to resolve the name conflict, or set a different value for spark.sql.globalTempDatabase, and launch your Spark application again.
globalTempViewManager is used when BaseSessionStateBuilder and HiveSessionStateBuilder are requested for a SessionCatalog.
SQLAppStatusStore¶
statusStore: SQLAppStatusStore
SharedState creates a SQLAppStatusStore when created.
When initialized, statusStore requests the SparkContext for AppStatusStore that is then requested for the KVStore (which is assumed a ElementTrackingStore).
statusStore creates a SQLAppStatusListener (with the live flag on) and registers it with the LiveListenerBus to application status queue.
statusStore creates a SQLAppStatusStore (with the KVStore and the SQLAppStatusListener).
In the end, statusStore creates a SQLTab (with the SQLAppStatusStore and the SparkUI if available).
externalCatalogClassName Internal Method¶
externalCatalogClassName(
conf: SparkConf): String
externalCatalogClassName gives the name of the class of the ExternalCatalog implementation based on spark.sql.catalogImplementation configuration property:
- org.apache.spark.sql.hive.HiveExternalCatalog for
hive - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog for
in-memory
externalCatalogClassName is used when SharedState is requested for the ExternalCatalog.
Warehouse Location¶
warehousePath: String
Warning
This is no longer part of SharedState and will go away once I find out where. Your help is appreciated.
warehousePath is the location of the warehouse.
warehousePath is hive.metastore.warehouse.dir (if defined) or spark.sql.warehouse.dir.
warehousePath prints out the following INFO message to the logs when SharedState is created:
Warehouse path is '[warehousePath]'.
warehousePath is used when SharedState initializes ExternalCatalog (and creates the default database in the metastore).
While initialized, warehousePath does the following:
-
Loads
hive-site.xmlwhen found on CLASSPATH, i.e. adds it as a configuration resource to Hadoop's http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/conf/Configuration.html[Configuration] (ofSparkContext). -
Removes
hive.metastore.warehouse.dirfromSparkConf(ofSparkContext) and leaves it off if defined using any of the Hadoop configuration resources. -
Sets spark.sql.warehouse.dir or hive.metastore.warehouse.dir in the Hadoop configuration (of
SparkContext)- If
hive.metastore.warehouse.dirhas been defined in any of the Hadoop configuration resources but spark.sql.warehouse.dir has not,spark.sql.warehouse.dirbecomes the value ofhive.metastore.warehouse.dir.
warehousePathprints out the following INFO message to the logs:spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('[hiveWarehouseDir]').- Otherwise, the Hadoop configuration's
hive.metastore.warehouse.diris set tospark.sql.warehouse.dir
warehousePathprints out the following INFO message to the logs:Setting hive.metastore.warehouse.dir ('[hiveWarehouseDir]') to the value of spark.sql.warehouse.dir ('[sparkWarehouseDir]'). - If
Logging¶
Enable ALL logging level for org.apache.spark.sql.internal.SharedState logger to see what happens inside.
Add the following line to conf/log4j2.properties:
log4j.logger.org.apache.spark.sql.internal.SharedState=ALL
Refer to Logging.