CacheManager is a registry of structured queries that are cached and supposed to be replaced with corresponding InMemoryRelation logical operators as their cached representation (when
QueryExecution is requested for a logical query plan with cached data).
CacheManager is shared across SparkSessions through SharedState.
val spark: SparkSession = ... spark.sharedState.cacheManager
Dataset.cache and persist Operators¶
A structured query (as Dataset) can be cached and registered with
CacheManager using Dataset.cache or Dataset.persist high-level operators.
CacheManager uses the
cachedData internal registry to manage cached structured queries as
CachedData with InMemoryRelation leaf logical operators.
CachedData is added when
CacheManager is requested to:
CachedData is removed when
CacheManager is requested to:
CachedData are removed (cleared) when
CacheManager is requested to clearCache
Re-Caching By Path¶
recacheByPath( spark: SparkSession, resourcePath: String): Unit recacheByPath( spark: SparkSession, resourcePath: Path, fs: FileSystem): Unit
recacheByPath is used when:
CatalogImplis requested to refreshByPath
- InsertIntoHadoopFsRelationCommand command is executed
lookupAndRefresh( plan: LogicalPlan, fs: FileSystem, qualifiedPath: Path): Boolean
refreshFileIndexIfNecessary( fileIndex: FileIndex, fs: FileSystem, qualifiedPath: Path): Boolean
refreshFileIndexIfNecessary is used when
CacheManager is requested to lookupAndRefresh.
Looking Up CachedData¶
lookupCachedData( query: Dataset[_]): Option[CachedData] lookupCachedData( plan: LogicalPlan): Option[CachedData]
lookupCachedData is used when:
- Dataset.storageLevel basic action is used
CatalogImplis requested to isCached
CacheManageris requested to cacheQuery and useCachedData
uncacheQuery( query: Dataset[_], cascade: Boolean, blocking: Boolean = true): Unit uncacheQuery( spark: SparkSession, plan: LogicalPlan, cascade: Boolean, blocking: Boolean): Unit
uncacheQuery is used when:
- Dataset.unpersist basic action is used
DropTableCommandand TruncateTableCommand logical commands are executed
CatalogImplis requested to uncache and refresh a table or view, dropTempView and dropGlobalTempView
cacheQuery( query: Dataset[_], tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit
cacheQuery adds the analyzed logical plan of the input Dataset to the cachedData internal registry of cached queries.
cacheQuery requests the
Dataset for the analyzed logical plan and creates a InMemoryRelation with the following:
- spark.sql.inMemoryColumnarStorage.compressed configuration property
- spark.sql.inMemoryColumnarStorage.batchSize configuration property
- Optimized physical query plan (after requesting
SessionStateto execute the analyzed logical plan)
- Statistics of the analyzed query plan
cacheQuery then creates a
CachedData (for the analyzed query plan and the
InMemoryRelation) and adds it to the cachedData internal registry.
If the input
query has already been cached,
cacheQuery simply prints out the following WARN message to the logs and exits (i.e. does nothing but prints out the WARN message):
Asked to cache already cached data.
cacheQuery is used when:
- Dataset.persist basic action is used
CatalogImplis requested to cache and refresh a table or view in-memory
clearCache takes every
CachedData from the cachedData internal registry and requests it for the InMemoryRelation to access the CachedRDDBuilder.
clearCache requests the
CachedRDDBuilder to clearCache.
In the end,
clearCache removes all
CachedData entries from the cachedData internal registry.
clearCache is used when
CatalogImpl is requested to clear the cache.
recacheByCondition( spark: SparkSession, condition: LogicalPlan => Boolean): Unit
recacheByCondition is used when
CacheManager is requested to uncache a structured query, recacheByPlan, and recacheByPath.
Re-Caching By Logical Plan¶
recacheByPlan( spark: SparkSession, plan: LogicalPlan): Unit
recacheByPlan is used when InsertIntoDataSourceCommand logical command is executed.
Replacing Segments of Logical Query Plan With Cached Data¶
useCachedData( plan: LogicalPlan): LogicalPlan
useCachedData traverses the given logical query plan down (parent operators first, children later) and replaces them with cached representation (i.e. InMemoryRelation) if found.
useCachedData does this operator substitution for SubqueryExpression expressions, too.
useCachedData skips IgnoreCachedData commands (and leaves them unchanged).
useCachedData is used (recursively) when
QueryExecution is requested for a logical query plan with cached data.
ALL logging level for
org.apache.spark.sql.execution.CacheManager logger to see what happens inside.
Add the following line to
Refer to Logging.