Skip to content


CacheManager is a registry of structured queries that are cached and supposed to be replaced with corresponding InMemoryRelation logical operators as their cached representation (when QueryExecution is requested for a logical query plan with cached data).

Accessing CacheManager

CacheManager is shared across SparkSessions through SharedState.

val spark: SparkSession = ...

Dataset.cache and persist Operators

A structured query (as Dataset) can be cached and registered with CacheManager using Dataset.cache or Dataset.persist high-level operators.

Cached Queries

cachedData: LinkedList[CachedData]

CacheManager uses the cachedData internal registry to manage cached structured queries as CachedData with InMemoryRelation leaf logical operators.

A new CachedData is added when CacheManager is requested to:

A CachedData is removed when CacheManager is requested to:

All CachedData are removed (cleared) when CacheManager is requested to clearCache

Re-Caching By Path

  spark: SparkSession,
  resourcePath: String): Unit
  spark: SparkSession,
  resourcePath: Path,
  fs: FileSystem): Unit


recacheByPath is used when:


  plan: LogicalPlan,
  fs: FileSystem,
  qualifiedPath: Path): Boolean



  fileIndex: FileIndex,
  fs: FileSystem,
  qualifiedPath: Path): Boolean


refreshFileIndexIfNecessary is used when CacheManager is requested to lookupAndRefresh.

Looking Up CachedData

  query: Dataset[_]): Option[CachedData]
  plan: LogicalPlan): Option[CachedData]


lookupCachedData is used when:

Un-caching Dataset

  query: Dataset[_],
  cascade: Boolean,
  blocking: Boolean = true): Unit
  spark: SparkSession,
  plan: LogicalPlan,
  cascade: Boolean,
  blocking: Boolean): Unit


uncacheQuery is used when:

Caching Query

  query: Dataset[_],
  tableName: Option[String] = None,
  storageLevel: StorageLevel = MEMORY_AND_DISK): Unit

cacheQuery adds the analyzed logical plan of the input Dataset to the cachedData internal registry of cached queries.

Internally, cacheQuery requests the Dataset for the analyzed logical plan and creates a InMemoryRelation with the following:

cacheQuery then creates a CachedData (for the analyzed query plan and the InMemoryRelation) and adds it to the cachedData internal registry.

If the input query has already been cached, cacheQuery simply prints out the following WARN message to the logs and exits (i.e. does nothing but prints out the WARN message):

Asked to cache already cached data.

cacheQuery is used when:

Clearing Cache

clearCache(): Unit

clearCache takes every CachedData from the cachedData internal registry and requests it for the InMemoryRelation to access the CachedRDDBuilder. clearCache requests the CachedRDDBuilder to clearCache.

In the end, clearCache removes all CachedData entries from the cachedData internal registry.

clearCache is used when CatalogImpl is requested to clear the cache.

Re-Caching Query

  spark: SparkSession,
  condition: LogicalPlan => Boolean): Unit


recacheByCondition is used when CacheManager is requested to uncache a structured query, recacheByPlan, and recacheByPath.

Re-Caching By Logical Plan

  spark: SparkSession,
  plan: LogicalPlan): Unit


recacheByPlan is used when InsertIntoDataSourceCommand logical command is executed.

Replacing Segments of Logical Query Plan With Cached Data

  plan: LogicalPlan): LogicalPlan

useCachedData traverses the given logical query plan down (parent operators first, children later) and replaces them with cached representation (i.e. InMemoryRelation) if found. useCachedData does this operator substitution for SubqueryExpression expressions, too.

useCachedData skips IgnoreCachedData commands (and leaves them unchanged).

useCachedData is used (recursively) when QueryExecution is requested for a logical query plan with cached data.


Enable ALL logging level for org.apache.spark.sql.execution.CacheManager logger to see what happens inside.

Add the following line to conf/

Refer to Logging.