Skip to content

HDFSBackedStateStoreProvider

HDFSBackedStateStoreProvider is a StateStoreProvider that uses a Hadoop DFS-compatible file system for versioned state checkpointing.

HDFSBackedStateStoreProvider is the default StateStoreProvider per the spark.sql.streaming.stateStore.providerClass internal configuration property.

Creating Instance

HDFSBackedStateStoreProvider takes no arguments to be created.

HDFSBackedStateStoreProvider is created when:

getMetricsForProvider

getMetricsForProvider(): Map[String, Long]

getMetricsForProvider returns the following performance metrics:


getMetricsForProvider is used when:

Loading Specified Version of State (Store) For Update

getStore(
  version: Long): StateStore

getStore is part of the StateStoreProvider abstraction.


getStore getLoadedMapForStore for the given version.

getStore prints out the following INFO message to the logs:

Retrieved version [version] of [this] for update

In the end, getStore creates a new HDFSBackedStateStore for the given version and the new state.

Supported Custom Metrics

supportedCustomMetrics: Seq[StateStoreCustomMetric]

supportedCustomMetrics is part of the StateStoreProvider abstraction.

loadedMapCacheHitCount

count of cache hit on states cache in provider

loadedMapCacheMissCount

count of cache miss on states cache in provider

stateOnCurrentVersionSizeBytes

estimated size of state only on current version

Logging

Enable ALL logging level for org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider=ALL

Refer to Logging.