Skip to content

CommandUtils — Utilities for Table Statistics

CommandUtils is a helper class that logical commands use to manage table statistics.

analyzeTable

analyzeTable(
  sparkSession: SparkSession,
  tableIdent: TableIdentifier,
  noScan: Boolean): Unit

analyzeTable requests the SessionCatalog for the table metadata.

analyzeTable branches off based on the type of the table: a view and the other types.

For CatalogTableType.VIEWs, analyzeTable requests the CacheManager to lookupCachedData. If available and the given noScan flag is disabled, analyzeTable requests the table to count the number of rows (that materializes the underlying columnar RDD).

For other types, analyzeTable calculateTotalSize for the table. With the given noScan flag disabled, analyzeTable creates a DataFrame for the table and counts the number of rows (that triggers a Spark job). In case the table stats have changed, analyzeTable requests the SessionCatalog to alterTableStats.

analyzeTable is used when:

Logging

Enable ALL logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.command.CommandUtils=ALL

Refer to Logging.

Review Me

Updating Existing Table Statistics

updateTableStats(
  sparkSession: SparkSession,
  table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the <> (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).

Important

updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.

Note

updateTableStats uses SparkSession to access the current SparkSession.md#sessionState[SessionState] that it then uses to access the session-scoped SessionState.md#catalog[SessionCatalog].

updateTableStats is used when:

Calculating Total Size of Table (with Partitions)

calculateTotalSize(
  sessionState: SessionState,
  catalogTable: CatalogTable): BigInt

calculateTotalSize <> for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).

NOTE: calculateTotalSize uses the input SessionState to access the SessionState.md#catalog[SessionCatalog].

calculateTotalSize is used when:

Calculating Total File Size Under Path

calculateLocationSize(
  sessionState: SessionState,
  identifier: TableIdentifier,
  locationUri: Option[URI]): Long

calculateLocationSize reads hive.exec.stagingdir configuration property for the staging directory (with .hive-staging being the default).

You should see the following INFO message in the logs:

Starting to calculate the total file size under path [locationUri].

calculateLocationSize calculates the sum of the length of all the files under the input locationUri.

Note

calculateLocationSize uses Hadoop's FileSystem.getFileStatus and FileStatus.getLen to access the file and the length of the file (in bytes), respectively.

In the end, you should see the following INFO message in the logs:

It took [durationInMs] ms to calculate the total file size under path [locationUri].

calculateLocationSize is used when: