Skip to content

CommandUtils

CommandUtils is a helper class for logical commands to manage table statistics.

Analyzing Table

analyzeTable(
  sparkSession: SparkSession,
  tableIdent: TableIdentifier,
  noScan: Boolean): Unit

analyzeTable requests the SessionCatalog for the table metadata.

analyzeTable branches off based on the type of the table: a view and the other types.

Views

For CatalogTableType.VIEWs, analyzeTable requests the CacheManager to lookupCachedData. If available and the given noScan flag is disabled, analyzeTable requests the table to count the number of rows (that materializes the underlying columnar RDD).

Other Table Types

For other types, analyzeTable calculateTotalSize for the table. With the given noScan flag disabled, analyzeTable creates a DataFrame for the table and counts the number of rows (that triggers a Spark job). In case the table stats have changed, analyzeTable requests the SessionCatalog to alterTableStats.

Usage

analyzeTable is used when the following commands are executed:

Updating Existing Table Statistics

updateTableStats(
  sparkSession: SparkSession,
  table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).


updateTableStats requests SessionCatalog to alterTableStats with the <> (when spark.sql.statistics.size.autoUpdate.enabled property is enabled) or empty statistics (that effectively removes the recorded statistics completely).

Important

updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.


updateTableStats is used when the following logical commands are executed:

compareAndGetNewStats

compareAndGetNewStats(
  oldStats: Option[CatalogStatistics],
  newTotalSize: BigInt,
  newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats...FIXME


compareAndGetNewStats is used when:

computeColumnStats

computeColumnStats(
  sparkSession: SparkSession,
  relation: LogicalPlan,
  columns: Seq[Attribute]): (Long, Map[Attribute, ColumnStat])

computeColumnStats...FIXME


computeColumnStats is used when:

computePercentiles

computePercentiles(
  attributesToAnalyze: Seq[Attribute],
  sparkSession: SparkSession,
  relation: LogicalPlan): AttributeMap[ArrayData]

computePercentiles...FIXME

Logging

Enable ALL logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/log4j2.properties:

logger.CommandUtils.name = org.apache.spark.sql.execution.command.CommandUtils
logger.CommandUtils.level = all

Refer to Logging.