CommandUtils¶

CommandUtils is a helper class for logical commands to manage table statistics.

Analyzing Table¶

analyzeTable(
  sparkSession: SparkSession,
  tableIdent: TableIdentifier,
  noScan: Boolean): Unit

analyzeTable requests the SessionCatalog for the table metadata.

analyzeTable branches off based on the type of the table: a view and the other types.

Views¶

For CatalogTableType.VIEWs, analyzeTable requests the CacheManager to lookupCachedData. If available and the given noScan flag is disabled, analyzeTable requests the table to count the number of rows (that materializes the underlying columnar RDD).

Other Table Types¶

For other types, analyzeTable calculateTotalSize for the table. With the given noScan flag disabled, analyzeTable creates a DataFrame for the table and counts the number of rows (that triggers a Spark job). In case the table stats have changed, analyzeTable requests the SessionCatalog to alterTableStats.

Usage¶

analyzeTable is used when the following commands are executed:

Updating Existing Table Statistics¶

updateTableStats(
  sparkSession: SparkSession,
  table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the <> (when spark.sql.statistics.size.autoUpdate.enabled property is enabled) or empty statistics (that effectively removes the recorded statistics completely).

Important

updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.

updateTableStats is used when the following logical commands are executed:

AlterTableAddPartitionCommand
AlterTableDropPartitionCommand
AlterTableSetLocationCommand
CreateDataSourceTableAsSelectCommand
InsertIntoHadoopFsRelationCommand
InsertIntoHiveTable
LoadDataCommand
TruncateTableCommand

compareAndGetNewStats¶

compareAndGetNewStats(
  oldStats: Option[CatalogStatistics],
  newTotalSize: BigInt,
  newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats...FIXME

compareAndGetNewStats is used when:

AnalyzePartitionCommand is executed
CommandUtils is requested to analyze a table

computeColumnStats¶

computeColumnStats(
  sparkSession: SparkSession,
  relation: LogicalPlan,
  columns: Seq[Attribute]): (Long, Map[Attribute, ColumnStat])

computeColumnStats...FIXME

computeColumnStats is used when:

CacheManager is requested to analyzeColumnCacheQuery
AnalyzeColumnCommand logical command is requested to analyzeColumnInCatalog

computePercentiles¶

computePercentiles(
  attributesToAnalyze: Seq[Attribute],
  sparkSession: SparkSession,
  relation: LogicalPlan): AttributeMap[ArrayData]

computePercentiles...FIXME

Logging¶

Enable ALL logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/log4j2.properties:

logger.CommandUtils.name = org.apache.spark.sql.execution.command.CommandUtils
logger.CommandUtils.level = all

Refer to Logging.