CommandUtils¶
CommandUtils is a helper class for logical commands to manage table statistics.
Analyzing Table¶
analyzeTable(
  sparkSession: SparkSession,
  tableIdent: TableIdentifier,
  noScan: Boolean): Unit
analyzeTable requests the SessionCatalog for the table metadata.
analyzeTable branches off based on the type of the table: a view and the other types.
Views¶
For CatalogTableType.VIEWs, analyzeTable requests the CacheManager to lookupCachedData. If available and the given noScan flag is disabled, analyzeTable requests the table to count the number of rows (that materializes the underlying columnar RDD).
Other Table Types¶
For other types, analyzeTable calculateTotalSize for the table. With the given noScan flag disabled, analyzeTable creates a DataFrame for the table and counts the number of rows (that triggers a Spark job). In case the table stats have changed, analyzeTable requests the SessionCatalog to alterTableStats.
Usage¶
analyzeTable is used when the following commands are executed:
Updating Existing Table Statistics¶
updateTableStats(
  sparkSession: SparkSession,
  table: CatalogTable): Unit
updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).
updateTableStats requests SessionCatalog to alterTableStats with the <
Important
updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
updateTableStats is used when the following logical commands are executed:
- AlterTableAddPartitionCommand
- AlterTableDropPartitionCommand
- AlterTableSetLocationCommand
- CreateDataSourceTableAsSelectCommand
- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
- LoadDataCommand
- TruncateTableCommand
compareAndGetNewStats¶
compareAndGetNewStats(
  oldStats: Option[CatalogStatistics],
  newTotalSize: BigInt,
  newRowCount: Option[BigInt]): Option[CatalogStatistics]
compareAndGetNewStats...FIXME
compareAndGetNewStats is used when:
- AnalyzePartitionCommand is executed
- CommandUtilsis requested to analyze a table
computeColumnStats¶
computeColumnStats(
  sparkSession: SparkSession,
  relation: LogicalPlan,
  columns: Seq[Attribute]): (Long, Map[Attribute, ColumnStat])
computeColumnStats...FIXME
computeColumnStats is used when:
- CacheManageris requested to analyzeColumnCacheQuery
- AnalyzeColumnCommandlogical command is requested to analyzeColumnInCatalog
computePercentiles¶
computePercentiles(
  attributesToAnalyze: Seq[Attribute],
  sparkSession: SparkSession,
  relation: LogicalPlan): AttributeMap[ArrayData]
computePercentiles...FIXME
Logging¶
Enable ALL logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.
Add the following line to conf/log4j2.properties:
logger.CommandUtils.name = org.apache.spark.sql.execution.command.CommandUtils
logger.CommandUtils.level = all
Refer to Logging.