CommandUtils¶
CommandUtils
is a helper class for logical commands to manage table statistics.
Analyzing Table¶
analyzeTable(
sparkSession: SparkSession,
tableIdent: TableIdentifier,
noScan: Boolean): Unit
analyzeTable
requests the SessionCatalog for the table metadata.
analyzeTable
branches off based on the type of the table: a view and the other types.
Views¶
For CatalogTableType.VIEW
s, analyzeTable
requests the CacheManager to lookupCachedData. If available and the given noScan
flag is disabled, analyzeTable
requests the table to count
the number of rows (that materializes the underlying columnar RDD).
Other Table Types¶
For other types, analyzeTable
calculateTotalSize for the table. With the given noScan
flag disabled, analyzeTable
creates a DataFrame
for the table and count
s the number of rows (that triggers a Spark job). In case the table stats have changed, analyzeTable
requests the SessionCatalog to alterTableStats.
Usage¶
analyzeTable
is used when the following commands are executed:
Updating Existing Table Statistics¶
updateTableStats(
sparkSession: SparkSession,
table: CatalogTable): Unit
updateTableStats
updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).
updateTableStats
requests SessionCatalog to alterTableStats with the <
Important
updateTableStats
uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
updateTableStats
is used when the following logical commands are executed:
AlterTableAddPartitionCommand
AlterTableDropPartitionCommand
AlterTableSetLocationCommand
- CreateDataSourceTableAsSelectCommand
- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
LoadDataCommand
TruncateTableCommand
compareAndGetNewStats¶
compareAndGetNewStats(
oldStats: Option[CatalogStatistics],
newTotalSize: BigInt,
newRowCount: Option[BigInt]): Option[CatalogStatistics]
compareAndGetNewStats
...FIXME
compareAndGetNewStats
is used when:
- AnalyzePartitionCommand is executed
CommandUtils
is requested to analyze a table
computeColumnStats¶
computeColumnStats(
sparkSession: SparkSession,
relation: LogicalPlan,
columns: Seq[Attribute]): (Long, Map[Attribute, ColumnStat])
computeColumnStats
...FIXME
computeColumnStats
is used when:
CacheManager
is requested to analyzeColumnCacheQueryAnalyzeColumnCommand
logical command is requested to analyzeColumnInCatalog
computePercentiles¶
computePercentiles(
attributesToAnalyze: Seq[Attribute],
sparkSession: SparkSession,
relation: LogicalPlan): AttributeMap[ArrayData]
computePercentiles
...FIXME
Logging¶
Enable ALL
logging level for org.apache.spark.sql.execution.command.CommandUtils
logger to see what happens inside.
Add the following line to conf/log4j2.properties
:
logger.CommandUtils.name = org.apache.spark.sql.execution.command.CommandUtils
logger.CommandUtils.level = all
Refer to Logging.