CatalogStatistics¶

CatalogStatistics represents the table- and column-level statistics (that are stored in SessionCatalog).

Creating Instance¶

CatalogStatistics takes the following to be created:

Physical total size (in bytes)
Estimated number of rows (row count)
Column statistics (column names and their CatalogColumnStats)

CatalogStatistics is created when:

AnalyzeColumnCommand and AlterTableAddPartitionCommand logical commands are executed (and alter table statistics in SessionCatalog)
CommandUtils is requested to update existing table statistics and current statistics (if changed)
HiveExternalCatalog is requested to convert Hive properties to Spark statistics
HiveClientImpl is requested to readHiveStats
PruneHiveTablePartitions logical optimization is executed (and updates the table metadata)
PruneFileSourcePartitions logical optimization is executed

Table Statistics¶

CatalogStatistics is created with table statistics when:

HiveExternalCatalog is requested for table statistics (from a Hive metastore)
HiveClientImpl is requested to readHiveStats

CatalogStatistics is created to update table statistics for the following logical optimizations:

CatalogStatistics is created to alter table statistics (directly or indirectly using CommandUtils) for the following logical commands:

Logical Command	SQL Statement
AnalyzeColumnCommand	ANALYZE TABLE FOR COLUMNS
AnalyzePartitionCommand	ANALYZE TABLE PARTITION
AnalyzeTableCommand	ANALYZE TABLE
AnalyzeTablesCommand	`ANALYZE TABLES`
`AlterTableAddPartitionCommand`	`ALTER TABLE ADD PARTITION`
`AlterTableDropPartitionCommand`	`ALTER TABLE DROP PARTITION`
`AlterTableSetLocationCommand`	`ALTER TABLE SET LOCATION`
CreateDataSourceTableAsSelectCommand
InsertIntoHadoopFsRelationCommand
InsertIntoHiveTable
LoadDataCommand
TruncateTableCommand

CatalogStatistics and Statistics¶

CatalogStatistics are a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hints in Hive metastore).

CatalogStatistics are stored in a Hive metastore and are referred as Hive statistics while Statistics are Spark statistics.

Readable Textual Representation¶

simpleString: String

simpleString is the following text (with the sizeInBytes and the optional rowCount if defined):

[sizeInBytes] bytes, [rowCount] rows

simpleString is used when:

CatalogTablePartition is requested to toLinkedHashMap
CatalogTable is requested to toLinkedHashMap

Converting Metastore Statistics to Spark Statistics¶

toPlanStats(
  planOutput: Seq[Attribute],
  cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.

Otherwise (when cost-based optimization is disabled), toPlanStats creates a Statistics with just the mandatory sizeInBytes.

Note

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.

toPlanStats is used when:

HiveTableRelation and LogicalRelation are requested for statistics

Demo¶

// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats

scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
assert(stats.isInstanceOf[Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]])

stats.map(_.simpleString).foreach(println)
// 714 bytes, 2 rows