Skip to content

CatalogStatistics

CatalogStatistics represents the table- and column-level statistics (that are stored in SessionCatalog).

Creating Instance

CatalogStatistics takes the following to be created:

  • Physical total size (in bytes)
  • Estimated number of rows (row count)
  • Column statistics (column names and their CatalogColumnStats)

CatalogStatistics is created when:

Table Statistics

CatalogStatistics is created with table statistics when:

CatalogStatistics is created to update table statistics for the following logical optimizations:

CatalogStatistics is created to alter table statistics (directly or indirectly using CommandUtils) for the following logical commands:

Logical Command SQL Statement
AnalyzeColumnCommand ANALYZE TABLE FOR COLUMNS
AnalyzePartitionCommand ANALYZE TABLE PARTITION
AnalyzeTableCommand ANALYZE TABLE
AnalyzeTablesCommand ANALYZE TABLES
AlterTableAddPartitionCommand ALTER TABLE ADD PARTITION
AlterTableDropPartitionCommand ALTER TABLE DROP PARTITION
AlterTableSetLocationCommand ALTER TABLE SET LOCATION
CreateDataSourceTableAsSelectCommand
InsertIntoHadoopFsRelationCommand
InsertIntoHiveTable
LoadDataCommand
TruncateTableCommand

CatalogStatistics and Statistics

CatalogStatistics are a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hints in Hive metastore).

CatalogStatistics are stored in a Hive metastore and are referred as Hive statistics while Statistics are Spark statistics.

Readable Textual Representation

simpleString: String

simpleString is the following text (with the sizeInBytes and the optional rowCount if defined):

[sizeInBytes] bytes, [rowCount] rows

simpleString is used when:

Converting Metastore Statistics to Spark Statistics

toPlanStats(
  planOutput: Seq[Attribute],
  cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.

Otherwise (when cost-based optimization is disabled), toPlanStats creates a Statistics with just the mandatory sizeInBytes.

Note

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.


toPlanStats is used when:

Demo

// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
assert(stats.isInstanceOf[Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]])
stats.map(_.simpleString).foreach(println)
// 714 bytes, 2 rows