Skip to content

CatalogStatistics

CatalogStatistics are table and partition statistics that are stored in an external catalog.

Creating Instance

CatalogStatistics takes the following to be created:

  • Physical total size (in bytes)
  • Estimated number of rows (row count)
  • Column statistics (column names and their CatalogColumnStats)

CatalogStatistics is created when:

CatalogStatistics and Statistics

CatalogStatistics are a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore).

CatalogStatistics are often stored in a Hive metastore and are referred as Hive statistics while Statistics are the Spark statistics.

Readable Textual Representation

simpleString: String

simpleString is the following text (with the sizeInBytes and the optional rowCount if defined):

[sizeInBytes] bytes, [rowCount] rows
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows

simpleString is used when:

Converting Metastore Statistics to Spark Statistics

toPlanStats(
  planOutput: Seq[Attribute],
  cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.

Otherwise, when cost-based optimization is disabled, toPlanStats creates a Statistics with just the mandatory sizeInBytes.

Note

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.

toPlanStats is used when:

Demo

scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
Back to top