CatalogStatistics¶
CatalogStatistics
are table and partition statistics that are stored in an external catalog.
Creating Instance¶
CatalogStatistics
takes the following to be created:
- Physical total size (in bytes)
- Estimated number of rows (row count)
- Column statistics (column names and their
CatalogColumnStat
s)
CatalogStatistics
is created when:
- AnalyzeColumnCommand and
AlterTableAddPartitionCommand
logical commands are executed (and store statistics in ExternalCatalog) CommandUtils
utility is used to updating existing table statistics and current statistics (if changed)HiveExternalCatalog
is requested to convert properties to Spark statisticsHiveClientImpl
utility is used to readHiveStatsPruneHiveTablePartitions
is requested toupdateTableMeta
- PruneFileSourcePartitions logical optimization is executed
CatalogStatistics and Statistics¶
CatalogStatistics
are a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore).
CatalogStatistics
are often stored in a Hive metastore and are referred as Hive statistics while Statistics
are the Spark statistics.
Readable Textual Representation¶
simpleString: String
simpleString
is the following text (with the sizeInBytes and the optional rowCount if defined):
[sizeInBytes] bytes, [rowCount] rows
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows
simpleString
is used when:
CatalogTablePartition
is requested to toLinkedHashMapCatalogTable
is requested to toLinkedHashMap
Converting Metastore Statistics to Spark Statistics¶
toPlanStats(
planOutput: Seq[Attribute],
cboEnabled: Boolean): Statistics
toPlanStats
converts the table statistics (from an external metastore) to Spark statistics.
With cost-based optimization enabled and row count statistics available, toPlanStats
creates a Statistics with the estimated total (output) size, row count and column statistics.
Otherwise, when cost-based optimization is disabled, toPlanStats
creates a Statistics with just the mandatory sizeInBytes.
Note
toPlanStats
does the reverse of HiveExternalCatalog.statsToProperties.
toPlanStats
is used when:
- HiveTableRelation and LogicalRelation are requested for statistics
Demo¶
scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]