CatalogStatistics¶
CatalogStatistics represents the table- and column-level statistics (that are stored in SessionCatalog).
Creating Instance¶
CatalogStatistics takes the following to be created:
- Physical total size (in bytes)
- Estimated number of rows (row count)
- Column statistics (column names and their CatalogColumnStats)
CatalogStatistics is created when:
- AnalyzeColumnCommand and
AlterTableAddPartitionCommandlogical commands are executed (and alter table statistics in SessionCatalog) CommandUtilsis requested to update existing table statistics and current statistics (if changed)HiveExternalCatalogis requested to convert Hive properties to Spark statisticsHiveClientImplis requested to readHiveStats- PruneHiveTablePartitions logical optimization is executed (and updates the table metadata)
- PruneFileSourcePartitions logical optimization is executed
Table Statistics¶
CatalogStatistics is created with table statistics when:
HiveExternalCatalogis requested for table statistics (from a Hive metastore)HiveClientImplis requested to readHiveStats
CatalogStatistics is created to update table statistics for the following logical optimizations:
CatalogStatistics is created to alter table statistics (directly or indirectly using CommandUtils) for the following logical commands:
| Logical Command | SQL Statement |
|---|---|
| AnalyzeColumnCommand | ANALYZE TABLE FOR COLUMNS |
| AnalyzePartitionCommand | ANALYZE TABLE PARTITION |
| AnalyzeTableCommand | ANALYZE TABLE |
| AnalyzeTablesCommand | ANALYZE TABLES |
AlterTableAddPartitionCommand | ALTER TABLE ADD PARTITION |
AlterTableDropPartitionCommand | ALTER TABLE DROP PARTITION |
AlterTableSetLocationCommand | ALTER TABLE SET LOCATION |
| CreateDataSourceTableAsSelectCommand | |
| InsertIntoHadoopFsRelationCommand | |
| InsertIntoHiveTable | |
| LoadDataCommand | |
| TruncateTableCommand |
CatalogStatistics and Statistics¶
CatalogStatistics are a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hints in Hive metastore).
CatalogStatistics are stored in a Hive metastore and are referred as Hive statistics while Statistics are Spark statistics.
Readable Textual Representation¶
simpleString: String
simpleString is the following text (with the sizeInBytes and the optional rowCount if defined):
[sizeInBytes] bytes, [rowCount] rows
simpleString is used when:
CatalogTablePartitionis requested to toLinkedHashMapCatalogTableis requested to toLinkedHashMap
Converting Metastore Statistics to Spark Statistics¶
toPlanStats(
planOutput: Seq[Attribute],
cboEnabled: Boolean): Statistics
toPlanStats converts the table statistics (from an external metastore) to Spark statistics.
With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.
Otherwise (when cost-based optimization is disabled), toPlanStats creates a Statistics with just the mandatory sizeInBytes.
Note
toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.
toPlanStats is used when:
- HiveTableRelation and LogicalRelation are requested for statistics
Demo¶
// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
assert(stats.isInstanceOf[Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]])
stats.map(_.simpleString).foreach(println)
// 714 bytes, 2 rows