CatalogStatistics¶
CatalogStatistics
represents the table- and column-level statistics (that are stored in SessionCatalog).
Creating Instance¶
CatalogStatistics
takes the following to be created:
- Physical total size (in bytes)
- Estimated number of rows (row count)
- Column statistics (column names and their CatalogColumnStats)
CatalogStatistics
is created when:
- AnalyzeColumnCommand and
AlterTableAddPartitionCommand
logical commands are executed (and alter table statistics in SessionCatalog) CommandUtils
is requested to update existing table statistics and current statistics (if changed)HiveExternalCatalog
is requested to convert Hive properties to Spark statisticsHiveClientImpl
is requested to readHiveStats- PruneHiveTablePartitions logical optimization is executed (and updates the table metadata)
- PruneFileSourcePartitions logical optimization is executed
Table Statistics¶
CatalogStatistics
is created with table statistics when:
HiveExternalCatalog
is requested for table statistics (from a Hive metastore)HiveClientImpl
is requested to readHiveStats
CatalogStatistics
is created to update table statistics for the following logical optimizations:
CatalogStatistics
is created to alter table statistics (directly or indirectly using CommandUtils) for the following logical commands:
Logical Command | SQL Statement |
---|---|
AnalyzeColumnCommand | ANALYZE TABLE FOR COLUMNS |
AnalyzePartitionCommand | ANALYZE TABLE PARTITION |
AnalyzeTableCommand | ANALYZE TABLE |
AnalyzeTablesCommand | ANALYZE TABLES |
AlterTableAddPartitionCommand | ALTER TABLE ADD PARTITION |
AlterTableDropPartitionCommand | ALTER TABLE DROP PARTITION |
AlterTableSetLocationCommand | ALTER TABLE SET LOCATION |
CreateDataSourceTableAsSelectCommand | |
InsertIntoHadoopFsRelationCommand | |
InsertIntoHiveTable | |
LoadDataCommand | |
TruncateTableCommand |
CatalogStatistics and Statistics¶
CatalogStatistics
are a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hints in Hive metastore).
CatalogStatistics
are stored in a Hive metastore and are referred as Hive statistics while Statistics
are Spark statistics.
Readable Textual Representation¶
simpleString: String
simpleString
is the following text (with the sizeInBytes and the optional rowCount if defined):
[sizeInBytes] bytes, [rowCount] rows
simpleString
is used when:
CatalogTablePartition
is requested to toLinkedHashMapCatalogTable
is requested to toLinkedHashMap
Converting Metastore Statistics to Spark Statistics¶
toPlanStats(
planOutput: Seq[Attribute],
cboEnabled: Boolean): Statistics
toPlanStats
converts the table statistics (from an external metastore) to Spark statistics.
With cost-based optimization enabled and row count statistics available, toPlanStats
creates a Statistics with the estimated total (output) size, row count and column statistics.
Otherwise (when cost-based optimization is disabled), toPlanStats
creates a Statistics with just the mandatory sizeInBytes.
Note
toPlanStats
does the reverse of HiveExternalCatalog.statsToProperties.
toPlanStats
is used when:
- HiveTableRelation and LogicalRelation are requested for statistics
Demo¶
// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
assert(stats.isInstanceOf[Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]])
stats.map(_.simpleString).foreach(println)
// 714 bytes, 2 rows