Statistics¶
Statistics holds the following estimates of a logical operator:
- Total (output) size (in bytes)
- Estimated number of rows
- Column Statistics (column (equi-height) histograms)
Note
Cost statistics, plan statistics or query statistics are synonyms and used interchangeably.
Statistics is created when:
CatalogStatisticsis requested to convert metastore statistics- DataSourceV2Relation, DataSourceV2ScanRelation, ExternalRDD, LocalRelation, LogicalRDD, LogicalRelation,
Range,OneRowRelationlogical operators are requested tocomputeStats AggregateEstimationand JoinEstimation are requested toestimate- SizeInBytesOnlyStatsPlanVisitor is executed
- QueryStageExec physical operator is requested to
computeStats - DetermineTableStats logical resolution rule is executed
Row Count¶
Row Count estimate is used in CostBasedJoinReorder logical optimization for Cost-Based Optimization.
Statistics and CatalogStatistics¶
CatalogStatistics is a "subset" of all possible Statistics (as there are no concepts of attributes in metastore).
CatalogStatistics are statistics stored in an external catalog (usually a Hive metastore) and are often referred as Hive statistics while Statistics represents the Spark statistics.
Accessing Statistics of Logical Operator¶
Statistics of a logical plan are available using stats property.
val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats
scala> :type stats
org.apache.spark.sql.catalyst.plans.logical.Statistics
scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none
Note
Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table.
Note
Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL Command to generate column (equi-height) histograms of a table.
Textual Representation¶
toString: String
toString gives textual representation of the Statistics.
import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20))
scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20)