Statistics¶
Statistics
holds the following estimates of a logical operator:
- Total (output) size (in bytes)
- Estimated number of rows
- Column Statistics (column (equi-height) histograms)
Note
Cost statistics, plan statistics or query statistics are synonyms and used interchangeably.
Statistics
is created when:
CatalogStatistics
is requested to convert metastore statistics- DataSourceV2Relation, DataSourceV2ScanRelation, ExternalRDD, LocalRelation, LogicalRDD, LogicalRelation,
Range
,OneRowRelation
logical operators are requested tocomputeStats
AggregateEstimation
and JoinEstimation are requested toestimate
- SizeInBytesOnlyStatsPlanVisitor is executed
- QueryStageExec physical operator is requested to
computeStats
- DetermineTableStats logical resolution rule is executed
Row Count¶
Row Count estimate is used in CostBasedJoinReorder logical optimization for Cost-Based Optimization.
Statistics and CatalogStatistics¶
CatalogStatistics is a "subset" of all possible Statistics
(as there are no concepts of attributes in metastore).
CatalogStatistics
are statistics stored in an external catalog (usually a Hive metastore) and are often referred as Hive statistics while Statistics
represents the Spark statistics.
Accessing Statistics of Logical Operator¶
Statistics of a logical plan are available using stats property.
val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats
scala> :type stats
org.apache.spark.sql.catalyst.plans.logical.Statistics
scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none
Note
Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table.
Note
Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL Command to generate column (equi-height) histograms of a table.
Textual Representation¶
toString: String
toString
gives textual representation of the Statistics
.
import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20))
scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20)