Skip to content

Statistics

Statistics holds the following estimates of a logical operator:

Note

Cost statistics, plan statistics or query statistics are synonyms and used interchangeably.

Statistics is created when:

Row Count

Row Count estimate is used in CostBasedJoinReorder logical optimization for Cost-Based Optimization.

Statistics and CatalogStatistics

CatalogStatistics is a "subset" of all possible Statistics (as there are no concepts of attributes in metastore).

CatalogStatistics are statistics stored in an external catalog (usually a Hive metastore) and are often referred as Hive statistics while Statistics represents the Spark statistics.

Accessing Statistics of Logical Operator

Statistics of a logical plan are available using stats property.

val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats

scala> :type stats
org.apache.spark.sql.catalyst.plans.logical.Statistics

scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none

Note

Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table.

Textual Representation

toString: String

toString gives textual representation of the Statistics.

import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20))

scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20)