StatisticsCollection¶
StatisticsCollection is an abstraction of statistics collectors.
Contract¶
Data Schema¶
dataSchema: StructType
Schema (StructType) of the data files
Used when:
StatisticsCollectionis requested for the statCollectionSchema and the statsSchema
Maximum Number of Indexed Columns¶
numIndexedCols: Int
Maximum number of leaf columns to collect stats on
Used when:
StatisticsCollectionis requested for statCollectionSchema and to collectStats
SparkSession¶
spark: SparkSession
Used when:
StatisticsCollectionis requested for statsCollector and statsSchema
Implementations¶
statsSchema¶
statsSchema: StructType
statsSchema...FIXME
statsSchema is used when:
DataSkippingReaderBaseis requested for getStatsColumnOpt, withStatsInternal0, getStatsColumnOpt
statsCollector Column¶
statsCollector: Column
statsCollector takes the value of the spark.databricks.io.skipping.stringPrefixLength configuration property.
statsCollector creates a Column with stats name to be a struct of the following:
count(*)asnumRecords- collectStats as
minValues - collectStats as
maxValues - collectStats as
nullCount
Lazy Value
statsCollector is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
statsCollector is used when:
OptimisticTransactionImplis requested for the stats collector column (of the table at the snapshot within this transaction)TransactionalWriteis requested to writeFilesStatisticsCollectionis requested for statsSchema
collectStats¶
collectStats(
name: String,
schema: StructType)(
function: PartialFunction[(Column, StructField), Column]): Column
collectStats...FIXME
statCollectionSchema¶
statCollectionSchema: StructType
For the number of leaf columns to collect stats on greater than or equal 0, statCollectionSchema truncate the dataSchema. Otherwise, statCollectionSchema returns the dataSchema intact.
Lazy Value
statCollectionSchema is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
truncateSchema¶
truncateSchema(
schema: StructType,
indexedCols: Int): (StructType, Int)
truncateSchema...FIXME
Recomputing Statistics¶
recompute(
spark: SparkSession,
deltaLog: DeltaLog,
predicates: Seq[Expression] = Seq(Literal(true)),
fileFilter: AddFile => Boolean = af => true): Unit
recompute...FIXME
recompute seems unused.