StatisticsCollection¶
StatisticsCollection
is an abstraction of statistics collectors.
Contract¶
Data Schema¶
dataSchema: StructType
Schema (StructType) of the data files
Used when:
StatisticsCollection
is requested for the statCollectionSchema and the statsSchema
Maximum Number of Indexed Columns¶
numIndexedCols: Int
Maximum number of leaf columns to collect stats on
Used when:
StatisticsCollection
is requested for statCollectionSchema and to collectStats
SparkSession¶
spark: SparkSession
Used when:
StatisticsCollection
is requested for statsCollector and statsSchema
Implementations¶
statsSchema¶
statsSchema: StructType
statsSchema
...FIXME
statsSchema
is used when:
DataSkippingReaderBase
is requested for getStatsColumnOpt, withStatsInternal0, getStatsColumnOpt
statsCollector Column¶
statsCollector: Column
statsCollector
takes the value of the spark.databricks.io.skipping.stringPrefixLength configuration property.
statsCollector
creates a Column
with stats
name to be a struct
of the following:
count(*)
asnumRecords
- collectStats as
minValues
- collectStats as
maxValues
- collectStats as
nullCount
Lazy Value
statsCollector
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
statsCollector
is used when:
OptimisticTransactionImpl
is requested for the stats collector column (of the table at the snapshot within this transaction)TransactionalWrite
is requested to writeFilesStatisticsCollection
is requested for statsSchema
collectStats¶
collectStats(
name: String,
schema: StructType)(
function: PartialFunction[(Column, StructField), Column]): Column
collectStats
...FIXME
statCollectionSchema¶
statCollectionSchema: StructType
For the number of leaf columns to collect stats on greater than or equal 0
, statCollectionSchema
truncate the dataSchema. Otherwise, statCollectionSchema
returns the dataSchema intact.
Lazy Value
statCollectionSchema
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
truncateSchema¶
truncateSchema(
schema: StructType,
indexedCols: Int): (StructType, Int)
truncateSchema
...FIXME
Recomputing Statistics¶
recompute(
spark: SparkSession,
deltaLog: DeltaLog,
predicates: Seq[Expression] = Seq(Literal(true)),
fileFilter: AddFile => Boolean = af => true): Unit
recompute
...FIXME
recompute
seems unused.