TahoeBatchFileIndex¶
TahoeBatchFileIndex is a TahoeFileIndexWithSnapshotDescriptor of a delta table at a given version.
Creating Instance¶
TahoeBatchFileIndex takes the following to be created:
TahoeBatchFileIndex is created when:
DeltaLogis requested for a DataFrame for given AddFiles- DeleteCommand and UpdateCommand are executed (and
DeltaCommandis requested for a HadoopFsRelation)
Action Type¶
TahoeBatchFileIndex is given an Action Type identifier when created:
- batch or streaming when
DeltaLogis requested for a batch or streaming DataFrame for given AddFiles, respectively - delete for DeleteCommand
- update for UpdateCommand
Important
Action Type seems not to be used ever.
tableVersion¶
tableVersion is part of the TahoeFileIndex abstraction.
tableVersion is always the version of the Snapshot.
matchingFiles¶
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression],
keepStats: Boolean = false): Seq[AddFile]
matchingFiles is part of the TahoeFileIndex abstraction.
matchingFiles filterFileList (that gives a DataFrame) and collects the AddFiles (using Dataset.collect).
Input Files¶
inputFiles is part of the FileIndex (Spark SQL) abstraction.
inputFiles returns the paths of all the given AddFiles.
Partitions¶
partitionSchema is part of the FileIndex (Spark SQL) abstraction.
partitionSchema requests the Snapshot for the metadata that is in turn requested for the partitionSchema.
Estimated Size of Relation¶
sizeInBytes is part of the FileIndex (Spark SQL) abstraction.
sizeInBytes is a sum of the sizes of all the given AddFiles.