TahoeBatchFileIndex¶

Creating Instance¶

TahoeBatchFileIndex takes the following to be created:

TahoeBatchFileIndex is created when:

DeltaLog is requested for a DataFrame for given AddFiles
DeleteCommand and UpdateCommand are executed (and DeltaCommand is requested for a HadoopFsRelation)

TahoeBatchFileIndex is given an Action Type identifier when created:

batch or streaming when DeltaLog is requested for a batch or streaming DataFrame for given AddFiles, respectively
delete for DeleteCommand
update for UpdateCommand

Important

Action Type seems not to be used ever.

tableVersion: Long

tableVersion is part of the TahoeFileIndex abstraction.

tableVersion is always the version of the Snapshot.

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression],
  keepStats: Boolean = false): Seq[AddFile]

matchingFiles is part of the TahoeFileIndex abstraction.

matchingFiles filterFileList (that gives a DataFrame) and collects the AddFiles (using Dataset.collect).

inputFiles: Array[String]

inputFiles is part of the FileIndex (Spark SQL) abstraction.

inputFiles returns the paths of all the given AddFiles.

partitionSchema: StructType

partitionSchema is part of the FileIndex (Spark SQL) abstraction.

partitionSchema requests the Snapshot for the metadata that is in turn requested for the partitionSchema.

sizeInBytes: Long

sizeInBytes is part of the FileIndex (Spark SQL) abstraction.

sizeInBytes is a sum of the sizes of all the given AddFiles.