TahoeRemoveFileIndex¶

TahoeRemoveFileIndex is a TahoeFileIndexWithSnapshotDescriptor of RemoveFiles for changesToDF in Change Data Feed.

Creating Instance¶

TahoeRemoveFileIndex takes the following to be created:

SparkSession (Spark SQL)
Versioned RemoveFiles
DeltaLog
Path
SnapshotDescriptor
Row Index Filters (Option[Map[String, RowIndexFilterType]])

TahoeRemoveFileIndex is created when:

CDCReaderImpl is requested to changesToDF (getDeletedAndAddedRows and processDeletionVectorActions)

Versioned RemoveFiles¶

filesByVersion: Seq[CDCDataSpec[RemoveFile]]

TahoeRemoveFileIndex is given a CDCDataSpecs of RemoveFiles when created.

The CDCDataSpecs come from the DeltaLog of a delta table (converted along the way to match the API).

Matching Files¶

TahoeFileIndex

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[AddFile]

matchingFiles is part of the TahoeFileIndex abstraction.

matchingFiles creates AddFiles for every RemoveFile (in the given CDCDataSpecs of RemoveFiles by version).

Fake AddFiles

matchingFiles returns a Seq[AddFile] and so AddFiles are fake in TahoeRemoveFileIndex as it deals with RemoveFiles.

matchingFiles filterFileList (with the partitionSchema, a DataFrame of the "fake" AddFiles and the given partitionFilters). That gives a DataFrame.

In the end, filterFileList converts the DataFrame to a Dataset[AddFile] (using Dataset.as operator) and collect the AddFiles (using Dataset.collect operator).

Input Files¶

TahoeFileIndex

inputFiles: Array[String]

inputFiles is part of the FileIndex (Spark SQL) abstraction.

inputFiles is the absolute paths of all the RemoveFiles of the CDCDataSpecs.

Partition Schema¶

TahoeFileIndex

partitionSchema: StructType

partitionSchema is part of the FileIndex (Spark SQL) abstraction.

partitionSchema returns the CDF-Aware Read Schema.