TahoeRemoveFileIndex is a TahoeFileIndexWithSnapshotDescriptor of RemoveFiles for changesToDF in Change Data Feed.

Creating Instance

TahoeRemoveFileIndex takes the following to be created:

TahoeRemoveFileIndex is created when:

Versioned RemoveFiles

filesByVersion: Seq[CDCDataSpec[RemoveFile]]

TahoeRemoveFileIndex is given a CDCDataSpecs of RemoveFiles when created.

The CDCDataSpecs come from the DeltaLog of a delta table (converted along the way to match the API).

Matching Files

  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[AddFile]

matchingFiles is part of the TahoeFileIndex abstraction.

matchingFiles creates AddFiles for every RemoveFile (in the given CDCDataSpecs of RemoveFiles by version).

Fake AddFiles

matchingFiles returns a Seq[AddFile] and so AddFiles are fake in TahoeRemoveFileIndex as it deals with RemoveFiles.

matchingFiles filterFileList (with the partitionSchema, a DataFrame of the "fake" AddFiles and the given partitionFilters). That gives a DataFrame.

In the end, filterFileList converts the DataFrame to a Dataset[AddFile] (using operator) and collect the AddFiles (using Dataset.collect operator).

Input Files

inputFiles: Array[String]

inputFiles is part of the FileIndex (Spark SQL) abstraction.

inputFiles is the absolute paths of all the RemoveFiles of the CDCDataSpecs.

Partition Schema

partitionSchema: StructType

partitionSchema is part of the FileIndex (Spark SQL) abstraction.

partitionSchema returns the CDF-Aware Read Schema.