TahoeRemoveFileIndex¶
TahoeRemoveFileIndex is a TahoeFileIndexWithSnapshotDescriptor of RemoveFiles for changesToDF in Change Data Feed.
Creating Instance¶
TahoeRemoveFileIndex takes the following to be created:
-
SparkSession(Spark SQL) - Versioned RemoveFiles
- DeltaLog
- Path
- SnapshotDescriptor
- Row Index Filters (
Option[Map[String, RowIndexFilterType]])
TahoeRemoveFileIndex is created when:
CDCReaderImplis requested to changesToDF (getDeletedAndAddedRows and processDeletionVectorActions)
Versioned RemoveFiles¶
TahoeRemoveFileIndex is given a CDCDataSpecs of RemoveFiles when created.
The CDCDataSpecs come from the DeltaLog of a delta table (converted along the way to match the API).
Matching Files¶
TahoeFileIndex
matchingFiles is part of the TahoeFileIndex abstraction.
matchingFiles creates AddFiles for every RemoveFile (in the given CDCDataSpecs of RemoveFiles by version).
Fake AddFiles
matchingFiles returns a Seq[AddFile] and so AddFiles are fake in TahoeRemoveFileIndex as it deals with RemoveFiles.
matchingFiles filterFileList (with the partitionSchema, a DataFrame of the "fake" AddFiles and the given partitionFilters). That gives a DataFrame.
In the end, filterFileList converts the DataFrame to a Dataset[AddFile] (using Dataset.as operator) and collect the AddFiles (using Dataset.collect operator).
Input Files¶
TahoeFileIndex
inputFiles is part of the FileIndex (Spark SQL) abstraction.
inputFiles is the absolute paths of all the RemoveFiles of the CDCDataSpecs.
Partition Schema¶
TahoeFileIndex
partitionSchema is part of the FileIndex (Spark SQL) abstraction.
partitionSchema returns the CDF-Aware Read Schema.