TahoeRemoveFileIndex¶
TahoeRemoveFileIndex is a TahoeFileIndexWithSnapshotDescriptor of RemoveFiles for changesToDF in Change Data Feed.
Creating Instance¶
TahoeRemoveFileIndex takes the following to be created:
-
SparkSession(Spark SQL) - Versioned RemoveFiles
- DeltaLog
- Path
- SnapshotDescriptor
- Row Index Filters (
Option[Map[String, RowIndexFilterType]])
TahoeRemoveFileIndex is created when:
CDCReaderImplis requested to changesToDF (getDeletedAndAddedRows and processDeletionVectorActions)
Versioned RemoveFiles¶
filesByVersion: Seq[CDCDataSpec[RemoveFile]]
TahoeRemoveFileIndex is given a CDCDataSpecs of RemoveFiles when created.
The CDCDataSpecs come from the DeltaLog of a delta table (converted along the way to match the API).
Matching Files¶
TahoeFileIndex
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[AddFile]
matchingFiles is part of the TahoeFileIndex abstraction.
matchingFiles creates AddFiles for every RemoveFile (in the given CDCDataSpecs of RemoveFiles by version).
Fake AddFiles
matchingFiles returns a Seq[AddFile] and so AddFiles are fake in TahoeRemoveFileIndex as it deals with RemoveFiles.
matchingFiles filterFileList (with the partitionSchema, a DataFrame of the "fake" AddFiles and the given partitionFilters). That gives a DataFrame.
In the end, filterFileList converts the DataFrame to a Dataset[AddFile] (using Dataset.as operator) and collect the AddFiles (using Dataset.collect operator).
Input Files¶
TahoeFileIndex
inputFiles: Array[String]
inputFiles is part of the FileIndex (Spark SQL) abstraction.
inputFiles is the absolute paths of all the RemoveFiles of the CDCDataSpecs.
Partition Schema¶
TahoeFileIndex
partitionSchema: StructType
partitionSchema is part of the FileIndex (Spark SQL) abstraction.
partitionSchema returns the CDF-Aware Read Schema.