TahoeRemoveFileIndex¶
TahoeRemoveFileIndex
is a TahoeFileIndexWithSnapshotDescriptor of RemoveFiles for changesToDF in Change Data Feed.
Creating Instance¶
TahoeRemoveFileIndex
takes the following to be created:
-
SparkSession
(Spark SQL) - Versioned RemoveFiles
- DeltaLog
- Path
- SnapshotDescriptor
- Row Index Filters (
Option[Map[String, RowIndexFilterType]]
)
TahoeRemoveFileIndex
is created when:
CDCReaderImpl
is requested to changesToDF (getDeletedAndAddedRows and processDeletionVectorActions)
Versioned RemoveFiles¶
filesByVersion: Seq[CDCDataSpec[RemoveFile]]
TahoeRemoveFileIndex
is given a CDCDataSpecs of RemoveFiles when created.
The CDCDataSpec
s come from the DeltaLog of a delta table (converted along the way to match the API).
Matching Files¶
TahoeFileIndex
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[AddFile]
matchingFiles
is part of the TahoeFileIndex abstraction.
matchingFiles
creates AddFiles for every RemoveFile (in the given CDCDataSpecs of RemoveFiles by version).
Fake AddFiles
matchingFiles
returns a Seq[AddFile]
and so AddFiles are fake in TahoeRemoveFileIndex
as it deals with RemoveFiles.
matchingFiles
filterFileList (with the partitionSchema, a DataFrame
of the "fake" AddFile
s and the given partitionFilters
). That gives a DataFrame
.
In the end, filterFileList
converts the DataFrame
to a Dataset[AddFile]
(using Dataset.as
operator) and collect the AddFiles (using Dataset.collect
operator).
Input Files¶
TahoeFileIndex
inputFiles: Array[String]
inputFiles
is part of the FileIndex
(Spark SQL) abstraction.
inputFiles
is the absolute paths of all the RemoveFiles of the CDCDataSpecs.
Partition Schema¶
TahoeFileIndex
partitionSchema: StructType
partitionSchema
is part of the FileIndex
(Spark SQL) abstraction.
partitionSchema
returns the CDF-Aware Read Schema.