TahoeFileIndex¶
TahoeFileIndex
is an extension of the FileIndex
(Spark SQL) abstraction for file indices of delta tables that can list data files to scan (based on partition and data filters).
The aim of TahoeFileIndex
(and FileIndex
in general) is to reduce usage of very expensive disk access for file-related information using Hadoop FileSystem API.
TahoeFileIndex
is SupportsRowIndexFilters.
Contract¶
Matching Files¶
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[AddFile]
AddFiles matching given partition and data filters (predicates)
See:
Used when:
TahoeFileIndex
is requested for data filesScanWithDeletionVectors
is requested for createBroadcastDVMap
Implementations¶
Creating Instance¶
TahoeFileIndex
takes the following to be created:
Abstract Class
TahoeFileIndex
is an abstract class and cannot be created directly. It is created indirectly for the concrete TahoeFileIndexes.
Root Paths¶
rootPaths: Seq[Path]
rootPaths
is the path only.
rootPaths
is part of the FileIndex
(Spark SQL) abstraction.
Listing Files¶
listFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[PartitionDirectory]
listFiles
is the path only.
listFiles
is part of the FileIndex
(Spark SQL)abstraction.
Partitions¶
partitionSchema: StructType
partitionSchema
is the partition schema of (the Metadata of the Snapshot) of the DeltaLog.
partitionSchema
is part of the FileIndex
(Spark SQL) abstraction.
Version of Delta Table¶
tableVersion: Long
tableVersion
is the version of (the snapshot of) the DeltaLog.
tableVersion
is used when TahoeFileIndex
is requested for the human-friendly textual representation.
Textual Representation¶
toString: String
toString
returns the following text (using the version and the path of the Delta table):
Delta[version=[tableVersion], [truncatedPath]]