TahoeFileIndex¶

TahoeFileIndex is an extension of the FileIndex (Spark SQL) abstraction for file indices of delta tables that can list data files to scan (based on partition and data filters).

The aim of TahoeFileIndex (and FileIndex in general) is to reduce usage of very expensive disk access for file-related information using Hadoop FileSystem API.

TahoeFileIndex is SupportsRowIndexFilters.

Contract¶

Matching Files¶

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[AddFile]

AddFiles matching given partition and data filters (predicates)

See:

TahoeRemoveFileIndex

Used when:

TahoeFileIndex is requested for data files
ScanWithDeletionVectors is requested for createBroadcastDVMap

Implementations¶

Creating Instance¶

TahoeFileIndex takes the following to be created:

SparkSession
DeltaLog
Hadoop Path

Abstract Class

TahoeFileIndex is an abstract class and cannot be created directly. It is created indirectly for the concrete TahoeFileIndexes.

Root Paths¶

rootPaths: Seq[Path]

rootPaths is the path only.

rootPaths is part of the FileIndex (Spark SQL) abstraction.

Listing Files¶

listFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[PartitionDirectory]

listFiles is the path only.

listFiles is part of the FileIndex (Spark SQL)abstraction.

Partitions¶

partitionSchema: StructType

partitionSchema is the partition schema of (the Metadata of the Snapshot) of the DeltaLog.

partitionSchema is part of the FileIndex (Spark SQL) abstraction.

Version of Delta Table¶

tableVersion: Long

tableVersion is the version of (the snapshot of) the DeltaLog.

tableVersion is used when TahoeFileIndex is requested for the human-friendly textual representation.

Textual Representation¶

toString: String

toString returns the following text (using the version and the path of the Delta table):

Delta[version=[tableVersion], [truncatedPath]]