TahoeFileIndex — Indices Of Files Of Delta Table

TahoeFileIndex is an extension of the Spark SQL FileIndex contract for file indices of delta tables that can list data files to scan (based on partition and data filters).

Read up on FileIndex in The Internals of Spark SQL online book.
The aim of TahoeFileIndex is to reduce usage of very expensive disk access for file-related information using Hadoop FileSystem API.
Table 1. TahoeFileIndex Contract (Abstract Methods Only)
Method Description

matchingFiles

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression],
  keepStats: Boolean = false): Seq[AddFile]

Files (AddFiles) matching given partition and data predicates

When requested for the root input paths (rootPaths), TahoeFileIndex simply gives the path.

Table 2. TahoeFileIndices
TahoeFileIndex Description

TahoeBatchFileIndex

TahoeLogFileIndex

Creating TahoeFileIndex Instance

TahoeFileIndex takes the following to be created:

TahoeFileIndex is a Scala abstract class and cannot be created directly. It is created indirectly for the concrete file indices.

Version of Delta Table — tableVersion Method

tableVersion: Long

tableVersion is simply the version of (the snapshot of) the DeltaLog.

tableVersion is used when TahoeFileIndex is requested for the human-friendly textual representation.

Listing Data Files — listFiles Method

listFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[PartitionDirectory]
listFiles is part of the FileIndex contract for the file names (grouped into partitions when the data is partitioned).

listFiles…​FIXME

Partition Schema — partitionSchema Method

partitionSchema: StructType
partitionSchema is part of the FileIndex contract for the partition schema.

partitionSchema simply requests the DeltaLog for the Snapshot and then requests the Snapshot for Metadata that in turn is requested for the partitionSchema.

Human-Friendly Textual Representation — toString Method

toString: String
toString is part of the java.lang.Object contract for a string representation of the object.

toString returns the following text (based on the table version and the path truncated to 100 characters):

Delta[version=[tableVersion], [truncatedPath]]