TahoeLogFileIndex

TahoeLogFileIndex is a concrete file index.

TahoeLogFileIndex is created when DeltaLog is requested for a relation (when DeltaDataSource is requested for one as a CreatableRelationProvider and a RelationProvider).

val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan

import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head

import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]

Creating TahoeLogFileIndex Instance

TahoeLogFileIndex takes the following to be created:

  • SparkSession

  • DeltaLog

  • Data directory of the delta table (as Hadoop Path)

  • Partition filters (default: empty) (as Catalyst expressions, i.e. Seq[Expression])

  • Snapshot version (default: undefined) (Option[Long])

TahoeLogFileIndex initializes the internal properties.

Schema of Partition Columns — partitionSchema Method

partitionSchema: StructType
partitionSchema is part of the FileIndex contract (Spark SQL) to get the schema of the partition columns (if used).

partitionSchema…​FIXME

matchingFiles Method

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression],
  keepStats: Boolean = false): Seq[AddFile]
matchingFiles is part of the TahoeFileIndex Contract for the files matching given predicates.

matchingFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index’s partition filters, the given partitionFilters and dataFilters).

inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.

inputFiles Method

inputFiles: Array[String]
inputFiles is part of the FileIndex contract to…​FIXME

inputFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index’s partition filters).

inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.

Historical Or Latest Snapshot — getSnapshot Method

getSnapshot(
  stalenessAcceptable: Boolean): Snapshot

getSnapshot returns a Snapshot that is either the historical snapshot (for the snapshot version if defined) or requests the DeltaLog to update (and give one).

getSnapshot is used when TahoeLogFileIndex is requested for the matching files and the input files.

sizeInBytes Property

sizeInBytes: Long
sizeInBytes is part of the FileIndex contract for the table size (in bytes).

sizeInBytes…​FIXME

Internal Properties

Name Description

historicalSnapshotOpt

Historical snapshot, i.e. Snapshot for the versionToUse (if defined)

Used when TahoeLogFileIndex is requested for the (historical or latest) snapshot and the schema of the partition columns