Skip to content

TahoeLogFileIndex

TahoeLogFileIndex is a file index.

Creating Instance

TahoeLogFileIndex takes the following to be created:

TahoeLogFileIndex is created when:

spark.databricks.delta.checkLatestSchemaOnRead

TahoeLogFileIndex uses the spark.databricks.delta.checkLatestSchemaOnRead configuration property when requested for a Snapshot.

isTimeTravelQuery flag

TahoeLogFileIndex is given a isTimeTravelQuery flag when created.

isTimeTravelQuery flag is false by default and can be different when DeltaLog is requested to create a BaseRelation (when DeltaTableV2 is requested for a BaseRelation based on DeltaTimeTravelSpec).

Demo

val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan

import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head

import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]

matchingFiles Method

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression],
  keepStats: Boolean = false): Seq[AddFile]

matchingFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters, the given partitionFilters and dataFilters).

Note

inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.

matchingFiles is part of the TahoeFileIndex abstraction.

inputFiles Method

inputFiles: Array[String]

inputFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters only).

Note

inputFiles and matchingFiles are similar. Both get the snapshot, but they use different filtering expressions and return value types.

inputFiles is part of the FileIndex contract (Spark SQL).

Snapshot

getSnapshot: Snapshot

getSnapshot returns the Snapshot to scan.


With checkSchemaOnRead enabled or the DeltaColumnMappingMode (of the Metadata of the Snapshot) set (different from NoMapping), getSnapshot makes sure that the schemas are read-compatible (and hasn't changed in an incompatible manner since analysis time)


getSnapshot is used when:

getSnapshotToScan

getSnapshotToScan: Snapshot

getSnapshot returns the Snapshot with isTimeTravelQuery enabled or requests the DeltaLog to update and give one.

Internal Properties

historicalSnapshotOpt

Historical snapshot that is the Snapshot for the versionToUse if defined.

Used when TahoeLogFileIndex is requested for the (historical or latest) snapshot and the schema of the partition columns