TahoeLogFileIndex¶

TahoeLogFileIndex is a file index.

Creating Instance¶

TahoeLogFileIndex takes the following to be created:

SparkSession (Spark SQL)
DeltaLog
Data directory of the Delta table (as a Hadoop Path)
Snapshot at analysis
Partition Filters (as Catalyst expressions; default: empty)
isTimeTravelQuery flag (default: false)

TahoeLogFileIndex is created when:

DeltaLog is requested for an Insertable HadoopFsRelation

spark.databricks.delta.checkLatestSchemaOnRead¶

TahoeLogFileIndex uses the spark.databricks.delta.checkLatestSchemaOnRead configuration property when requested for a Snapshot.

isTimeTravelQuery flag¶

TahoeLogFileIndex is given a isTimeTravelQuery flag when created.

isTimeTravelQuery flag is false by default and can be different when DeltaLog is requested to create a BaseRelation (when DeltaTableV2 is requested for a BaseRelation based on DeltaTimeTravelSpec).

Demo¶

val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan

import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head

import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]

matchingFiles Method¶

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression],
  keepStats: Boolean = false): Seq[AddFile]

matchingFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters, the given partitionFilters and dataFilters).

Note

inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.

matchingFiles is part of the TahoeFileIndex abstraction.

inputFiles Method¶

inputFiles: Array[String]

inputFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters only).

Note

inputFiles and matchingFiles are similar. Both get the snapshot, but they use different filtering expressions and return value types.

inputFiles is part of the FileIndex contract (Spark SQL).

Snapshot¶

getSnapshot: Snapshot

getSnapshot returns the Snapshot to scan.

With checkSchemaOnRead enabled or the DeltaColumnMappingMode (of the Metadata of the Snapshot) set (different from NoMapping), getSnapshot makes sure that the schemas are read-compatible (and hasn't changed in an incompatible manner since analysis time)

getSnapshot is used when:

TahoeLogFileIndex is requested for the matching files and the input files

getSnapshotToScan¶

getSnapshotToScan: Snapshot

getSnapshot returns the Snapshot with isTimeTravelQuery enabled or requests the DeltaLog to update and give one.

Internal Properties¶

historicalSnapshotOpt¶

Historical snapshot that is the Snapshot for the versionToUse if defined.

Used when TahoeLogFileIndex is requested for the (historical or latest) snapshot and the schema of the partition columns