Skip to content

TahoeLogFileIndex

TahoeLogFileIndex is a file index.

Creating Instance

TahoeLogFileIndex takes the following to be created:

  • SparkSession
  • DeltaLog
  • Data directory of the Delta table (as a Hadoop Path)
  • Schema at analysis (StructType)
  • Catalyst Expressions for the partition filters (default: empty)
  • Snapshot version (default: undefined) (Option[Long])

TahoeLogFileIndex is created when DeltaLog is requested for an Insertable HadoopFsRelation.

Demo

val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan

import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head

import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]

matchingFiles Method

matchingFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression],
  keepStats: Boolean = false): Seq[AddFile]

matchingFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters, the given partitionFilters and dataFilters).

Note

inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.

matchingFiles is part of the TahoeFileIndex abstraction.

inputFiles Method

inputFiles: Array[String]

inputFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters only).

Note

inputFiles and matchingFiles are similar. Both get the snapshot, but they use different filtering expressions and return value types.

inputFiles is part of the FileIndex contract (Spark SQL).

Historical Or Latest Snapshot

getSnapshot(
  stalenessAcceptable: Boolean): Snapshot

getSnapshot returns a Snapshot that is either the historical snapshot (for the snapshot version if specified) or requests the DeltaLog to update (and give one).

getSnapshot is used when TahoeLogFileIndex is requested for the matching files and the input files.

Internal Properties

historicalSnapshotOpt

Historical snapshot that is the Snapshot for the versionToUse if defined.

Used when TahoeLogFileIndex is requested for the (historical or latest) snapshot and the schema of the partition columns


Last update: 2020-10-05