TahoeLogFileIndex¶
TahoeLogFileIndex is a file index.
Creating Instance¶
TahoeLogFileIndex takes the following to be created:
-
SparkSession(Spark SQL) - DeltaLog
- Data directory of the Delta table (as a Hadoop Path)
- Snapshot at analysis
- Partition Filters (as Catalyst expressions; default: empty)
- isTimeTravelQuery flag (default:
false)
TahoeLogFileIndex is created when:
DeltaLogis requested for an Insertable HadoopFsRelation
spark.databricks.delta.checkLatestSchemaOnRead¶
TahoeLogFileIndex uses the spark.databricks.delta.checkLatestSchemaOnRead configuration property when requested for a Snapshot.
isTimeTravelQuery flag¶
TahoeLogFileIndex is given a isTimeTravelQuery flag when created.
isTimeTravelQuery flag is false by default and can be different when DeltaLog is requested to create a BaseRelation (when DeltaTableV2 is requested for a BaseRelation based on DeltaTimeTravelSpec).
Demo¶
val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan
import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head
import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]
matchingFiles Method¶
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression],
keepStats: Boolean = false): Seq[AddFile]
matchingFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters, the given partitionFilters and dataFilters).
Note
inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.
matchingFiles is part of the TahoeFileIndex abstraction.
inputFiles Method¶
inputFiles: Array[String]
inputFiles gets the snapshot (with stalenessAcceptable flag off) and requests it for the files to scan (for the index's partition filters only).
Note
inputFiles and matchingFiles are similar. Both get the snapshot, but they use different filtering expressions and return value types.
inputFiles is part of the FileIndex contract (Spark SQL).
Snapshot¶
getSnapshot: Snapshot
getSnapshot returns the Snapshot to scan.
With checkSchemaOnRead enabled or the DeltaColumnMappingMode (of the Metadata of the Snapshot) set (different from NoMapping), getSnapshot makes sure that the schemas are read-compatible (and hasn't changed in an incompatible manner since analysis time)
getSnapshot is used when:
TahoeLogFileIndexis requested for the matching files and the input files
getSnapshotToScan¶
getSnapshotToScan: Snapshot
getSnapshot returns the Snapshot with isTimeTravelQuery enabled or requests the DeltaLog to update and give one.
Internal Properties¶
historicalSnapshotOpt¶
Historical snapshot that is the Snapshot for the versionToUse if defined.
Used when TahoeLogFileIndex is requested for the (historical or latest) snapshot and the schema of the partition columns