TahoeLogFileIndex¶
TahoeLogFileIndex
is a file index.
Creating Instance¶
TahoeLogFileIndex
takes the following to be created:
-
SparkSession
(Spark SQL) - DeltaLog
- Data directory of the Delta table (as a Hadoop Path)
- Snapshot at analysis
- Partition Filters (as Catalyst expressions; default: empty)
- isTimeTravelQuery flag (default:
false
)
TahoeLogFileIndex
is created when:
DeltaLog
is requested for an Insertable HadoopFsRelation
spark.databricks.delta.checkLatestSchemaOnRead¶
TahoeLogFileIndex
uses the spark.databricks.delta.checkLatestSchemaOnRead configuration property when requested for a Snapshot.
isTimeTravelQuery flag¶
TahoeLogFileIndex
is given a isTimeTravelQuery
flag when created.
isTimeTravelQuery
flag is false
by default and can be different when DeltaLog
is requested to create a BaseRelation (when DeltaTableV2
is requested for a BaseRelation based on DeltaTimeTravelSpec).
Demo¶
val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan
import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head
import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]
matchingFiles Method¶
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression],
keepStats: Boolean = false): Seq[AddFile]
matchingFiles
gets the snapshot (with stalenessAcceptable
flag off) and requests it for the files to scan (for the index's partition filters, the given partitionFilters
and dataFilters
).
Note
inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.
matchingFiles
is part of the TahoeFileIndex abstraction.
inputFiles Method¶
inputFiles: Array[String]
inputFiles
gets the snapshot (with stalenessAcceptable
flag off) and requests it for the files to scan (for the index's partition filters only).
Note
inputFiles and matchingFiles are similar. Both get the snapshot, but they use different filtering expressions and return value types.
inputFiles
is part of the FileIndex
contract (Spark SQL).
Snapshot¶
getSnapshot: Snapshot
getSnapshot
returns the Snapshot to scan.
With checkSchemaOnRead enabled or the DeltaColumnMappingMode (of the Metadata of the Snapshot) set (different from NoMapping
), getSnapshot
makes sure that the schemas are read-compatible (and hasn't changed in an incompatible manner since analysis time)
getSnapshot
is used when:
TahoeLogFileIndex
is requested for the matching files and the input files
getSnapshotToScan¶
getSnapshotToScan: Snapshot
getSnapshot
returns the Snapshot with isTimeTravelQuery enabled or requests the DeltaLog to update and give one.
Internal Properties¶
historicalSnapshotOpt¶
Historical snapshot that is the Snapshot for the versionToUse if defined.
Used when TahoeLogFileIndex
is requested for the (historical or latest) snapshot and the schema of the partition columns