TahoeLogFileIndex¶
TahoeLogFileIndex
is a file index.
Creating Instance¶
TahoeLogFileIndex
takes the following to be created:
-
SparkSession
- DeltaLog
- Data directory of the Delta table (as a Hadoop Path)
- Schema at analysis (
StructType
) - Catalyst Expressions for the partition filters (default:
empty
) - Snapshot version (default:
undefined
) (Option[Long]
)
TahoeLogFileIndex
is created when DeltaLog
is requested for an Insertable HadoopFsRelation.
Demo¶
val q = spark.read.format("delta").load("/tmp/delta/users")
val plan = q.queryExecution.executedPlan
import org.apache.spark.sql.execution.FileSourceScanExec
val scan = plan.collect { case e: FileSourceScanExec => e }.head
import org.apache.spark.sql.delta.files.TahoeLogFileIndex
val index = scan.relation.location.asInstanceOf[TahoeLogFileIndex]
scala> println(index)
Delta[version=1, file:/tmp/delta/users]
matchingFiles Method¶
matchingFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression],
keepStats: Boolean = false): Seq[AddFile]
matchingFiles
gets the snapshot (with stalenessAcceptable
flag off) and requests it for the files to scan (for the index's partition filters, the given partitionFilters
and dataFilters
).
Note
inputFiles and matchingFiles are similar. Both get the snapshot (of the delta table), but they use different filtering expressions and return value types.
matchingFiles
is part of the TahoeFileIndex abstraction.
inputFiles Method¶
inputFiles: Array[String]
inputFiles
gets the snapshot (with stalenessAcceptable
flag off) and requests it for the files to scan (for the index's partition filters only).
Note
inputFiles and matchingFiles are similar. Both get the snapshot, but they use different filtering expressions and return value types.
inputFiles
is part of the FileIndex
contract (Spark SQL).
Historical Or Latest Snapshot¶
getSnapshot(
stalenessAcceptable: Boolean): Snapshot
getSnapshot
returns a Snapshot that is either the historical snapshot (for the snapshot version if specified) or requests the DeltaLog to update (and give one).
getSnapshot
is used when TahoeLogFileIndex
is requested for the matching files and the input files.
Internal Properties¶
historicalSnapshotOpt¶
Historical snapshot that is the Snapshot for the versionToUse if defined.
Used when TahoeLogFileIndex
is requested for the (historical or latest) snapshot and the schema of the partition columns