Skip to content

TahoeFileIndex

= TahoeFileIndex -- Indices Of Files Of Delta Table :navtitle: TahoeFileIndex

TahoeFileIndex is an <> of the Spark SQL FileIndex contract for <> of delta tables that can <> to scan (based on <>).

TIP: Read up on https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-FileIndex.html[FileIndex] in https://bit.ly/spark-sql-internals[The Internals of Spark SQL] online book.

NOTE: The aim of TahoeFileIndex is to reduce usage of very expensive disk access for file-related information using Hadoop FileSystem API.

[[contract]] .TahoeFileIndex Contract (Abstract Methods Only) [cols="30m,70",options="header",width="100%"] |=== | Method | Description

| matchingFiles a| [[matchingFiles]]

[source, scala]

matchingFiles( partitionFilters: Seq[Expression], dataFilters: Seq[Expression], keepStats: Boolean = false): Seq[AddFile]


Files (AddFile.md[AddFiles]) matching given partition and data predicates

Used for <>

|===

[[rootPaths]] When requested for the root input paths (rootPaths), TahoeFileIndex simply gives the <>.

[[implementations]] .TahoeFileIndices [cols="30,70",options="header",width="100%"] |=== | TahoeFileIndex | Description

| <> | [[TahoeBatchFileIndex]]

| <> | [[TahoeLogFileIndex]]

|===

== [[creating-instance]] Creating TahoeFileIndex Instance

TahoeFileIndex takes the following to be created:

  • [[spark]] SparkSession
  • [[deltaLog]] <>
  • [[path]] Hadoop Path

NOTE: TahoeFileIndex is a Scala abstract class and cannot be <> directly. It is created indirectly for the <>.

== [[tableVersion]] Version of Delta Table -- tableVersion Method

[source, scala]

tableVersion: Long

tableVersion is simply the <> of (the <> of) the <>.

NOTE: tableVersion is used when TahoeFileIndex is requested for the <>.

== [[listFiles]] Listing Data Files -- listFiles Method

[source, scala]

listFiles( partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory]


NOTE: listFiles is part of the FileIndex contract for the file names (grouped into partitions when the data is partitioned).

listFiles...FIXME

== [[partitionSchema]] Partition Schema -- partitionSchema Method

[source, scala]

partitionSchema: StructType

NOTE: partitionSchema is part of the FileIndex contract for the partition schema.

partitionSchema simply requests the <> for the <> and then requests the Snapshot for <> that in turn is requested for the <>.

== [[toString]] Human-Friendly Textual Representation -- toString Method

[source, scala]

toString: String

NOTE: toString is part of the java.lang.Object contract for a string representation of the object.

toString returns the following text (based on the <> and the <> truncated to 100 characters):

Delta[version=[tableVersion], [truncatedPath]]

Last update: 2020-10-05