Skip to content

FileIndex

FileIndex is an abstraction of file indices for root paths and partition schema that make up a relation.

FileIndex is an optimization technique that is used with a HadoopFsRelation to avoid expensive file listings (esp. on object storages like Amazon S3 or Google Cloud Storage)

Contract

Input Files

inputFiles: Array[String]

File names to read when scanning this relation

Used when:

Listing Files

listFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[PartitionDirectory]

File names (grouped into partitions when the data is partitioned)

Used when:

Metadata Duration

metadataOpsTimeNs: Option[Long] = None

Metadata operation time for listing files (in nanoseconds)

Used when FileSourceScanExec physical operator is requested for partitions

Partitions

partitionSchema: StructType

Partition schema (StructType)

Used when:

Refreshing Cached File Listings

refresh(): Unit

Refreshes the file listings that may have been cached

Used when:

Root Paths

rootPaths: Seq[Path]

Root paths from which the catalog gets the files (as Hadoop Paths). There could be a single root path of the entire table (with partition directories) or individual partitions.

Used when:

Estimated Size

sizeInBytes: Long

Estimated size of the data of the relation (in bytes)

Used when:

Implementations