FileIndex¶

FileIndex is an abstraction of file indices for root paths and partition schema that make up a relation.

FileIndex is an optimization technique that is used with a HadoopFsRelation to avoid expensive file listings (esp. on object storages like Amazon S3 or Google Cloud Storage)

Contract¶

Input Files¶

inputFiles: Array[String]

File names to read when scanning this relation

Used when:

Dataset is requested for inputFiles
HadoopFsRelation is requested for input files

Listing Files¶

listFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[PartitionDirectory]

File names (grouped into partitions when the data is partitioned)

Used when:

HiveMetastoreCatalog is requested to convert a HiveTableRelation to a LogicalRelation
FileSourceScanExec physical operator is requested for selectedPartitions
OptimizeMetadataOnlyQuery logical optimization is executed
FileScan is requested for partitions

Metadata Duration¶

metadataOpsTimeNs: Option[Long] = None

Metadata operation time for listing files (in nanoseconds)

Used when FileSourceScanExec physical operator is requested for partitions

Partitions¶

partitionSchema: StructType

Partition schema (StructType)

Used when:

DataSource is requested to getOrInferFileFormatSchema and resolve a FileFormat-based relation
FallBackFileSourceV2 logical resolution rule is executed
FileScanBuilder is created
FileTable is requested for dataSchema and partitioning

Refreshing Cached File Listings¶

refresh(): Unit

Refreshes the file listings that may have been cached

Used when:

CacheManager is requested to recacheByPath
InsertIntoHadoopFsRelationCommand is executed
LogicalRelation logical operator is requested to refresh (for a HadoopFsRelation)

Root Paths¶

rootPaths: Seq[Path]

Root paths from which the catalog gets the files (as Hadoop Paths). There could be a single root path of the entire table (with partition directories) or individual partitions.

Used when:

HiveMetastoreCatalog is requested for a cached LogicalRelation (when requested to convert a HiveTableRelation)
OptimizedCreateHiveTableAsSelectCommand is executed
CacheManager is requested to recache by path
FileSourceScanExec physical operator is requested for the metadata and verboseStringWithOperatorId
DDLUtils utility is used to verifyNotReadPath
DataSourceAnalysis logical resolution rule is executed (for an InsertIntoStatement over a HadoopFsRelation)
FileScan is requested for a description

Estimated Size¶

sizeInBytes: Long

Estimated size of the data of the relation (in bytes)

Used when:

HadoopFsRelation is requested for the estimated size
FileScan is requested for statistics

FileIndex¶

Contract¶

Input Files¶

Listing Files¶

Metadata Duration¶

Partitions¶

Refreshing Cached File Listings¶

Root Paths¶

Estimated Size¶

Implementations¶