FileIndex¶
FileIndex
is an abstraction of file indices for root paths and partition schema that make up a relation.
FileIndex
is an optimization technique that is used with a HadoopFsRelation to avoid expensive file listings (esp. on object storages like Amazon S3 or Google Cloud Storage)
Contract¶
Input Files¶
inputFiles: Array[String]
File names to read when scanning this relation
Used when:
Dataset
is requested for inputFilesHadoopFsRelation
is requested for input files
Listing Files¶
listFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[PartitionDirectory]
File names (grouped into partitions when the data is partitioned)
Used when:
HiveMetastoreCatalog
is requested to convert a HiveTableRelation to a LogicalRelationFileSourceScanExec
physical operator is requested for selectedPartitions- OptimizeMetadataOnlyQuery logical optimization is executed
FileScan
is requested for partitions
Metadata Duration¶
metadataOpsTimeNs: Option[Long] = None
Metadata operation time for listing files (in nanoseconds)
Used when FileSourceScanExec
physical operator is requested for partitions
Partitions¶
partitionSchema: StructType
Partition schema (StructType)
Used when:
DataSource
is requested to getOrInferFileFormatSchema and resolve a FileFormat-based relationFallBackFileSourceV2
logical resolution rule is executed- FileScanBuilder is created
FileTable
is requested for dataSchema and partitioning
Refreshing Cached File Listings¶
refresh(): Unit
Refreshes the file listings that may have been cached
Used when:
CacheManager
is requested to recacheByPath- InsertIntoHadoopFsRelationCommand is executed
LogicalRelation
logical operator is requested to refresh (for a HadoopFsRelation)
Root Paths¶
rootPaths: Seq[Path]
Root paths from which the catalog gets the files (as Hadoop Path
s). There could be a single root path of the entire table (with partition directories) or individual partitions.
Used when:
HiveMetastoreCatalog
is requested for a cached LogicalRelation (when requested to convert a HiveTableRelation)OptimizedCreateHiveTableAsSelectCommand
is executedCacheManager
is requested to recache by pathFileSourceScanExec
physical operator is requested for the metadata and verboseStringWithOperatorIdDDLUtils
utility is used toverifyNotReadPath
- DataSourceAnalysis logical resolution rule is executed (for an
InsertIntoStatement
over a HadoopFsRelation) FileScan
is requested for a description
Estimated Size¶
sizeInBytes: Long
Estimated size of the data of the relation (in bytes)
Used when:
HadoopFsRelation
is requested for the estimated sizeFileScan
is requested for statistics