FileIndex¶
FileIndex is an abstraction of file indices for root paths and partition schema that make up a relation.
FileIndex is an optimization technique that is used with a HadoopFsRelation to avoid expensive file listings (esp. on object storages like Amazon S3 or Google Cloud Storage)
Contract¶
Input Files¶
inputFiles: Array[String]
File names to read when scanning this relation
Used when:
Datasetis requested for inputFilesHadoopFsRelationis requested for input files
Listing Files¶
listFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[PartitionDirectory]
File names (grouped into partitions when the data is partitioned)
Used when:
HiveMetastoreCatalogis requested to convert a HiveTableRelation to a LogicalRelationFileSourceScanExecphysical operator is requested for selectedPartitions- OptimizeMetadataOnlyQuery logical optimization is executed
FileScanis requested for partitions
Metadata Duration¶
metadataOpsTimeNs: Option[Long] = None
Metadata operation time for listing files (in nanoseconds)
Used when FileSourceScanExec physical operator is requested for partitions
Partitions¶
partitionSchema: StructType
Partition schema (StructType)
Used when:
DataSourceis requested to getOrInferFileFormatSchema and resolve a FileFormat-based relationFallBackFileSourceV2logical resolution rule is executed- FileScanBuilder is created
FileTableis requested for dataSchema and partitioning
Refreshing Cached File Listings¶
refresh(): Unit
Refreshes the file listings that may have been cached
Used when:
CacheManageris requested to recacheByPath- InsertIntoHadoopFsRelationCommand is executed
LogicalRelationlogical operator is requested to refresh (for a HadoopFsRelation)
Root Paths¶
rootPaths: Seq[Path]
Root paths from which the catalog gets the files (as Hadoop Paths). There could be a single root path of the entire table (with partition directories) or individual partitions.
Used when:
HiveMetastoreCatalogis requested for a cached LogicalRelation (when requested to convert a HiveTableRelation)OptimizedCreateHiveTableAsSelectCommandis executedCacheManageris requested to recache by pathFileSourceScanExecphysical operator is requested for the metadata and verboseStringWithOperatorIdDDLUtilsutility is used toverifyNotReadPath- DataSourceAnalysis logical resolution rule is executed (for an
InsertIntoStatementover a HadoopFsRelation) FileScanis requested for a description
Estimated Size¶
sizeInBytes: Long
Estimated size of the data of the relation (in bytes)
Used when:
HadoopFsRelationis requested for the estimated sizeFileScanis requested for statistics