CatalogFileIndex¶
CatalogFileIndex
is a FileIndex.
Creating Instance¶
CatalogFileIndex
takes the following to be created:
- SparkSession
- CatalogTable
- Estimated Size
CatalogFileIndex
is created when:
HiveMetastoreCatalog
is requested to convert a HiveTableRelation to a LogicalRelationDataSource
is requested to create a BaseRelation for a FileFormat
FileStatusCache¶
CatalogFileIndex
creates a FileStatusCache when created.
The FileStatusCache
is used when:
- filterPartitions (and create a InMemoryFileIndex)
- refresh (and invalidateAll)
Listing Files¶
listFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[PartitionDirectory]
listFiles
lists the partitions for the input partition filters and then requests them for the underlying partition files.
listFiles
is part of the FileIndex abstraction.
Input Files¶
inputFiles: Array[String]
inputFiles
lists all the partitions and then requests them for the input files.
inputFiles
is part of the FileIndex abstraction.
Root Paths¶
rootPaths: Seq[Path]
rootPaths
returns the base location converted to a Hadoop Path.
rootPaths
is part of the FileIndex abstraction.
Listing Partitions By Given Predicate Expressions¶
filterPartitions(
filters: Seq[Expression]): InMemoryFileIndex
filterPartitions
requests the CatalogTable for the partition columns.
For a partitioned table, filterPartitions
starts tracking time. filterPartitions
requests the SessionCatalog for the partitions by filter and creates a PrunedInMemoryFileIndex (with the partition listing time).
For an unpartitioned table (no partition columns defined), filterPartitions
simply returns a InMemoryFileIndex (with the base location and no user-specified schema).
filterPartitions
is used when:
HiveMetastoreCatalog
is requested to convert a HiveTableRelation to a LogicalRelationCatalogFileIndex
is requested to listFiles and inputFiles- PruneFileSourcePartitions logical optimization is executed
Internal Properties¶
Base Location¶
Base location (as a Java URI) as defined in the CatalogTable metadata (under the locationUri of the storage)
Used when CatalogFileIndex
is requested to filter the partitions and for the root paths
Hadoop Configuration¶
Hadoop Configuration
Used when CatalogFileIndex
is requested to filter the partitions