CatalogFileIndex¶
CatalogFileIndex is a FileIndex.
Creating Instance¶
CatalogFileIndex takes the following to be created:
- SparkSession
- CatalogTable
- Estimated Size
CatalogFileIndex is created when:
HiveMetastoreCatalogis requested to convert a HiveTableRelation to a LogicalRelationDataSourceis requested to create a BaseRelation for a FileFormat
FileStatusCache¶
CatalogFileIndex creates a FileStatusCache when created.
The FileStatusCache is used when:
- filterPartitions (and create a InMemoryFileIndex)
- refresh (and invalidateAll)
Listing Files¶
listFiles(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): Seq[PartitionDirectory]
listFiles lists the partitions for the input partition filters and then requests them for the underlying partition files.
listFiles is part of the FileIndex abstraction.
Input Files¶
inputFiles: Array[String]
inputFiles lists all the partitions and then requests them for the input files.
inputFiles is part of the FileIndex abstraction.
Root Paths¶
rootPaths: Seq[Path]
rootPaths returns the base location converted to a Hadoop Path.
rootPaths is part of the FileIndex abstraction.
Listing Partitions By Given Predicate Expressions¶
filterPartitions(
filters: Seq[Expression]): InMemoryFileIndex
filterPartitions requests the CatalogTable for the partition columns.
For a partitioned table, filterPartitions starts tracking time. filterPartitions requests the SessionCatalog for the partitions by filter and creates a PrunedInMemoryFileIndex (with the partition listing time).
For an unpartitioned table (no partition columns defined), filterPartitions simply returns a InMemoryFileIndex (with the base location and no user-specified schema).
filterPartitions is used when:
HiveMetastoreCatalogis requested to convert a HiveTableRelation to a LogicalRelationCatalogFileIndexis requested to listFiles and inputFiles- PruneFileSourcePartitions logical optimization is executed
Internal Properties¶
Base Location¶
Base location (as a Java URI) as defined in the CatalogTable metadata (under the locationUri of the storage)
Used when CatalogFileIndex is requested to filter the partitions and for the root paths
Hadoop Configuration¶
Hadoop Configuration
Used when CatalogFileIndex is requested to filter the partitions