HiveExternalCatalog¶
HiveExternalCatalog
is an ExternalCatalog for SparkSession
with Hive support enabled.
HiveExternalCatalog
uses an HiveClient to interact with a Hive metastore.
Creating Instance¶
HiveExternalCatalog
takes the following to be created:
-
SparkConf
(Spark Core) -
Configuration
(Apache Hadoop)
HiveExternalCatalog
is created when:
SharedState
is requested for the ExternalCatalog (and spark.sql.catalogImplementation ishive
).
restoreTableMetadata¶
restoreTableMetadata(
inputTable: CatalogTable): CatalogTable
restoreTableMetadata
...FIXME
restoreTableMetadata
is used when:
HiveExternalCatalog
is requested for table metadata, the metadata of tables and listPartitionsByFilter
restoreHiveSerdeTable¶
restoreHiveSerdeTable(
table: CatalogTable): CatalogTable
restoreHiveSerdeTable
...FIXME
restoreDataSourceTable¶
restoreDataSourceTable(
table: CatalogTable,
provider: String): CatalogTable
restoreDataSourceTable
...FIXME
Looking Up BucketSpec in Table Properties¶
getBucketSpecFromTableProperties(
metadata: CatalogTable): Option[BucketSpec]
getBucketSpecFromTableProperties
looks up the value of spark.sql.sources.schema.numBuckets
property (among the properties) in the given CatalogTable metadata.
If found, getBucketSpecFromTableProperties
creates a BucketSpec with the following:
BucketSpec | Metadata Property |
---|---|
numBuckets | spark.sql.sources.schema.numBuckets |
bucketColumnNames | spark.sql.sources.schema.numBucketCols spark.sql.sources.schema.bucketCol.N |
sortColumnNames | spark.sql.sources.schema.numSortCols spark.sql.sources.schema.sortCol.N |
restorePartitionMetadata¶
restorePartitionMetadata(
partition: CatalogTablePartition,
table: CatalogTable): CatalogTablePartition
restorePartitionMetadata
...FIXME
restorePartitionMetadata
is used when:
HiveExternalCatalog
is requested to getPartition and getPartitionOption
Restoring Table Statistics from Table Properties (from Hive Metastore)¶
statsFromProperties(
properties: Map[String, String],
table: String): Option[CatalogStatistics]
statsFromProperties
collects statistics-related spark.sql.statistics
-prefixed properties.
For no keys with the prefix, statsFromProperties
returns None
.
If there are keys with spark.sql.statistics
prefix, statsFromProperties
creates a CatalogColumnStat for every column in the schema
.
For every column name in schema
, statsFromProperties
collects all the keys that start with spark.sql.statistics.colStats.[name]
prefix (after having checked that the key spark.sql.statistics.colStats.[name].version
exists that is a marker that the column statistics exist in the statistics properties) and converts them to a ColumnStat
(for the column name).
In the end, statsFromProperties
creates a CatalogStatistics as follows:
Catalog Statistic | Value |
---|---|
sizeInBytes | spark.sql.statistics.totalSize |
rowCount | spark.sql.statistics.numRows property |
colStats | Column Names and their CatalogColumnStats |
statsFromProperties
is used when:
HiveExternalCatalog
is requested to restore metadata of a table or a partition
Demo¶
import org.apache.spark.sql.internal.StaticSQLConf
val catalogType = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)
assert(catalogType == "hive")
assert(spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION) == "hive")
assert(spark.conf.get("spark.sql.catalogImplementation") == "hive")
val metastore = spark.sharedState.externalCatalog
import org.apache.spark.sql.catalyst.catalog.ExternalCatalog
assert(metastore.isInstanceOf[ExternalCatalog])
scala> println(metastore)
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener@277ffb17
scala> println(metastore.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog@4eda3af9
Logging¶
Enable ALL
logging level for org.apache.spark.sql.hive.HiveExternalCatalog
logger to see what happens inside.
Add the following line to conf/log4j2.properties
:
logger.HiveExternalCatalog.name = org.apache.spark.sql.hive.HiveExternalCatalog
logger.HiveExternalCatalog.level = all
Refer to Logging.