Skip to content

CatalogTable

CatalogTable is the specification (metadata) of a table managed by SessionCatalog.

Creating Instance

CatalogTable takes the following to be created:

  • TableIdentifier
  • Table type
  • CatalogStorageFormat
  • Schema (StructType)
  • Name of the table provider
  • Partition Columns
  • Bucketing specification
  • Owner
  • Created Time
  • Last access time
  • Created By version
  • Table Properties
  • Table statistics
  • View Text
  • Comment
  • Unsupported Features (Seq[String])
  • tracksPartitionsInCatalog flag (default: false)
  • schemaPreservesCase flag (default: true)
  • Ignored properties
  • View Original Text

CatalogTable is created when:

Bucketing Specification

bucketSpec: Option[BucketSpec] = None

CatalogTable can be given a BucketSpec when created. It is undefined (None) by default.

BucketSpec is given (using getBucketSpecFromTableProperties from a Hive metastore) when:

BucketSpec is given when:

BucketSpec is used when:

Note

  1. Use DescribeTableCommand to review BucketSpec
  2. Use ShowCreateTableCommand to review the Spark DDL syntax
  3. Use Catalog.listColumns to list all columns (incl. bucketing columns)

Table Type

CatalogTable is given a CatalogTableType when created:

CatalogTableType is included when a TreeNode is requested for a JSON representation for...FIXME

Table Statistics

stats: Option[CatalogStatistics] = None

CatalogTable can be given a CatalogStatistics when created. It is undefined (None) by default.

Review Me

You manage a table metadata using the Catalog interface. Among the management tasks is to get the <> of a table (that are used for cost-based query optimization).

scala> t1Metadata.stats.foreach(println)
CatalogStatistics(714,Some(2),Map(p1 -> ColumnStat(2,Some(0),Some(1),0,4,4,None), id -> ColumnStat(2,Some(0),Some(1),0,4,4,None)))

scala> t1Metadata.stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows

CAUTION: FIXME When are stats specified? What if there are not?

Unless <> are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource creates a HadoopFsRelation with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue) for query planning of joins (and possibly to auto broadcast the table).

Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.

Unless <> are available in a table metadata (in a catalog) for HiveTableRelation (and hive provider) DetermineTableStats logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).

When requested to hive/HiveClientImpl.md#getTableOption[look up a table in a metastore], HiveClientImpl hive/HiveClientImpl.md#readHiveStats[reads table or partition statistics directly from a Hive metastore].

You can use AnalyzeColumnCommand.md[AnalyzeColumnCommand], AnalyzePartitionCommand.md[AnalyzePartitionCommand], AnalyzeTableCommand.md[AnalyzeTableCommand] commands to record statistics in a catalog.

The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.

You can use DESCRIBE SQL command to show the histogram of a column if stored in a catalog.

Demo: Accessing Table Metadata

Catalog

val q = spark.catalog.listTables.filter($"name" === "t1")
scala> q.show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
|  t1| default|       null|  MANAGED|      false|
+----+--------+-----------+---------+-----------+

SessionCatalog

import org.apache.spark.sql.catalyst.catalog.SessionCatalog
val sessionCatalog = spark.sessionState.catalog
assert(sessionCatalog.isInstanceOf[SessionCatalog])
val t1Tid = spark.sessionState.sqlParser.parseTableIdentifier("t1")
val t1Metadata = sessionCatalog.getTempViewOrPermanentTableMetadata(t1Tid)
import org.apache.spark.sql.catalyst.catalog.CatalogTable
assert(t1Metadata.isInstanceOf[CatalogTable])