Skip to content

Metadata

Metadata is an <> that describes metadata (change) of a <> (indirectly via <>).

import org.apache.spark.sql.delta.DeltaLog
val deltaLog = DeltaLog.forTable(spark, "/tmp/delta/users")
scala> :type deltaLog.snapshot.metadata
org.apache.spark.sql.delta.actions.Metadata

Metadata contains all the non-data information (metadata) like <>, <>, <>, <>, <>, <> and <>. These can be changed (e.g., schema evolution).

TIP: Use <> to review the metadata of a delta table.

Metadata uses <> to uniquely identify a delta table. The ID is never going to change through the history of the table (unless the entire directory, along with the transaction log is deleted). It is known as tableId or <>.

[NOTE]

When I asked the question https://groups.google.com/forum/#!topic/delta-users/5OKEFvVKiew[tableId and reservoirId - Why two different names for metadata ID?] on delta-users mailing list, Tathagata Das wrote:

Any reference to "reservoir" is just legacy code. In the early days of this project, the project was called "Tahoe" and each table is called a "reservoir" (Tahoe is one of the 2nd deepest lake in US, and is a very large reservoir of water ;) ). So you may still find those two terms all around the codebase.

In some cases, like DeltaSourceOffset, the term reservoirId is in the json that is written to the streaming checkpoint directory. So we cannot change that for backward compatibility.

====

Metadata can be <> in a OptimisticTransactionImpl.md[transaction] once (and only when created for an uninitialized table, when <> is -1).

[source,scala]

txn.metadata

Metadata is <> when:

  • DeltaLog is requested for the <>

  • OptimisticTransactionImpl is requested for the <>

  • ConvertToDeltaCommand is requested to <>

  • ImplicitMetadataOperation is requested to <>

== [[creating-instance]] Creating Metadata Instance

Metadata takes the following to be created:

  • [[id]] Table ID (default: a random UUID)
  • [[name]] Name of the delta table (default: null)
  • [[description]] Description (default: null)
  • [[format]] Format
  • [[schemaString]] Schema (default: null)
  • [[partitionColumns]] Partition columns (default: Nil)
  • [[configuration]] Configuration (default: empty)
  • [[createdTime]] Created time (in millis since the epoch)

== [[wrap]] wrap Method

[source, scala]

wrap: SingleAction

NOTE: wrap is part of the <> contract to wrap the action into a <>.

wrap simply creates a new <> with the Metadata field set to this Metadata.

== [[partitionSchema]] partitionSchema (Lazy) Property

[source, scala]

partitionSchema: StructType

partitionSchema is the <> as StructFields (and defined in the <>).

NOTE: partitionSchema throws an IllegalArgumentException for undefined fields that were used for the <> but not defined in the <>.

NOTE: partitionSchema is used when...FIXME

== [[dataSchema]] dataSchema (Lazy) Property

[source, scala]

dataSchema: StructType

dataSchema...FIXME

NOTE: dataSchema is used when...FIXME

== [[schema]] schema (Lazy) Property

[source, scala]

schema: StructType

schema is a deserialized <> (from JSON format) to StructType.

[NOTE]

schema is used when:

  • Metadata is requested for the schema of the <> and the <>

  • DeltaLog is requested for an DeltaLog.md#createRelation[insertable HadoopFsRelation for batch queries] (for the data schema), to DeltaLog.md#upgradeProtocol[upgrade protocol], a DeltaLog.md#createDataFrame[DataFrame for given AddFiles]

  • DeltaTableUtils utility is used to DeltaTableUtils.md#combineWithCatalogMetadata[combineWithCatalogMetadata]

  • OptimisticTransactionImpl is requested to OptimisticTransactionImpl.md#verifyNewMetadata[verifyNewMetadata]

* ...FIXME (there are other uses)


Last update: 2020-09-29