Metadata¶
Metadata
is an Action to update the metadata of a delta table (indirectly via the Snapshot).
Use DescribeDeltaDetailCommand to review the metadata of a delta table.
Creating Instance¶
Metadata
takes the following to be created:
- Id
- Name (default:
null
) - Description (default:
null
) - Format (default: empty)
- Schema (default:
null
) - Partition Columns (default:
Nil
) - Table Configuration (default: (empty))
- Created Time (default: undefined)
Metadata
is created when:
DeltaLog
is requested for the metadata (but that should be rare)InitialSnapshot
is created- ConvertToDeltaCommand is executed
ImplicitMetadataOperation
is requested to updateMetadata
Updating Metadata¶
Metadata
can be updated in a transaction once only (and only when created for an uninitialized table, when readVersion is -1
).
txn.metadata
Demo¶
val path = "/tmp/delta/users"
import org.apache.spark.sql.delta.DeltaLog
val deltaLog = DeltaLog.forTable(spark, path)
import org.apache.spark.sql.delta.actions.Metadata
assert(deltaLog.snapshot.metadata.isInstanceOf[Metadata])
deltaLog.snapshot.metadata.id
Table ID¶
Metadata
uses a Table ID (aka reservoirId) to uniquely identify a delta table and is never going to change through the history of the table.
Metadata
can be given a table ID when created or defaults to a random UUID (Java).
Note
When I asked the question tableId and reservoirId - Why two different names for metadata ID? on delta-users mailing list, Tathagata Das wrote:
Any reference to "reservoir" is just legacy code. In the early days of this project, the project was called "Tahoe" and each table is called a "reservoir" (Tahoe is one of the 2nd deepest lake in US, and is a very large reservoir of water ;) ). So you may still find those two terms all around the codebase.
In some cases, like DeltaSourceOffset, the term
reservoirId
is in the json that is written to the streaming checkpoint directory. So we cannot change that for backward compatibility.
Column Mapping Mode¶
columnMappingMode: DeltaColumnMappingMode
columnMappingMode
is the value of columnMapping.mode table property (from this Metadata).
columnMappingMode
is used when:
DeltaFileFormat
is requested for the FileFormat
Data Schema (of Delta Table)¶
dataSchema: StructType
Lazy Value
dataSchema
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
dataSchema
is the schema without the partition columns (and is the columns written out to data files).
dataSchema
is used when:
OptimisticTransactionImpl
is requested to verify a new metadataSnapshot
is requested for the data schema