Snapshot — Versioned State Of Delta Table

Snapshot is an immutable snapshot of the state of a delta table at some version.

Snapshot is created when DeltaLog is requested for the current snapshot or at a given version, and to update.

Snapshot can be requested for all data files.

scala> deltaLog.snapshot.allFiles.show(false)
+-------------------------------------------------------------------+---------------+----+----------------+----------+-----+----+
|path                                                               |partitionValues|size|modificationTime|dataChange|stats|tags|
+-------------------------------------------------------------------+---------------+----+----------------+----------+-----+----+
|part-00000-4050db39-e0f5-485d-ab3b-3ca72307f621-c000.snappy.parquet|[]             |262 |1578083748000   |false     |null |null|
|part-00000-ba39f292-2970-4528-a40c-8f0aa5f796de-c000.snappy.parquet|[]             |262 |1578083570000   |false     |null |null|
|part-00003-99f9d902-24a7-4f76-a15a-6971940bc245-c000.snappy.parquet|[]             |429 |1578083748000   |false     |null |null|
|part-00007-03d987f1-5bb3-4b5b-8db9-97b6667107e2-c000.snappy.parquet|[]             |429 |1578083748000   |false     |null |null|
|part-00011-a759a8c2-507d-46dd-9da7-dc722316214b-c000.snappy.parquet|[]             |429 |1578083748000   |false     |null |null|
|part-00015-2e685d29-25ed-4262-90a7-5491847fd8d0-c000.snappy.parquet|[]             |429 |1578083748000   |false     |null |null|
|part-00015-ee0ac1af-e1e0-4422-8245-12da91ced0a2-c000.snappy.parquet|[]             |429 |1578083570000   |false     |null |null|
+-------------------------------------------------------------------+---------------+----+----------------+----------+-----+----+

Snapshot can be requested for removed data files (aka tombstones).

scala> deltaLog.snapshot.tombstones.show(false)
+----+-----------------+----------+
|path|deletionTimestamp|dataChange|
+----+-----------------+----------+
+----+-----------------+----------+

Snapshot uses the spark.databricks.delta.snapshotPartitions internal configuration property (default: 50) for the number of partition for state reconstruction.

Creating Snapshot Instance

Snapshot takes the following to be created:

Snapshot initializes the internal properties.

While being created, Snapshot requests the DeltaLog to protocolRead with the protocol.

state Method

state: Dataset[SingleAction]

state is used when:

All AddFiles — allFiles Method

allFiles: Dataset[AddFile]

allFiles simply takes the state dataset and selects AddFiles (adds where clause for add IS NOT NULL and select over the fields of AddFiles).

allFiles simply adds where and select clauses. No computation as it is (a description of) a distributed computation as a Dataset[AddFile].
import org.apache.spark.sql.delta.DeltaLog
val deltaLog = DeltaLog.forTable(spark, "/tmp/delta/users")
val files = deltaLog.snapshot.allFiles

scala> :type files
org.apache.spark.sql.Dataset[org.apache.spark.sql.delta.actions.AddFile]

scala> files.show
+--------------------+---------------+----+----------------+----------+-----+----+
|                path|partitionValues|size|modificationTime|dataChange|stats|tags|
+--------------------+---------------+----+----------------+----------+-----+----+
|part-00000-68a7ce...|             []| 875|   1579789902000|     false| null|null|
|part-00000-73e140...|             []| 419|   1579723382000|     false| null|null|
|part-00000-7a01b0...|             []| 875|   1579877119000|     false| null|null|
|part-00001-8a2ece...|             []| 875|   1579789902000|     false| null|null|
|part-00002-0fc3da...|             []| 866|   1579789902000|     false| null|null|
|part-00003-c0fc5f...|             []| 884|   1579789902000|     false| null|null|
+--------------------+---------------+----+----------------+----------+-----+----+

allFiles is used when:

stateReconstruction Internal Property

stateReconstruction: Dataset[SingleAction]

stateReconstruction is a dataset of SingleActions (that is the dataset part) of the cachedState.

emptyActions Internal Method

emptyActions: Dataset[SingleAction]

emptyActions is an empty dataset of SingleActions for stateReconstruction and load.

load Internal Method

load(
  files: Seq[DeltaLogFileIndex]): Dataset[SingleAction]

load…​FIXME

load is used when Snapshot is created (and initializes stateReconstruction).

Transaction Version By App ID — transactions Lookup Table

transactions: Map[String, Long]

transactions takes the SetTransaction actions (from the state dataset) and makes them a lookup table of transaction version by appId.

transactions is a Scala lazy value and is not initialized until the first access.
transactions is used when OptimisticTransactionImpl is requested for the transaction version for a given (streaming query) id.

tombstones Method

tombstones: Dataset[RemoveFile]

tombstones…​FIXME

tombstones seems to be used for testing only.

redactedPath Method

redactedPath: String

redactedPath…​FIXME

redactedPath is used…​FIXME

dataSkippingNumIndexedCols Table Property — numIndexedCols Value

numIndexedCols: Int

numIndexedCols simply reads the dataSkippingNumIndexedCols table property from the Metadata.

numIndexedCols seems unused.

Internal Properties

Name Description

cachedState

Cached Delta State that is made up of the following:

Used when Snapshot is requested for the state (i.e. Dataset[SingleAction])

metadata

Metadata of the current state of the delta table

protocol

Protocol of the current state of the delta table

setTransactions

SetTransactions of the current state of the delta table