Skip to content

SnapshotManagement

SnapshotManagement is an extension for DeltaLog to manage Snapshots.

Demo

val name = "employees"
val dataPath = s"/tmp/delta/$name"
sql(s"DROP TABLE $name")
sql(s"""
    | CREATE TABLE $name (id bigint, name string, city string)
    | USING delta
    | OPTIONS (path='$dataPath')
    """.stripMargin)

import org.apache.spark.sql.delta.DeltaLog
val log = DeltaLog.forTable(spark, dataPath)

import org.apache.spark.sql.delta.SnapshotManagement
assert(log.isInstanceOf[SnapshotManagement], "DeltaLog is a SnapshotManagement")

val snapshot = log.update(stalenessAcceptable = false)
scala> :type snapshot
org.apache.spark.sql.delta.Snapshot

assert(snapshot.version == 0)

Current Snapshot

currentSnapshot: Snapshot

currentSnapshot is a registry with the current Snapshot of a Delta table.

When DeltaLog is created, currentSnapshot is initialized as getSnapshotAtInit and changed every update.

currentSnapshot...FIXME

currentSnapshot is used when:

  • DeltaLog is requested to isValid
  • SnapshotManagement is requested to...FIXME

Updating Current Snapshot

update(
  stalenessAcceptable: Boolean = false): Snapshot

update...FIXME

update is used when:

tryUpdate

tryUpdate(
  isAsync: Boolean = false): Snapshot

tryUpdate...FIXME

tryUpdate is used when SnapshotManagement is requested to update.

updateInternal

updateInternal(
  isAsync: Boolean): Snapshot

updateInternal...FIXME

updateInternal is used when SnapshotManagement is requested to update (and tryUpdate).

Loading Latest Snapshot

getSnapshotAtInit: Snapshot

getSnapshotAtInit getLogSegmentFrom for the last checkpoint.

getSnapshotAtInit prints out the following INFO message to the logs:

Loading version [version][startCheckpoint]

getSnapshotAtInit creates a Snapshot for the log segment.

getSnapshotAtInit records the current time in lastUpdateTimestamp registry.

getSnapshotAtInit prints out the following INFO message to the logs:

Returning initial snapshot [snapshot]

getSnapshotAtInit is used when SnapshotManagement is created (and initializes the currentSnapshot registry).

getLogSegmentFrom

getLogSegmentFrom(
  startingCheckpoint: Option[CheckpointMetaData]): LogSegment

getLogSegmentFrom getLogSegmentForVersion for the version of the given CheckpointMetaData (if specified) as a start checkpoint version or leaves it undefined.

getLogSegmentFrom is used when SnapshotManagement is requested for getSnapshotAtInit.

getLogSegmentForVersion

getLogSegmentForVersion(
  startCheckpoint: Option[Long],
  versionToLoad: Option[Long] = None): LogSegment

getLogSegmentForVersion...FIXME

getLogSegmentForVersion is used when SnapshotManagement is requested for getLogSegmentFrom, updateInternal and getSnapshotAt.

Creating Snapshot

createSnapshot(
  segment: LogSegment,
  minFileRetentionTimestamp: Long,
  timestamp: Long): Snapshot

createSnapshot readChecksum (for the version of the given LogSegment) and creates a Snapshot.

createSnapshot is used when SnapshotManagement is requested for getSnapshotAtInit, getSnapshotAt and update.

Last Successful Update Timestamp

SnapshotManagement uses lastUpdateTimestamp internal registry for the timestamp of the last successful update.


Last update: 2020-10-05