Checkpoints is an abstraction of DeltaLogs that can checkpoint the current state of a delta table.
Checkpoints requires to be used with DeltaLog (or subtypes) only.
Hadoop Path to the data directory of the delta table
Performs log cleanup (to remove stale log files)
Checkpointsis requested to checkpointAndCleanUpDeltaLog
Hadoop Path to the log directory of the delta table
Metadata of the delta table
Snapshot of the delta table
_last_checkpoint Metadata File¶
Checkpoints uses _last_checkpoint metadata file (under the log path) for the following:
checkpoint(): Unit checkpoint( snapshotToCheckpoint: Snapshot): CheckpointMetaData
checkpoint writes a checkpoint of the current state of the delta table (Snapshot). That produces a checkpoint metadata with the version, the number of actions and possibly parts (for multi-part checkpoints).
checkpoint requests the LogStore to overwrite the _last_checkpoint file with the JSON-encoded checkpoint metadata.
In the end,
checkpoint cleans up the expired logs (if enabled).
checkpoint is used when:
OptimisticTransactionImplis requested to postCommit (based on checkpoint interval table property)
DeltaCommandis requested to updateAndCheckpoint
checkpointAndCleanUpDeltaLog( snapshotToCheckpoint: Snapshot): Unit
checkpointAndCleanUpDeltaLog does the following (in the order):
writeCheckpointFiles( snapshotToCheckpoint: Snapshot): CheckpointMetaData
writeCheckpointFiles writes out a checkpoint of the given Snapshot.
Writing Out State Checkpoint¶
writeCheckpoint( spark: SparkSession, deltaLog: DeltaLog, snapshot: Snapshot): CheckpointMetaData
writeCheckpoint writes out the contents of the given Snapshot into one or more checkpoint files (based on spark.databricks.delta.checkpoint.partSize configuration property).
writeCheckpoint creates the following accumulators:
writeCheckpoint reads spark.databricks.delta.checkpoint.partSize configuration property to calculate the number and paths of this checkpoint (based on numOfFiles and numOfRemoves of the given Snapshot).
For multiple checkpoint parts (paths),
writeCheckpoint checkpointFileWithParts. Otherwise,
writeCheckpoint executes checkpointing (as a distributed computation using stateDS and as many tasks the number of checkpoint parts).
writeCheckpoint uses a new execution ID with the name Delta checkpoint.
writeLastCheckpointFile( deltaLog: DeltaLog, checkpointMetaData: CheckpointMetaData, addChecksum: Boolean): Unit
Loading Latest Checkpoint Metadata¶
lastCheckpoint loadMetadataFromFile (allowing for 3 retries).
lastCheckpoint is used when:
SnapshotManagementis requested to load the latest snapshot
MetadataCleanupis requested to listExpiredDeltaLogs
loadMetadataFromFile( tries: Int): Option[CheckpointMetaData]
loadMetadataFromFile loads the _last_checkpoint file (in JSON format) and converts it to
CheckpointMetaData (with a version, size and parts).
loadMetadataFromFile uses the LogStore to read the _last_checkpoint file.
In case the _last_checkpoint file is corrupted,