MetadataCleanup

MetadataCleanup is an abstraction of MetadataCleanups that can clean up the DeltaLog.

NOTE: DeltaLog is the default and only known MetadataCleanup in Delta Lake.

Enable ALL logging level for org.apache.spark.sql.delta.MetadataCleanup logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.delta.MetadataCleanup=ALL

Refer to Logging.

doLogCleanup Method

doLogCleanup(): Unit

doLogCleanup is part of the Checkpoints Contract to…​FIXME.

Interestingly, this MetadataCleanup and Checkpoints abstractions require to be used with DeltaLog only.

doLogCleanup cleanUpExpiredLogs when the enableExpiredLogCleanup table property is enabled.

enableExpiredLogCleanup Table Property — enableExpiredLogCleanup Method

enableExpiredLogCleanup: Boolean

enableExpiredLogCleanup gives the value of enableExpiredLogCleanup table property (from the Metadata).

enableExpiredLogCleanup is used exclusively when MetadataCleanup is requested to doLogCleanup.

logRetentionDuration Table Property — deltaRetentionMillis Method

deltaRetentionMillis: Long

deltaRetentionMillis gives the value of logRetentionDuration table property (from the Metadata).

deltaRetentionMillis is used when…​FIXME

cleanUpExpiredLogs Internal Method

cleanUpExpiredLogs(): Unit

cleanUpExpiredLogs calculates a so-called fileCutOffTime based on the current time and the logRetentionDuration table property.

cleanUpExpiredLogs prints out the following INFO message to the logs:

Starting the deletion of log files older than [date]

cleanUpExpiredLogs finds the expired delta logs (based on the fileCutOffTime) and deletes the files (using Hadoop’s FileSystem.delete non-recursively).

In the end, cleanUpExpiredLogs prints out the following INFO message to the logs:

Deleted numDeleted log files older than [date]
cleanUpExpiredLogs is used exclusively when MetadataCleanup is requested to doLogCleanup.

Finding Expired Delta Logs — listExpiredDeltaLogs Internal Method

listExpiredDeltaLogs(
  fileCutOffTime: Long): Iterator[FileStatus]

listExpiredDeltaLogs…​FIXME

requests the LogStore for the paths (in the same directory) that are (lexicographically) greater or equal to the 0th checkpoint file (per checkpointPrefix format) of the checkpoint and delta files in the log directory (of the DeltaLog).

In the end, listExpiredDeltaLogs creates a BufferingLogDeletionIterator that…​FIXME

listExpiredDeltaLogs is used exclusively when MetadataCleanup is requested to cleanUpExpiredLogs.