VacuumCommand Utility — Garbage Collecting Delta Table

VacuumCommand is a concrete VacuumCommandImpl for gc.

Enable ALL logging level for org.apache.spark.sql.delta.commands.VacuumCommand logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.delta.commands.VacuumCommand=ALL

Refer to Logging.

Garbage Collecting Of Delta Table — gc Utility

gc(
  spark: SparkSession,
  deltaLog: DeltaLog,
  dryRun: Boolean = true,
  retentionHours: Option[Double] = None,
  clock: Clock = new SystemClock): DataFrame

gc requests the given DeltaLog to update (and give the latest Snapshot of the delta table).

gc…​FIXME (deleteBeforeTimestamp)

gc prints out the following INFO message to the logs:

Starting garbage collection (dryRun = [dryRun]) of untracked files older than [deleteBeforeTimestamp] in [path]

gc requests the Snapshot for the state dataset and defines a function for every action (in a partition) that does the following:

  1. FIXME

gc converts the mapped state dataset (of actions) into a DataFrame with a single path column.

gc…​FIXME

gc caches the allFilesAndDirs dataset.

gc prints out the following INFO message to the logs:

Deleting untracked files and empty directories in [path]

gc…​FIXME

gc prints out the following message to standard output:

Deleted [filesDeleted] files and directories in a total of [dirCounts] directories.

gc…​FIXME

In the end, gc unpersists the allFilesAndDirs dataset.

gc is used when:

checkRetentionPeriodSafety Method

checkRetentionPeriodSafety(
  spark: SparkSession,
  retentionMs: Option[Long],
  configuredRetention: Long): Unit

checkRetentionPeriodSafety…​FIXME

checkRetentionPeriodSafety is used exclusively when VacuumCommand utility is requested to gc.