VacuumCommand Utility — Garbage Collecting Delta Table

VacuumCommand is a concrete VacuumCommandImpl for gc.

Garbage Collecting Of Delta Table — gc Utility

gc(
  spark: SparkSession,
  deltaLog: DeltaLog,
  dryRun: Boolean = true,
  retentionHours: Option[Double] = None,
  clock: Clock = new SystemClock): DataFrame

gc requests the given DeltaLog to update (and give the latest Snapshot of the delta table).

gc…​FIXME (deleteBeforeTimestamp)

gc prints out the following INFO message to the logs:

Starting garbage collection (dryRun = [dryRun]) of untracked files older than [deleteBeforeTimestamp] in [path]

gc requests the Snapshot for the state dataset and defines a function for every action (in a partition) that does the following:

  1. FIXME

gc converts the mapped state dataset (of actions) into a DataFrame with a single path column.

gc…​FIXME

gc caches the allFilesAndDirs dataset.

gc prints out the following INFO message to the logs:

Deleting untracked files and empty directories in [path]

gc…​FIXME

gc prints out the following message to standard output:

Deleted [filesDeleted] files and directories in a total of [dirCounts] directories.

gc…​FIXME

In the end, gc unpersists the allFilesAndDirs dataset.

gc is used when:

checkRetentionPeriodSafety Method

checkRetentionPeriodSafety(
  spark: SparkSession,
  retentionMs: Option[Long],
  configuredRetention: Long): Unit

checkRetentionPeriodSafety…​FIXME

checkRetentionPeriodSafety is used exclusively when VacuumCommand utility is requested to gc.

Logging

Enable ALL logging level for org.apache.spark.sql.delta.commands.VacuumCommand logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.delta.commands.VacuumCommand=ALL

Refer to Logging.