Skip to content

DMLWithDeletionVectorsHelper

DMLWithDeletionVectorsHelper is a DeltaCommand with utilities for DML operations to work with Deletion Vectors.

createTargetDfForScanningForMatches

createTargetDfForScanningForMatches(
  spark: SparkSession,
  target: LogicalPlan,
  fileIndex: TahoeFileIndex): DataFrame

createTargetDfForScanningForMatches creates a DataFrame with replaceFileIndex logical operator (based on the given target and TahoeFileIndex)


createTargetDfForScanningForMatches is used when:

Replacing FileIndex

replaceFileIndex(
  target: LogicalPlan,
  fileIndex: TahoeFileIndex): LogicalPlan

replaceFileIndex replaces a FileIndex in all the delta tables in the given target logical plan (with some other changes).

replaceFileIndex transforms (recognizes) the following logical operators in given target logical plan:

  1. LogicalRelations with HadoopFsRelations (Spark SQL) with DeltaParquetFileFormat
  2. Projects

replaceFileIndex adds the following metadata columns to the output schema (of the logical operators):

  1. Delta-specific __delta_internal_row_index
  2. FileFormat-specific _metadata (Spark SQL)

In addition, for LogicalRelations, replaceFileIndex changes the HadoopFsRelation to use the following:

findTouchedFiles

findTouchedFiles(
  sparkSession: SparkSession,
  txn: OptimisticTransaction,
  hasDVsEnabled: Boolean,
  deltaLog: DeltaLog,
  targetDf: DataFrame,
  fileIndex: TahoeFileIndex,
  condition: Expression,
  opName: String): Seq[TouchedFileWithDV]
Supported Operators

findTouchedFiles supports DELETE and UPDATE commands only (as indicated by opName argument).

findTouchedFiles requests the given TahoeFileIndex (that is assumed to be a TahoeBatchFileIndex) for the data files.

In the end, findTouchedFiles findFilesWithMatchingRows with candidate file map and matched row index sets.


findTouchedFiles is used when:

findFilesWithMatchingRows

findFilesWithMatchingRows(
  txn: OptimisticTransaction,
  nameToAddFileMap: Map[String, AddFile],
  matchedFileRowIndexSets: Seq[DeletionVectorResult]): Seq[TouchedFileWithDV]

findFilesWithMatchingRows...FIXME

processUnmodifiedData

processUnmodifiedData(
  spark: SparkSession,
  touchedFiles: Seq[TouchedFileWithDV],
  snapshot: Snapshot): (Seq[FileAction], Map[String, Long])

Review Me

processUnmodifiedData calculates the following metrics (using the given touchedFiles):

processUnmodifiedData splits (partitions) the given TouchedFileWithDVs into fully removed ones and the others (based on isFullyReplaced flag).

processUnmodifiedData...FIXME

In the end, processUnmodifiedData returns a collection of the RemoveFile and AddFile actions along with the following metrics:

  • numModifiedRows
  • numRemovedFiles
  • numDeletionVectorsAdded
  • numDeletionVectorsRemoved
  • numDeletionVectorsUpdated

processUnmodifiedData is used when: