DMLWithDeletionVectorsHelper¶
DMLWithDeletionVectorsHelper
is a DeltaCommand with utilities for DML operations to work with Deletion Vectors.
createTargetDfForScanningForMatches¶
createTargetDfForScanningForMatches(
spark: SparkSession,
target: LogicalPlan,
fileIndex: TahoeFileIndex): DataFrame
createTargetDfForScanningForMatches
creates a DataFrame
with replaceFileIndex logical operator (based on the given target
and TahoeFileIndex)
createTargetDfForScanningForMatches
is used when:
DeleteCommand
is requested to performDelete (with deletion vectors enabled)UpdateCommand
is requested to performUpdate (with deletion vectors enabled)
Replacing FileIndex¶
replaceFileIndex(
target: LogicalPlan,
fileIndex: TahoeFileIndex): LogicalPlan
replaceFileIndex
replaces a FileIndex
in all the delta tables in the given target
logical plan (with some other changes).
replaceFileIndex
transforms (recognizes) the following logical operators in given target
logical plan:
LogicalRelation
s withHadoopFsRelation
s (Spark SQL) with DeltaParquetFileFormatProject
s
replaceFileIndex
adds the following metadata columns to the output schema (of the logical operators):
- Delta-specific __delta_internal_row_index
FileFormat
-specific_metadata
(Spark SQL)
In addition, for LogicalRelation
s, replaceFileIndex
changes the HadoopFsRelation
to use the following:
- The given TahoeFileIndex as the
FileIndex
(Spark SQL) - The
DeltaParquetFileFormat
with splitting and pushdowns disabled
findTouchedFiles¶
findTouchedFiles(
sparkSession: SparkSession,
txn: OptimisticTransaction,
hasDVsEnabled: Boolean,
deltaLog: DeltaLog,
targetDf: DataFrame,
fileIndex: TahoeFileIndex,
condition: Expression,
opName: String): Seq[TouchedFileWithDV]
Supported Operators
findTouchedFiles
supports DELETE and UPDATE commands only (as indicated by opName
argument).
findTouchedFiles
requests the given TahoeFileIndex (that is assumed to be a TahoeBatchFileIndex) for the data files.
In the end, findTouchedFiles
findFilesWithMatchingRows with candidate file map and matched row index sets.
findTouchedFiles
is used when:
DeleteCommand
is requested to performDeleteUpdateCommand
is requested to performUpdate
findFilesWithMatchingRows¶
findFilesWithMatchingRows(
txn: OptimisticTransaction,
nameToAddFileMap: Map[String, AddFile],
matchedFileRowIndexSets: Seq[DeletionVectorResult]): Seq[TouchedFileWithDV]
findFilesWithMatchingRows
...FIXME
processUnmodifiedData¶
processUnmodifiedData(
spark: SparkSession,
touchedFiles: Seq[TouchedFileWithDV],
snapshot: Snapshot): (Seq[FileAction], Map[String, Long])
Review Me
processUnmodifiedData
calculates the following metrics (using the given touchedFiles
):
- The total number of modified rows
- The number of isFullyReplaced data files
processUnmodifiedData
splits (partitions) the given TouchedFileWithDVs into fully removed ones and the others (based on isFullyReplaced flag).
processUnmodifiedData
...FIXME
In the end, processUnmodifiedData
returns a collection of the RemoveFile and AddFile actions along with the following metrics:
numModifiedRows
numRemovedFiles
numDeletionVectorsAdded
numDeletionVectorsRemoved
numDeletionVectorsUpdated
processUnmodifiedData
is used when:
DeleteCommand
is requested to performDelete (with shouldWritePersistentDeletionVectors enabled and supported)