DMLWithDeletionVectorsHelper¶
DMLWithDeletionVectorsHelper is a DeltaCommand with utilities for DML operations to work with Deletion Vectors.
createTargetDfForScanningForMatches¶
createTargetDfForScanningForMatches(
spark: SparkSession,
target: LogicalPlan,
fileIndex: TahoeFileIndex): DataFrame
createTargetDfForScanningForMatches creates a DataFrame with replaceFileIndex logical operator (based on the given target and TahoeFileIndex)
createTargetDfForScanningForMatches is used when:
DeleteCommandis requested to performDelete (with deletion vectors enabled)UpdateCommandis requested to performUpdate (with deletion vectors enabled)
Replacing FileIndex¶
replaceFileIndex(
target: LogicalPlan,
fileIndex: TahoeFileIndex): LogicalPlan
replaceFileIndex replaces a FileIndex in all the delta tables in the given target logical plan (with some other changes).
replaceFileIndex transforms (recognizes) the following logical operators in given target logical plan:
LogicalRelations withHadoopFsRelations (Spark SQL) with DeltaParquetFileFormatProjects
replaceFileIndex adds the following metadata columns to the output schema (of the logical operators):
- Delta-specific __delta_internal_row_index
FileFormat-specific_metadata(Spark SQL)
In addition, for LogicalRelations, replaceFileIndex changes the HadoopFsRelation to use the following:
- The given TahoeFileIndex as the
FileIndex(Spark SQL) - The
DeltaParquetFileFormatwith splitting and pushdowns disabled
findTouchedFiles¶
findTouchedFiles(
sparkSession: SparkSession,
txn: OptimisticTransaction,
hasDVsEnabled: Boolean,
deltaLog: DeltaLog,
targetDf: DataFrame,
fileIndex: TahoeFileIndex,
condition: Expression,
opName: String): Seq[TouchedFileWithDV]
Supported Operators
findTouchedFiles supports DELETE and UPDATE commands only (as indicated by opName argument).
findTouchedFiles requests the given TahoeFileIndex (that is assumed to be a TahoeBatchFileIndex) for the data files.
In the end, findTouchedFiles findFilesWithMatchingRows with candidate file map and matched row index sets.
findTouchedFiles is used when:
DeleteCommandis requested to performDeleteUpdateCommandis requested to performUpdate
findFilesWithMatchingRows¶
findFilesWithMatchingRows(
txn: OptimisticTransaction,
nameToAddFileMap: Map[String, AddFile],
matchedFileRowIndexSets: Seq[DeletionVectorResult]): Seq[TouchedFileWithDV]
findFilesWithMatchingRows...FIXME
processUnmodifiedData¶
processUnmodifiedData(
spark: SparkSession,
touchedFiles: Seq[TouchedFileWithDV],
snapshot: Snapshot): (Seq[FileAction], Map[String, Long])
Review Me
processUnmodifiedData calculates the following metrics (using the given touchedFiles):
- The total number of modified rows
- The number of isFullyReplaced data files
processUnmodifiedData splits (partitions) the given TouchedFileWithDVs into fully removed ones and the others (based on isFullyReplaced flag).
processUnmodifiedData...FIXME
In the end, processUnmodifiedData returns a collection of the RemoveFile and AddFile actions along with the following metrics:
numModifiedRowsnumRemovedFilesnumDeletionVectorsAddednumDeletionVectorsRemovednumDeletionVectorsUpdated
processUnmodifiedData is used when:
DeleteCommandis requested to performDelete (with shouldWritePersistentDeletionVectors enabled and supported)