MERGE Command¶
Delta Lake supports merging records into a delta table using the following high-level operators:
- MERGE INTO SQL command (Spark SQL)
- DeltaTable.merge
Merge command is executed as a transactional MergeIntoCommand.
While running a merge, MergeIntoCommand
can choose between the available MergeOutputGenerations (ClassicMergeExecutor or InsertOnlyMergeExecutor) for an optimized output of the merge command to write out.
SQL MERGE (DML Statement)
Quoting Wikipedia:
MERGE
(also called upsert) statements are used to simultaneouslyINSERT
new records orUPDATE
existing records depending on whether condition matches.
Merge command lets you transactionally execute multiple INSERT
, UPDATE
, and DELETE
DML statements.
Insert-Only Merges¶
Insert-Only Merges are MERGE queries with WHEN NOT MATCHED clauses only (DeltaMergeIntoNotMatchedClauses precisely that can only be WHEN NOT MATCHED THEN INSERT clauses).
notMatchedClause
: WHEN NOT MATCHED (BY TARGET)? (AND notMatchedCond)? THEN notMatchedAction
;
notMatchedAction
: INSERT *
| INSERT (columns) VALUES (expressions)
;
Delta Lake applies extra optimizations to insert-only merges only with spark.databricks.delta.merge.optimizeInsertOnlyMerge.enabled enabled.
For insert-only merges, merge becomes a writeOnlyInserts (using InsertOnlyMergeExecutor) (instead of writeAllChanges using ClassicMergeExecutor).
InsertOnlyMergeExecutor uses LEFT ANTI join to find the rows to insert (relying on Data Skipping along the way).
Single Insert-Only Merges¶
There is a special handling of single INSERT-only MERGEs.
MERGE INTO merge_demo to
USING merge_demo_source from
ON to.id = from.id
WHEN NOT MATCHED THEN INSERT *;
Schema Evolution (Auto Schema Merging)¶
Merge command supports schema evolution for star and non-star UPDATEs and INSERTs (with columns not in a target delta table) when schema.autoMerge.enabled is enabled.
Explore Star UPDATEs with Nested Columns
Star WHEN MATCHED UPDATEs with nested columns look very interesting.
Explore Schema Evolution and Types
Schema merging allows for implicit conversions so the type of the source may be different in the target (cf. the note at the end of resolveReferencesAndSchema).
Can this lead to any troubles? 🤔
Learn More
Until this note is here, learn more in resolveClause.
With Auto Schema Merging enabled, Delta Lake adds the new columns and nested fields to the target table (assigned to in merge actions of the WHEN MATCHED and the WHEN NOT MATCHED clauses that are not already part of the target schema).
NOT MATCHED BY SOURCE Excluded
NOT MATCHED BY SOURCEs are excluded since they can't reference source columns by definition and thus can't introduce new columns in the target schema.
Matched-Only Merges¶
A matched-only merge contains one or more WHEN MATCHED clauses only (with neither WHEN NOT MATCHED nor WHEN NOT MATCHED BY SOURCE clauses).
In other words, a matched-only merge is a merge with UPDATEs and DELETEs only, and no WHEN NOT MATCHED
clauses.
With merge.optimizeMatchedOnlyMerge.enabled enabled, Delta Lake optimizes matched-only merges to use a RIGHT OUTER join (instead of a FULL OUTER join) while writing out all merge changes.
Change Data Feed¶
With Change Data Feed enabled on a delta table that is the target table of a merge command, MergeIntoCommand
...FIXME
Merge and Joins¶
Merge command uses different joins when executed.
Repartition Before Write¶
writeFiles can repartition the output dataframe based on merge.repartitionBeforeWrite.enabled.
Materialized Source¶
The source of MERGE
command can be materialized based on spark.databricks.delta.merge.materializeSource configuration property.
Demo¶
Examples¶
Conditional Update with Delete¶
Tip
Use this notebook.
CREATE TABLE source
USING delta
AS VALUES
(0, 0),
(1, 10),
(2, 20) AS data(key, value);
CREATE TABLE target
USING delta
AS VALUES
(1, 1),
(2, 2),
(3, 3) AS data(key, value);
MERGE INTO target t
USING source s
ON s.key = t.key
WHEN MATCHED AND s.key <> 1 THEN UPDATE SET key = s.key, value = s.value
WHEN MATCHED THEN DELETE
Logging¶
Logging is configured using the logger of the MergeIntoCommand.
Learn More¶
- Optimizing Merge on Delta Lake (video)