Change Data Feed¶
Change Data Feed (CDF) (aka Change Data Capture or CDC in short) is a feature of Delta Lake that allows tracking row-level changes between versions of a delta table.
With so-called CDC-Aware Table Scan (CDC Read), loading a delta table gives data changes (not the data of a particular version of the delta table).
As they put it (in this comment), CDCReader is the key class used for Change Data Feed (with DelayedCommitProtocol to handle it properly).
Non-CDC data is written out to the base directory of a delta table, while CDC data is written out to the _change_data special folder.
Change Data Feed is a new feature in Delta Lake 2.0.0 (that was tracked under Support for Change Data Feed in Delta Lake #1105).
Enabling CDF for a Delta table¶
Enable CDF for a table using delta.enableChangeDataFeed table property.
ALTER TABLE delta_demo
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
CREATE TABLE delta_demo (id INT, name STRING, age INT)
USING delta
TBLPROPERTIES (delta.enableChangeDataFeed = true)
Additionally, this property can be set for all new tables by default.
SET spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;
Options¶
Change Data Feed is enabled in batch and streaming queries using readChangeFeed option.
spark
.read
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", startingVersion)
.option("endingVersion", endingVersion)
.table("source")
spark
.readStream
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", startingVersion)
.table("source")
readChangeFeed
is used alongside the other CDC options:
_change_type Column¶
_change_type column represents a change type.
_change_type | Command |
---|---|
delete | DeleteCommand |
FIXME |
Protocol¶
Change Data Feed requires the minimum protocol version to be 0 for readers and 4 for writers.
Column Mapping Not Supported¶
Change data feed reads are currently not supported on tables with column mapping enabled (and a DeltaUnsupportedOperationException is thrown).