Skip to content

CDCReader

CDCReader is a CDCReaderImpl.

CDCReader utility plays the key role in Change Data Capture in Delta Lake (per this comment).

_change_data Directory

CDCReader uses _change_data as the name of the directory (under the data directory) where data changes of a delta table are written out (using DelayedCommitProtocol).

This directory may contain partition directories.

Used when:

CDF Virtual Columns

CDC_COLUMNS_IN_DATA: Seq[String]

CDCReader defines a CDC_COLUMNS_IN_DATA collection the following CDF-specific column names:

CDC_COLUMNS_IN_DATA is used when:

__is_cdc Virtual Partition Column

CDCReader defines __is_cdc column name to partition on with Change Data Feed enabled.

__is_cdc column is added when TransactionalWrite is requested to performCDCPartition with CDF enabled on a delta table (and _change_type among the columns).

If added, __is_cdc column becomes the first partitioning column. It is then "consumed" by DelayedCommitProtocol (to write changes to cdc--prefixed files, not part-).

__is_cdc is a virtual column.

Used when:

Change Type Column

CDCReader defines _change_type column name that represents the type of a data change.

Change Type Command
delete Delete
insert WriteIntoDelta
update_postimage Update
update_preimage Update

_change_type is a CDF virtual column and among the columns in the CDF-aware read schema.

_change_type is among the cdcAttributes.

Commit Version Column

CDCReader defines _commit_version column name that represents...FIXME

_commit_version is among the DELTA_INTERNAL_COLUMNS.

_commit_version is among the cdcAttributes and the CDF-aware read schema.

Used when:

CDC_TYPE_NOT_CDC Literal

CDC_TYPE_NOT_CDC: Literal

CDCReader defines CDC_TYPE_NOT_CDC value as a Literal expression with null value (of StringType type).

CDC_TYPE_NOT_CDC is used as a special sentinel value for rows that are part of the main table rather than change data.

CDC_TYPE_NOT_CDC is used by DML commands when executed with Change Data Feed enabled:

All but DeleteCommand commands use CDC_TYPE_NOT_CDC with _change_type as follows:

Column(CDC_TYPE_NOT_CDC).as("_change_type")

DeleteCommand uses CDC_TYPE_NOT_CDC as follows:

.withColumn(
  "_change_type",
  Column(If(filterCondition, CDC_TYPE_NOT_CDC, Literal("delete")))
)

CDC_TYPE_NOT_CDC is used when (with Change Data Feed enabled):

insert Change Type

CDCReader uses insert value as the value of the _change_type column for the following: