CDCReader¶

CDCReader is a CDCReaderImpl.

CDCReader utility plays the key role in Change Data Capture in Delta Lake (per this comment).

_change_data Directory¶

CDCReader uses _change_data as the name of the directory (under the data directory) where data changes of a delta table are written out (using DelayedCommitProtocol).

This directory may contain partition directories.

Used when:

DelayedCommitProtocol is requested for the newTaskTempFile

CDF Virtual Columns¶

CDC_COLUMNS_IN_DATA: Seq[String]

CDCReader defines a CDC_COLUMNS_IN_DATA collection the following CDF-specific column names:

__is_cdc
_change_type

CDC_COLUMNS_IN_DATA is used when:

ColumnWithDefaultExprUtils is requested to addDefaultExprsOrReturnConstraints
DeltaColumnMappingBase is requested for the DELTA_INTERNAL_COLUMNS

__is_cdc Virtual Partition Column¶

CDCReader defines __is_cdc column name to partition on with Change Data Feed enabled.

__is_cdc column is added when TransactionalWrite is requested to performCDCPartition with CDF enabled on a delta table (and _change_type among the columns).

If added, __is_cdc column becomes the first partitioning column. It is then "consumed" by DelayedCommitProtocol (to write changes to cdc--prefixed files, not part-).

__is_cdc is a virtual column.

Used when:

DelayedCommitProtocol is requested to getFileName and buildActionFromAddedFile

Change Type Column¶

CDCReader defines _change_type column name that represents the type of a data change.

Change Type	Command
delete	Delete
insert	WriteIntoDelta
update_postimage	Update
update_preimage	Update

_change_type is a CDF virtual column and among the columns in the CDF-aware read schema.

_change_type is among the cdcAttributes.

Commit Version Column¶

CDCReader defines _commit_version column name that represents...FIXME

_commit_version is among the DELTA_INTERNAL_COLUMNS.

_commit_version is among the cdcAttributes and the CDF-aware read schema.

Used when:

CdcAddFileIndex is requested for the matching files
TahoeChangeFileIndex is requested for the matching files and the partitionSchema
TahoeRemoveFileIndex is requested for the matching files

CDC_TYPE_NOT_CDC Literal¶

CDC_TYPE_NOT_CDC: Literal

CDCReader defines CDC_TYPE_NOT_CDC value as a Literal expression with null value (of StringType type).

CDC_TYPE_NOT_CDC is used as a special sentinel value for rows that are part of the main table rather than change data.

CDC_TYPE_NOT_CDC is used by DML commands when executed with Change Data Feed enabled:

All but DeleteCommand commands use CDC_TYPE_NOT_CDC with _change_type as follows:

Column(CDC_TYPE_NOT_CDC).as("_change_type")

DeleteCommand uses CDC_TYPE_NOT_CDC as follows:

.withColumn(
  "_change_type",
  Column(If(filterCondition, CDC_TYPE_NOT_CDC, Literal("delete")))
)

CDC_TYPE_NOT_CDC is used when (with Change Data Feed enabled):

DeleteCommand is requested to rewriteFiles
MergeIntoCommand is requested to run a merge (for a non-insert-only merge or with merge.optimizeInsertOnlyMerge.enabled disabled that uses ClassicMergeExecutor to write out merge changes with generateWriteAllChangesOutputCols and generateCdcAndOutputRows)
UpdateCommand is requested to withUpdatedColumns
WriteIntoDelta is requested to write

insert Change Type¶

CDCReader uses insert value as the value of the _change_type column for the following:

WriteIntoDelta is requested to write data out (with isCDCEnabledOnTable)
MergeOutputGeneration is requested to generateAllActionExprs and generateCdcAndOutputRows
CdcAddFileIndex is requested for the matching files