CDCReader¶
CDCReader is a CDCReaderImpl.
CDCReader utility plays the key role in Change Data Capture in Delta Lake (per this comment).
_change_data Directory¶
CDCReader uses _change_data as the name of the directory (under the data directory) where data changes of a delta table are written out (using DelayedCommitProtocol).
This directory may contain partition directories.
Used when:
DelayedCommitProtocolis requested for the newTaskTempFile
CDF Virtual Columns¶
CDC_COLUMNS_IN_DATA: Seq[String]
CDCReader defines a CDC_COLUMNS_IN_DATA collection the following CDF-specific column names:
CDC_COLUMNS_IN_DATA is used when:
ColumnWithDefaultExprUtilsis requested to addDefaultExprsOrReturnConstraintsDeltaColumnMappingBaseis requested for the DELTA_INTERNAL_COLUMNS
__is_cdc Virtual Partition Column¶
CDCReader defines __is_cdc column name to partition on with Change Data Feed enabled.
__is_cdc column is added when TransactionalWrite is requested to performCDCPartition with CDF enabled on a delta table (and _change_type among the columns).
If added, __is_cdc column becomes the first partitioning column. It is then "consumed" by DelayedCommitProtocol (to write changes to cdc--prefixed files, not part-).
__is_cdc is a virtual column.
Used when:
DelayedCommitProtocolis requested to getFileName and buildActionFromAddedFile
Change Type Column¶
CDCReader defines _change_type column name that represents the type of a data change.
| Change Type | Command |
|---|---|
| delete | Delete |
| insert | WriteIntoDelta |
| update_postimage | Update |
| update_preimage | Update |
_change_type is a CDF virtual column and among the columns in the CDF-aware read schema.
_change_type is among the cdcAttributes.
Commit Version Column¶
CDCReader defines _commit_version column name that represents...FIXME
_commit_version is among the DELTA_INTERNAL_COLUMNS.
_commit_version is among the cdcAttributes and the CDF-aware read schema.
Used when:
CdcAddFileIndexis requested for the matching filesTahoeChangeFileIndexis requested for the matching files and the partitionSchemaTahoeRemoveFileIndexis requested for the matching files
CDC_TYPE_NOT_CDC Literal¶
CDC_TYPE_NOT_CDC: Literal
CDCReader defines CDC_TYPE_NOT_CDC value as a Literal expression with null value (of StringType type).
CDC_TYPE_NOT_CDC is used as a special sentinel value for rows that are part of the main table rather than change data.
CDC_TYPE_NOT_CDC is used by DML commands when executed with Change Data Feed enabled:
All but DeleteCommand commands use CDC_TYPE_NOT_CDC with _change_type as follows:
Column(CDC_TYPE_NOT_CDC).as("_change_type")
DeleteCommand uses CDC_TYPE_NOT_CDC as follows:
.withColumn(
"_change_type",
Column(If(filterCondition, CDC_TYPE_NOT_CDC, Literal("delete")))
)
CDC_TYPE_NOT_CDC is used when (with Change Data Feed enabled):
DeleteCommandis requested to rewriteFilesMergeIntoCommandis requested to run a merge (for a non-insert-only merge or with merge.optimizeInsertOnlyMerge.enabled disabled that usesClassicMergeExecutorto write out merge changes with generateWriteAllChangesOutputCols and generateCdcAndOutputRows)UpdateCommandis requested to withUpdatedColumnsWriteIntoDeltais requested to write
insert Change Type¶
CDCReader uses insert value as the value of the _change_type column for the following:
WriteIntoDeltais requested to write data out (with isCDCEnabledOnTable)MergeOutputGenerationis requested to generateAllActionExprs and generateCdcAndOutputRowsCdcAddFileIndexis requested for the matching files