CDCReader¶
CDCReader
is a CDCReaderImpl.
CDCReader
utility plays the key role in Change Data Capture in Delta Lake (per this comment).
_change_data Directory¶
CDCReader
uses _change_data
as the name of the directory (under the data directory) where data changes of a delta table are written out (using DelayedCommitProtocol).
This directory may contain partition directories.
Used when:
DelayedCommitProtocol
is requested for the newTaskTempFile
CDF Virtual Columns¶
CDC_COLUMNS_IN_DATA: Seq[String]
CDCReader
defines a CDC_COLUMNS_IN_DATA
collection the following CDF-specific column names:
CDC_COLUMNS_IN_DATA
is used when:
ColumnWithDefaultExprUtils
is requested to addDefaultExprsOrReturnConstraintsDeltaColumnMappingBase
is requested for the DELTA_INTERNAL_COLUMNS
__is_cdc Virtual Partition Column¶
CDCReader
defines __is_cdc
column name to partition on with Change Data Feed enabled.
__is_cdc
column is added when TransactionalWrite
is requested to performCDCPartition with CDF enabled on a delta table (and _change_type among the columns).
If added, __is_cdc
column becomes the first partitioning column. It is then "consumed" by DelayedCommitProtocol (to write changes to cdc-
-prefixed files, not part-
).
__is_cdc
is a virtual column.
Used when:
DelayedCommitProtocol
is requested to getFileName and buildActionFromAddedFile
Change Type Column¶
CDCReader
defines _change_type
column name that represents the type of a data change.
Change Type | Command |
---|---|
delete | Delete |
insert | WriteIntoDelta |
update_postimage | Update |
update_preimage | Update |
_change_type
is a CDF virtual column and among the columns in the CDF-aware read schema.
_change_type
is among the cdcAttributes.
Commit Version Column¶
CDCReader
defines _commit_version
column name that represents...FIXME
_commit_version
is among the DELTA_INTERNAL_COLUMNS.
_commit_version
is among the cdcAttributes and the CDF-aware read schema.
Used when:
CdcAddFileIndex
is requested for the matching filesTahoeChangeFileIndex
is requested for the matching files and the partitionSchemaTahoeRemoveFileIndex
is requested for the matching files
CDC_TYPE_NOT_CDC Literal¶
CDC_TYPE_NOT_CDC: Literal
CDCReader
defines CDC_TYPE_NOT_CDC
value as a Literal
expression with null
value (of StringType
type).
CDC_TYPE_NOT_CDC
is used as a special sentinel value for rows that are part of the main table rather than change data.
CDC_TYPE_NOT_CDC
is used by DML commands when executed with Change Data Feed enabled:
All but DeleteCommand
commands use CDC_TYPE_NOT_CDC
with _change_type as follows:
Column(CDC_TYPE_NOT_CDC).as("_change_type")
DeleteCommand
uses CDC_TYPE_NOT_CDC
as follows:
.withColumn(
"_change_type",
Column(If(filterCondition, CDC_TYPE_NOT_CDC, Literal("delete")))
)
CDC_TYPE_NOT_CDC
is used when (with Change Data Feed enabled):
DeleteCommand
is requested to rewriteFilesMergeIntoCommand
is requested to run a merge (for a non-insert-only merge or with merge.optimizeInsertOnlyMerge.enabled disabled that usesClassicMergeExecutor
to write out merge changes with generateWriteAllChangesOutputCols and generateCdcAndOutputRows)UpdateCommand
is requested to withUpdatedColumnsWriteIntoDelta
is requested to write
insert Change Type¶
CDCReader
uses insert
value as the value of the _change_type column for the following:
WriteIntoDelta
is requested to write data out (with isCDCEnabledOnTable)MergeOutputGeneration
is requested to generateAllActionExprs and generateCdcAndOutputRowsCdcAddFileIndex
is requested for the matching files