DeltaSink¶
DeltaSink
is the Sink
(Spark Structured Streaming) of the delta data source for streaming queries.
Creating Instance¶
DeltaSink
takes the following to be created:
-
SQLContext
- Hadoop Path of the delta table (to write data to as configured by the path option)
- Partition columns
-
OutputMode
(Spark Structured Streaming) - DeltaOptions
DeltaSink
is created when:
DeltaDataSource
is requested for a streaming sink
DeltaLog¶
deltaLog: DeltaLog
deltaLog
is a DeltaLog that is created for the delta table when DeltaSink
is created.
deltaLog
is used when:
- DeltaSink is requested to add a streaming micro-batch
Adding Streaming Micro-Batch¶
addBatch(
batchId: Long,
data: DataFrame): Unit
addBatch
is part of the Sink
(Spark Structured Streaming) abstraction.
addBatch
requests the DeltaLog to start a new transaction.
addBatch
registers the following performance metrics.
Name | web UI |
---|---|
numAddedFiles | number of files added. |
numRemovedFiles | number of files removed. |
addBatch
makes sure that sql.streaming.queryId
local property is defined (attached to the query's current thread).
If the batch reads the same delta table as this sink is going to write to, addBatch
requests the OptimisticTransaction
to readWholeTable.
addBatch
updates the metadata.
addBatch
determines the deleted files based on the OutputMode. For Complete
output mode, addBatch
...FIXME
addBatch
requests the OptimisticTransaction
to write data out.
addBatch
updates the numRemovedFiles
and numAddedFiles
performance metrics, and requests the OptimisticTransaction
to register the SQLMetrics.
In the end, addBatch
requests the OptimisticTransaction
to commit (with a new SetTransaction, AddFiles and RemoveFiles, and StreamingUpdate operation).
Text Representation¶
toString(): String
DeltaSink
uses the following text representation (with the path):
DeltaSink[path]
ImplicitMetadataOperation¶
DeltaSink
is an ImplicitMetadataOperation.
canMergeSchema¶
canMergeSchema: Boolean
canMergeSchema
is part of the ImplicitMetadataOperation abstraction.
canMergeSchema
is the value of canMergeSchema option (in the DeltaOptions).
canOverwriteSchema¶
canOverwriteSchema: Boolean
canOverwriteSchema
is part of the ImplicitMetadataOperation abstraction.
canOverwriteSchema
is true
when all the following hold:
- OutputMode is
OutputMode.Complete
- canOverwriteSchema is enabled (
true
) (in the DeltaOptions)