Source — Streaming Source in Micro-Batch Stream Processing¶
Source
is an extension of the SparkDataStream abstraction for streaming sources for "streamed reading" of continually arriving data in a streaming query (identified by offset).
Source
is used in Micro-Batch Stream Processing.
Source
is created using StreamSourceProvider.createSource (and DataSource.createSource).
For fault tolerance, Source
must be able to replay an arbitrary sequence of past data in a stream using a range of offsets. This is the assumption so Structured Streaming can achieve end-to-end exactly-once guarantees.
Contract¶
commit¶
commit(
end: Offset): Unit
Commits data up to the given end offset (informs the source that Spark has completed processing all data for offsets less than or equal to the end offset and will only request offsets greater than the end offset in the future).
Used when:
- MicroBatchExecution stream execution engine is requested to write offsets to a commit log (walCommit phase) while running an activated streaming query
getBatch¶
getBatch(
start: Option[Offset],
end: Offset): DataFrame
Generating a streaming DataFrame
with data between the start and end offsets
Start offset can be undefined (None
) to indicate that the batch should begin with the first record
Used when MicroBatchExecution stream execution engine is requested to run an activated streaming query, namely:
getOffset¶
getOffset: Option[Offset]
Latest (maximum) offset of the source (or None
to denote no data)
Used when:
- MicroBatchExecution stream execution engine (Micro-Batch Stream Processing) is requested for latest offsets of all sources (getOffset phase) while running activated streaming query
schema¶
schema: StructType
Schema of the data from this source
Implementations¶
initialOffset Method¶
initialOffset(): OffsetV2
initialOffset
throws an IllegalStateException
.
initialOffset
is part of the SparkDataStream abstraction.
deserializeOffset Method¶
deserializeOffset(
json: String): OffsetV2
deserializeOffset
throws an IllegalStateException
.
deserializeOffset
is part of the SparkDataStream abstraction.