Skip to content

Source — Streaming Source in Micro-Batch Stream Processing

Source is an extension of the SparkDataStream abstraction for streaming sources for "streamed reading" of continually arriving data in a streaming query (identified by offset).

Source is used in Micro-Batch Stream Processing.

Source is created using StreamSourceProvider.createSource (and DataSource.createSource).

For fault tolerance, Source must be able to replay an arbitrary sequence of past data in a stream using a range of offsets. This is the assumption so Structured Streaming can achieve end-to-end exactly-once guarantees.

Contract

commit

commit(
  end: Offset): Unit

Commits data up to the given end offset (informs the source that Spark has completed processing all data for offsets less than or equal to the end offset and will only request offsets greater than the end offset in the future).

Used when:

getBatch

getBatch(
  start: Option[Offset],
  end: Offset): DataFrame

Generating a streaming DataFrame with data between the start and end offsets

Start offset can be undefined (None) to indicate that the batch should begin with the first record

Used when MicroBatchExecution stream execution engine is requested to run an activated streaming query, namely:

getOffset

getOffset: Option[Offset]

Latest (maximum) offset of the source (or None to denote no data)

Used when:

schema

schema: StructType

Schema of the data from this source

Implementations

initialOffset Method

initialOffset(): OffsetV2

initialOffset throws an IllegalStateException.

initialOffset is part of the SparkDataStream abstraction.

deserializeOffset Method

deserializeOffset(
  json: String): OffsetV2

deserializeOffset throws an IllegalStateException.

deserializeOffset is part of the SparkDataStream abstraction.