Skip to content

DataSource

The Internals of Spark SQL

Learn more about DataSource in The Internals of Spark SQL online book.

Creating Streaming Source (Legacy Data Sources)

createSource(
  metadataPath: String): Source

Legacy Code Path

createSource is used for streaming queries with legacy data sources (that has not fully migrated to the modern Connector API in Spark SQL yet):

createSource creates a new instance of the data source class and branches off per the implementation type:


createSource is used when:

StreamSourceProvider

For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a source.

FileFormat

For a FileFormat, createSource creates a new FileStreamSource.

createSource throws an IllegalArgumentException when path option was not specified for a FileFormat data source:

'path' is not specified

Other Types

For any other data source type, createSource simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

SourceInfo

sourceInfo: SourceInfo

Metadata of a Source with the following:

  • Name (alias)
  • Schema
  • Partitioning columns

sourceInfo is initialized (lazily) using sourceSchema.

Lazy Value

sourceInfo is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and cached afterwards.

Used when:

Generating Metadata of Streaming Source

sourceSchema(): SourceInfo

sourceSchema creates a new instance of the data source class and branches off per the type:


sourceSchema is used when:

StreamSourceProvider

For a StreamSourceProvider, sourceSchema requests the StreamSourceProvider for the name and schema (of the streaming source).

In the end, sourceSchema returns the name and the schema as part of SourceInfo (with partition columns unspecified).

FileFormat

For a FileFormat, sourceSchema...FIXME

Other Types

For any other data source type, sourceSchema simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

Creating Streaming Sink

createSink(
  outputMode: OutputMode): Sink

createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.

The Internals of Spark SQL

Learn more about FileFormat in The Internals of Spark SQL online book.

createSink creates a new instance of the data source class and branches off per the type:

createSink throws a IllegalArgumentException when path option is not specified for a FileFormat data source:

'path' is not specified

createSink throws an AnalysisException when the given OutputMode is different from Append for a FileFormat data source:

Data source [className] does not support [outputMode] output mode

createSink throws an UnsupportedOperationException for unsupported data source formats:

Data source [className] does not support streamed writing

createSink is used when DataStreamWriter is requested to start a streaming query.