DataSource¶

The Internals of Spark SQL

Learn more about DataSource in The Internals of Spark SQL online book.

Creating Streaming Source (Legacy Data Sources)¶

createSource(
  metadataPath: String): Source

Legacy Code Path

createSource is used for streaming queries with legacy data sources (that has not fully migrated to the modern Connector API in Spark SQL yet):

StreamingRelation operators for legacy data sources (that use the DataSource V1 API)
StreamingRelationV2 operators for modern data sources yet disabled TableProviders (Spark SQL)

createSource creates a new instance of the data source class and branches off per the implementation type:

StreamSourceProvider
FileFormat
other types

createSource is used when:

MicroBatchExecution is requested for the analyzed logical plan

StreamSourceProvider¶

For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a source.

FileFormat¶

For a FileFormat, createSource creates a new FileStreamSource.

createSource throws an IllegalArgumentException when path option was not specified for a FileFormat data source:

'path' is not specified

Other Types¶

For any other data source type, createSource simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

SourceInfo¶

sourceInfo: SourceInfo

Metadata of a Source with the following:

Name (alias)
Schema
Partitioning columns

sourceInfo is initialized (lazily) using sourceSchema.

Lazy Value

sourceInfo is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and cached afterwards.

Used when:

DataSource is requested to create a streaming source for a File-Based Data Source (when MicroBatchExecution is requested to initialize the analyzed logical plan)
StreamingRelation utility is used to create a StreamingRelation (when DataStreamReader is requested for a streaming query)

Generating Metadata of Streaming Source¶

sourceSchema(): SourceInfo

sourceSchema creates a new instance of the data source class and branches off per the type:

StreamSourceProvider
FileFormat
other types

sourceSchema is used when:

DataSource is requested for the SourceInfo

StreamSourceProvider¶

For a StreamSourceProvider, sourceSchema requests the StreamSourceProvider for the name and schema (of the streaming source).

In the end, sourceSchema returns the name and the schema as part of SourceInfo (with partition columns unspecified).

FileFormat¶

For a FileFormat, sourceSchema...FIXME

Other Types¶

For any other data source type, sourceSchema simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

Creating Streaming Sink¶

createSink(
  outputMode: OutputMode): Sink

createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.

The Internals of Spark SQL

Learn more about FileFormat in The Internals of Spark SQL online book.

createSink creates a new instance of the data source class and branches off per the type:

For a StreamSinkProvider, createSink simply delegates the call and requests it to create a streaming sink
For a FileFormat, createSink creates a FileStreamSink when path option is specified and the output mode is Append.

createSink throws a IllegalArgumentException when path option is not specified for a FileFormat data source:

'path' is not specified

createSink throws an AnalysisException when the given OutputMode is different from Append for a FileFormat data source:

Data source [className] does not support [outputMode] output mode

createSink throws an UnsupportedOperationException for unsupported data source formats:

Data source [className] does not support streamed writing

createSink is used when DataStreamWriter is requested to start a streaming query.