DataSource¶
The Internals of Spark SQL
Learn more about DataSource in The Internals of Spark SQL online book.
Creating Streaming Source (Legacy Data Sources)¶
createSource(
metadataPath: String): Source
Legacy Code Path
createSource is used for streaming queries with legacy data sources (that has not fully migrated to the modern Connector API in Spark SQL yet):
- StreamingRelation operators for legacy data sources (that use the DataSource V1 API)
- StreamingRelationV2 operators for modern data sources yet disabled
TableProviders (Spark SQL)
createSource creates a new instance of the data source class and branches off per the implementation type:
createSource is used when:
MicroBatchExecutionis requested for the analyzed logical plan
StreamSourceProvider¶
For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a source.
FileFormat¶
For a FileFormat, createSource creates a new FileStreamSource.
createSource throws an IllegalArgumentException when path option was not specified for a FileFormat data source:
'path' is not specified
Other Types¶
For any other data source type, createSource simply throws an UnsupportedOperationException:
Data source [className] does not support streamed reading
SourceInfo¶
sourceInfo: SourceInfo
Metadata of a Source with the following:
- Name (alias)
- Schema
- Partitioning columns
sourceInfo is initialized (lazily) using sourceSchema.
Lazy Value
sourceInfo is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and cached afterwards.
Used when:
-
DataSourceis requested to create a streaming source for a File-Based Data Source (whenMicroBatchExecutionis requested to initialize the analyzed logical plan) -
StreamingRelationutility is used to create a StreamingRelation (whenDataStreamReaderis requested for a streaming query)
Generating Metadata of Streaming Source¶
sourceSchema(): SourceInfo
sourceSchema creates a new instance of the data source class and branches off per the type:
sourceSchema is used when:
DataSourceis requested for the SourceInfo
StreamSourceProvider¶
For a StreamSourceProvider, sourceSchema requests the StreamSourceProvider for the name and schema (of the streaming source).
In the end, sourceSchema returns the name and the schema as part of SourceInfo (with partition columns unspecified).
FileFormat¶
For a FileFormat, sourceSchema...FIXME
Other Types¶
For any other data source type, sourceSchema simply throws an UnsupportedOperationException:
Data source [className] does not support streamed reading
Creating Streaming Sink¶
createSink(
outputMode: OutputMode): Sink
createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.
The Internals of Spark SQL
Learn more about FileFormat in The Internals of Spark SQL online book.
createSink creates a new instance of the data source class and branches off per the type:
-
For a StreamSinkProvider,
createSinksimply delegates the call and requests it to create a streaming sink -
For a
FileFormat,createSinkcreates a FileStreamSink whenpathoption is specified and the output mode is Append.
createSink throws a IllegalArgumentException when path option is not specified for a FileFormat data source:
'path' is not specified
createSink throws an AnalysisException when the given OutputMode is different from Append for a FileFormat data source:
Data source [className] does not support [outputMode] output mode
createSink throws an UnsupportedOperationException for unsupported data source formats:
Data source [className] does not support streamed writing
createSink is used when DataStreamWriter is requested to start a streaming query.