DataSource¶
The Internals of Spark SQL
Learn more about DataSource in The Internals of Spark SQL online book.
Creating Streaming Source (Legacy Data Sources)¶
createSource(
metadataPath: String): Source
Legacy Code Path
createSource
is used for streaming queries with legacy data sources (that has not fully migrated to the modern Connector API in Spark SQL yet):
- StreamingRelation operators for legacy data sources (that use the DataSource V1 API)
- StreamingRelationV2 operators for modern data sources yet disabled
TableProvider
s (Spark SQL)
createSource
creates a new instance of the data source class and branches off per the implementation type:
createSource
is used when:
MicroBatchExecution
is requested for the analyzed logical plan
StreamSourceProvider¶
For a StreamSourceProvider, createSource
requests the StreamSourceProvider
to create a source.
FileFormat¶
For a FileFormat
, createSource
creates a new FileStreamSource.
createSource
throws an IllegalArgumentException
when path
option was not specified for a FileFormat
data source:
'path' is not specified
Other Types¶
For any other data source type, createSource
simply throws an UnsupportedOperationException
:
Data source [className] does not support streamed reading
SourceInfo¶
sourceInfo: SourceInfo
Metadata of a Source with the following:
- Name (alias)
- Schema
- Partitioning columns
sourceInfo
is initialized (lazily) using sourceSchema.
Lazy Value
sourceInfo
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and cached afterwards.
Used when:
-
DataSource
is requested to create a streaming source for a File-Based Data Source (whenMicroBatchExecution
is requested to initialize the analyzed logical plan) -
StreamingRelation
utility is used to create a StreamingRelation (whenDataStreamReader
is requested for a streaming query)
Generating Metadata of Streaming Source¶
sourceSchema(): SourceInfo
sourceSchema
creates a new instance of the data source class and branches off per the type:
sourceSchema
is used when:
DataSource
is requested for the SourceInfo
StreamSourceProvider¶
For a StreamSourceProvider, sourceSchema
requests the StreamSourceProvider
for the name and schema (of the streaming source).
In the end, sourceSchema
returns the name and the schema as part of SourceInfo
(with partition columns unspecified).
FileFormat¶
For a FileFormat
, sourceSchema
...FIXME
Other Types¶
For any other data source type, sourceSchema
simply throws an UnsupportedOperationException
:
Data source [className] does not support streamed reading
Creating Streaming Sink¶
createSink(
outputMode: OutputMode): Sink
createSink
creates a streaming sink for StreamSinkProvider or FileFormat
data sources.
The Internals of Spark SQL
Learn more about FileFormat in The Internals of Spark SQL online book.
createSink
creates a new instance of the data source class and branches off per the type:
-
For a StreamSinkProvider,
createSink
simply delegates the call and requests it to create a streaming sink -
For a
FileFormat
,createSink
creates a FileStreamSink whenpath
option is specified and the output mode is Append.
createSink
throws a IllegalArgumentException
when path
option is not specified for a FileFormat
data source:
'path' is not specified
createSink
throws an AnalysisException
when the given OutputMode is different from Append for a FileFormat
data source:
Data source [className] does not support [outputMode] output mode
createSink
throws an UnsupportedOperationException
for unsupported data source formats:
Data source [className] does not support streamed writing
createSink
is used when DataStreamWriter
is requested to start a streaming query.