Skip to content

DataStreamReader

DataStreamReader is an interface that Spark developers use to describe how to load data from a streaming data source.

DataStreamReader and The Others

Accessing DataStreamReader

DataStreamReader is available using SparkSession.readStream method.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

val streamReader = spark.readStream

Built-in Formats

DataStreamReader supports loading streaming data from the following built-in data sources:

  • csv
  • json
  • orc
  • parquet (default)
  • text
  • textFile

Tip

Use spark.sql.sources.default configuration property to change the default data source.

Note

Hive data source can only be used with tables, and it is an AnalysisException when specified explicitly.

Loading Data

load(): DataFrame
load(
  path: String): DataFrame

DataStreamReader gives specialized methods for built-in data systems and formats.

In order to plug in a custom data source, DataStreamReader gives format and load methods.

load creates a streaming DataFrame that represents a "loading" streaming data node (and is internally a logical plan with a StreamingRelationV2 or StreamingRelation leaf logical operators).

load uses DataSource.lookupDataSource utility to look up the data source by source alias.

Tip

Learn more about DataSource.lookupDataSource utility in The Internals of Spark SQL online book.

SupportsRead Tables with MICRO_BATCH_READ or CONTINUOUS_READ

For a TableProvider (that is not a FileDataSourceV2), load requests it for a Table.

Tip

Learn more about TableProvider in The Internals of Spark SQL online book.

For a Table with SupportsRead with MICRO_BATCH_READ or CONTINUOUS_READ capabilities, load creates a DataFrame with StreamingRelationV2 leaf logical operator.

Tip

Learn more about Table, SupportsRead and capabilities in The Internals of Spark SQL online book.

If the DataSource is a StreamSourceProvider, load creates the StreamingRelationV2 with a StreamingRelation leaf logical operator.

For other Tables, load creates a DataFrame with a StreamingRelation leaf logical operator.

Other Data Sources

load creates a DataFrame with a StreamingRelation leaf logical operator.