DataSource API¶
Reading Datasets¶
Spark SQL can read data from external storage systems like files, Hive tables and JDBC databases through DataFrameReader interface.
You use SparkSession.md[SparkSession] to access DataFrameReader
using SparkSession.md#read[read] operation.
[source, scala]¶
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.getOrCreate
val reader = spark.read¶
DataFrameReader
is an interface to create DataFrames (aka Dataset[Row]
) from files, Hive tables or tables using JDBC.
[source, scala]¶
val people = reader.csv("people.csv") val cities = reader.format("json").load("cities.json")
As of Spark 2.0, DataFrameReader
can read text files using textFile methods that return Dataset[String]
(not DataFrames
).
[source, scala]¶
spark.read.textFile("README.md")¶
There are two operation modes in Spark SQL, i.e. batch and spark-structured-streaming.md[streaming] (part of Spark Structured Streaming).
You can access spark-sql-streaming-DataStreamReader.md[DataStreamReader] for reading streaming datasets through SparkSession.md#readStream[SparkSession.readStream] method.
[source, scala]¶
import org.apache.spark.sql.streaming.DataStreamReader val stream: DataStreamReader = spark.readStream
The available methods in DataStreamReader
are similar to DataFrameReader
.
Saving Datasets¶
Spark SQL can save data to external storage systems like files, Hive tables and JDBC databases through DataFrameWriter interface.
You use Dataset.md#write[write] method on a Dataset
to access DataFrameWriter
.
import org.apache.spark.sql.{DataFrameWriter, Dataset}
val ints: Dataset[Int] = (0 to 5).toDS
val writer: DataFrameWriter[Int] = ints.write
DataFrameWriter
is an interface to persist a Dataset.md[Datasets] to an external storage system in a batch fashion.
You can access spark-sql-streaming-DataStreamWriter.md[DataStreamWriter] for writing streaming datasets through Dataset.md#writeStream[Dataset.writeStream] method.
[source, scala]¶
val papers = spark.readStream.text("papers").as[String]
import org.apache.spark.sql.streaming.DataStreamWriter val writer: DataStreamWriter[String] = papers.writeStream
The available methods in DataStreamWriter
are similar to DataFrameWriter
.