Skip to content

DataFrameReader

DataFrameReader[T] is a high-level API for Spark SQL developers to describe the "read path" of a structured query (over rows of T type).

DataFrameReader is used to describe an input node in a data processing graph.

DataFrameReader is used to describe the input data source format to be used to "load" data from a data source (e.g. files, Hive tables, JDBC or Dataset[String]).

DataFrameReader merely describes a process of loading a data (load specification) and does not trigger a Spark job (until an action is called unlike DataFrameWriter).

DataFrameReader is available using SparkSession.read operator.

Creating Instance

DataFrameReader takes the following to be created:

Demo

import org.apache.spark.sql.SparkSession
assert(spark.isInstanceOf[SparkSession])

val reader = spark.read

import org.apache.spark.sql.DataFrameReader
assert(reader.isInstanceOf[DataFrameReader])

format

format(
  source: String): DataFrameReader

format specifies the input data source format.

Built-in data source formats:

  • json
  • csv
  • parquet
  • orc
  • text
  • jdbc
  • libsvm

Use spark.sql.sources.default configuration property to specify the default format.

Loading Data

load(): DataFrame
load(
  path: String): DataFrame
load(
  paths: String*): DataFrame

load loads a dataset from a data source (with optional support for multiple paths) as an untyped DataFrame.

Internally, load lookupDataSource for the data source format. load then branches off per its type (i.e. whether it is of DataSourceV2 marker type or not).

For a "DataSource V2" data source, load...FIXME

Otherwise, if the source is not a "DataSource V2" data source, load loadV1Source.

load throws a AnalysisException when the source is hive.

Hive data source can only be used with tables, you can not read files of Hive data source directly.

loadV1Source

loadV1Source(
  paths: String*): DataFrame

loadV1Source creates a DataSource and requests it to resolve the underlying relation (as a BaseRelation).

In the end, loadV1Source requests the SparkSession to create a DataFrame from the BaseRelation.