Skip to content


DataFrameReader[T] is a high-level API for Spark SQL developers to describe the "read path" of a structured query (over rows of T type).

DataFrameReader is used to describe an input node in a data processing graph.

DataFrameReader is used to describe the input data source format to be used to "load" data from a data source (e.g. files, Hive tables, JDBC or Dataset[String]).

DataFrameReader merely describes a process of loading a data (load specification) and does not trigger a Spark job (until an action is called unlike DataFrameWriter).

DataFrameReader is available using operator.

Creating Instance

DataFrameReader takes the following to be created:


import org.apache.spark.sql.SparkSession

val reader =

import org.apache.spark.sql.DataFrameReader


  source: String): DataFrameReader

format specifies the input data source format.

Built-in data source formats:

  • json
  • csv
  • parquet
  • orc
  • text
  • jdbc
  • libsvm

Use spark.sql.sources.default configuration property to specify the default format.

Loading Data

load(): DataFrame
  path: String): DataFrame
  paths: String*): DataFrame

load loads a dataset from a data source (with optional support for multiple paths) as an untyped DataFrame.

Internally, load lookupDataSource for the data source format. load then branches off per its type (i.e. whether it is of DataSourceV2 marker type or not).

For a "DataSource V2" data source, load...FIXME

Otherwise, if the source is not a "DataSource V2" data source, load loadV1Source.

load throws a AnalysisException when the source is hive.

Hive data source can only be used with tables, you can not read files of Hive data source directly.


  paths: String*): DataFrame

loadV1Source creates a DataSource and requests it to resolve the underlying relation (as a BaseRelation).

In the end, loadV1Source requests the SparkSession to create a DataFrame from the BaseRelation.