DataFrameWriter¶
DataFrameWriter[T]
is a high-level API for Spark SQL developers to describe "write path" of a structured query (over rows of T
type).
DataFrameWriter
is used to describe an output node in a data processing graph.
DataFrameWriter
is used to describe the output data source format to be used to "save" data to a data source (e.g. files, Hive tables, JDBC or Dataset[String]
).
DataFrameWriter
ends description of a write specification and does trigger a Spark job (unlike DataFrameWriter).
DataFrameWriter
is available using Dataset.write operator.
Creating Instance¶
DataFrameWriter
takes the following to be created:
Demo¶
assert(df.isInstanceOf[Dataset[_]])
val writer = df.write
import org.apache.spark.sql.DataFrameWriter
assert(writer.isInstanceOf[DataFrameWriter])
DataFrame¶
When created, DataFrameWriter
converts the Dataset to a DataFrame.
Name of Data Source¶
source: String
source
is a short name (alias) or a fully-qualified class name to identify the data source to write data to.
source
can be specified using format
method:
format(
source: String): DataFrameWriter[T]
Default: spark.sql.sources.default configuration property
insertInto¶
insertInto(
tableName: String): Unit
insertInto
requests the DataFrame for the SparkSession.
insertInto
tries to look up the TableProvider for the data source.
insertInto
requests the ParserInterface to parse the tableName
identifier (possibly multi-part).
In the end, insertInto
uses the modern or the legacy insert paths based on...FIXME
insertInto
asserts that write is not bucketed (with insertInto operation name).
Note
saveAsTable and insertInto are structurally alike.
Modern Insert Path (CatalogPlugin)¶
insertInto(
catalog: CatalogPlugin,
ident: Identifier): Unit
insertInto
...FIXME
Legacy Insert Path (TableIdentifier)¶
insertInto(
tableIdent: TableIdentifier): Unit
insertInto
creates an InsertIntoStatement logical operator (with overwrite flag enabled when SaveMode is Overwrite
).
In the end, insertInto
executes the InsertIntoStatement logical command.
AnalysisException¶
insertInto
throws an AnalysisException
when the partitioningColumns are defined:
insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().
saveAsTable¶
saveAsTable(
tableName: String): Unit
saveAsTable
requests the DataFrame for the SparkSession.
saveAsTable
tries to look up the TableProvider for the data source.
saveAsTable
requests the ParserInterface to parse the tableName
identifier (possibly multi-part).
In the end, saveAsTable
uses the modern or the legacy save paths based on...FIXME
Note
saveAsTable and insertInto are structurally alike.
Modern saveAsTable with TableCatalog¶
saveAsTable(
catalog: TableCatalog,
ident: Identifier,
nameParts: Seq[String]): Unit
Legacy saveAsTable with TableIdentifier¶
saveAsTable(
tableIdent: TableIdentifier): Unit
saveAsTable
saves the content of a DataFrame
to the tableName
table.
AnalysisException¶
saveAsTable
throws an AnalysisException
when no catalog could handle the table identifier:
Couldn't find a catalog to handle the identifier [tableName].
Demo¶
val ids = spark.range(5)
ids.write.
option("path", "/tmp/five_ids").
saveAsTable("five_ids")
// Check out if saveAsTable as five_ids was successful
val q = spark.catalog.listTables.filter($"name" === "five_ids")
scala> q.show
+--------+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+--------+--------+-----------+---------+-----------+
|five_ids| default| null| EXTERNAL| false|
+--------+--------+-----------+---------+-----------+
Writing Out Data (save)¶
save(): Unit
save(
path: String): Unit
Saves a DataFrame
(the result of executing a structured query) to a data source.
Internally, save
uses DataSource
to look up the class of the requested data source (for the source option and the SQLConf).
Note
save
uses SparkSession to access the SessionState and in turn the SQLConf.
val df: DataFrame = ???
df.sparkSession.sessionState.conf
save
...FIXME
save
throws an AnalysisException
when requested to save to Hive data source (the source is hive
):
Hive data source can only be used with tables, you can not write files of Hive data source directly.
save
throws an AnalysisException
when bucketing is used (the numBuckets or sortColumnNames options are defined):
'[operation]' does not support bucketing right now
saveInternal¶
saveInternal(
path: Option[String]): Unit
saveInternal
...FIXME
Looking up TableProvider¶
lookupV2Provider(): Option[TableProvider]
lookupV2Provider
tries to look up a TableProvider for the source.
lookupV2Provider
explicitly excludes FileDataSourceV2-based data sources (due to SPARK-28396).
lookupV2Provider
is used when:
DataFrameWriter
is requested to save, insertInto and saveAsTable
Save Mode¶
mode(
saveMode: SaveMode): DataFrameWriter[T]
mode(
saveMode: String): DataFrameWriter[T]
mode
defines the behaviour of save when an external file or table Spark writes to already exists.
Name | Behaviour |
---|---|
Append | Records are appended to an existing data |
ErrorIfExists | Exception is thrown if the target exists |
Ignore | Do not save the records and not change the existing data in any way |
Overwrite | Existing data is overwritten by new records |
Creating BucketSpec¶
getBucketSpec: Option[BucketSpec]
getBucketSpec
creates a new BucketSpec for numBuckets if defined (with bucketColumnNames and sortColumnNames).
IllegalArgumentException
getBucketSpec
throws an IllegalArgumentException
when numBuckets are not defined but sortColumnNames are.
sortBy must be used together with bucketBy
getBucketSpec
is used when:
DataFrameWriter
is requested to assertNotBucketed, createTable, partitioningAsV2
partitioningAsV2¶
partitioningAsV2: Seq[Transform]
partitioningAsV2
creates Transforms based on the partitioningColumns (IdentityTransform
s) and getBucketSpec (a BucketTransform
), if defined.
partitioningAsV2
is used when:
DataFrameWriter
is requested to saveInternal, saveAsTable, checkPartitioningMatchesV2Table
Executing Logical Command for Writing to Data Source V1¶
saveToV1Source(): Unit
saveToV1Source
creates a DataSource (for the source class name, the partitioningColumns and the extraOptions) and requests it for the logical command for writing (with the mode and the analyzed logical plan of the structured query).
Note
While requesting the analyzed logical plan of the structured query, saveToV1Source
triggers execution of logical commands.
In the end, saveToV1Source
runs the logical command for writing.
Note
The logical command for writing can be one of the following:
saveToV1Source
is used when:
DataFrameWriter
is requested to save the rows of a structured query (a DataFrame) to a data source
Executing Logical Command(s)¶
runCommand(
session: SparkSession,
name: String)(
command: LogicalPlan): Unit
runCommand
uses the given SparkSession to access the SessionState that is in turn requested to execute the logical command (that creates a QueryExecution).
runCommand
records the current time (start time) and uses the SQLExecution
helper object to execute the action (under a new execution id) that simply requests the QueryExecution
for the RDD[InternalRow] (and triggers execution of logical commands).
Tip
Use web UI's SQL tab to see the execution or a SparkListener
to be notified when the execution is started and finished. The SparkListener
should intercept SparkListenerSQLExecutionStart
and SparkListenerSQLExecutionEnd
events.
runCommand
records the current time (end time).
In the end, runCommand
uses the input SparkSession
to access the ExecutionListenerManager and requests it to onSuccess (with the input name
, the QueryExecution
and the duration).
In case of any exceptions, runCommand
requests the ExecutionListenerManager
to onFailure (with the exception) and (re)throws it.
runCommand
is used when:
DataFrameWriter
is requested to save the rows of a structured query (a DataFrame) to a data source (and indirectly executing a logical command for writing to a data source V1), insert the rows of a structured streaming (a DataFrame) into a table and create a table (that is used exclusively for saveAsTable)
Creating Table¶
createTable(
tableIdent: TableIdentifier): Unit
createTable
builds a CatalogStorageFormat per extraOptions.
createTable
assumes the table being external when location URI of CatalogStorageFormat
is defined, and managed otherwise.
createTable
creates a CatalogTable (with the bucketSpec per getBucketSpec).
In the end, createTable
creates a CreateTable logical command (with the CatalogTable
, mode and the logical query plan of the dataset) and runs it.
createTable
is used when:
DataFrameWriter
is requested to saveAsTable