DeltaDataSource¶
DeltaDataSource
is a DataSourceRegister and is the entry point to all the features provided by delta
data source that supports batch and streaming queries.
DataSourceRegister and delta Alias¶
DeltaDataSource
is a DataSourceRegister
(Spark SQL) and registers delta alias.
DeltaDataSource
is registered using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
org.apache.spark.sql.delta.sources.DeltaDataSource
RelationProvider¶
DeltaDataSource
is a RelationProvider
(Spark SQL).
Creating Relation¶
createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
createRelation
verifies the given parameters.
createRelation
extracts time travel specification from the given parameters.
With spark.databricks.delta.loadFileSystemConfigsFromDataFrameOptions enabled, createRelation
uses the given parameters
as options
.
In the end, createRelation
creates a DeltaTableV2 (with the required path
option and the optional time travel specification) and requests it for an insertable HadoopFsRelation.
createRelation
makes sure that there is path
parameter defined (in the given parameters
) or throws an IllegalArgumentException
:
'path' is not specified
createRelation
is part of the RelationProvider
(Spark SQL) abstraction.
CreatableRelationProvider¶
DeltaDataSource
is a CreatableRelationProvider
(Spark SQL).
Creating Relation¶
createRelation(
sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation
createRelation
creates a DeltaLog for the required path
parameter (from the given parameters
) and the given parameters
itself.
createSource
creates a DeltaOptions (with the given parameters
and the current SQLConf
).
createRelation
creates and executes a WriteIntoDelta command for the given data
.
In the end, createRelation
requests the DeltaLog
for a HadoopFsRelation.
createRelation
makes sure that there is path
parameter defined (in the given parameters
) or throws an IllegalArgumentException
:
'path' is not specified
createRelation
is part of the CreatableRelationProvider
(Spark SQL) abstraction.
StreamSourceProvider¶
DeltaDataSource
is a StreamSourceProvider
(Spark Structured Streaming).
Creating DeltaSource¶
createSource(
sqlContext: SQLContext,
metadataPath: String,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): Source
createSource
creates a DeltaLog for the required path
parameter (from the given parameters
).
createSource
creates a DeltaOptions (with the given parameters
and the current SQLConf
).
In the end, createSource
creates a DeltaSource (with the DeltaLog
and the DeltaOptions
).
createSource
makes sure that there is path
parameter defined (in the given parameters
) or throws an IllegalArgumentException
:
'path' is not specified
createSource
makes sure that there is no schema
specified or throws an AnalysisException
:
Delta does not support specifying the schema at read time.
createSource
makes sure that there is schema available (in the Snapshot) of the DeltaLog
or throws an AnalysisException
:
Table schema is not set. Write data into it or use CREATE TABLE to set the schema.
createSource
is part of the StreamSourceProvider
(Spark Structured Streaming) abstraction.
Streaming Schema¶
sourceSchema(
sqlContext: SQLContext,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): (String, StructType)
sourceSchema
creates a DeltaLog for the required path
parameter (from the given parameters
).
sourceSchema
takes the schema (of the Snapshot) of the DeltaLog
and removes generation expressions (if defined).
In the end, sourceSchema
returns the delta name with the schema (of the Delta table without the generation expressions).
createSource
makes sure that there is no schema
specified or throws an AnalysisException
:
Delta does not support specifying the schema at read time.
createSource
makes sure that there is path
parameter defined (in the given parameters
) or throws an IllegalArgumentException
:
'path' is not specified
createSource
makes sure that there is no time travel specified using the following:
If either is set, createSource
throws an AnalysisException
:
Cannot time travel views, subqueries or streams.
sourceSchema
is part of the StreamSourceProvider
(Spark Structured Streaming) abstraction.
StreamSinkProvider¶
DeltaDataSource
is a StreamSinkProvider
(Spark Structured Streaming).
DeltaDataSource
supports Append
and Complete
output modes only.
Tip
Consult the demo Using Delta Lake (as Streaming Sink) in Streaming Queries.
Creating Streaming Sink¶
createSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink
createSink
creates a DeltaOptions (with the given parameters
and the current SQLConf
).
In the end, createSink
creates a DeltaSink (with the required path
parameter, the given partitionColumns
and the DeltaOptions
).
createSink
makes sure that there is path
parameter defined (in the given parameters
) or throws an IllegalArgumentException
:
'path' is not specified
createSink
makes sure that the given outputMode
is either Append
or Complete
, or throws an IllegalArgumentException
:
Data source [dataSource] does not support [outputMode] output mode
createSink
is part of the StreamSinkProvider
(Spark Structured Streaming) abstraction.
TableProvider¶
DeltaDataSource
is aTableProvider
(Spark SQL).
DeltaDataSource
allows registering Delta tables in a HiveMetaStore
. Delta creates a transaction log at the table root directory, and the Hive MetaStore contains no information but the table format and the location of the table. All table properties, schema and partitioning information live in the transaction log to avoid a split brain situation.
The feature was added in SC-34233.
Loading Delta Table¶
getTable(
schema: StructType,
partitioning: Array[Transform],
properties: Map[String, String]): Table
getTable
is part of the TableProvider
(Spark SQL) abstraction.
getTable
creates a DeltaTableV2 (with the path from the given properties
).
getTable
throws an IllegalArgumentException
when path
option is not specified:
'path' is not specified
Utilities¶
getTimeTravelVersion¶
getTimeTravelVersion(
parameters: Map[String, String]): Option[DeltaTimeTravelSpec]
getTimeTravelVersion
...FIXME
getTimeTravelVersion
is used when DeltaDataSource
is requested to create a relation (as a RelationProvider).
parsePathIdentifier¶
parsePathIdentifier(
spark: SparkSession,
userPath: String): (Path, Seq[(String, String)], Option[DeltaTimeTravelSpec])
parsePathIdentifier
...FIXME
parsePathIdentifier
is used when DeltaTableV2
is requested for metadata (for a non-catalog table).