Skip to content

DeltaDataSource

DeltaDataSource is a DataSourceRegister and is the entry point to all the features provided by delta data source that supports batch and streaming queries.

DataSourceRegister and delta Alias

DeltaDataSource is a DataSourceRegister (Spark SQL) and registers delta alias.

DeltaDataSource is registered using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:

org.apache.spark.sql.delta.sources.DeltaDataSource

RelationProvider

DeltaDataSource is a RelationProvider (Spark SQL).

Creating Relation

createRelation(
  sqlContext: SQLContext,
  parameters: Map[String, String]): BaseRelation

createRelation verifies the given parameters.

createRelation extracts time travel specification from the given parameters.

With spark.databricks.delta.loadFileSystemConfigsFromDataFrameOptions enabled, createRelation uses the given parameters as options.

In the end, createRelation creates a DeltaTableV2 (with the required path option and the optional time travel specification) and requests it for an insertable HadoopFsRelation.


createRelation makes sure that there is path parameter defined (in the given parameters) or throws an IllegalArgumentException:

'path' is not specified

createRelation is part of the RelationProvider (Spark SQL) abstraction.

CreatableRelationProvider

DeltaDataSource is a CreatableRelationProvider (Spark SQL).

Creating Relation

createRelation(
  sqlContext: SQLContext,
  mode: SaveMode,
  parameters: Map[String, String],
  data: DataFrame): BaseRelation

createRelation creates a DeltaLog for the required path parameter (from the given parameters) and the given parameters itself.

createSource creates a DeltaOptions (with the given parameters and the current SQLConf).

createRelation creates and executes a WriteIntoDelta command for the given data.

In the end, createRelation requests the DeltaLog for a HadoopFsRelation.


createRelation makes sure that there is path parameter defined (in the given parameters) or throws an IllegalArgumentException:

'path' is not specified

createRelation is part of the CreatableRelationProvider (Spark SQL) abstraction.

StreamSourceProvider

DeltaDataSource is a StreamSourceProvider (Spark Structured Streaming).

Creating DeltaSource

createSource(
  sqlContext: SQLContext,
  metadataPath: String,
  schema: Option[StructType],
  providerName: String,
  parameters: Map[String, String]): Source

createSource creates a DeltaLog for the required path parameter (from the given parameters).

createSource creates a DeltaOptions (with the given parameters and the current SQLConf).

In the end, createSource creates a DeltaSource (with the DeltaLog and the DeltaOptions).


createSource makes sure that there is path parameter defined (in the given parameters) or throws an IllegalArgumentException:

'path' is not specified

createSource makes sure that there is no schema specified or throws an AnalysisException:

Delta does not support specifying the schema at read time.

createSource makes sure that there is schema available (in the Snapshot) of the DeltaLog or throws an AnalysisException:

Table schema is not set.  Write data into it or use CREATE TABLE to set the schema.

createSource is part of the StreamSourceProvider (Spark Structured Streaming) abstraction.

Streaming Schema

sourceSchema(
  sqlContext: SQLContext,
  schema: Option[StructType],
  providerName: String,
  parameters: Map[String, String]): (String, StructType)

sourceSchema creates a DeltaLog for the required path parameter (from the given parameters).

sourceSchema takes the schema (of the Snapshot) of the DeltaLog and removes generation expressions (if defined).

In the end, sourceSchema returns the delta name with the schema (of the Delta table without the generation expressions).


createSource makes sure that there is no schema specified or throws an AnalysisException:

Delta does not support specifying the schema at read time.

createSource makes sure that there is path parameter defined (in the given parameters) or throws an IllegalArgumentException:

'path' is not specified

createSource makes sure that there is no time travel specified using the following:

If either is set, createSource throws an AnalysisException:

Cannot time travel views, subqueries or streams.

sourceSchema is part of the StreamSourceProvider (Spark Structured Streaming) abstraction.

StreamSinkProvider

DeltaDataSource is a StreamSinkProvider (Spark Structured Streaming).

DeltaDataSource supports Append and Complete output modes only.

Creating Streaming Sink

createSink(
  sqlContext: SQLContext,
  parameters: Map[String, String],
  partitionColumns: Seq[String],
  outputMode: OutputMode): Sink

createSink creates a DeltaOptions (with the given parameters and the current SQLConf).

In the end, createSink creates a DeltaSink (with the required path parameter, the given partitionColumns and the DeltaOptions).


createSink makes sure that there is path parameter defined (in the given parameters) or throws an IllegalArgumentException:

'path' is not specified

createSink makes sure that the given outputMode is either Append or Complete, or throws an IllegalArgumentException:

Data source [dataSource] does not support [outputMode] output mode

createSink is part of the StreamSinkProvider (Spark Structured Streaming) abstraction.

TableProvider

DeltaDataSource is aTableProvider (Spark SQL).

DeltaDataSource allows registering Delta tables in a HiveMetaStore. Delta creates a transaction log at the table root directory, and the Hive MetaStore contains no information but the table format and the location of the table. All table properties, schema and partitioning information live in the transaction log to avoid a split brain situation.

The feature was added in SC-34233.

Loading Delta Table

getTable(
  schema: StructType,
  partitioning: Array[Transform],
  properties: Map[String, String]): Table

getTable is part of the TableProvider (Spark SQL) abstraction.


getTable creates a DeltaTableV2 (with the path from the given properties).


getTable throws an IllegalArgumentException when path option is not specified:

'path' is not specified

Utilities

getTimeTravelVersion

getTimeTravelVersion(
  parameters: Map[String, String]): Option[DeltaTimeTravelSpec]

getTimeTravelVersion...FIXME

getTimeTravelVersion is used when DeltaDataSource is requested to create a relation (as a RelationProvider).

parsePathIdentifier

parsePathIdentifier(
  spark: SparkSession,
  userPath: String): (Path, Seq[(String, String)], Option[DeltaTimeTravelSpec])

parsePathIdentifier...FIXME

parsePathIdentifier is used when DeltaTableV2 is requested for metadata (for a non-catalog table).

Back to top