Skip to content

DeltaDataSource

DeltaDataSource is a DataSourceRegister and is the entry point to all the features provided by delta data source.

DeltaDataSource is a RelationProvider.

DeltaDataSource is a StreamSinkProvider for a streaming sink for streaming queries (Structured Streaming).

DataSourceRegister and delta Alias

DeltaDataSource is a DataSourceRegister and registers delta alias.

Tip

Read up on DataSourceRegister in The Internals of Spark SQL online book.

DeltaDataSource is registered using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:

org.apache.spark.sql.delta.sources.DeltaDataSource

RelationProvider for Batch Queries

DeltaDataSource is a RelationProvider for reading (loading) data from a delta table in a structured query.

Tip

Read up on RelationProvider in The Internals of Spark SQL online book.

createRelation(
  sqlContext: SQLContext,
  parameters: Map[String, String]): BaseRelation

createRelation reads the path option from the given parameters.

createRelation verifies the given parameters.

createRelation extracts time travel specification from the given parameters.

In the end, createRelation creates a DeltaTableV2 (for the path option and the time travel specification) and requests it for an insertable HadoopFsRelation.

createRelation throws an IllegalArgumentException when path option is not specified:

'path' is not specified

Source Schema

sourceSchema(
  sqlContext: SQLContext,
  schema: Option[StructType],
  providerName: String,
  parameters: Map[String, String]): (String, StructType)

sourceSchema creates a DeltaLog for a Delta table in the directory specified by the required path option (in the parameters) and returns the delta name with the schema (of the Delta table).

sourceSchema throws an IllegalArgumentException when the path option has not been specified:

'path' is not specified

sourceSchema throws an AnalysisException when the path option uses time travel:

Cannot time travel views, subqueries or streams.

sourceSchema is part of the StreamSourceProvider abstraction (Spark Structured Streaming).

CreatableRelationProvider

DeltaDataSource is a CreatableRelationProvider for writing out the result of a structured query.

Tip

Read up on CreatableRelationProvider in The Internals of Spark SQL online book.

Creating Streaming Source

DeltaDataSource is a StreamSourceProvider.

Creating Streaming Sink

DeltaDataSource is a StreamSinkProvider for a streaming sink for Structured Streaming.

DeltaDataSource supports Append and Complete output modes only.

In the end, DeltaDataSource creates a DeltaSink.

Loading Table

getTable(
  schema: StructType,
  partitioning: Array[Transform],
  properties: java.util.Map[String, String]): Table

getTable...FIXME

getTable is part of the TableProvider (Spark SQL 3.0.0) abstraction.

Utilities

getTimeTravelVersion

getTimeTravelVersion(
  parameters: Map[String, String]): Option[DeltaTimeTravelSpec]

getTimeTravelVersion...FIXME

getTimeTravelVersion is used when DeltaDataSource is requested to create a relation (as a RelationProvider).

parsePathIdentifier

parsePathIdentifier(
  spark: SparkSession,
  userPath: String): (Path, Seq[(String, String)], Option[DeltaTimeTravelSpec])

parsePathIdentifier...FIXME

parsePathIdentifier is used when DeltaTableV2 is requested for the rootPath, partitionFilters, and timeTravelByPath (for a non-catalog table).


Last update: 2020-10-13