Declarative Pipelines¶

Spark Declarative Pipelines (SDP) is a declarative framework for building ETL pipelines on Apache Spark.

Danger

Declarative Pipelines framework is only available in the development branch of Apache Spark 4.1.0-SNAPSHOT.

Declarative Pipelines has not been released in any Spark version yet.

Streaming flows are backed by streaming sources, and batch flows are backed by batch sources.

Declarative Pipelines uses the following Python decorators to describe tables and views:

@sdp.materialized_view for materialized views
@sdp.table for streaming and batch tables

DataflowGraph is the core graph structure in Declarative Pipelines.

Once described, a pipeline can be started (on a PipelineExecution).

Demo¶

Step 1. Register Dataflow Graph¶

DataflowGraphRegistry

import org.apache.spark.sql.connect.pipelines.DataflowGraphRegistry

val graphId = DataflowGraphRegistry.createDataflowGraph(
  defaultCatalog=spark.catalog.currentCatalog(),
  defaultDatabase=spark.catalog.currentDatabase,
  defaultSqlConf=Map.empty)

Step 2. Look Up Dataflow Graph¶

DataflowGraphRegistry

import org.apache.spark.sql.pipelines.graph.GraphRegistrationContext

val graphCtx: GraphRegistrationContext =
  DataflowGraphRegistry.getDataflowGraphOrThrow(dataflowGraphId=graphId)

Step 3. Create DataflowGraph¶

GraphRegistrationContext

import org.apache.spark.sql.pipelines.graph.DataflowGraph

val sdp: DataflowGraph = graphCtx.toDataflowGraph

Step 4. Create Update Context¶

PipelineUpdateContextImpl

import org.apache.spark.sql.pipelines.graph.{ PipelineUpdateContext, PipelineUpdateContextImpl }
import org.apache.spark.sql.pipelines.logging.PipelineEvent

val swallowEventsCallback: PipelineEvent => Unit = _ => ()

val updateCtx: PipelineUpdateContext =
  new PipelineUpdateContextImpl(unresolvedGraph=sdp, eventCallback=swallowEventsCallback)

Step 5. Start Pipeline¶

PipelineExecution

updateCtx.pipelineExecution.runPipeline()

Dataset Types¶

Declarative Pipelines supports the following dataset types:

Materialized Views (datasets) that are published to a catalog
Table that are published to a catalog
Views that are not published to a catalog

Learning Resources¶

Spark Declarative Pipelines Programming Guide