Skip to content

Declarative Pipelines

Spark Declarative Pipelines (SDP) is a declarative framework for building ETL pipelines on Apache Spark.

Danger

Declarative Pipelines framework is only available in the development branch of Apache Spark 4.1.0-SNAPSHOT.

Declarative Pipelines has not been released in any Spark version yet.

Streaming flows are backed by streaming sources, and batch flows are backed by batch sources.

Declarative Pipelines uses the following Python decorators to describe tables and views:

  • @sdp.materialized_view for materialized views
  • @sdp.table for streaming and batch tables

DataflowGraph is the core graph structure in Declarative Pipelines.

Once described, a pipeline can be started (on a PipelineExecution).

Demo

Step 1. Register Dataflow Graph

DataflowGraphRegistry

import org.apache.spark.sql.connect.pipelines.DataflowGraphRegistry

val graphId = DataflowGraphRegistry.createDataflowGraph(
  defaultCatalog=spark.catalog.currentCatalog(),
  defaultDatabase=spark.catalog.currentDatabase,
  defaultSqlConf=Map.empty)

Step 2. Look Up Dataflow Graph

DataflowGraphRegistry

import org.apache.spark.sql.pipelines.graph.GraphRegistrationContext

val graphCtx: GraphRegistrationContext =
  DataflowGraphRegistry.getDataflowGraphOrThrow(dataflowGraphId=graphId)

Step 3. Create DataflowGraph

GraphRegistrationContext

import org.apache.spark.sql.pipelines.graph.DataflowGraph

val sdp: DataflowGraph = graphCtx.toDataflowGraph

Step 4. Create Update Context

PipelineUpdateContextImpl

import org.apache.spark.sql.pipelines.graph.{ PipelineUpdateContext, PipelineUpdateContextImpl }
import org.apache.spark.sql.pipelines.logging.PipelineEvent

val swallowEventsCallback: PipelineEvent => Unit = _ => ()

val updateCtx: PipelineUpdateContext =
  new PipelineUpdateContextImpl(unresolvedGraph=sdp, eventCallback=swallowEventsCallback)

Step 5. Start Pipeline

PipelineExecution

updateCtx.pipelineExecution.runPipeline()

Dataset Types

Declarative Pipelines supports the following dataset types:

  • Materialized Views (datasets) that are published to a catalog
  • Table that are published to a catalog
  • Views that are not published to a catalog

Learning Resources

  1. Spark Declarative Pipelines Programming Guide