Spark Pipelines CLI¶

Spark Declarative Pipelines comes with spark-pipelines shell script to launch a Spark Declarative Pipelines project.

$SPARK_HOME/bin/spark-pipelines

spark-pipelines prepares the runtime environment to run SparkPipelines (with the path to cli.py Python script).

cli.py does two very critical steps in a SDP project's execution:

As a Python script, the cli.py imports all the Python transformation scripts written by a SDP developer (that are immediately executed per Python import system's rules).
SQL libraries remain untouched and sent over the wire to a Spark Connect server (PipelinesHandler) for execution.

The Pipelines CLI supports the following commands:

dry-run
init
run

uv

uvx --with "pyspark[pipelines]" spark-pipelines

spark-pipelines Shell Script¶

dry-run¶

Launch a run that just validates the graph and checks for errors

Option	Description	Default
`--spec`	Path to the pipeline spec	(undefined)

init¶

Generate a sample pipeline project, including a spec file and example definitions

Option	Description	Default	Required
`--name`	Name of the project. A directory with this name will be created underneath the current directory	(undefined)	✅

$ ./bin/spark-pipelines init --name hello-pipelines
Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
cd 'hello-pipelines'
spark-pipelines run

run¶

Run a pipeline. If no --refresh option specified, a default incremental update is performed.

Option	Description	Default
`--spec`	Path to the pipeline spec	(undefined)
`--full-refresh`	List of datasets to reset and recompute (comma-separated)	(empty)
`--full-refresh-all`	Perform a full graph reset and recompute	(undefined)
`--refresh`	List of datasets to update (comma-separated)	(empty)

When executed, run prints out the following log message:

Loading pipeline spec from [spec_path]...

run loads a pipeline spec.