SparkPipelines — Spark Pipelines CLI¶

SparkPipelines is a standalone application that is executed using spark-pipelines shell script.

SparkPipelines is a Scala "launchpad" to execute pyspark/pipelines/cli.py Python script (through SparkSubmit).

cli.py¶

pyspark/pipelines/cli.py is the heart of the Spark Pipelines CLI (launched using spark-pipelines shell script).

As a Python script, cli.py can simply import Python libraries (to trigger their execution) whereas SQL libraries are left untouched and sent over the wire to a Spark Connect server (PipelinesHandler) for execution.

The Pipelines CLI supports the following commands:

dry-run
init
run

uv run

$ pwd
/Users/jacek/oss/spark/python

$ PYTHONPATH=. uv run \
    --with grpcio-status \
    --with grpcio \
    --with pyarrow \
    --with pandas \
    --with pyspark \
    python pyspark/pipelines/cli.py
...
usage: cli.py [-h] {run,dry-run,init} ...
cli.py: error: the following arguments are required: command

dry-run¶

Launch a run that just validates the graph and checks for errors

Option	Description	Default
`--spec`	Path to the pipeline spec	(undefined)

init¶

Generate a sample pipeline project, including a spec file and example definitions

Option	Description	Default	Required
`--name`	Name of the project. A directory with this name will be created underneath the current directory	(undefined)	✅

$ ./bin/spark-pipelines init --name hello-pipelines
Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
cd 'hello-pipelines'
spark-pipelines run

run¶

Run a pipeline. If no --refresh option specified, a default incremental update is performed.

Option	Description	Default
`--spec`	Path to the pipeline spec	(undefined)
`--full-refresh`	List of datasets to reset and recompute (comma-separated)	(empty)
`--full-refresh-all`	Perform a full graph reset and recompute	(undefined)
`--refresh`	List of datasets to update (comma-separated)	(empty)

When executed, run prints out the following log message:

Loading pipeline spec from [spec_path]...

run loads a pipeline spec.