Spark Pipelines CLI¶
Spark Declarative Pipelines comes with spark-pipelines shell script to launch a Spark Declarative Pipelines project.
$SPARK_HOME/bin/spark-pipelines
spark-pipelines prepares the runtime environment to run SparkPipelines (with the path to cli.py Python script).
cli.py does two very critical steps in a SDP project's execution:
- As a Python script, the
cli.pyimports all the Python transformation scripts written by a SDP developer (that are immediately executed per Python import system's rules). - SQL libraries remain untouched and sent over the wire to a Spark Connect server (PipelinesHandler) for execution.
The Pipelines CLI supports the following commands:
uvx --with "pyspark[pipelines]" spark-pipelines
spark-pipelines Shell Script¶
dry-run¶
Launch a run that just validates the graph and checks for errors
| Option | Description | Default |
|---|---|---|
--spec | Path to the pipeline spec | (undefined) |
init¶
Generate a sample pipeline project, including a spec file and example definitions
| Option | Description | Default | Required |
|---|---|---|---|
--name | Name of the project. A directory with this name will be created underneath the current directory | (undefined) | ✅ |
$ ./bin/spark-pipelines init --name hello-pipelines
Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
cd 'hello-pipelines'
spark-pipelines run
run¶
Run a pipeline. If no --refresh option specified, a default incremental update is performed.
| Option | Description | Default |
|---|---|---|
--spec | Path to the pipeline spec | (undefined) |
--full-refresh | List of datasets to reset and recompute (comma-separated) | (empty) |
--full-refresh-all | Perform a full graph reset and recompute | (undefined) |
--refresh | List of datasets to update (comma-separated) | (empty) |
When executed, run prints out the following log message:
Loading pipeline spec from [spec_path]...
run loads a pipeline spec.
run prints out the following log message:
Creating Spark session...
run creates a Spark session with the configurations from the pipeline spec.
run prints out the following log message:
Creating dataflow graph...
run sends a CreateDataflowGraph command for execution in the Spark Connect server.
Spark Connect Server and Command Execution
CreateDataflowGraph is handled by PipelinesHandler on the Spark Connect Server.
run prints out the following log message:
Dataflow graph created (ID: [dataflow_graph_id]).
run prints out the following log message:
Registering graph elements...
run creates a SparkConnectGraphElementRegistry and register_definitions.
run prints out the following log message:
Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...
run sends a StartRun command for execution in the Spark Connect Server.
StartRun Command and PipelinesHandler
StartRun command is handled by PipelinesHandler on the Spark Connect Server.
In the end, run keeps printing out pipeline events from the Spark Connect server.