SparkPipelines — Spark Pipelines CLI¶
SparkPipelines
is a standalone application that is executed using spark-pipelines shell script.
SparkPipelines
is a Scala "launchpad" to execute pyspark/pipelines/cli.py Python script (through SparkSubmit).
PySpark Pipelines CLI¶
pyspark/pipelines/cli.py
is the Pipelines CLI that is launched using spark-pipelines shell script.
The Pipelines CLI supports the following commands:
$ pwd
/Users/jacek/oss/spark/python
$ PYTHONPATH=. uv run \
--with grpcio-status \
--with grpcio \
--with pyarrow \
--with pandas \
--with pyspark \
python pyspark/pipelines/cli.py
...
usage: cli.py [-h] {run,dry-run,init} ...
cli.py: error: the following arguments are required: command
dry-run¶
Launch a run that just validates the graph and checks for errors
Option | Description | Default |
---|---|---|
--spec | Path to the pipeline spec | (undefined) |
init¶
Generate a sample pipeline project, including a spec file and example definitions
Option | Description | Default | Required |
---|---|---|---|
--name | Name of the project. A directory with this name will be created underneath the current directory | (undefined) | ✅ |
$ ./bin/spark-pipelines init --name hello-pipelines
Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
cd 'hello-pipelines'
spark-pipelines run
run¶
Run a pipeline. If no --refresh
option specified, a default incremental update is performed.
Option | Description | Default |
---|---|---|
--spec | Path to the pipeline spec | (undefined) |
--full-refresh | List of datasets to reset and recompute (comma-separated) | (empty) |
--full-refresh-all | Perform a full graph reset and recompute | (undefined) |
--refresh | List of datasets to update (comma-separated) | (empty) |
When executed, run
prints out the following log message:
Loading pipeline spec from [spec_path]...
run
loads a pipeline spec.
run
prints out the following log message:
Creating Spark session...
run
creates a Spark session with the configurations from the pipeline spec.
run
prints out the following log message:
Creating dataflow graph...
run
sends a CreateDataflowGraph
command for execution in the Spark Connect server.
Spark Connect Server and Command Execution
CreateDataflowGraph
is handled by PipelinesHandler on the Spark Connect Server.
run
prints out the following log message:
Dataflow graph created (ID: [dataflow_graph_id]).
run
prints out the following log message:
Registering graph elements...
run
creates a SparkConnectGraphElementRegistry and register_definitions
.
run
prints out the following log message:
Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...
run
sends a StartRun
command for execution in the Spark Connect Server.
StartRun Command and PipelinesHandler
StartRun
command is handled by PipelinesHandler on the Spark Connect Server.
In the end, run
keeps printing out pipeline events from the Spark Connect server.