SparkPipelines — Spark Pipelines CLI¶
SparkPipelines
is a standalone application that can be executed using spark-pipelines shell script.
SparkPipelines
is a Scala "launchpad" to execute python/pyspark/pipelines/cli.py Python script (through SparkSubmit).
PySpark Pipelines CLI¶
$ pwd
/Users/jacek/oss/spark/python
$ PYTHONPATH=. uv run \
--with grpcio-status \
--with grpcio \
--with pyarrow \
--with pandas \
--with pyspark \
python pyspark/pipelines/cli.py
...
usage: cli.py [-h] {run,dry-run,init} ...
cli.py: error: the following arguments are required: command
dry-run¶
Launch a run that just validates the graph and checks for errors
Option | Description | Default |
---|---|---|
--spec | Path to the pipeline spec | (undefined) |
init¶
Generate a sample pipeline project, including a spec file and example definitions
Option | Description | Default | Required |
---|---|---|---|
--name | Name of the project. A directory with this name will be created underneath the current directory | (undefined) | ✅ |
$ ./bin/spark-pipelines init --name hello-pipelines
Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
cd 'hello-pipelines'
spark-pipelines run
run¶
Run a pipeline. If no --refresh
option specified, a default incremental update is performed.
Option | Description | Default |
---|---|---|
--spec | Path to the pipeline spec | (undefined) |
--full-refresh | List of datasets to reset and recompute (comma-separated) | (empty) |
--full-refresh-all | Perform a full graph reset and recompute | (undefined) |
--refresh | List of datasets to update (comma-separated) | (empty) |