Demo: spark-pipelines CLI¶
Activate Virtual Environment
Follow Demo: Create Virtual Environment for Python Client before getting started with this demo.
Display Pipelines Help¶
Run spark-pipelines --help to learn the options.
$SPARK_HOME/bin/spark-pipelines --help
usage: cli.py [-h] {run,dry-run,init} ...
Pipelines CLI
positional arguments:
{run,dry-run,init}
run Run a pipeline. If no refresh options specified, a
default incremental update is performed.
dry-run Launch a run that just validates the graph and checks
for errors.
init Generate a sample pipeline project, including a spec
file and example transformations.
options:
-h, --help show this help message and exit
Create Pipelines Demo Project¶
You've only created an empty Python project so far (using uv).
Create a demo double hello-spark-pipelines pipelines project with a sample spark-pipeline.yml and sample transformations (in Python and in SQL).
$SPARK_HOME/bin/spark-pipelines init --name hello-spark-pipelines && \
mv hello-spark-pipelines/* . && \
rm -rf hello-spark-pipelines
cat spark-pipeline.yml
name: hello-spark-pipelines
storage: file:///Users/jacek/sandbox/hello-spark-pipelines/hello-spark-pipelines/pipeline-storage
libraries:
- glob:
include: transformations/**
tree transformations
transformations
├── example_python_materialized_view.py
└── example_sql_materialized_view.sql
1 directory, 2 files
Spark Connect Server should be down
spark-pipelines dry-run starts its own Spark Connect Server at 15002 port (unless started with --remote option).
Shut down Spark Connect Server if you started it already.
$SPARK_HOME/sbin/stop-connect-server.sh
--remote option
Use --remote option to connect to a standalone Spark Connect Server.
$SPARK_HOME/bin/spark-pipelines --remote sc://localhost dry-run
Dry Run Pipelines Project¶
$SPARK_HOME/bin/spark-pipelines dry-run
Loading pipeline spec from /Users/jacek/sandbox/hello-spark-pipelines/spark-pipeline.yml...
Creating Spark session...
Creating dataflow graph...
Registering graph elements...
Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
Found 2 files matching glob 'transformations/**/*'
Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
Starting run...
Run is COMPLETED.
Run Pipelines Project¶
Run the pipeline.
$SPARK_HOME/bin/spark-pipelines run
Loading pipeline spec from /Users/jacek/sandbox/hello-spark-pipelines/spark-pipeline.yml...
Creating Spark session...
Creating dataflow graph...
Registering graph elements...
Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
Found 2 files matching glob 'transformations/**/*'
Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
Starting run...
Flow spark_catalog.default.example_python_materialized_view is QUEUED.
Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
Flow spark_catalog.default.example_python_materialized_view is PLANNING.
Flow spark_catalog.default.example_python_materialized_view is STARTING.
Flow spark_catalog.default.example_python_materialized_view is RUNNING.
Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
Flow spark_catalog.default.example_sql_materialized_view is STARTING.
Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
Run is COMPLETED.
tree spark-warehouse
spark-warehouse
├── example_python_materialized_view
│ ├── _SUCCESS
│ └── part-00000-284bc03a-3405-4e8e-bbd7-f6f17d79c282-c000.snappy.parquet
└── example_sql_materialized_view
├── _SUCCESS
└── part-00000-8316b6c6-7532-4f7a-92f6-2ec024e069f4-c000.snappy.parquet
3 directories, 4 files