Skip to content

SparkPipelines

SparkPipelines is a standalone application that spark-pipelines shell script uses to run pyspark/pipelines/cli.py Python script.

This somewhat convoluted way of executing pyspark/pipelines/cli.py Python script lets Spark Declarative Pipelines use the full execution power of spark-submit (Apache Spark) (with the built-in support for Spark Connect among other features) with extra pipelines-specific command-line arguments and options.

SparkPipelines behaves similarly to executingspark-submit explicitly as follows:

spark-submit \
    [sparkSubmitArgs] \
    /absolute/path/to/pyspark/pipelines/cli.py \
    [pipelinesArgs]
uvx --from "pyspark[pipelines]==4.1.1" \
    spark-submit \
        [sparkSubmitArgs] \
        /absolute/path/to/pyspark/pipelines/cli.py \
        [pipelinesArgs]

Launch SparkPipelines

main(
  args: Array[String]): Unit

main expects the first command-line argument to be the absolute path of the pyspark/pipelines/cli.py Python script.

main runs SparkSubmit (Apache Spark) with the arguments properly ordered.

constructSparkSubmitArgs

constructSparkSubmitArgs(
  pipelinesCliFile: String,
  args: Array[String]): Seq[String]

constructSparkSubmitArgs splits the given args into spark-submit- and pipelines-specific ones.

constructSparkSubmitArgs gives a sequence of the spark-submit-specific arguments followed by the given pipelinesCliFile and the pipelines-specific arguments.

splitArgs

splitArgs(
  args: Array[String]): (Seq[String], Seq[String])

splitArgs parses the given args (using a custom SparkSubmitArgumentsParser (Apache Spark)) and returns a pair of spark-submit- and pipelines-specific arguments.

splitArgs forces spark.api.mode configuration property to be connect.

SparkUserAppException

splitArgs reports a SparkUserAppException when spark.api.mode configuration property is specified explicitly on command line and is not connect.

Declarative Pipelines currently only supports Spark Connect.

splitArgs uses local as the default value of --remote command-line option.


splitArgs creates a custom SparkSubmitArgumentsParser to parse the given args.

All known arguments are considered spark-submit-specific except the following:

  • --name
  • -h
  • --help

Unknown and extra arguments are pipelines-specific.