SparkPipelines¶
SparkPipelines is a standalone application that spark-pipelines shell script uses to run pyspark/pipelines/cli.py Python script.
This somewhat convoluted way of executing pyspark/pipelines/cli.py Python script lets Spark Declarative Pipelines use the full execution power of spark-submit (Apache Spark) (with the built-in support for Spark Connect among other features) with extra pipelines-specific command-line arguments and options.
SparkPipelines behaves similarly to executingspark-submit explicitly as follows:
Launch SparkPipelines¶
main expects the first command-line argument to be the absolute path of the pyspark/pipelines/cli.py Python script.
main runs SparkSubmit (Apache Spark) with the arguments properly ordered.
constructSparkSubmitArgs¶
constructSparkSubmitArgs splits the given args into spark-submit- and pipelines-specific ones.
constructSparkSubmitArgs gives a sequence of the spark-submit-specific arguments followed by the given pipelinesCliFile and the pipelines-specific arguments.
splitArgs¶
splitArgs parses the given args (using a custom SparkSubmitArgumentsParser (Apache Spark)) and returns a pair of spark-submit- and pipelines-specific arguments.
splitArgs forces spark.api.mode configuration property to be connect.
SparkUserAppException
splitArgs reports a SparkUserAppException when spark.api.mode configuration property is specified explicitly on command line and is not connect.
Declarative Pipelines currently only supports Spark Connect.
splitArgs uses local as the default value of --remote command-line option.
splitArgs creates a custom SparkSubmitArgumentsParser to parse the given args.
All known arguments are considered spark-submit-specific except the following:
--name-h--help
Unknown and extra arguments are pipelines-specific.