Skip to content

PythonRunner

PythonRunner is a command-line application to launch a separate process to run a Python application (alongside the JVM process of PythonRunner with Apache Spark services).

PythonRunner and Python Process

PythonRunner can be launched using spark-submit shell script (Spark Core).

PythonRunner executes the Python executable (with the PySpark application and arguments) as a subprocess that is expected to connect back to the JVM to access Spark services.

Uh-oh, there are two PythonRunners 🙄

This page is about org.apache.spark.deploy.PythonRunner while there is another PythonRunner.

Arguments

PythonRunner accepts the following command-line arguments (in that order):

  1. Main python file (pythonFile)
  2. Extra python files (pyFiles)
  3. PySpark application arguments, if any

Python Executable

PythonRunner determines the Python executable to launch a PySpark application with based on the following (in the order of precedence):

  1. spark.pyspark.driver.python configuration property
  2. spark.pyspark.python configuration property
  3. PYSPARK_DRIVER_PYTHON environment variable
  4. PYSPARK_PYTHON environment variable
  5. python3

Environment Variables

PythonRunner defines the following environment variables to configure the PySpark application's execution environment.

Environment Variable Value
PYTHONPATH Comma-separated list of local paths with formatted pyFiles and sparkPythonPath, followed by the existing PYTHONPATH
PYTHONUNBUFFERED YES
PYSPARK_GATEWAY_PORT The listening port of the started Py4JServer
PYSPARK_GATEWAY_SECRET The secret of the started Py4JServer
PYSPARK_PYTHON spark.pyspark.python if defined
PYTHONHASHSEED PYTHONHASHSEED env var if defined
OMP_NUM_THREADS spark.driver.cores (unless defined for Spark on k8s, YARN and Mesos)
SPARK_REMOTE spark.remote if defined

Launching Application

main(
  args: Array[String]): Unit

main takes the arguments (from the given args).

main determines the Python executable to launch the PySpark application (based on configuration properties and environment variables).

main creates a Py4JServer that is immediately started (on a daemon py4j-gateway-init thread). main waits until the Py4JServer has started.

main starts a Python process using the Python executable and the environment variables.

main pauses itself and waits for the Python process to finish. Once it happens, main requests the Py4JServer to shutdown.

Demo

Demo: Executing PySpark Applications Using spark-submit