PythonRunner¶
PythonRunner is a command-line application to launch a separate process to run a Python application (alongside the JVM process of PythonRunner with Apache Spark services).

PythonRunner can be launched using spark-submit shell script (Spark Core).
PythonRunner executes the Python executable (with the PySpark application and arguments) as a subprocess that is expected to connect back to the JVM to access Spark services.
Uh-oh, there are two PythonRunners 🙄
This page is about org.apache.spark.deploy.PythonRunner while there is another PythonRunner.
Arguments¶
PythonRunner accepts the following command-line arguments (in that order):
- Main python file (
pythonFile) - Extra python files (
pyFiles) - PySpark application arguments, if any
Python Executable¶
PythonRunner determines the Python executable to launch a PySpark application with based on the following (in the order of precedence):
- spark.pyspark.driver.python configuration property
- spark.pyspark.python configuration property
- PYSPARK_DRIVER_PYTHON environment variable
- PYSPARK_PYTHON environment variable
python3
Environment Variables¶
PythonRunner defines the following environment variables to configure the PySpark application's execution environment.
| Environment Variable | Value |
|---|---|
PYTHONPATH | Comma-separated list of local paths with formatted pyFiles and sparkPythonPath, followed by the existing PYTHONPATH |
PYTHONUNBUFFERED | YES |
| PYSPARK_GATEWAY_PORT | The listening port of the started Py4JServer |
| PYSPARK_GATEWAY_SECRET | The secret of the started Py4JServer |
PYSPARK_PYTHON | spark.pyspark.python if defined |
PYTHONHASHSEED | PYTHONHASHSEED env var if defined |
OMP_NUM_THREADS | spark.driver.cores (unless defined for Spark on k8s, YARN and Mesos) |
SPARK_REMOTE | spark.remote if defined |
Launching Application¶
main(
args: Array[String]): Unit
main takes the arguments (from the given args).
main determines the Python executable to launch the PySpark application (based on configuration properties and environment variables).
main creates a Py4JServer that is immediately started (on a daemon py4j-gateway-init thread). main waits until the Py4JServer has started.
main starts a Python process using the Python executable and the environment variables.
main pauses itself and waits for the Python process to finish. Once it happens, main requests the Py4JServer to shutdown.