PythonRunner¶
PythonRunner
is a command-line application to launch a separate process to run a Python application (alongside the JVM process of PythonRunner
with Apache Spark services).
PythonRunner
can be launched using spark-submit
shell script (Spark Core).
PythonRunner
executes the Python executable (with the PySpark application and arguments) as a subprocess that is expected to connect back to the JVM to access Spark services.
Uh-oh, there are two PythonRunners 🙄
This page is about org.apache.spark.deploy.PythonRunner
while there is another PythonRunner.
Arguments¶
PythonRunner
accepts the following command-line arguments (in that order):
- Main python file (
pythonFile
) - Extra python files (
pyFiles
) - PySpark application arguments, if any
Python Executable¶
PythonRunner
determines the Python executable to launch a PySpark application with based on the following (in the order of precedence):
- spark.pyspark.driver.python configuration property
- spark.pyspark.python configuration property
- PYSPARK_DRIVER_PYTHON environment variable
- PYSPARK_PYTHON environment variable
python3
Environment Variables¶
PythonRunner
defines the following environment variables to configure the PySpark application's execution environment.
Environment Variable | Value |
---|---|
PYTHONPATH | Comma-separated list of local paths with formatted pyFiles and sparkPythonPath, followed by the existing PYTHONPATH |
PYTHONUNBUFFERED | YES |
PYSPARK_GATEWAY_PORT | The listening port of the started Py4JServer |
PYSPARK_GATEWAY_SECRET | The secret of the started Py4JServer |
PYSPARK_PYTHON | spark.pyspark.python if defined |
PYTHONHASHSEED | PYTHONHASHSEED env var if defined |
OMP_NUM_THREADS | spark.driver.cores (unless defined for Spark on k8s, YARN and Mesos) |
SPARK_REMOTE | spark.remote if defined |
Launching Application¶
main(
args: Array[String]): Unit
main
takes the arguments (from the given args
).
main
determines the Python executable to launch the PySpark application (based on configuration properties and environment variables).
main
creates a Py4JServer that is immediately started (on a daemon py4j-gateway-init thread). main
waits until the Py4JServer
has started.
main
starts a Python process using the Python executable and the environment variables.
main
pauses itself and waits for the Python process to finish. Once it happens, main
requests the Py4JServer
to shutdown.