Skip to content

PythonRunner

PythonRunner is a command-line application to launch Python applications.

PythonRunner is used by spark-submit.

PythonRunner executes a configured python executable as a subprocess and then has it connect back to the JVM to access system properties, etc.

Arguments

PythonRunner requires the following command-line arguments:

  1. Main python file (pythonFile)
  2. Extra python files (pyFiles)
  3. Application arguments

main

main takes the arguments from command line.

main determines what python executable to use based on (in that order):

  1. spark.pyspark.driver.python configuration property
  2. spark.pyspark.python configuration property
  3. PYSPARK_DRIVER_PYTHON environment variable
  4. PYSPARK_PYTHON environment variable
  5. python3

main creates a Py4JServer that is started on a daemon py4j-gateway-init thread.

main waits until the gateway server has started.

main launches a Python process using the python executable and the following environment variables.

Environment Variable Value
PYTHONPATH
PYTHONUNBUFFERED YES
PYSPARK_GATEWAY_PORT getListeningPort
PYSPARK_GATEWAY_SECRET secret
PYSPARK_PYTHON spark.pyspark.python if defined
PYTHONHASHSEED PYTHONHASHSEED env var if defined
OMP_NUM_THREADS spark.driver.cores unless defined

main waits for the Python process to finish and requests the Py4JServer to shutdown.

Demo

./bin/spark-class org.apache.spark.deploy.PythonRunner

Last update: 2021-03-05