Skip to content

worker.py

worker.py is a Python module in pyspark package.

from pyspark import worker

Entry Point

Top-Level Code Environment

If the module is executed in the top-level code environment (and not initialized from an import statement), its __name__ is set to the string __main__.

Sometimes "top-level code" is called an entry point to the application.

Learn more in the __main__ — Top-level code environment.

When executed in the top-level code environment (e.g., python3 -m), worker.py reads the following environment variables:

Environment Variable Description
PYTHON_WORKER_FACTORY_PORT Port the JVM listens to
PYTHON_WORKER_FACTORY_SECRET Authorization Secret

worker.py local_connect_and_auth (that gives a sock_file).

worker.py write_int with the PID of the Python process to the sock_file.

In the end, worker.py main (with the sock_file and sock_file for the input and output files).

main

main(
    infile,
    outfile)

main reads PYTHON_FAULTHANDLER_DIR environment variable.

main does a lot of initializations.

FIXME Review the initializations

main read_udfs that gives the following:

  • func
  • profiler
  • deserializer
  • serializer

requests the deserializer to load_stream from the given infile and executes func (with the split_index and the deserialized stream).

main does a lot of post-processings.

FIXME Review the post-processings

read_udfs

read_udfs(
    pickleSer,
    infile,
    eval_type)

read_udfs...FIXME

read_single_udf

read_single_udf(
    pickleSer,
    infile,
    eval_type,
    runner_conf,
    udf_index)

read_single_udf...FIXME