PySpark — Python on Apache Spark¶
PySpark is the Python API (frontend) of Apache Spark.
How It Works¶
When a Python script is executed using spark-submit
shell script (Spark Core), PythonRunner is started (and --verbose
option can show it as Main class
).
$ ./bin/spark-submit --version hello_pyspark.py
Using properties file: null
Parsed arguments:
master local[*]
...
primaryResource file:/Users/jacek/dev/oss/spark/hello_pyspark.py
name hello_pyspark.py
...
Main class:
org.apache.spark.deploy.PythonRunner
Arguments:
file:/Users/jacek/dev/oss/spark/hello_pyspark.py
null
Spark config:
(spark.app.name,hello_pyspark.py)
(spark.app.submitTime,1684188276759)
(spark.master,local[*])
(spark.submit.deployMode,client)
(spark.submit.pyFiles,)
...
spark-submit
execution above could be translated to the following:
./bin/spark-class org.apache.spark.deploy.PythonRunner hello_pyspark.py ""
PythonRunner
then launches a Py4JServer (on a py4j-gateway-init
daemon thread) and waits until it is started.
Finally, PythonRunner
launches a Python process (to run the Python script) and waits until the process finishes (successfully or not).
$ ps -o pid,command | grep python3 | grep -v grep
12607 python3 /Users/jacek/dev/oss/spark/hello_pyspark.py
lsof for open files and TCP inter-process connections
Use lsof
command to have a look at the open files and connections.
sudo lsof -p [pid of the python process]
Python 3.8 and Later¶
The minimum version of Python is 3.8.
Python 3.7 Deprecated
Python 3.7 support is deprecated in Spark 3.4.
shell.py¶
pyspark
shell defines PYTHONSTARTUP environment variable to execute shell.py before the first prompt is displayed in Python interactive mode.
Py4J¶
java_gateway uses Py4J - A Bridge between Python and Java:
Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods. Py4J also enables Java programs to call back Python objects.
pyspark.sql Package¶
pyspark.sql
is a Python package for Spark SQL.
from pyspark.sql import *
Tip
Learn more about Modules and Packages in Python in The Python Tutorial.
__init__.py¶
The __init__.py
files are required to make Python treat directories containing the file as packages.
Per 6.4.1. Importing * From a Package:
The import statement uses the following convention: if a package's
__init__.py
code defines a list named__all__
, it is taken to be the list of module names that should be imported whenfrom package import *
is encountered.
Per Public and Internal Interfaces in PEP 8 -- Style Guide for Python Code:
To better support introspection, modules should explicitly declare the names in their public API using the
__all__
attribute.
From python/pyspark/sql/__init__.py
:
__all__ = [
'SparkSession', 'SQLContext', 'HiveContext', 'UDFRegistration',
'DataFrame', 'GroupedData', 'Column', 'Catalog', 'Row',
'DataFrameNaFunctions', 'DataFrameStatFunctions', 'Window', 'WindowSpec',
'DataFrameReader', 'DataFrameWriter', 'PandasCogroupedOps'
]
pandas¶
The minimum version of Pandas is 0.23.2
(and PandasConversionMixin asserts that).
import pandas as pd
pyarrow¶
The minimum version of PyArrow is 1.0.0
(and PandasConversionMixin asserts that).
import pyarrow
Python Mixins¶
From 8.7. Class definitions:
classdef ::= [decorators] "class" classname [inheritance] ":" suite
The inheritance list usually gives a list of base classes
PySpark uses mixins: