Skip to content

PySpark

PySpark is the Python frontend for Apache Spark.

shell.py

pyspark shell defines PYTHONSTARTUP environment variable to execute shell.py before the first prompt is displayed in Python interactive mode.

Py4J

java_gateway uses Py4J - A Bridge between Python and Java:

Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods. Py4J also enables Java programs to call back Python objects.

pyspark.sql Package

pyspark.sql is a Python package for Spark SQL.

from pyspark.sql import *

Tip

Learn more about Modules and Packages in Python in The Python Tutorial.

__init__.py

The __init__.py files are required to make Python treat directories containing the file as packages.

Per 6.4.1. Importing * From a Package:

The import statement uses the following convention: if a package's __init__.py code defines a list named __all__, it is taken to be the list of module names that should be imported when from package import * is encountered.

Per Public and Internal Interfaces in PEP 8 -- Style Guide for Python Code:

To better support introspection, modules should explicitly declare the names in their public API using the __all__ attribute.

From python/pyspark/sql/__init__.py:

__all__ = [
    'SparkSession', 'SQLContext', 'HiveContext', 'UDFRegistration',
    'DataFrame', 'GroupedData', 'Column', 'Catalog', 'Row',
    'DataFrameNaFunctions', 'DataFrameStatFunctions', 'Window', 'WindowSpec',
    'DataFrameReader', 'DataFrameWriter', 'PandasCogroupedOps'
]

pandas

The minimum version of Pandas is 0.23.2 (and PandasConversionMixin asserts that).

import pandas as pd

pyarrow

The minimum version of PyArrow is 1.0.0 (and PandasConversionMixin asserts that).

import pyarrow

Python Mixins

From 8.7. Class definitions:

classdef ::= [decorators] "class" classname [inheritance] ":" suite

The inheritance list usually gives a list of base classes

PySpark uses mixins:


Last update: 2021-02-25