pandas API on Spark¶

pandas API on Spark (pyspark.pandas package) has been added to PySpark to execute pandas code on Spark clusters with no changes (except the import).

There are two related PySpark packages with pandas support:

Spark Structured Streaming

pandas API on Spark does not support Spark Structured Streaming (streaming queries).

Modules¶

pandas API on Spark requires that the following modules to be installed:

Module	Version
pandas	1.0.5
PyArrow	1.0.0

PYARROW_IGNORE_TIMEZONE¶

For PyArrow 2.0.0 and above, pandas API on Spark requires PYARROW_IGNORE_TIMEZONE environment variable to be set to 1 (on the driver and executors).

PYSPARK_PANDAS_USAGE_LOGGER¶

pandas API on Spark uses PYSPARK_PANDAS_USAGE_LOGGER (formerly KOALAS_USAGE_LOGGER) environment variable for a usage logger.

Demo¶

# The following would be required if we used pandas
# import pandas as pd

# but we don't need it anymore 😊

# The only change is supposed to be this extra `pyspark` prefix
# in the name of the package

import pyspark.pandas as pd

Python

pd.read_csv("people.csv")

   id  name
0   0  zero
1   1   one
2   2   two