GroupedData¶
GroupedData
is created for the following high-level operators:
GroupedData
is then used to execute aggregate functions (over groups of rows) using agg operator:
- Built-In Aggregation Functions
- pandas UDAFs
GroupedData
is a Python class with PandasGroupedOpsMixin mixin.
GroupedData
is defined in pyspark.sql.group module.
from pyspark.sql.group import GroupedData
Creating Instance¶
GroupedData
takes the following to be created:
agg¶
agg(
self,
*exprs: Union[Column, Dict[str, str]]) -> DataFrame
Note
Built-in aggregation functions and pandas UDAFs cannot be used together in a single agg
.
agg
accepts a collection of Column
expressions or a single Dict[str, str]
object.
agg
requests the RelationalGroupedDataset to agg
(Spark SQL).
In the end, agg
creates a DataFrame with the agg
result.