GroupedData¶
GroupedData is created for the following high-level operators:
GroupedData is then used to execute aggregate functions (over groups of rows) using agg operator:
- Built-In Aggregation Functions
- pandas UDAFs
GroupedData is a Python class with PandasGroupedOpsMixin mixin.
GroupedData is defined in pyspark.sql.group module.
from pyspark.sql.group import GroupedData
Creating Instance¶
GroupedData takes the following to be created:
agg¶
agg(
self,
*exprs: Union[Column, Dict[str, str]]) -> DataFrame
Note
Built-in aggregation functions and pandas UDAFs cannot be used together in a single agg.
agg accepts a collection of Column expressions or a single Dict[str, str] object.
agg requests the RelationalGroupedDataset to agg (Spark SQL).
In the end, agg creates a DataFrame with the agg result.