RelationalGroupedDataset¶
RelationalGroupedDataset is a high-level API for Untyped (Row-based) Grouping in Aggregation Queries to calculate aggregates over groups of rows in a DataFrame.
KeyValueGroupedDataset is for Typed Aggregates
KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows).
RelationalGroupedDataset is the result of executing the following Dataset high-level operators:
Creating Instance¶
RelationalGroupedDataset takes the following to be created:
- DataFrame
- Grouping Expressions
- GroupType
RelationalGroupedDataset is created (possibly using apply factory) for the following operators:
Creating RelationalGroupedDataset Instance¶
apply(
df: DataFrame,
groupingExprs: Seq[Expression],
groupType: GroupType): RelationalGroupedDataset
apply creates a RelationalGroupedDataset.
High-Level Operators¶
agg¶
agg(
aggExpr: (String, String),
aggExprs: (String, String)*): DataFrame
agg(
expr: Column,
exprs: Column*): DataFrame
agg(
exprs: Map[String, String]): DataFrame
agg creates a DataFrame of aggregates for all the given column expressions.
count¶
count(): DataFrame
count creates a Count with a 1 literal and converts it to an AggregateExpression.
count creates an Alias unary expression with the Count and count name.
In the end, count toDF the Alias unary expression.
Aggregating Numeric Columns¶
aggregateNumericColumns(
colNames: String*)(
f: Expression => AggregateFunction): DataFrame
aggregateNumericColumns asserts that the given colNames are all numberic (with NumericType) or takes all the numeric columns of this DataFrame.
For every numeric column, aggregateNumericColumns applies the given f function and converts the result to an AggregateExpression.
In the end, aggregateNumericColumns toDF the AggregateExpressions.
aggregateNumericColumns is used when:
Creating DataFrame of Aggregates¶
toDF(
aggExprs: Seq[Expression]): DataFrame
toDF determines whether to include groupingExprs in the result DataFrame or not based on spark.sql.retainGroupColumns configuration property.
toDF converts the aggregate expressions to use proper names.
toDF creates a new DataFrame with different LogicalPlans based on the GroupType.
| GroupType | Logical Operator |
|---|---|
| GroupByType | Aggregate with the Grouping Expressions |
| RollupType | Aggregate with Rollup expression with the Grouping Expressions |
| CubeType | Aggregate with Cube expression with the Grouping Expressions |
| PivotType | Pivot |
toDF is used when: