RelationalGroupedDataset¶
RelationalGroupedDataset
is a high-level API for Untyped (Row-based) Grouping in Aggregation Queries to calculate aggregates over groups of rows in a DataFrame.
KeyValueGroupedDataset is for Typed Aggregates
KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows).
RelationalGroupedDataset
is the result of executing the following Dataset
high-level operators:
Creating Instance¶
RelationalGroupedDataset
takes the following to be created:
- DataFrame
- Grouping Expressions
- GroupType
RelationalGroupedDataset
is created (possibly using apply factory) for the following operators:
Creating RelationalGroupedDataset Instance¶
apply(
df: DataFrame,
groupingExprs: Seq[Expression],
groupType: GroupType): RelationalGroupedDataset
apply
creates a RelationalGroupedDataset.
High-Level Operators¶
agg¶
agg(
aggExpr: (String, String),
aggExprs: (String, String)*): DataFrame
agg(
expr: Column,
exprs: Column*): DataFrame
agg(
exprs: Map[String, String]): DataFrame
agg
creates a DataFrame of aggregates for all the given column expressions.
count¶
count(): DataFrame
count
creates a Count
with a 1
literal and converts it to an AggregateExpression.
count
creates an Alias
unary expression with the Count
and count
name.
In the end, count
toDF the Alias
unary expression.
Aggregating Numeric Columns¶
aggregateNumericColumns(
colNames: String*)(
f: Expression => AggregateFunction): DataFrame
aggregateNumericColumns
asserts that the given colNames
are all numberic (with NumericType) or takes all the numeric columns of this DataFrame.
For every numeric column, aggregateNumericColumns
applies the given f
function and converts the result to an AggregateExpression.
In the end, aggregateNumericColumns
toDF the AggregateExpressions.
aggregateNumericColumns
is used when:
Creating DataFrame of Aggregates¶
toDF(
aggExprs: Seq[Expression]): DataFrame
toDF
determines whether to include groupingExprs in the result DataFrame
or not based on spark.sql.retainGroupColumns configuration property.
toDF
converts the aggregate expressions to use proper names.
toDF
creates a new DataFrame with different LogicalPlans based on the GroupType.
GroupType | Logical Operator |
---|---|
GroupByType | Aggregate with the Grouping Expressions |
RollupType | Aggregate with Rollup expression with the Grouping Expressions |
CubeType | Aggregate with Cube expression with the Grouping Expressions |
PivotType | Pivot |
toDF
is used when: