RelationalGroupedDataset¶

RelationalGroupedDataset is a high-level API for Untyped (Row-based) Grouping in Aggregation Queries to calculate aggregates over groups of rows in a DataFrame.

KeyValueGroupedDataset is for Typed Aggregates

KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows).

RelationalGroupedDataset is the result of executing the following Dataset high-level operators:

Creating Instance¶

RelationalGroupedDataset takes the following to be created:

RelationalGroupedDataset is created (possibly using apply factory) for the following operators:

Creating RelationalGroupedDataset Instance¶

apply(
  df: DataFrame,
  groupingExprs: Seq[Expression],
  groupType: GroupType): RelationalGroupedDataset

apply creates a RelationalGroupedDataset.

High-Level Operators¶

agg¶

agg(
  aggExpr: (String, String),
  aggExprs: (String, String)*): DataFrame
agg(
  expr: Column,
  exprs: Column*): DataFrame
agg(
  exprs: Map[String, String]): DataFrame

agg creates a DataFrame of aggregates for all the given column expressions.

count¶

count(): DataFrame

count creates a Count with a 1 literal and converts it to an AggregateExpression.

count creates an Alias unary expression with the Count and count name.

In the end, count toDF the Alias unary expression.

Aggregating Numeric Columns¶

aggregateNumericColumns(
  colNames: String*)(
  f: Expression => AggregateFunction): DataFrame

aggregateNumericColumns asserts that the given colNames are all numberic (with NumericType) or takes all the numeric columns of this DataFrame.

For every numeric column, aggregateNumericColumns applies the given f function and converts the result to an AggregateExpression.

In the end, aggregateNumericColumns toDF the AggregateExpressions.

aggregateNumericColumns is used when:

RelationalGroupedDataset is requested to mean, max, avg, min, sum

Creating DataFrame of Aggregates¶

toDF(
  aggExprs: Seq[Expression]): DataFrame

toDF determines whether to include groupingExprs in the result DataFrame or not based on spark.sql.retainGroupColumns configuration property.

toDF converts the aggregate expressions to use proper names.

toDF creates a new DataFrame with different LogicalPlans based on the GroupType.

GroupType	Logical Operator
GroupByType	Aggregate with the Grouping Expressions
RollupType	Aggregate with `Rollup` expression with the Grouping Expressions
CubeType	Aggregate with `Cube` expression with the Grouping Expressions
PivotType	Pivot

toDF is used when:

RelationalGroupedDataset is requested to aggregate numeric columns (for mean, max, avg, min, sum operators), agg, count