Skip to content

RelationalGroupedDataset

RelationalGroupedDataset is a high-level API for Untyped (Row-based) Grouping in Aggregation Queries to calculate aggregates over groups of rows in a DataFrame.

KeyValueGroupedDataset is for Typed Aggregates

KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows).

RelationalGroupedDataset is the result of executing the following Dataset high-level operators:

Creating Instance

RelationalGroupedDataset takes the following to be created:

RelationalGroupedDataset is created (possibly using apply factory) for the following operators:

Creating RelationalGroupedDataset Instance

apply(
  df: DataFrame,
  groupingExprs: Seq[Expression],
  groupType: GroupType): RelationalGroupedDataset

apply creates a RelationalGroupedDataset.

High-Level Operators

agg

agg(
  aggExpr: (String, String),
  aggExprs: (String, String)*): DataFrame
agg(
  expr: Column,
  exprs: Column*): DataFrame
agg(
  exprs: Map[String, String]): DataFrame

agg creates a DataFrame of aggregates for all the given column expressions.

count

count(): DataFrame

count creates a Count with a 1 literal and converts it to an AggregateExpression.

count creates an Alias unary expression with the Count and count name.

In the end, count toDF the Alias unary expression.

Aggregating Numeric Columns

aggregateNumericColumns(
  colNames: String*)(
  f: Expression => AggregateFunction): DataFrame

aggregateNumericColumns asserts that the given colNames are all numberic (with NumericType) or takes all the numeric columns of this DataFrame.

For every numeric column, aggregateNumericColumns applies the given f function and converts the result to an AggregateExpression.

In the end, aggregateNumericColumns toDF the AggregateExpressions.


aggregateNumericColumns is used when:

Creating DataFrame of Aggregates

toDF(
  aggExprs: Seq[Expression]): DataFrame

toDF determines whether to include groupingExprs in the result DataFrame or not based on spark.sql.retainGroupColumns configuration property.

toDF converts the aggregate expressions to use proper names.

toDF creates a new DataFrame with different LogicalPlans based on the GroupType.

GroupType Logical Operator
GroupByType Aggregate with the Grouping Expressions
RollupType Aggregate with Rollup expression with the Grouping Expressions
CubeType Aggregate with Cube expression with the Grouping Expressions
PivotType Pivot

toDF is used when: