SortBasedAggregationIterator¶

SortBasedAggregationIterator is an AggregationIterator that is used by SortAggregateExec physical operator to process rows in a partition.

Creating Instance¶

SortBasedAggregationIterator takes the following to be created:

Partition ID
Grouping NamedExpressions
Value Attributes
Input Iterator (InternalRows)
AggregateExpressions
Aggregate Attributes
Initial input buffer offset
Result NamedExpressions
Function to create a new MutableProjection given expressions and attributes ((Seq[Expression], Seq[Attribute]) => MutableProjection)
number of output rows metric

SortBasedAggregationIterator initializes immediately.

SortBasedAggregationIterator is created when:

SortAggregateExec physical operator is requested to doExecute

Initialization¶

initialize(): Unit

Procedure

initialize is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

initialize...FIXME

Performance Metrics¶

SortBasedAggregationIterator is given the performance metrics of the owning SortAggregateExec aggregate physical operator when created.

The metrics are displayed as part of SortAggregateExec aggregate physical operator (e.g. in web UI in Details for Query).

SortAggregateExec in web UI (Details for Query)

number of output rows¶

SortBasedAggregationIterator is given number of output rows performance metric when created.

The metric is number of output rows metric of the owning SortAggregateExec physical operator.

The metric is incremented every time SortBasedAggregationIterator is requested for next element (and there's one).

Aggregation Buffer¶

sortBasedAggregationBuffer: InternalRow

SortBasedAggregationIterator creates a new buffer for aggregation (called sortBasedAggregationBuffer) when created.

sortBasedAggregationBuffer is an InternalRow used when SortBasedAggregationIterator is requested for the following:

Initializing to initialize the buffer when there are input rows in the input iterator
processCurrentSortedGroup using the Process Row Function (with the firstRowInNextGroup and then all the rows in a group per currentGroupingKey)
next (after processCurrentSortedGroup) to generate next element (a sort aggregate result) using the Generate Output Function followed by initializing the buffer
outputForEmptyGroupingKeyWithoutInput (to initialize the buffer followed by generating the only sort aggregate result using the Generate Output Function)

Creating New Buffer¶

newBuffer: InternalRow

newBuffer creates a new aggregation buffer (an InternalRow) and initializes buffer values for all imperative aggregate functions (using their aggBufferAttributes)

Checking for Next Row Available¶

Iterator

hasNext: Boolean

hasNext is part of the Iterator (Scala) abstraction.

hasNext is the sortedInputHasNewGroup.

sortedInputHasNewGroup¶

var sortedInputHasNewGroup: Boolean = false

SortBasedAggregationIterator defines sortedInputHasNewGroup flag for hasNext.

sortedInputHasNewGroup indicates that there are no input rows or the current group is the last in the inputIterator.

sortedInputHasNewGroup flag is enabled (true) when the inputIterator has rows (is not empty) when initialize.

sortedInputHasNewGroup flag is disabled (false) when SortBasedAggregationIterator is requested for the following:

initialize and there are no rows in the inputIterator
There are no more groups in the inputIterator while processCurrentSortedGroup

Next Row¶

Iterator

next(): UnsafeRow

next is part of the Iterator (Scala) abstraction.

next if there is an input row available or throws an NoSuchElementException.

If there is an input row available, next does the following:

Processes the current (sorted) group
Generates output aggregate result for the current group using the Generate Output Function for the currentGroupingKey and the aggregation buffer
Initializes the buffer for values of the next group (with the aggregation buffer)
Increments the number of output rows metric

In the end, next returns the generated output row.

Processing Current Sorted Group¶

processCurrentSortedGroup(): Unit

processCurrentSortedGroup...FIXME

Current Grouping Key¶

var currentGroupingKey: UnsafeRow

currentGroupingKey is an UnsafeRow that is the value of the grouping key of the aggregation being processed.

In other words, SortBasedAggregationIterator uses the currentGroupingKey to process the current sorted group fully (using the Process Row Function) until the groupingProjection generates a next grouping key for an input row.

currentGroupingKey is initialized as the next grouping key at the beginning of processing the current sorted group (and remains so until there are more rows for the current aggregation group).

Next Grouping Key¶

var nextGroupingKey: UnsafeRow

nextGroupingKey is an UnsafeRow that is the value of the next grouping key (based on the groupingProjection).

nextGroupingKey is the result of executing the groupingProjection on the current row, if available, that happens when SortBasedAggregationIterator is requested for the following:

Initialization (when created)
Processing Current Sorted Group

As long as nextGroupingKey is within the same group (based on the groupingProjection) while processing current sorted group, SortBasedAggregationIterator keeps processesing input rows (using the Process Row Function) that are assumed to be within the same aggregate group.

nextGroupingKey becomes the Current Grouping Key at the beginning of processing the current sorted group.