Skip to content

Aggregate Window Functions

Aggregate Window Functions (aka window functions or windowed aggregates) are functions that perform a calculation over a group of records called window that are in some relation to the current record (i.e. can be in the same partition or frame as the current row).

In other words, when executed, a window function computes a value for each and every row in a window (per window specification).

Window functions are also called over functions due to how they are applied using over operator.

Spark SQL supports three kinds of window functions:

Ranking Functions

row_number

row_number(): Column

row_number assigns a unique, sequential number to each row within a window partition according to the ordering of rows (starting from 1).

Internally, row_number creates a RowNumber aggregate window leaf expression.

val buckets = spark.range(5).withColumn("bucket", 'id % 3)
// Make duplicates
val input = buckets.union(buckets)
import org.apache.spark.sql.expressions.Window
val buckets = Window.partitionBy('bucket).orderBy('id)
val rn = row_number().over(buckets)
scala> print(rn)
row_number() OVER (PARTITION BY bucket ORDER BY id ASC NULLS FIRST unspecifiedframe$())
val q = input.withColumn("row_number", rn)
scala> q.show
+---+------+----------+
| id|bucket|row_number|
+---+------+----------+
|  0|     0|         1|
|  0|     0|         2|
|  3|     0|         3|
|  3|     0|         4|
|  1|     1|         1|
|  1|     1|         2|
|  4|     1|         3|
|  4|     1|         4|
|  2|     2|         1|
|  2|     2|         2|
+---+------+----------+
scala> println(rn.expr.treeString)
row_number() windowspecdefinition('bucket, 'id ASC NULLS FIRST, unspecifiedframe$())
:- row_number()
+- windowspecdefinition('bucket, 'id ASC NULLS FIRST, unspecifiedframe$())
   :- 'bucket
   :- 'id ASC NULLS FIRST
   :  +- 'id
   +- unspecifiedframe$()