Skip to content

Window Functions

From the official documentation of PostgreSQL:

Window functions provide the ability to perform calculations across sets of rows that are related to the current query row.

Since window functions are a subset of standard functions in Spark SQL, they generate a value for every group of rows that are associated with the current row by some relation.

Window functions require a window specification (WindowSpec) that defines which rows are included in a window (frame, i.e. the set of rows that are associated with the current row by some relation).

Window utility is used to create a WindowSpec to be refined further using Window operators.

import org.apache.spark.sql.expressions.Window
val byHTokens = Window.partitionBy('token startsWith "h")
import org.apache.spark.sql.expressions.WindowSpec
assert(byHTokens.isInstanceOf[WindowSpec])
import org.apache.spark.sql.expressions.Window
val windowSpec = Window
  .partitionBy($"orderId")
  .orderBy($"time")

With a WindowSpec defined, Column.over operator is used to associate the WindowSpec with aggregate or window functions.

import org.apache.spark.sql.functions.rank
rank.over(byHTokens)
import org.apache.spark.sql.functions.first
first.over(windowSpec)

Limitations

WindowSpecDefinition expression enforces the following requirements on WindowFrames:

  1. No UnspecifiedFrames are allowed (and should be resolved during analysis)
  2. A range window frame cannot be used in an unordered window specification.
  3. A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions
  4. The data type in the order specification ought to match the data type of the range frame