Window Functions¶
From the official documentation of PostgreSQL:
Window functions provide the ability to perform calculations across sets of rows that are related to the current query row.
Since window functions are a subset of standard functions in Spark SQL, they generate a value for every group of rows that are associated with the current row by some relation.
Window functions require a window specification (WindowSpec) that defines which rows are included in a window (frame, i.e. the set of rows that are associated with the current row by some relation).
Window utility is used to create a WindowSpec
to be refined further using Window operators.
import org.apache.spark.sql.expressions.Window
val byHTokens = Window.partitionBy('token startsWith "h")
import org.apache.spark.sql.expressions.WindowSpec
assert(byHTokens.isInstanceOf[WindowSpec])
import org.apache.spark.sql.expressions.Window
val windowSpec = Window
.partitionBy($"orderId")
.orderBy($"time")
With a WindowSpec
defined, Column.over operator is used to associate the WindowSpec
with aggregate or window functions.
import org.apache.spark.sql.functions.rank
rank.over(byHTokens)
import org.apache.spark.sql.functions.first
first.over(windowSpec)
Limitations¶
WindowSpecDefinition
expression enforces the following requirements on WindowFrames:
- No
UnspecifiedFrame
s are allowed (and should be resolved during analysis) - A range window frame cannot be used in an unordered window specification.
- A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions
- The data type in the order specification ought to match the data type of the range frame