groupByKey Operator — Streaming Aggregation¶

groupByKey(
  func: T => K): KeyValueGroupedDataset[K, T]

groupByKey operator aggregates rows by a typed grouping function for Arbitrary Stateful Streaming Aggregation.

groupByKey creates a KeyValueGroupedDataset (Spark SQL) (with keys of type K and rows of type T) to apply aggregation functions over groups of rows (of type T) by key (of type K) per the given func key-generating function.

Note

The type of the input argument of func is the type of rows in the Dataset (i.e. Dataset[T]).

groupByKey simply applies the func function to every row (of type T) and associates it with a logical group per key (of type K).

func: T => K

Internally, groupByKey creates a structured query with the AppendColumns unary logical operator (with the given func and the analyzed logical plan of the target Dataset that groupByKey was executed on) and creates a new QueryExecution.

In the end, groupByKey creates a KeyValueGroupedDataset with the following:

Encoders for K keys and T rows
The new QueryExecution (with the AppendColumns unary logical operator)
The output schema of the analyzed logical plan
The new columns of the AppendColumns logical operator (i.e. the attributes of the key)

scala> :type sq
org.apache.spark.sql.Dataset[Long]

val baseCode = 'A'.toInt
val byUpperChar = (n: java.lang.Long) => (n % 3 + baseCode).toString
val kvs = sq.groupByKey(byUpperChar)

scala> :type kvs
org.apache.spark.sql.KeyValueGroupedDataset[String,Long]

// Peeking under the surface of KeyValueGroupedDataset
import org.apache.spark.sql.catalyst.plans.logical.AppendColumns
val appendColumnsOp = kvs.queryExecution.analyzed.collect { case ac: AppendColumns => ac }.head
scala> println(appendColumnsOp.newColumns)
List(value#7)

Demo: Aggregating Orders Per Zip Code¶

Go to Demo: groupByKey Streaming Aggregation in Update Mode.

Demo: Aggregating Metrics Per Device¶

The following example code shows how to apply groupByKey operator to a structured stream of timestamped values of different devices.

// input stream
import java.sql.Timestamp
val signals = spark.
  readStream.
  format("rate").
  option("rowsPerSecond", 1).
  load.
  withColumn("value", $"value" % 10)  // <-- randomize the values (just for fun)
  withColumn("deviceId", lit(util.Random.nextInt(10))). // <-- 10 devices randomly assigned to values
  as[(Timestamp, Long, Int)] // <-- convert to a "better" type (from "unpleasant" Row)

// stream processing using groupByKey operator
// groupByKey(func: ((Timestamp, Long, Int)) => K): KeyValueGroupedDataset[K, (Timestamp, Long, Int)]
// K becomes Int which is a device id
val deviceId: ((Timestamp, Long, Int)) => Int = { case (_, _, deviceId) => deviceId }
scala> val signalsByDevice = signals.groupByKey(deviceId)
signalsByDevice: org.apache.spark.sql.KeyValueGroupedDataset[Int,(java.sql.Timestamp, Long, Int)] = org.apache.spark.sql.KeyValueGroupedDataset@19d40bc6