CollectSet Expression

CollectSet is a Collect expression (with a mutable.HashSet[Any]] aggregation buffer).

CollectSet and Scala's HashSet

It's fair to say that CollectSet is merely a Spark SQL-enabled Scala mutable.HashSet[Any]].

Creating Instance

CollectSet takes the following to be created:

  • Child Expression
  • mutableAggBufferOffset (default: 0)
  • inputAggBufferOffset (default: 0)

CollectSet is created when:

Pretty Name

prettyName: String

prettyName is part of the Expression abstraction.

prettyName is collect_set.

Creating Aggregation Buffer

createAggregationBuffer(): mutable.HashSet[Any]

createAggregationBuffer is part of the TypedImperativeAggregate abstraction.

createAggregationBuffer creates an empty mutable.HashSet (Scala).

Interpreted Execution

  buffer: mutable.HashSet[Any]): Any

eval is part of the TypedImperativeAggregate abstraction.

eval creates a GenericArrayData with an array based on the DataType of the child expression:

  • For BinaryType, eval...FIXME
  • Otherwise, eval...FIXME

EliminateDistinct Logical Optimization

CollectSet is isDuplicateAgnostic per EliminateDistinct logical optimization.