Skip to content

BloomFilterAggregate Expression

BloomFilterAggregate is a TypedImperativeAggregate expression that uses BloomFilter for an aggregation buffer.

Creating Instance

BloomFilterAggregate takes the following to be created:

BloomFilterAggregate is created when:

Estimated Number of Items Expression

BloomFilterAggregate can be given Estimated Number of Items (as an Expression) when created.

Unless given, BloomFilterAggregate uses spark.sql.optimizer.runtime.bloomFilter.expectedNumItems configuration property.

Number of Bits Expression

BloomFilterAggregate can be given Number of Bits (as an Expression) when created.

The number of bits expression must be a constant literal (i.e., foldable) that evaluates to a long value.

The maximum value for the number of bits is spark.sql.optimizer.runtime.bloomFilter.maxNumBits configuration property.

The number of bits expression is the third expression (in this TernaryLike tree node).

Number of Bits

numBits: Long
Lazy Value

numBits is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

BloomFilterAggregate defines numBits value to be either the value of the numBitsExpression (after evaluating it to a number) or spark.sql.optimizer.runtime.bloomFilter.maxNumBits, whatever smaller.

The numBits value must be a positive value.

numBits is used to create an aggregation buffer.

Creating Aggregation Buffer

createAggregationBuffer(): BloomFilter

createAggregationBuffer is part of the TypedImperativeAggregate abstraction.

createAggregationBuffer creates a BloomFilter (with the estimated number of items and the number of bits).

Interpreted Execution

  buffer: BloomFilter): Any

eval is part of the TypedImperativeAggregate abstraction.

eval serializes the given buffer (unless the cardinality of this BloomFilter is 0 and eval returns null).

FIXME Why does eval return null?

Serializing Aggregate Buffer

  obj: BloomFilter): Array[Byte]

serialize is part of the TypedImperativeAggregate abstraction.

Two serializes

There is another serialize (in BloomFilterAggregate companion object) that just makes unit testing easier.