Skip to content

OptimizeIn Logical Optimization

OptimizeIn is a base logical optimization that <> as follows:

  1. Replaces an In expression that has an empty list and the value expression not nullable to false

  2. Eliminates duplicates of Literal expressions in an In predicate expression that is inSetConvertible

  3. Replaces an In predicate expression that is inSetConvertible with InSet expressions when the number of literal expressions in the list expression is greater than spark.sql.optimizer.inSetConversionThreshold internal configuration property

OptimizeIn is part of the Operator Optimization before Inferring Filters fixed-point batch in the standard batches of the Logical Optimizer.

OptimizeIn is simply a Catalyst rule for transforming logical plans, i.e. Rule[LogicalPlan].

// Use Catalyst DSL to define a logical plan

// HACK: Disable symbolToColumn implicit conversion
// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._

import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
val rel = LocalRelation('a.int, 'b.int, 'c.int)

import org.apache.spark.sql.catalyst.expressions.{In, Literal}
val plan = rel
  .where(In('a, Seq[Literal](1, 2, 3)))
  .analyze
scala> println(plan.numberedTreeString)
00 Filter a#6 IN (1,2,3)
01 +- LocalRelation <empty>, [a#6, b#7, c#8]

// In --> InSet
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", 0)

import org.apache.spark.sql.catalyst.optimizer.OptimizeIn
val optimizedPlan = OptimizeIn(plan)
scala> println(optimizedPlan.numberedTreeString)
00 Filter a#6 INSET (1,2,3)
01 +- LocalRelation <empty>, [a#6, b#7, c#8]

Executing Rule

apply(plan: LogicalPlan): LogicalPlan

apply...FIXME

apply is part of the Rule abstraction.