Skip to content

CodeGenerator

CodeGenerator is an abstraction of JVM bytecode generators for expression evaluation.

The Scala definition of this abstract class is as follows:

CodeGenerator[InType <: AnyRef, OutType <: AnyRef]

Contract

bind

bind(
  in: InType,
  inputSchema: Seq[Attribute]): InType

Used when:

  • CodeGenerator is requested to generate

canonicalize

canonicalize(
  in: InType): InType

Used when:

  • CodeGenerator is requested to generate

create

create(
  in: InType): OutType

Used when:

  • CodeGenerator is requested to generate

Implementations

cache

cache: Cache[CodeAndComment, (GeneratedClass, ByteCodeStats)]

CodeGenerator creates a cache of generated classes when loaded (as an Scala object).

When requested to look up a non-existent CodeAndComment, cache doCompile, updates CodegenMetrics and prints out the following INFO message to the logs:

Code generated in [timeMs] ms

cache allows for up to spark.sql.codegen.cache.maxEntries pairs.

cache is used when:

Compiling Java Code

compile(
  code: CodeAndComment): (GeneratedClass, ByteCodeStats)

compile looks the given CodeAndComment up in the cache.


compile is used when:

generate

generate(
  expressions: InType): OutType
generate(
  expressions: InType,
  inputSchema: Seq[Attribute]): OutType // (1)!
  1. Binds the input expressions to the given input schema

generate creates a class for the input expressions (after canonicalization).

generate is used when:

  • Serializer (of ExpressionEncoder) is requested to apply
  • RowOrdering utility is used to createCodeGeneratedObject
  • SafeProjection utility is used to createCodeGeneratedObject
  • LazilyGeneratedOrdering is requested for generatedOrdering
  • ObjectOperator utility is used to deserializeRowToObject and serializeObjectToRow
  • ComplexTypedAggregateExpression is requested for inputRowToObj and bufferRowToObject
  • DefaultCachedBatchSerializer is requested to convertCachedBatchToInternalRow

Creating CodegenContext

newCodeGenContext(): CodegenContext

newCodeGenContext creates a new CodegenContext.

newCodeGenContext is used when:

doCompile

doCompile(
  code: CodeAndComment): (GeneratedClass, ByteCodeStats)

doCompile creates a ClassBodyEvaluator (Janino).

doCompile requests the ClassBodyEvaluator to use org.apache.spark.sql.catalyst.expressions.GeneratedClass as the name of the generated class and sets some default imports (to be included in the generated class).

doCompile requests the ClassBodyEvaluator to use GeneratedClass as a superclass of the generated class (for passing extra references objects into the generated class).

abstract class GeneratedClass {
  def generate(references: Array[Any]): Any
}

doCompile prints out the following DEBUG message to the logs (with the given code):

[formatted code]

doCompile requests the ClassBodyEvaluator to cook (read, scan, parse and compile Java tokens) the source code and gets the bytecode statistics:

  • max method bytecode size
  • max constant pool size
  • number of inner classes

doCompile updates CodeGenerator code-gen metrics.

In the end, doCompile returns the GeneratedClass instance and bytecode statistics.


doCompile is used when:

  • CodeGenerator is requested to look up a code (in the cache)

Logging

CodeGenerator is an abstract class and logging is configured using the logger of the implementations.