Skip to content

HashJoin — Hash-Based Join Physical Operators

HashJoin is an extension of the JoinCodegenSupport abstraction for hash-based join physical operators with support for Java code generation.

Contract

BuildSide

buildSide: BuildSide

Preparing HashedRelation

prepareRelation(
  ctx: CodegenContext): HashedRelationInfo

Used when:

Implementations

Build Keys

buildKeys: Seq[Attribute]
Lazy Value

buildKeys is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

buildKeys is the left keys for the BuildSide as BuildLeft while the right keys as BuildRight.

Important

HashJoin assumes that the number of the left and the right keys are the same and are of the same types (position-wise).

Streamed Keys

streamedKeys: Seq[Attribute]
Lazy Value

streamedKeys is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

streamedKeys is the opposite of the build keys.

streamedKeys is the right keys for the BuildSide as BuildLeft while the left keys as BuildRight.

Output Attributes

output: Seq[Attribute]

output is a collection of Attributes based on the joinType.

joinType Output Schema
InnerLike output of the left followed by the right operator's
LeftOuter output of the left followed by the right operator's (with nullability on)
RightOuter output of the left (with nullability on) followed by the right operator's
ExistenceJoin output of the left followed by the exists attribute
LeftSemi output of the left operator
LeftAnti output of the left operator

output is part of the QueryPlan abstraction.

CodegenSupport — Generating Java Source Code

HashJoin is a CodegenSupport (indirectly as a JoinCodegenSupport) and supports generating Java source code for consume and produce execution paths.

Consume Path

doConsume(
  ctx: CodegenContext,
  input: Seq[ExprCode],
  row: ExprCode): String

doConsume generates a Java source code for "consume" execution path based on the given join type.

joinType doConsume
InnerLike codegenInner
LeftOuter codegenOuter
RightOuter codegenOuter
LeftSemi codegenSemi
LeftAnti codegenAnti
ExistenceJoin codegenExistence

doConsume is part of the CodegenSupport abstraction.

Anti Join

codegenAnti(
  ctx: CodegenContext,
  input: Seq[ExprCode]): String

codegenAnti...FIXME

codegenAnti is used when:

Inner Join

codegenInner(
  ctx: CodegenContext,
  input: Seq[ExprCode]): String

codegenInner prepares a HashedRelation (with the given CodegenContext).

codegenInner genStreamSideJoinKey and getJoinCondition.

For isEmptyHashedRelation, codegenInner returns the following text (which is simply a comment with no executable code):

// If HashedRelation is empty, hash inner join simply returns nothing.

For keyIsUnique, codegenInner returns the following code:

// generate join key for stream side
[keyEv.code]
// find matches from HashedRelation
UnsafeRow [matched] = [anyNull] ? null: (UnsafeRow)[relationTerm].getValue([keyEv.value]);
if ([matched] != null) {
  [checkCondition] {
    [numOutput].add(1);
    [consume(ctx, resultVars)]
  }
}

For all other cases, codegenInner returns the following code:

// generate join key for stream side
[keyEv.code]
// find matches from HashRelation
Iterator[UnsafeRow] [matches] = [anyNull] ?
  null : (Iterator[UnsafeRow])[relationTerm].get([keyEv.value]);
if ([matches] != null) {
  while ([matches].hasNext()) {
    UnsafeRow [matched] = (UnsafeRow) [matches].next();
    [checkCondition] {
      [numOutput].add(1);
      [consume(ctx, resultVars)]
    }
  }
}

Produce Path

doProduce(
  ctx: CodegenContext): String

doProduce assumes that the streamedPlan is a CodegenSupport and requests it to generate a Java source code for "produce" execution path.

doProduce is part of the CodegenSupport abstraction.

join

join(
  streamedIter: Iterator[InternalRow],
  hashed: HashedRelation,
  numOutputRows: SQLMetric): Iterator[InternalRow]

join branches off per JoinType to create an joined rows iterator (off the rows from the input streamedIter and hashed):

join creates a result projection.

In the end, for every row in the joined rows iterator join increments the input numOutputRows SQL metric and applies the result projection.

join reports an IllegalArgumentException for unsupported JoinType:

HashJoin should not take [joinType] as the JoinType

join is used when: