Join Queries¶
From PostgreSQL's 2.6. Joins Between Tables:
Queries can access multiple tables at once, or access the same table in such a way that multiple rows of the table are being processed at the same time. A query that accesses multiple rows of the same or different tables at one time is called a join query.
Dataset Join Operators¶
Operator | Return Type | Description |
---|---|---|
crossJoin | DataFrame | Untyped Row -based cross join |
join | DataFrame | Untyped Row -based join |
joinWith | Dataset | Type-preserving join with two output columns for records for which a join condition holds |
join
operators create a DataFrame
with a Join logical operator.
crossJoin¶
crossJoin(
right: Dataset[_]): DataFrame
crossJoin
creates a Join logical operator with the Cross join type.
join¶
join(
right: Dataset[_]): DataFrame
join(
right: Dataset[_],
joinExprs: Column): DataFrame
join(
right: Dataset[_],
joinExprs: Column,
joinType: String): DataFrame
join(
right: Dataset[_],
usingColumns: Seq[String]): DataFrame
join(
right: Dataset[_],
usingColumns: Seq[String],
joinType: String): DataFrame
join(
right: Dataset[_],
usingColumn: String): DataFrame
join
creates a Join logical operator with the given join type or the Inner.
joinWith¶
joinWith[U](
other: Dataset[U],
condition: Column): Dataset[(T, U)]
joinWith[U](
other: Dataset[U],
condition: Column,
joinType: String): Dataset[(T, U)]
joinWith
creates a Join logical operator with the given join type or the Inner.
Query Execution Planning¶
JoinSelection execution planning strategy is used to plan Join logical operators.
Join Condition¶
Join condition (join expression) can be specified using the join operators, where or filter operators.
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
df1.join(df2).filter($"df1Key" === $"df2Key")
Join Types¶
Join types can be specified using the join operators (using joinType
optional parameter).
df1.join(df2, $"df1Key" === $"df2Key", "inner")
Join names are case-insensitive and can use the underscore (_
) at any position (e.g. left_anti
and L_E_F_T_A_N_T_I
are equivalent).
SQL | JoinType | Parameter Name |
---|---|---|
CROSS | Cross | cross |
INNER | Inner | inner |
FULL OUTER | FullOuter | outer, full, fullouter |
LEFT ANTI | LeftAnti | leftanti |
LEFT OUTER | LeftOuter | leftouter, left |
LEFT SEMI | LeftSemi | leftsemi |
RIGHT OUTER | RightOuter | rightouter, right |
NATURAL | NaturalJoin | Special case for Inner, LeftOuter, RightOuter, FullOuter |
USING | UsingJoin | Special case for Inner, LeftOuter, LeftSemi, RightOuter, FullOuter, LeftAnti |
ExistenceJoin¶
ExistenceJoin
is an artifical join type used to express an existential sub-query, that is often referred to as existential join.
LeftAnti and ExistenceJoin are special cases of LeftOuter.
Join Families¶
InnerLike¶
InnerLike
with Inner and Cross
LeftExistence¶
LeftExistence
with LeftSemi, LeftAnti and ExistenceJoin
Demo¶
val left = Seq((0, "zero"), (1, "one")).toDF("id", "left")
val right = Seq((0, "zero"), (2, "two"), (3, "three")).toDF("id", "right")
Inner join¶
val q = left.join(right, "id")
+---+----+-----+
| id|left|right|
+---+----+-----+
| 0|zero| zero|
+---+----+-----+
== Physical Plan ==
*(1) Project [id#7, left#8, right#19]
+- *(1) BroadcastHashJoin [id#7], [id#18], Inner, BuildLeft, false
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#15]
: +- LocalTableScan [id#7, left#8]
+- *(1) LocalTableScan [id#18, right#19]
Full outer¶
val q = left.join(right, Seq("id"), "fullouter")
+---+----+-----+
| id|left|right|
+---+----+-----+
| 1| one| null|
| 3|null|three|
| 2|null| two|
| 0|zero| zero|
+---+----+-----+
== Physical Plan ==
*(3) Project [coalesce(id#7, id#18) AS id#25, left#8, right#19]
+- SortMergeJoin [id#7], [id#18], FullOuter
:- *(1) Sort [id#7 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#7, 200), ENSURE_REQUIREMENTS, [id=#38]
: +- LocalTableScan [id#7, left#8]
+- *(2) Sort [id#18 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#18, 200), ENSURE_REQUIREMENTS, [id=#39]
+- LocalTableScan [id#18, right#19]
Left Anti¶
val q = left.join(right, Seq("id"), "leftanti")
+---+----+
| id|left|
+---+----+
| 1| one|
+---+----+
== Physical Plan ==
*(1) BroadcastHashJoin [id#7], [id#18], LeftAnti, BuildRight, false
:- *(1) LocalTableScan [id#7, left#8]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#65]
+- LocalTableScan [id#18]