Dataset API — Untyped Transformations¶
Untyped transformations are part of the Dataset API for transforming a Dataset
to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).
Note
Untyped transformations are the methods in the Dataset
Scala class that are grouped in untypedrel
group name, i.e. @group untypedrel
.
[[methods]] .Dataset API's Untyped Transformations [cols="1,2",options="header",width="100%"] |=== | Transformation | Description
| <
[source, scala]¶
agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame
| <Dataset
onto a Column
)
[source, scala]¶
apply(colName: String): Column¶
| <
Dataset
onto a Column
) [source, scala]¶
col(colName: String): Column¶
| <
[source, scala]¶
colRegex(colName: String): Column¶
Selects a column based on the column name specified as a regex (i.e. maps a Dataset
onto a Column
)
| <
[source, scala]¶
crossJoin(right: Dataset[_]): DataFrame¶
| <
[source, scala]¶
cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset
| <
[source, scala]¶
drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame
| <
[source, scala]¶
groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset
| <
[source, scala]¶
join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame
| <
[source, scala]¶
na: DataFrameNaFunctions¶
| <
[source, scala]¶
rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset
| <
[source, scala]¶
select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame
| <
[source, scala]¶
selectExpr(exprs: String*): DataFrame¶
| <
[source, scala]¶
stat: DataFrameStatFunctions¶
| <
[source, scala]¶
withColumn(colName: String, col: Column): DataFrame¶
| <
[source, scala]¶
withColumnRenamed(existingName: String, newName: String): DataFrame¶
|===
=== [[agg]] agg
Untyped Transformation
[source, scala]¶
agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame
agg
...FIXME
=== [[apply]] apply
Untyped Transformation
[source, scala]¶
apply(colName: String): Column¶
apply
selects a column based on the column name (i.e. maps a Dataset
onto a Column
).
=== [[col]] col
Untyped Transformation
col(
colName: String): Column
col
selects a column based on the column name (i.e. maps a Dataset
onto a Column
).
Internally, col
branches off per the input column name.
If the column name is *
(a star), col
simply creates a Column with ResolvedStar
expression (with the schema output attributes of the analyzed logical plan of the QueryExecution).
Otherwise, col
uses colRegex untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.
In the case when the column name is not *
and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col
creates a Column with the column name resolved (as a NamedExpression).
=== [[colRegex]] colRegex
Untyped Transformation
colRegex(
colName: String): Column
colRegex
selects a column based on the column name specified as a regex (i.e. maps a Dataset
onto a Column
).
Note
colRegex
is used in col when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not *
).
Internally, colRegex
matches the input column name to different regular expressions (in the order):
-
For column names with quotes without a qualifier,
colRegex
simply creates a Column with aUnresolvedRegex
(with no table) -
For column names with quotes with a qualifier,
colRegex
simply creates a Column with aUnresolvedRegex
(with a table specified) -
For other column names,
colRegex
(behaves like col and) creates a Column with the column name resolved (as a NamedExpression)
=== [[crossJoin]] crossJoin
Untyped Transformation
[source, scala]¶
crossJoin(right: Dataset[_]): DataFrame¶
crossJoin
...FIXME
=== [[cube]] cube
Untyped Transformation
[source, scala]¶
cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset
cube
...FIXME
=== [[drop]] Dropping One or More Columns -- drop
Untyped Transformation
[source, scala]¶
drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame
drop
...FIXME
=== [[groupBy]] groupBy
Untyped Transformation
[source, scala]¶
groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset
groupBy
...FIXME
=== [[join]] join
Untyped Transformation
[source, scala]¶
join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame
join
...FIXME
=== [[na]] na
Untyped Transformation
[source, scala]¶
na: DataFrameNaFunctions¶
na
simply creates a <
=== [[rollup]] rollup
Untyped Transformation
[source, scala]¶
rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset
rollup
...FIXME
=== [[select]] select
Untyped Transformation
[source, scala]¶
select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame
select
...FIXME
=== [[selectExpr]] Projecting Columns using SQL Statements -- selectExpr
Untyped Transformation
selectExpr(
exprs: String*): DataFrame
selectExpr
is like select
, but accepts SQL statements.
val ds = spark.range(5)
scala> ds.selectExpr("rand() as random").show
16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
| random|
+-------------------+
| 0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+
Internally, it executes select
with every expression in exprs
mapped to Column (using SparkSqlParser.parseExpression).
scala> ds.select(expr("rand() as random")).show
+------------------+
| random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+
=== [[stat]] stat
Untyped Transformation
[source, scala]¶
stat: DataFrameStatFunctions¶
stat
simply creates a <
=== [[withColumn]] withColumn
Untyped Transformation
[source, scala]¶
withColumn(colName: String, col: Column): DataFrame¶
withColumn
...FIXME
=== [[withColumnRenamed]] withColumnRenamed
Untyped Transformation
[source, scala]¶
withColumnRenamed(existingName: String, newName: String): DataFrame¶
withColumnRenamed
...FIXME