Dataset API — Dataset Operators¶
Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.
Dataset Operators (Transformations and Actions)¶
explain¶
// Uses simple mode
explain(): Unit
// Uses extended or simple mode
explain(
extended: Boolean): Unit
explain(
mode: String): Unit
A basic action to display the logical and physical plans of the Dataset
, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output
[cols="1,3",options="header",width="100%"] |=== | Operator | Description
| agg a| [[agg]]
[source, scala]¶
agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame
An untyped transformation
| <
[source, scala]¶
alias(alias: String): Dataset[T] alias(alias: Symbol): Dataset[T]
A typed transformation that is a mere synonym of <
| apply a| [[apply]]
[source, scala]¶
apply(colName: String): Column¶
An untyped transformation to select a column based on the column name (i.e. maps a Dataset
onto a Column
)
| <
[source, scala]¶
as(alias: String): Dataset[T] as(alias: Symbol): Dataset[T]
A typed transformation
| <
[source, scala]¶
as[U : Encoder]: Dataset[U]¶
A typed transformation to enforce a type, i.e. marking the records in the Dataset
as of a given data type (data type conversion). as
simply changes the view of the data that is passed into typed operations (e.g. <
| <
[source, scala]¶
cache(): this.type¶
A basic action that is a mere synonym of <
| <
[source, scala]¶
checkpoint(): Dataset[T] checkpoint(eager: Boolean): Dataset[T]
A basic action to checkpoint the Dataset
in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)
| <
[source, scala]¶
coalesce(numPartitions: Int): Dataset[T]¶
A typed transformation to repartition a Dataset
| col a| [[col]]
[source, scala]¶
col(colName: String): Column¶
An untyped transformation to create a column (reference) based on the column name
| <
[source, scala]¶
collect(): Array[T]¶
An action
| colRegex a| [[colRegex]]
[source, scala]¶
colRegex(colName: String): Column¶
An untyped transformation to create a column (reference) based on the column name specified as a regex
| <
[source, scala]¶
columns: Array[String]¶
A basic action
| <
[source, scala]¶
count(): Long¶
An action to count the number of rows
| <
[source, scala]¶
createGlobalTempView(viewName: String): Unit¶
A basic action
| <
[source, scala]¶
createOrReplaceGlobalTempView(viewName: String): Unit¶
A basic action
| <
[source, scala]¶
createOrReplaceTempView(viewName: String): Unit¶
A basic action
| <
[source, scala]¶
createTempView(viewName: String): Unit¶
A basic action
| crossJoin a| [[crossJoin]]
[source, scala]¶
crossJoin(right: Dataset[_]): DataFrame¶
An untyped transformation
| cube a| [[cube]]
[source, scala]¶
cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset
An untyped transformation
| <
[source, scala]¶
describe(cols: String*): DataFrame¶
An action
| <
[source, scala]¶
distinct(): Dataset[T]¶
A typed transformation that is a mere synonym of <Dataset
)
| drop a| [[drop]]
[source, scala]¶
drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame
An untyped transformation
| <
[source, scala]¶
dropDuplicates(): Dataset[T] dropDuplicates(colNames: Array[String]): Dataset[T] dropDuplicates(colNames: Seq[String]): Dataset[T] dropDuplicates(col1: String, cols: String*): Dataset[T]
A typed transformation
| <
[source, scala]¶
dtypes: Array[(String, String)]¶
A basic action
| <
[source, scala]¶
except( other: Dataset[T]): Dataset[T]
A typed transformation
| <
[source, scala]¶
exceptAll( other: Dataset[T]): Dataset[T]
(New in 2.4.0) A typed transformation
| <
[source, scala]¶
filter(condition: Column): Dataset[T] filter(conditionExpr: String): Dataset[T] filter(func: T => Boolean): Dataset[T]
A typed transformation
| <
[source, scala]¶
first(): T¶
An action that is a mere synonym of <
>| <
[source, scala]¶
flatMapU : Encoder: Dataset[U]¶
A typed transformation
| <
[source, scala]¶
foreach(f: T => Unit): Unit¶
An action
| <
[source, scala]¶
foreachPartition(f: Iterator[T] => Unit): Unit¶
An action
| groupBy a| [[groupBy]]
[source, scala]¶
groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset
An untyped transformation
| <
[source, scala]¶
groupByKeyK: Encoder: KeyValueGroupedDataset[K, T]¶
A typed transformation
| <
[source, scala]¶
head(): T // <1> head(n: Int): Array[T]
<1> Uses 1
for n
An action
| <
[source, scala]¶
hint(name: String, parameters: Any*): Dataset[T]¶
A basic action to specify a hint (and optional parameters)
| <
[source, scala]¶
inputFiles: Array[String]¶
A basic action
| <
[source, scala]¶
intersect(other: Dataset[T]): Dataset[T]¶
A typed transformation
| <
[source, scala]¶
intersectAll(other: Dataset[T]): Dataset[T]¶
(New in 2.4.0) A typed transformation
| <
[source, scala]¶
isEmpty: Boolean¶
(New in 2.4.0) A basic action
| <
[source, scala]¶
isLocal: Boolean¶
A basic action
| isStreaming a| [[isStreaming]]
[source, scala]¶
isStreaming: Boolean¶
| join a| [[join]]
[source, scala]¶
join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame
An untyped transformation
| <
[source, scala]¶
joinWithU: Dataset[(T, U)] joinWithU: Dataset[(T, U)]
A typed transformation
| <
[source, scala]¶
limit(n: Int): Dataset[T]¶
A typed transformation
| <
[source, scala]¶
localCheckpoint(): Dataset[T] localCheckpoint(eager: Boolean): Dataset[T]
A basic action to checkpoint the Dataset
locally on executors (and therefore unreliably)
| <
[source, scala]¶
mapU: Encoder: Dataset[U]¶
A typed transformation
| <
[source, scala]¶
mapPartitionsU : Encoder: Dataset[U]¶
A typed transformation
| na a| [[na]]
[source, scala]¶
na: DataFrameNaFunctions¶
An untyped transformation
| <
[source, scala]¶
orderBy(sortExprs: Column*): Dataset[T] orderBy(sortCol: String, sortCols: String*): Dataset[T]
A typed transformation
| <
[source, scala]¶
persist(): this.type persist(newLevel: StorageLevel): this.type
A basic action to persist the Dataset
NOTE: Although its category persist
is not an action in the common sense that means executing anything in a Spark cluster (i.e. execution on the driver or on executors). It acts only as a marker to perform Dataset persistence once an action is really executed.
| <
[source, scala]¶
printSchema(): Unit¶
A basic action
| <
[source, scala]¶
randomSplit(weights: Array[Double]): Array[Dataset[T]] randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
A typed transformation to split a Dataset
randomly into two Datasets
| <
[source, scala]¶
rdd: RDD[T]¶
A basic action
| <
[source, scala]¶
reduce(func: (T, T) => T): T¶
An action to reduce the records of the Dataset
using the specified binary function.
| <
[source, scala]¶
repartition(partitionExprs: Column*): Dataset[T] repartition(numPartitions: Int): Dataset[T] repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
A typed transformation to repartition a Dataset
| <
[source, scala]¶
repartitionByRange(partitionExprs: Column*): Dataset[T] repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]
A typed transformation
| rollup a| [[rollup]]
[source, scala]¶
rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset
An untyped transformation
| <
[source, scala]¶
sample(withReplacement: Boolean, fraction: Double): Dataset[T] sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] sample(fraction: Double): Dataset[T] sample(fraction: Double, seed: Long): Dataset[T]
A typed transformation
| <
[source, scala]¶
schema: StructType¶
A basic action
| select a| [[select]]
[source, scala]¶
// Untyped transformations select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame
// Typed transformations selectU1: Dataset[U1] selectU1, U2: Dataset[(U1, U2)] selectU1, U2, U3: Dataset[(U1, U2, U3)] selectU1, U2, U3, U4: Dataset[(U1, U2, U3, U4)] selectU1, U2, U3, U4, U5: Dataset[(U1, U2, U3, U4, U5)]
An (untyped and typed) transformation
| selectExpr a| [[selectExpr]]
[source, scala]¶
selectExpr(exprs: String*): DataFrame¶
An untyped transformation
| <
[source, scala]¶
show(): Unit show(truncate: Boolean): Unit show(numRows: Int): Unit show(numRows: Int, truncate: Boolean): Unit show(numRows: Int, truncate: Int): Unit show(numRows: Int, truncate: Int, vertical: Boolean): Unit
An action
| <
[source, scala]¶
sort(sortExprs: Column*): Dataset[T] sort(sortCol: String, sortCols: String*): Dataset[T]
A typed transformation to sort elements globally (across partitions). Use <
| <
[source, scala]¶
sortWithinPartitions(sortExprs: Column*): Dataset[T] sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]
A typed transformation to sort elements within partitions (aka local sort). Use <
| stat a| [[stat]]
[source, scala]¶
stat: DataFrameStatFunctions¶
An untyped transformation
| <
[source, scala]¶
storageLevel: StorageLevel¶
A basic action
| <
[source, scala]¶
summary(statistics: String*): DataFrame¶
An action to calculate statistics (e.g. count
, mean
, stddev
, min
, max
and 25%
, 50%
, 75%
percentiles)
| <
[source, scala]¶
take(n: Int): Array[T]¶
An action to take the first records of a Dataset
| <
[source, scala]¶
toDF(): DataFrame toDF(colNames: String*): DataFrame
A basic action to convert a Dataset to a DataFrame
| <
[source, scala]¶
toJSON: Dataset[String]¶
A typed transformation
| <
[source, scala]¶
toLocalIterator(): java.util.Iterator[T]¶
An action that returns an iterator with all rows in the Dataset
. The iterator will consume as much memory as the largest partition in the Dataset
.
| <
[source, scala]¶
transformU: Dataset[U]¶
A typed transformation for chaining custom transformations
| <
[source, scala]¶
union(other: Dataset[T]): Dataset[T]¶
A typed transformation
| <
[source, scala]¶
unionByName(other: Dataset[T]): Dataset[T]¶
A typed transformation
| <
[source, scala]¶
unpersist(): this.type // <1> unpersist(blocking: Boolean): this.type
<1> Uses unpersist
with blocking
disabled (false
)
A basic action to unpersist the Dataset
| <
[source, scala]¶
where(condition: Column): Dataset[T] where(conditionExpr: String): Dataset[T]
A typed transformation
| withColumn a| [[withColumn]]
[source, scala]¶
withColumn(colName: String, col: Column): DataFrame¶
An untyped transformation
| withColumnRenamed a| [[withColumnRenamed]]
[source, scala]¶
withColumnRenamed(existingName: String, newName: String): DataFrame¶
An untyped transformation
| <
[source, scala]¶
write: DataFrameWriter[T]¶
A basic action that returns a DataFrameWriter for saving the content of the (non-streaming) Dataset
out to an external storage |===