Skip to content

Dataset API — Dataset Operators

Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.

Dataset Operators (Transformations and Actions)

explain

// Uses simple mode
explain(): Unit
// Uses extended or simple mode
explain(
  extended: Boolean): Unit
explain(
  mode: String): Unit

explain

A basic action to display the logical and physical plans of the Dataset, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output

[cols="1,3",options="header",width="100%"] |=== | Operator | Description

| agg a| [[agg]]

[source, scala]

agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame


An untyped transformation

| <> a| [[alias]]

[source, scala]

alias(alias: String): Dataset[T] alias(alias: Symbol): Dataset[T]


A typed transformation that is a mere synonym of <>.

| apply a| [[apply]]

[source, scala]

apply(colName: String): Column

An untyped transformation to select a column based on the column name (i.e. maps a Dataset onto a Column)

| <> a| [[as-alias]]

[source, scala]

as(alias: String): Dataset[T] as(alias: Symbol): Dataset[T]


A typed transformation

| <> a| [[as-type]]

[source, scala]

as[U : Encoder]: Dataset[U]

A typed transformation to enforce a type, i.e. marking the records in the Dataset as of a given data type (data type conversion). as simply changes the view of the data that is passed into typed operations (e.g. <>) and does not eagerly project away any columns that are not present in the specified class.

| <> a| [[cache]]

[source, scala]

cache(): this.type

A basic action that is a mere synonym of <>.

| <> a| [[checkpoint]]

[source, scala]

checkpoint(): Dataset[T] checkpoint(eager: Boolean): Dataset[T]


A basic action to checkpoint the Dataset in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)

| <> a| [[coalesce]]

[source, scala]

coalesce(numPartitions: Int): Dataset[T]

A typed transformation to repartition a Dataset

| col a| [[col]]

[source, scala]

col(colName: String): Column

An untyped transformation to create a column (reference) based on the column name

| <> a| [[collect]]

[source, scala]

collect(): Array[T]

An action

| colRegex a| [[colRegex]]

[source, scala]

colRegex(colName: String): Column

An untyped transformation to create a column (reference) based on the column name specified as a regex

| <> a| [[columns]]

[source, scala]

columns: Array[String]

A basic action

| <> a| [[count]]

[source, scala]

count(): Long

An action to count the number of rows

| <> a| [[createGlobalTempView]]

[source, scala]

createGlobalTempView(viewName: String): Unit

A basic action

| <> a| [[createOrReplaceGlobalTempView]]

[source, scala]

createOrReplaceGlobalTempView(viewName: String): Unit

A basic action

| <> a| [[createOrReplaceTempView]]

[source, scala]

createOrReplaceTempView(viewName: String): Unit

A basic action

| <> a| [[createTempView]]

[source, scala]

createTempView(viewName: String): Unit

A basic action

| crossJoin a| [[crossJoin]]

[source, scala]

crossJoin(right: Dataset[_]): DataFrame

An untyped transformation

| cube a| [[cube]]

[source, scala]

cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset


An untyped transformation

| <> a| [[describe]]

[source, scala]

describe(cols: String*): DataFrame

An action

| <> a| [[distinct]]

[source, scala]

distinct(): Dataset[T]

A typed transformation that is a mere synonym of <> (with all the columns of the Dataset)

| drop a| [[drop]]

[source, scala]

drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame


An untyped transformation

| <> a| [[dropDuplicates]]

[source, scala]

dropDuplicates(): Dataset[T] dropDuplicates(colNames: Array[String]): Dataset[T] dropDuplicates(colNames: Seq[String]): Dataset[T] dropDuplicates(col1: String, cols: String*): Dataset[T]


A typed transformation

| <> a| [[dtypes]]

[source, scala]

dtypes: Array[(String, String)]

A basic action

| <> a| [[except]]

[source, scala]

except( other: Dataset[T]): Dataset[T]


A typed transformation

| <> a| [[exceptAll]]

[source, scala]

exceptAll( other: Dataset[T]): Dataset[T]


(New in 2.4.0) A typed transformation

| <> a| [[filter]]

[source, scala]

filter(condition: Column): Dataset[T] filter(conditionExpr: String): Dataset[T] filter(func: T => Boolean): Dataset[T]


A typed transformation

| <> a| [[first]]

[source, scala]

first(): T

An action that is a mere synonym of <>

| <> a| [[flatMap]]

[source, scala]

flatMapU : Encoder: Dataset[U]

A typed transformation

| <> a| [[foreach]]

[source, scala]

foreach(f: T => Unit): Unit

An action

| <> a| [[foreachPartition]]

[source, scala]

foreachPartition(f: Iterator[T] => Unit): Unit

An action

| groupBy a| [[groupBy]]

[source, scala]

groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset


An untyped transformation

| <> a| [[groupByKey]]

[source, scala]

groupByKeyK: Encoder: KeyValueGroupedDataset[K, T]

A typed transformation

| <> a| [[head]]

[source, scala]

head(): T // <1> head(n: Int): Array[T]


<1> Uses 1 for n

An action

| <> a| [[hint]]

[source, scala]

hint(name: String, parameters: Any*): Dataset[T]

A basic action to specify a hint (and optional parameters)

| <> a| [[inputFiles]]

[source, scala]

inputFiles: Array[String]

A basic action

| <> a| [[intersect]]

[source, scala]

intersect(other: Dataset[T]): Dataset[T]

A typed transformation

| <> a| [[intersectAll]]

[source, scala]

intersectAll(other: Dataset[T]): Dataset[T]

(New in 2.4.0) A typed transformation

| <> a| [[isEmpty]]

[source, scala]

isEmpty: Boolean

(New in 2.4.0) A basic action

| <> a| [[isLocal]]

[source, scala]

isLocal: Boolean

A basic action

| isStreaming a| [[isStreaming]]

[source, scala]

isStreaming: Boolean

| join a| [[join]]

[source, scala]

join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame


An untyped transformation

| <> a| [[joinWith]]

[source, scala]

joinWithU: Dataset[(T, U)] joinWithU: Dataset[(T, U)]


A typed transformation

| <> a| [[limit]]

[source, scala]

limit(n: Int): Dataset[T]

A typed transformation

| <> a| [[localCheckpoint]]

[source, scala]

localCheckpoint(): Dataset[T] localCheckpoint(eager: Boolean): Dataset[T]


A basic action to checkpoint the Dataset locally on executors (and therefore unreliably)

| <> a| [[map]]

[source, scala]

mapU: Encoder: Dataset[U]

A typed transformation

| <> a| [[mapPartitions]]

[source, scala]

mapPartitionsU : Encoder: Dataset[U]

A typed transformation

| na a| [[na]]

[source, scala]

na: DataFrameNaFunctions

An untyped transformation

| <> a| [[orderBy]]

[source, scala]

orderBy(sortExprs: Column*): Dataset[T] orderBy(sortCol: String, sortCols: String*): Dataset[T]


A typed transformation

| <> a| [[persist]]

[source, scala]

persist(): this.type persist(newLevel: StorageLevel): this.type


A basic action to persist the Dataset

NOTE: Although its category persist is not an action in the common sense that means executing anything in a Spark cluster (i.e. execution on the driver or on executors). It acts only as a marker to perform Dataset persistence once an action is really executed.

| <> a| [[printSchema]]

[source, scala]

printSchema(): Unit

A basic action

| <> a| [[randomSplit]]

[source, scala]

randomSplit(weights: Array[Double]): Array[Dataset[T]] randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]


A typed transformation to split a Dataset randomly into two Datasets

| <> a| [[rdd]]

[source, scala]

rdd: RDD[T]

A basic action

| <> a| [[reduce]]

[source, scala]

reduce(func: (T, T) => T): T

An action to reduce the records of the Dataset using the specified binary function.

| <> a| [[repartition]]

[source, scala]

repartition(partitionExprs: Column*): Dataset[T] repartition(numPartitions: Int): Dataset[T] repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]


A typed transformation to repartition a Dataset

| <> a| [[repartitionByRange]]

[source, scala]

repartitionByRange(partitionExprs: Column*): Dataset[T] repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]


A typed transformation

| rollup a| [[rollup]]

[source, scala]

rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset


An untyped transformation

| <> a| [[sample]]

[source, scala]

sample(withReplacement: Boolean, fraction: Double): Dataset[T] sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] sample(fraction: Double): Dataset[T] sample(fraction: Double, seed: Long): Dataset[T]


A typed transformation

| <> a| [[schema]]

[source, scala]

schema: StructType

A basic action

| select a| [[select]]

[source, scala]

// Untyped transformations select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame

// Typed transformations selectU1: Dataset[U1] selectU1, U2: Dataset[(U1, U2)] selectU1, U2, U3: Dataset[(U1, U2, U3)] selectU1, U2, U3, U4: Dataset[(U1, U2, U3, U4)] selectU1, U2, U3, U4, U5: Dataset[(U1, U2, U3, U4, U5)]


An (untyped and typed) transformation

| selectExpr a| [[selectExpr]]

[source, scala]

selectExpr(exprs: String*): DataFrame

An untyped transformation

| <> a| [[show]]

[source, scala]

show(): Unit show(truncate: Boolean): Unit show(numRows: Int): Unit show(numRows: Int, truncate: Boolean): Unit show(numRows: Int, truncate: Int): Unit show(numRows: Int, truncate: Int, vertical: Boolean): Unit


An action

| <> a| [[sort]]

[source, scala]

sort(sortExprs: Column*): Dataset[T] sort(sortCol: String, sortCols: String*): Dataset[T]


A typed transformation to sort elements globally (across partitions). Use <> transformation for partition-local sort

| <> a| [[sortWithinPartitions]]

[source, scala]

sortWithinPartitions(sortExprs: Column*): Dataset[T] sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]


A typed transformation to sort elements within partitions (aka local sort). Use <> transformation for global sort (across partitions)

| stat a| [[stat]]

[source, scala]

stat: DataFrameStatFunctions

An untyped transformation

| <> a| [[storageLevel]]

[source, scala]

storageLevel: StorageLevel

A basic action

| <> a| [[summary]]

[source, scala]

summary(statistics: String*): DataFrame

An action to calculate statistics (e.g. count, mean, stddev, min, max and 25%, 50%, 75% percentiles)

| <> a| [[take]]

[source, scala]

take(n: Int): Array[T]

An action to take the first records of a Dataset

| <> a| [[toDF]]

[source, scala]

toDF(): DataFrame toDF(colNames: String*): DataFrame


A basic action to convert a Dataset to a DataFrame

| <> a| [[toJSON]]

[source, scala]

toJSON: Dataset[String]

A typed transformation

| <> a| [[toLocalIterator]]

[source, scala]

toLocalIterator(): java.util.Iterator[T]

An action that returns an iterator with all rows in the Dataset. The iterator will consume as much memory as the largest partition in the Dataset.

| <> a| [[transform]]

[source, scala]

transformU: Dataset[U]

A typed transformation for chaining custom transformations

| <> a| [[union]]

[source, scala]

union(other: Dataset[T]): Dataset[T]

A typed transformation

| <> a| [[unionByName]]

[source, scala]

unionByName(other: Dataset[T]): Dataset[T]

A typed transformation

| <> a| [[unpersist]]

[source, scala]

unpersist(): this.type // <1> unpersist(blocking: Boolean): this.type


<1> Uses unpersist with blocking disabled (false)

A basic action to unpersist the Dataset

| <> a| [[where]]

[source, scala]

where(condition: Column): Dataset[T] where(conditionExpr: String): Dataset[T]


A typed transformation

| withColumn a| [[withColumn]]

[source, scala]

withColumn(colName: String, col: Column): DataFrame

An untyped transformation

| withColumnRenamed a| [[withColumnRenamed]]

[source, scala]

withColumnRenamed(existingName: String, newName: String): DataFrame

An untyped transformation

| <> a| [[write]]

[source, scala]

write: DataFrameWriter[T]

A basic action that returns a DataFrameWriter for saving the content of the (non-streaming) Dataset out to an external storage |===