Dataset API — Dataset Operators¶

Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.

Dataset Operators (Transformations and Actions)¶

explain¶

// Uses simple mode
explain(): Unit
// Uses extended or simple mode
explain(
  extended: Boolean): Unit
explain(
  mode: String): Unit

explain

A basic action to display the logical and physical plans of the Dataset, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output

[cols="1,3",options="header",width="100%"] |=== | Operator | Description

| agg a| [[agg]]

[source, scala]¶

agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame

An untyped transformation

| <> a| [[alias]]

[source, scala]¶

alias(alias: String): Dataset[T] alias(alias: Symbol): Dataset[T]

A typed transformation that is a mere synonym of <>.

| apply a| [[apply]]

[source, scala]¶

apply(colName: String): Column¶

An untyped transformation to select a column based on the column name (i.e. maps a Dataset onto a Column)

| <> a| [[as-alias]]

[source, scala]¶

as(alias: String): Dataset[T] as(alias: Symbol): Dataset[T]

A typed transformation

| <> a| [[as-type]]

[source, scala]¶

as[U : Encoder]: Dataset[U]¶

A typed transformation to enforce a type, i.e. marking the records in the Dataset as of a given data type (data type conversion). as simply changes the view of the data that is passed into typed operations (e.g. <>) and does not eagerly project away any columns that are not present in the specified class.

| <> a| [[cache]]

[source, scala]¶

cache(): this.type¶

A basic action that is a mere synonym of <>.

| <> a| [[checkpoint]]

[source, scala]¶

checkpoint(): Dataset[T] checkpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)

| <> a| [[coalesce]]

[source, scala]¶

coalesce(numPartitions: Int): Dataset[T]¶

A typed transformation to repartition a Dataset

| col a| [[col]]

[source, scala]¶

col(colName: String): Column¶

An untyped transformation to create a column (reference) based on the column name

| <> a| [[collect]]

[source, scala]¶

collect(): Array[T]¶

An action

| colRegex a| [[colRegex]]

[source, scala]¶

colRegex(colName: String): Column¶

An untyped transformation to create a column (reference) based on the column name specified as a regex

| <> a| [[columns]]

[source, scala]¶

columns: Array[String]¶

A basic action

| <> a| [[count]]

[source, scala]¶

count(): Long¶

An action to count the number of rows

| <> a| [[createGlobalTempView]]

[source, scala]¶

createGlobalTempView(viewName: String): Unit¶

A basic action

| <> a| [[createOrReplaceGlobalTempView]]

[source, scala]¶

createOrReplaceGlobalTempView(viewName: String): Unit¶

A basic action

| <> a| [[createOrReplaceTempView]]

[source, scala]¶

createOrReplaceTempView(viewName: String): Unit¶

A basic action

| <> a| [[createTempView]]

[source, scala]¶

createTempView(viewName: String): Unit¶

A basic action

| crossJoin a| [[crossJoin]]

[source, scala]¶

crossJoin(right: Dataset[_]): DataFrame¶

An untyped transformation

| cube a| [[cube]]

[source, scala]¶

cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

| <> a| [[describe]]

[source, scala]¶

describe(cols: String*): DataFrame¶

An action

| <> a| [[distinct]]

[source, scala]¶

distinct(): Dataset[T]¶

A typed transformation that is a mere synonym of <> (with all the columns of the Dataset)

| drop a| [[drop]]

[source, scala]¶

drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame

An untyped transformation

| <> a| [[dropDuplicates]]

[source, scala]¶

dropDuplicates(): Dataset[T] dropDuplicates(colNames: Array[String]): Dataset[T] dropDuplicates(colNames: Seq[String]): Dataset[T] dropDuplicates(col1: String, cols: String*): Dataset[T]

A typed transformation

| <> a| [[dtypes]]

[source, scala]¶

dtypes: Array[(String, String)]¶

A basic action

| <> a| [[except]]

[source, scala]¶

except( other: Dataset[T]): Dataset[T]

A typed transformation

| <> a| [[exceptAll]]

[source, scala]¶

exceptAll( other: Dataset[T]): Dataset[T]

(New in 2.4.0) A typed transformation

| <> a| [[filter]]

[source, scala]¶

filter(condition: Column): Dataset[T] filter(conditionExpr: String): Dataset[T] filter(func: T => Boolean): Dataset[T]

A typed transformation

| <> a| [[first]]

[source, scala]¶

first(): T¶

An action that is a mere synonym of <>

| <> a| [[flatMap]]

[source, scala]¶

flatMapU : Encoder: Dataset[U]¶

A typed transformation

| <> a| [[foreach]]

[source, scala]¶

foreach(f: T => Unit): Unit¶

An action

| <> a| [[foreachPartition]]

[source, scala]¶

foreachPartition(f: Iterator[T] => Unit): Unit¶

An action

| groupBy a| [[groupBy]]

[source, scala]¶

groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

| <> a| [[groupByKey]]

[source, scala]¶

groupByKeyK: Encoder: KeyValueGroupedDataset[K, T]¶

A typed transformation

| <> a| [[head]]

[source, scala]¶

head(): T // <1> head(n: Int): Array[T]

<1> Uses 1 for n

An action

| <> a| [[hint]]

[source, scala]¶

hint(name: String, parameters: Any*): Dataset[T]¶

A basic action to specify a hint (and optional parameters)

| <> a| [[inputFiles]]

[source, scala]¶

inputFiles: Array[String]¶

A basic action

| <> a| [[intersect]]

[source, scala]¶

intersect(other: Dataset[T]): Dataset[T]¶

A typed transformation

| <> a| [[intersectAll]]

[source, scala]¶

intersectAll(other: Dataset[T]): Dataset[T]¶

(New in 2.4.0) A typed transformation

| <> a| [[isEmpty]]

[source, scala]¶

isEmpty: Boolean¶

(New in 2.4.0) A basic action

| <> a| [[isLocal]]

[source, scala]¶

isLocal: Boolean¶

A basic action

| isStreaming a| [[isStreaming]]

[source, scala]¶

isStreaming: Boolean¶

| join a| [[join]]

[source, scala]¶

join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame

An untyped transformation

| <> a| [[joinWith]]

[source, scala]¶

joinWithU: Dataset[(T, U)] joinWithU: Dataset[(T, U)]

A typed transformation

| <> a| [[limit]]

[source, scala]¶

limit(n: Int): Dataset[T]¶

A typed transformation

| <> a| [[localCheckpoint]]

[source, scala]¶

localCheckpoint(): Dataset[T] localCheckpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset locally on executors (and therefore unreliably)

| <> a| [[map]]

[source, scala]¶

mapU: Encoder: Dataset[U]¶

A typed transformation

| <> a| [[mapPartitions]]

[source, scala]¶

mapPartitionsU : Encoder: Dataset[U]¶

A typed transformation

| na a| [[na]]

[source, scala]¶

na: DataFrameNaFunctions¶

An untyped transformation

| <> a| [[orderBy]]

[source, scala]¶

orderBy(sortExprs: Column*): Dataset[T] orderBy(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation

| <> a| [[persist]]

[source, scala]¶

persist(): this.type persist(newLevel: StorageLevel): this.type

A basic action to persist the Dataset

NOTE: Although its category persist is not an action in the common sense that means executing anything in a Spark cluster (i.e. execution on the driver or on executors). It acts only as a marker to perform Dataset persistence once an action is really executed.

| <> a| [[printSchema]]

[source, scala]¶

printSchema(): Unit¶

A basic action

| <> a| [[randomSplit]]

[source, scala]¶

randomSplit(weights: Array[Double]): Array[Dataset[T]] randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

A typed transformation to split a Dataset randomly into two Datasets

| <> a| [[rdd]]

[source, scala]¶

rdd: RDD[T]¶

A basic action

| <> a| [[reduce]]

[source, scala]¶

reduce(func: (T, T) => T): T¶

An action to reduce the records of the Dataset using the specified binary function.

| <> a| [[repartition]]

[source, scala]¶

repartition(partitionExprs: Column*): Dataset[T] repartition(numPartitions: Int): Dataset[T] repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

A typed transformation to repartition a Dataset

| <> a| [[repartitionByRange]]

[source, scala]¶

repartitionByRange(partitionExprs: Column*): Dataset[T] repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

A typed transformation

| rollup a| [[rollup]]

[source, scala]¶

rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

| <> a| [[sample]]

[source, scala]¶

sample(withReplacement: Boolean, fraction: Double): Dataset[T] sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] sample(fraction: Double): Dataset[T] sample(fraction: Double, seed: Long): Dataset[T]

A typed transformation

| <> a| [[schema]]

[source, scala]¶

schema: StructType¶

A basic action

| select a| [[select]]

[source, scala]¶

// Untyped transformations select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame

// Typed transformations selectU1: Dataset[U1] selectU1, U2: Dataset[(U1, U2)] selectU1, U2, U3: Dataset[(U1, U2, U3)] selectU1, U2, U3, U4: Dataset[(U1, U2, U3, U4)] selectU1, U2, U3, U4, U5: Dataset[(U1, U2, U3, U4, U5)]

An (untyped and typed) transformation

| selectExpr a| [[selectExpr]]

[source, scala]¶

selectExpr(exprs: String*): DataFrame¶

An untyped transformation

| <> a| [[show]]

[source, scala]¶

show(): Unit show(truncate: Boolean): Unit show(numRows: Int): Unit show(numRows: Int, truncate: Boolean): Unit show(numRows: Int, truncate: Int): Unit show(numRows: Int, truncate: Int, vertical: Boolean): Unit

An action

| <> a| [[sort]]

[source, scala]¶

sort(sortExprs: Column*): Dataset[T] sort(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation to sort elements globally (across partitions). Use <> transformation for partition-local sort

| <> a| [[sortWithinPartitions]]

[source, scala]¶

sortWithinPartitions(sortExprs: Column*): Dataset[T] sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation to sort elements within partitions (aka local sort). Use <> transformation for global sort (across partitions)

| stat a| [[stat]]

[source, scala]¶

stat: DataFrameStatFunctions¶

An untyped transformation

| <> a| [[storageLevel]]

[source, scala]¶

storageLevel: StorageLevel¶

A basic action

| <> a| [[summary]]

[source, scala]¶

summary(statistics: String*): DataFrame¶

An action to calculate statistics (e.g. count, mean, stddev, min, max and 25%, 50%, 75% percentiles)

| <> a| [[take]]

[source, scala]¶

take(n: Int): Array[T]¶

An action to take the first records of a Dataset

| <> a| [[toDF]]

[source, scala]¶

toDF(): DataFrame toDF(colNames: String*): DataFrame

A basic action to convert a Dataset to a DataFrame

| <> a| [[toJSON]]

[source, scala]¶

toJSON: Dataset[String]¶

A typed transformation

| <> a| [[toLocalIterator]]

[source, scala]¶

toLocalIterator(): java.util.Iterator[T]¶

An action that returns an iterator with all rows in the Dataset. The iterator will consume as much memory as the largest partition in the Dataset.

| <> a| [[transform]]

[source, scala]¶

transformU: Dataset[U]¶

A typed transformation for chaining custom transformations

| <> a| [[union]]

[source, scala]¶

union(other: Dataset[T]): Dataset[T]¶

A typed transformation

| <> a| [[unionByName]]

[source, scala]¶

unionByName(other: Dataset[T]): Dataset[T]¶

A typed transformation

| <> a| [[unpersist]]

[source, scala]¶

unpersist(): this.type // <1> unpersist(blocking: Boolean): this.type

<1> Uses unpersist with blocking disabled (false)

A basic action to unpersist the Dataset

| <> a| [[where]]

[source, scala]¶

where(condition: Column): Dataset[T] where(conditionExpr: String): Dataset[T]

A typed transformation

| withColumn a| [[withColumn]]

[source, scala]¶

withColumn(colName: String, col: Column): DataFrame¶

An untyped transformation

| withColumnRenamed a| [[withColumnRenamed]]

[source, scala]¶

withColumnRenamed(existingName: String, newName: String): DataFrame¶

An untyped transformation

| <> a| [[write]]

[source, scala]¶

write: DataFrameWriter[T]¶

A basic action that returns a DataFrameWriter for saving the content of the (non-streaming) Dataset out to an external storage |===