ML Pipelines (

ML Pipeline API (aka Spark ML or due to the package the API lives in) lets Spark users quickly and easily assemble and configure practical distributed Machine Learning pipelines (aka workflows) by standardizing the APIs for different Machine Learning concepts.

Both scikit-learn and GraphLab have the concept of pipelines built into their system.

The ML Pipeline API is a new DataFrame-based API developed under package and is the primary API for MLlib as of Spark 2.0.

The previous RDD-based API under org.apache.spark.mllib package is in maintenance-only mode which means that it is still maintained with bug fixes but no new features are expected.

The key concepts of Pipeline API (aka Components):

spark mllib pipeline
Figure 1. Pipeline with Transformers and Estimator (and corresponding Model)

The beauty of using Spark ML is that the ML dataset is simply a DataFrame (and all calculations are simply UDF applications on columns).

Use of a machine learning algorithm is only one component of a predictive analytic workflow. There can also be additional pre-processing steps for the machine learning algorithm to work.

While a RDD computation in Spark Core, a Dataset manipulation in Spark SQL, a continuous DStream computation in Spark Streaming are the main data abstractions a ML Pipeline is in Spark MLlib.

A typical standard machine learning workflow is as follows:

  1. Loading data (aka data ingestion)

  2. Extracting features (aka feature extraction)

  3. Training model (aka model training)

  4. Evaluate (or predictionize)

You may also think of two additional steps before the final model becomes production ready and hence of any use:

  1. Testing model (aka model testing)

  2. Selecting the best model (aka model selection or model tuning)

  3. Deploying model (aka model deployment and integration)

The Pipeline API lives under package.

Given the Pipeline Components, a typical machine learning pipeline is as follows:

  • You use a collection of Transformer instances to prepare input DataFrame - the dataset with proper input data (in columns) for a chosen ML algorithm.

  • You then fit (aka build) a Model.

  • With a Model you can calculate predictions (in prediction column) on features input column through DataFrame transformation.

Example: In text classification, preprocessing steps like n-gram extraction, and TF-IDF feature weighting are often necessary before training of a classification model like an SVM.

Upon deploying a model, your system must not only know the SVM weights to apply to input features, but also transform raw data into the format the model is trained on.

  • Pipeline for text categorization

  • Pipeline for image classification

Pipelines are like a query plan in a database system.

Components of ML Pipeline:

  • Pipeline Construction Framework – A DSL for the construction of pipelines that includes concepts of Nodes and Pipelines.

    • Nodes are data transformation steps (Transformers)

    • Pipelines are a DAG of Nodes.

      Pipelines become objects that can be saved out and applied in real-time to new data.

It can help creating domain-specific feature transformers, general purpose transformers, statistical utilities and nodes.

You could persist (i.e. save to a persistent storage) or unpersist (i.e. load from a persistent storage) ML components as described in Persisting Machine Learning Components.

A ML component is any object that belongs to Pipeline API, e.g. Pipeline, LinearRegressionModel, etc.

Features of Pipeline API

The features of the Pipeline API in Spark MLlib:

  • DataFrame as a dataset format

  • ML Pipelines API is similar to scikit-learn

  • Easy debugging (via inspecting columns added during execution)

  • Parameter tuning

  • Compositions (to build more complex pipelines out of existing ones)


A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a PipelineModel to an input dataset.

pipeline: DataFrame =[fit]=> DataFrame (using transformers and estimators)

A pipeline is represented by Pipeline class.


Pipeline is also an Estimator (so it is acceptable to set up a Pipeline with other Pipeline instances).

The Pipeline object can read or load pipelines (refer to Persisting Machine Learning Components page).

read: MLReader[Pipeline]
load(path: String): Pipeline

You can create a Pipeline with an optional uid identifier. It is of the format pipeline_[randomUid] when unspecified.

val pipeline = new Pipeline()

scala> println(pipeline.uid)

val pipeline = new Pipeline("my_pipeline")

scala> println(pipeline.uid)

The identifier uid is used to create an instance of PipelineModel to return from fit(dataset: DataFrame): PipelineModel method.

scala> val pipeline = new Pipeline("my_pipeline")
pipeline: = my_pipeline

scala> val df = (0 to 9).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: int]

scala> val model = pipeline.setStages(Array()).fit(df)
model: = my_pipeline

The stages mandatory parameter can be set using setStages(value: Array[PipelineStage]): this.type method.

Pipeline Fitting (fit method)

fit(dataset: DataFrame): PipelineModel

The fit method returns a PipelineModel that holds a collection of Transformer objects that are results of method for every Estimator in the Pipeline (with possibly-modified dataset) or simply input Transformer objects. The input dataset DataFrame is passed to transform for every Transformer instance in the Pipeline.

It first transforms the schema of the input dataset DataFrame.

It then searches for the index of the last Estimator to calculate Transformers for Estimator and simply return Transformer back up to the index in the pipeline. For each Estimator the fit method is called with the input dataset. The result DataFrame is passed to the next Transformer in the chain.

An IllegalArgumentException exception is thrown when a stage is neither Estimator or Transformer.

transform method is called for every Transformer calculated but the last one (that is the result of executing fit on the last Estimator).

The calculated Transformers are collected.

After the last Estimator there can only be Transformer stages.

The method returns a PipelineModel with uid and transformers. The parent Estimator is the Pipeline itself.