Spark MLlib

I’m new to Machine Learning as a discipline and Spark MLlib in particular so mistakes in this document are considered a norm (not an exception).

Spark MLlib is a module (a library / an extension) of Apache Spark to provide distributed machine learning algorithms on top of Spark’s RDD abstraction. Its goal is to simplify the development and usage of large scale machine learning.

You can find the following types of machine learning algorithms in MLlib:

  • Classification

  • Regression

  • Frequent itemsets (via FP-growth Algorithm)

  • Recommendation

  • Feature extraction and selection

  • Clustering

  • Statistics

  • Linear Algebra

You can also do the following using MLlib:

There are two libraries for Machine Learning in Spark MLlib: org.apache.spark.mllib for RDD-based Machine Learning and a higher-level API under for DataFrame-based Machine Learning with Pipelines.

Machine Learning uses large datasets to identify (infer) patterns and make decisions (aka predictions). Automated decision making is what makes Machine Learning so appealing. You can teach a system from a dataset and let the system act by itself to predict future.

The amount of data (measured in TB or PB) is what makes Spark MLlib especially important since a human could not possibly extract much value from the dataset in a short time.

Spark handles data distribution and makes the huge data available by means of RDDs, DataFrames, and recently Datasets.

Use cases for Machine Learning (and hence Spark MLlib that comes with appropriate algorithms):

  • Security monitoring and fraud detection

  • Operational optimizations

  • Product recommendations or (more broadly) Marketing optimization

  • Ad serving and optimization


This section introduces the concepts of Machine Learning and how they are modeled in Spark MLlib.


An observation is used to learn about or evaluate (i.e. draw conclusions about) the observed item’s target value.

Spark models observations as rows in a DataFrame.


A feature (aka dimension or variable) is an attribute of an observation. It is an independent variable.

Spark models features as columns in a DataFrame (one per feature or a set of features).

Ultimately, it is up to an algorithm to expect one or many features per column.

There are two classes of features:

  • Categorical with discrete values, i.e. the set of possible values is limited, and can range from one to many thousands. There is no ordering implied, and so the values are incomparable.

  • Numerical with quantitative values, i.e. any numerical values that you can compare to each other. You can further classify them into discrete and continuous features.


A label is a variable that a machine learning system learns to predict that are assigned to observations.

There are categorical and numerical labels.

A label is a dependent variable that depends on other dependent or independent variables like features.

FP-growth Algorithm

Spark 1.5 have significantly improved on frequent pattern mining capabilities with new algorithms for association rule generation and sequential pattern mining.

  • Frequent Itemset Mining using the Parallel FP-growth algorithm (since Spark 1.3)

    • Frequent Pattern Mining in MLlib User Guide

    • frequent pattern mining

      • reveals the most frequently visited site in a particular period

      • finds popular routing paths that generate most traffic in a particular region

    • models its input as a set of transactions, e.g. a path of nodes.

    • A transaction is a set of items, e.g. network nodes.

    • the algorithm looks for common subsets of items that appear across transactions, e.g. sub-paths of the network that are frequently traversed.

    • A naive solution: generate all possible itemsets and count their occurrence

    • A subset is considered a pattern when it appears in some minimum proportion of all transactions - the support.

    • the items in a transaction are unordered

    • analyzing traffic patterns from network logs

    • the algorithm finds all frequent itemsets without generating and testing all candidates

  • suffix trees (FP-trees) constructed and grown from filtered transactions

  • Also available in Mahout, but slower.

  • Distributed generation of association rules (since Spark 1.5).

    • in a retailer’s transaction database, a rule {toothbrush, floss} ⇒ {toothpaste} with a confidence value 0.8 would indicate that 80% of customers who buy a toothbrush and floss also purchase a toothpaste in the same transaction. The retailer could then use this information, put both toothbrush and floss on sale, but raise the price of toothpaste to increase overall profit.

    • FPGrowth model

  • parallel sequential pattern mining (since Spark 1.5)

    • PrefixSpan algorithm with modifications to parallelize the algorithm for Spark.

    • extract frequent sequential patterns like routing updates, activation failures, and broadcasting timeouts that could potentially lead to customer complaints and proactively reach out to customers when it happens.

Power Iteration Clustering

  • since Spark 1.3

  • unsupervised learning including clustering

  • identifying similar behaviors among users or network clusters

  • Power Iteration Clustering (PIC) in MLlib, a simple and scalable graph clustering method

    • PIC in MLlib User Guide

    • org.apache.spark.mllib.clustering.PowerIterationClustering

    • a graph algorithm

    • Among the first MLlib algorithms built upon GraphX.

    • takes an undirected graph with similarities defined on edges and outputs clustering assignment on nodes

    • uses truncated power iteration to find a very low-dimensional embedding of the nodes, and this embedding leads to effective graph clustering.

    • stores the normalized similarity matrix as a graph with normalized similarities defined as edge properties

    • The edge properties are cached and remain static during the power iterations.

    • The embedding of nodes is defined as node properties on the same graph topology.

    • update the embedding through power iterations, where aggregateMessages is used to compute matrix-vector multiplications, the essential operation in a power iteration method

    • k-means is used to cluster nodes using the embedding.

    • able to distinguish clearly the degree of similarity – as represented by the Euclidean distance among the points – even though their relationship is non-linear