ML Persistence — Saving and Loading Models and Pipelines

MLWriter and MLReader belong to package.

They allow you to save and load models despite the languages — Scala, Java, Python or R — they have been saved in and loaded later on.


MLWriter abstract class comes with save(path: String) method to save a ML component to a given path.

save(path: String): Unit

It comes with another (chainable) method overwrite to overwrite the output path if it already exists.

overwrite(): this.type

The component is saved into a JSON file (see MLWriter Example section below).

Enable INFO logging level for the MLWriter implementation logger to see what happens inside.

Add the following line to conf/$.PipelineWriter=INFO

Refer to Logging.

FIXME The logging doesn’t work and overwriting does not print out INFO message to the logs :(

MLWriter Example

val pipeline = new Pipeline().setStages(Array.empty[PipelineStage])"sample-pipeline")

The result of save for "unfitted" pipeline is a JSON file for metadata (as shown below).

$ cat sample-pipeline/metadata/part-00000 | jq
  "class": "",
  "timestamp": 1472747720477,
  "sparkVersion": "2.1.0-SNAPSHOT",
  "uid": "pipeline_181c90b15d65",
  "paramMap": {
    "stageUids": []

The result of save for pipeline model is a JSON file for metadata while Parquet for model data, e.g. coefficients.

val model ="sample-model")
$ cat sample-model/metadata/part-00000 | jq
  "class": "",
  "timestamp": 1472748168005,
  "sparkVersion": "2.1.0-SNAPSHOT",
  "uid": "pipeline_3ed598da1c4b",
  "paramMap": {
    "stageUids": [

$ tree sample-model/stages/
|-- 0_regexTok_bf73e7c36e22
|   `-- metadata
|       |-- _SUCCESS
|       `-- part-00000
|-- 1_hashingTF_ebece38da130
|   `-- metadata
|       |-- _SUCCESS
|       `-- part-00000
`-- 2_logreg_819864aa7120
    |-- data
    |   |-- _SUCCESS
    |   `-- part-r-00000-56423674-0208-4768-9d83-2e356ac6a8d2.snappy.parquet
    `-- metadata
        |-- _SUCCESS
        `-- part-00000

7 directories, 8 files


MLReader abstract class comes with load(path: String) method to load a ML component from a given path.

val pipeline ="sample-pipeline")

scala> val stageCount = pipeline.getStages.size
stageCount: Int = 0

val pipelineModel ="sample-model")

scala> pipelineModel.stages
res1: Array[] = Array(regexTok_bf73e7c36e22, hashingTF_ebece38da130, logreg_819864aa7120)