Anatomy of Spark Application

Every Spark application starts from creating SparkContext.

Without SparkContext no computation (as a Spark job) can be started.
A Spark application is an instance of SparkContext. Or, put it differently, a Spark context constitutes a Spark application.

A Spark application is uniquely identified by a pair of the application and application attempt ids.

package pl.japila.spark

import org.apache.spark.{SparkContext, SparkConf}

object SparkMeApp {
  def main(args: Array[String]) {

    val masterURL = "local[*]"  (1)

    val conf = new SparkConf()  (2)
      .setAppName("SparkMe Application")

    val sc = new SparkContext(conf) (3)

    val fileName = util.Try(args(0)).getOrElse("build.sbt")

    val lines = sc.textFile(fileName).cache() (4)

    val c = lines.count() (5)
    println(s"There are $c lines in $fileName")
1 Master URL to connect the application to
2 Create Spark configuration
3 Create Spark context
4 Create lines RDD
5 Execute count action
Spark shell creates a Spark context and SQL context for you at startup.

When a Spark application starts (using spark-submit script or as a standalone application), it connects to Spark master as described by master URL. It is part of Spark context’s initialization.

spark submit master workers
Figure 1. Submitting Spark application to master using master URL
Your Spark application can run locally or on the cluster which is based on the cluster manager and the deploy mode (--deploy-mode). Refer to Deployment Modes.

You can then create RDDs, transform them to other RDDs and ultimately execute actions. You can also cache interim RDDs to speed up data processing.

After all the data processing is completed, the Spark application finishes by stopping the Spark context.