Skip to content

SparkSubmit

SparkSubmit is the entry point to spark-submit shell script.

Special Primary Resource Names

SparkSubmit uses the following special primary resource names to represent Spark shells rather than application jars:

pyspark-shell

SparkSubmit uses pyspark-shell when:

isShell

isShell(
  res: String): Boolean

isShell is true when the given res primary resource represents a Spark shell.

isShell is used when:

Actions

SparkSubmit executes actions (based on the action argument).

Killing Submission

kill(
  args: SparkSubmitArguments): Unit

kill...FIXME

Displaying Version

printVersion(): Unit

printVersion...FIXME

Submission Status

requestStatus(
  args: SparkSubmitArguments): Unit

requestStatus...FIXME

Application Submission

submit(
  args: SparkSubmitArguments,
  uninitLog: Boolean): Unit

submit doRunMain unless isStandaloneCluster and useRest.

For isStandaloneCluster with useRest requested, submit...FIXME

doRunMain

doRunMain(): Unit

doRunMain runMain unless proxyUser is specified.

With proxyUser specified, doRunMain...FIXME

Running Main Class

runMain(
  args: SparkSubmitArguments,
  uninitLog: Boolean): Unit

runMain prepares submit environment for the given SparkSubmitArguments (that gives childArgs, childClasspath, sparkConf and childMainClass).

With verbose enabled, runMain prints out the following INFO messages to the logs:

Main class:
[childMainClass]
Arguments:
[childArgs]
Spark config:
[sparkConf_redacted]
Classpath elements:
[childClasspath]

runMain creates and sets a context classloader (based on spark.driver.userClassPathFirst configuration property) and adds the jars (from childClasspath).

runMain loads the main class (childMainClass).

runMain creates a SparkApplication (if the main class is a subtype of) or creates a JavaMainApplication (with the main class).

In the end, runMain requests the SparkApplication to start (with the childArgs and sparkConf).

Cluster Managers

SparkSubmit has a built-in support for some cluster managers (that are selected based on the master argument).

Nickname Master URL
KUBERNETES k8s://-prefix
LOCAL local-prefix
MESOS mesos-prefix
STANDALONE spark-prefix
YARN yarn

Launching Standalone Application

main(
  args: Array[String]): Unit

main creates a SparkSubmit to doSubmit (with the given args).

doSubmit

doSubmit(
  args: Array[String]): Unit

doSubmit initializeLogIfNecessary.

doSubmit parses the arguments in the given args (that gives a SparkSubmitArguments).

With verbose option on, doSubmit prints out the appArgs to standard output.

doSubmit branches off based on action.

Action Handler
SUBMIT submit
KILL kill
REQUEST_STATUS requestStatus
PRINT_VERSION printVersion

doSubmit is used when:

  • InProcessSparkSubmit standalone application is started
  • SparkSubmit standalone application is started

Parsing Arguments

parseArguments(
  args: Array[String]): SparkSubmitArguments

parseArguments creates a SparkSubmitArguments (with the given args).

prepareSubmitEnvironment

prepareSubmitEnvironment(
  args: SparkSubmitArguments,
  conf: Option[HadoopConfiguration] = None): (Seq[String], Seq[String], SparkConf, String)

prepareSubmitEnvironment creates a 4-element tuple made up of the following:

  1. childArgs for arguments
  2. childClasspath for Classpath elements
  3. sysProps for Spark properties
  4. childMainClass

Tip

Use --verbose command-line option to have the elements of the tuple printed out to the standard output.

prepareSubmitEnvironment...FIXME

For isPython in CLIENT deploy mode, prepareSubmitEnvironment sets the following based on primaryResource:

  • For pyspark-shell the mainClass is org.apache.spark.api.python.PythonGatewayServer

  • Otherwise, the mainClass is org.apache.spark.deploy.PythonRunner and the main python file, extra python files and the childArgs

prepareSubmitEnvironment...FIXME

prepareSubmitEnvironment determines the cluster manager based on master argument.

For KUBERNETES, prepareSubmitEnvironment checkAndGetK8sMasterUrl.

prepareSubmitEnvironment...FIXME

prepareSubmitEnvironment is used when...FIXME

childMainClass

childMainClass is the last 4th argument in the result tuple of prepareSubmitEnvironment.

// (childArgs, childClasspath, sparkConf, childMainClass)
(Seq[String], Seq[String], SparkConf, String)

childMainClass can be as follows (based on the deployMode):

Deploy Mode Master URL childMainClass
client any mainClass
cluster KUBERNETES KubernetesClientApplication
cluster MESOS RestSubmissionClientApp (for REST submission API)
cluster STANDALONE RestSubmissionClientApp (for REST submission API)
cluster STANDALONE ClientApp
cluster YARN YarnClusterApplication

isKubernetesClient

prepareSubmitEnvironment uses isKubernetesClient flag to indicate that:

isKubernetesClusterModeDriver

prepareSubmitEnvironment uses isKubernetesClusterModeDriver flag to indicate that:

renameResourcesToLocalFS

renameResourcesToLocalFS(
  resources: String,
  localResources: String): String

renameResourcesToLocalFS...FIXME

renameResourcesToLocalFS is used for isKubernetesClusterModeDriver mode.

downloadResource

downloadResource(
  resource: String): String

downloadResource...FIXME

Checking Whether Resource is Internal

isInternal(
  res: String): Boolean

isInternal is true when the given res is spark-internal.

isInternal is used when:

isUserJar

isUserJar(
  res: String): Boolean

isUserJar is true when the given res is none of the following:

isUserJar is used when:

  • FIXME

isPython

isPython(
  res: String): Boolean

isPython is positive (true) when the given res primary resource represents a PySpark application:


isPython is used when: