Spark on YARN
You can submit Spark applications to a Hadoop YARN cluster using
yarn master URL.
spark-submit --master yarn mySparkApp.jar
Since Spark 2.0.0,
Without specifying the deploy mode, it is assumed
spark-submit --master yarn --deploy-mode client mySparkApp.jar
|Deploy modes are all about where the Spark driver runs.|
In client mode the Spark driver (and SparkContext) runs on a client node outside a YARN cluster whereas in cluster mode it runs inside a YARN cluster, i.e. inside a YARN container alongside ApplicationMaster (that acts as the Spark application in YARN).
spark-submit --master yarn --deploy-mode cluster mySparkApp.jar
In that sense, a Spark application deployed to YARN is a YARN-compatible execution framework that can be deployed to a YARN cluster (alongside other Hadoop workloads). On YARN, a Spark executor maps to a single YARN container.
|In order to deploy applications to YARN clusters, you need to use Spark with YARN support.|
Spark on YARN supports multiple application attempts and supports data locality for data in HDFS. You can also take advantage of Hadoop’s security and run Spark in a secure Hadoop environment using Kerberos authentication (aka Kerberized clusters).
There are few settings that are specific to YARN (see Settings). Among them, you can particularly like the support for YARN resource queues (to divide cluster resources and allocate shares to different teams and users based on advanced policies).
You can start spark-submit with
The memory in the YARN resource requests is
--executor-memory + what’s set for spark.yarn.executor.memoryOverhead, which defaults to 10% of
If YARN has enough resources it will deploy the executors distributed across the cluster, then each of them will try to process the data locally (
NODE_LOCAL in Spark Web UI), with as many splits in parallel as you defined in spark.executor.cores.
The memory for
ApplicationMaster is controlled by custom settings per deploy mode.
Otherwise, you will see the following error in the logs and Spark will exit.
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Since Spark 2.0.0, the only proper master URL is
./bin/spark-submit --master yarn ...
Before Spark 2.0.0, you could have used
yarn-cluster, but it is now deprecated. When you use the deprecated master URLs, you should see the following warning in the logs:
Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
When a principal is specified a keytab must be specified, too.
The settings spark.yarn.principal and
spark.yarn.principal will be set to respective values and
UserGroupInformation.loginUserFromKeytab will be called with their values as input arguments.
SPARK_DIST_CLASSPATH is a distribution-defined CLASSPATH to add to processes.
It is used to populate CLASSPATH for ApplicationMaster and executors.
(video) Spark on YARN: The Road Ahead — Marcelo Vanzin (Cloudera) from Spark Summit 2015