YarnClusterSchedulerBackend - SchedulerBackend for YARN in Cluster Deploy Mode

YarnClusterSchedulerBackend is a custom YarnSchedulerBackend for Spark on YARN in cluster deploy mode.

This is a scheduler backend that supports multiple application attempts and URLs for driver’s logs to display as links in the web UI in the Executors tab for the driver.

It uses spark.yarn.app.attemptId under the covers (that the YARN resource manager sets?).

YarnClusterSchedulerBackend is a private[spark] Scala class. You can find the sources in org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.

Enable DEBUG logging level for org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=DEBUG

Refer to Logging.

Creating YarnClusterSchedulerBackend

Creating a YarnClusterSchedulerBackend object requires a TaskSchedulerImpl and SparkContext objects.

Starting YarnClusterSchedulerBackend (start method)

YarnClusterSchedulerBackend comes with a custom start method.

start is part of the SchedulerBackend Contract.

It then calls the parent’s start and sets the parent’s totalExpectedExecutors to the initial number of executors.

Calculating Driver Log URLs (getDriverLogUrls method)

getDriverLogUrls in YarnClusterSchedulerBackend calculates the URLs for the driver’s logs - standard output (stdout) and standard error (stderr).

getDriverLogUrls is part of the SchedulerBackend Contract.

Internally, it retrieves the container id and through environment variables computes the base URL.

You should see the following DEBUG in the logs:

DEBUG Base URL for logs: [baseUrl]