YarnScheduler — TaskScheduler for Client Deploy Mode

YarnScheduler is the TaskScheduler for Spark on YARN in client deploy mode.

It is a custom TaskSchedulerImpl with ability to compute racks per hosts, i.e. it comes with a specialized getRackForHost.

It also sets org.apache.hadoop.yarn.util.RackResolver logger to WARN if not set already.

Enable INFO or DEBUG logging levels for org.apache.spark.scheduler.cluster.YarnScheduler logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.scheduler.cluster.YarnScheduler=DEBUG

Refer to Logging.

Tracking Racks per Hosts and Ports (getRackForHost method)

getRackForHost attempts to compute the rack for a host.

getRackForHost overrides the parent TaskSchedulerImpl’s getRackForHost

It simply uses Hadoop’s org.apache.hadoop.yarn.util.RackResolver to resolve a hostname to its network location, i.e. a rack.