HeartbeatReceiver RPC Endpoint¶
HeartbeatReceiver is a ThreadSafeRpcEndpoint that is registered on the driver as HeartbeatReceiver.
HeartbeatReceiver is registered immediately after a Spark application is started (i.e. when SparkContext is created).
HeartbeatReceiver takes the following to be created:
HeartbeatReceiver is created when
SparkContext is created
HeartbeatReceiver manages a reference to TaskScheduler.
- Executor ID
HeartbeatReceiver is notified that an executor is no longer available
HeartbeatReceiver removes the executor (from executorLastSeen internal registry).
- Executor ID
HeartbeatReceiver is notified that a new executor has been registered
HeartbeatReceiver registers the executor and the current time (in executorLastSeen internal registry).
HeartbeatReceiver prints out the following TRACE message to the logs:
Checking for hosts with no recent heartbeats in HeartbeatReceiver.
For any such executor,
HeartbeatReceiver prints out the following WARN message to the logs:
Removing executor [executorId] with no recent heartbeats: [time] ms exceeds timeout [timeout] ms
HeartbeatReceiver TaskScheduler.executorLost (with
SlaveLost("Executor heartbeat timed out after [timeout] ms").
SparkContext.killAndReplaceExecutor is asynchronously called for the executor (i.e. on killExecutorThread).
The executor is removed from the executorLastSeen internal registry.
- Executor ID
- AccumulatorV2 updates (by task ID)
ExecutorMetricspeaks (by stage and stage attempt IDs)
Executor informs that it is alive and reports task metrics.
HeartbeatReceiver finds the
executorId executor (in executorLastSeen internal registry).
When the executor is found,
HeartbeatReceiver updates the time the heartbeat was received (in executorLastSeen internal registry).
HeartbeatReceiver uses the Clock to know the current time.
HeartbeatReceiver then submits an asynchronous task to notify
TaskScheduler that the heartbeat was received from the executor (using TaskScheduler internal reference).
HeartbeatReceiver posts a
HeartbeatResponse back to the executor (with the response from
TaskScheduler whether the executor has been registered already or not so it may eventually need to re-register).
If however the executor was not found (in executorLastSeen internal registry), i.e. the executor was not registered before, you should see the following DEBUG message in the logs and the response is to notify the executor to re-register.
Received heartbeat from unknown executor [executorId]
In a very rare case, when TaskScheduler is not yet assigned to
HeartbeatReceiver, you should see the following WARN message in the logs and the response is to notify the executor to re-register.
Dropping [heartbeat] because TaskScheduler is not ready yet
SparkContext informs that
TaskScheduler is available.
HeartbeatReceiver sets the internal reference to
onExecutorAdded( executorAdded: SparkListenerExecutorAdded): Unit
onExecutorAdded sends an ExecutorRegistered message to itself.
onExecutorAdded is part of the SparkListener abstraction.
addExecutor( executorId: String): Option[Future[Boolean]]
onExecutorRemoved( executorRemoved: SparkListenerExecutorRemoved): Unit
onExecutorRemoved removes the executor.
onExecutorRemoved is part of the SparkListener abstraction.
removeExecutor( executorId: String): Option[Future[Boolean]]
onStart is part of the RpcEndpoint abstraction.
onStop is part of the RpcEndpoint abstraction.
Handling Two-Way Messages¶
receiveAndReply( context: RpcCallContext): PartialFunction[Any, Unit]
receiveAndReply is part of the RpcEndpoint abstraction.
killExecutorThread is a daemon ScheduledThreadPoolExecutor with a single thread.
The name of the thread pool is kill-executor-thread.
eventLoopThread is a daemon ScheduledThreadPoolExecutor with a single thread.
The name of the thread pool is heartbeat-receiver-event-loop-thread.
Expiring Dead Hosts¶
expireDeadHosts is used when
HeartbeatReceiver is requested to receives an ExpireDeadHosts message.
ALL logging level for
org.apache.spark.HeartbeatReceiver logger to see what happens inside.
Add the following line to
Refer to Logging.