FetchFailedException exception may be thrown when a task runs (and
ShuffleBlockFetcherIterator could not fetch shuffle blocks).
FetchFailedException is reported,
TaskRunner catches it and notifies the ExecutorBackend (with
TaskState.FAILED task state).
FetchFailedException takes the following to be created:
- Shuffle ID
- Map ID
- Map Index
- Reduce ID
- Error Message
- Error Cause
While being created,
FetchFailedException requests the current TaskContext to setFetchFailed.
FetchFailedException is created when:
ShuffleBlockFetcherIteratoris requested to throw a FetchFailedException (for a
FetchFailedException can be given an error cause when created.
The root cause of the
FetchFailedException is usually because the Executor (with the BlockManager for requested shuffle blocks) is lost and no longer available due to the following:
OutOfMemoryErrorcould be thrown (aka OOMed) or some other unhandled exception
- The cluster manager that manages the workers with the executors of your Spark application (e.g. Kubernetes, Hadoop YARN) enforces the container memory limits and eventually decides to kill the executor due to excessive memory usage
A solution is usually to tune the memory of your Spark application.
TaskContext comes with setFetchFailed and fetchFailed to hold a
FetchFailedException unmodified (regardless of what happens in a user code).