Distinguish Active and Dead Jobs

This topic describes how to distinguish between active and dead jobs.

Problem

On clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages.

../../_images/stuck-stages.png

Cause

Whenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal eventListenerBus drops events. These events are used to track job progress in the Spark UI. Whenever the event listener drops events you start seeing dead jobs/stages in Spark UI, which never finish. The jobs are actually finished but not shown as completed in the Spark UI.

You observe the following traces in driver logs:

18/01/25 06:37:32 WARN LiveListenerBus: Dropped 5044 SparkListenerEvents since Thu Jan 25 06:36:32 UTC 2018

Solution

There is no way to remove dead jobs from the Spark UI without restarting the cluster. However, you can identify the active jobs and stages by running the following commands:

sc.statusTracker.getActiveJobIds()  // Returns an array containing the IDs of all active jobs.
sc.statusTracker.getActiveStageIds() // Returns an array containing the IDs of all active stages.