# Opt in to the experimental Monitoring Daemonrun_monitoring:enabled:true# values below are the defaults, and don't need to be specified execpt to override themstart_timeout_seconds:180max_resume_run_attempts:3# experimental if above 0poll_interval_seconds:120
When Dagster launches a run, the run stays in STARTING status until the run worker spins up and marks the run as STARTED. In the event that some failure causes the run worker to not spin up, the run might be stuck in STARTING status. The start_timeout_seconds offers a time limit for how long runs can hang in this state before being marked as failed.
It's possible for a run worker process to crash during a run. This can happen for a variety of reasons (the host it's running on could go down, it could run out of memory, etc.). Without the monitoring daemon, there are two possible outcomes, neither desirable:
If the run worker was able to catch the interrupt, it will mark the run as failed
If the run worker goes down without a grace period, the run could be left hanging in STARTED status
If a run worker crashes, the run it's managing can hang. The monitoring daemon can run health checks on run workers for all active runs to detect this. If a failed run worker is detected (e.g. by the K8s Job having a non-zero exit code), the run is either marked as failed or resumed (see below).
Resuming runs after run worker crashes (Experimental)#
This feature is experimental and currently only supported when using:
The monitoring daemon handles these by performing health checks on the run workers. If a failure is detected, the daemon can launch a new run worker which resumes execution of the existing run. The run worker crash will be show in the event log, and the run will continue to completion. If the run worker continues to crash, the daemon will mark the run as failed after the configured number of attempts.
To enable, set max_resume_run_attempts to a value greater than 0.