MTTR, MTTF, MTBF — what's the difference?

MTTR (Mean Time to Recovery) is the average time it takes to restore a service after a failure. It is calculated as total downtime divided by the number of incidents. Lower MTTR means faster recovery.

MTTF (Mean Time to Failure) is the average time a system operates before failing. It represents the uptime between incidents: total uptime divided by number of failures.

MTBF (Mean Time Between Failures) is MTTF + MTTR — the total cycle time from one failure to the next. It is most useful for understanding failure frequency.

Formulas:
MTTR = Total downtime ÷ Number of incidents
MTTF = (Total time − Total downtime) ÷ Number of incidents
MTBF = MTTF + MTTR

Improving MTTR

MTTR is driven by three factors: time to detect (alerting quality), time to diagnose (observability quality), and time to remediate (runbook quality and system design). Investments in better alerting and runbooks typically show the fastest MTTR improvements.