Error Budgets Explained — SRE Guide

What is an error budget?

An error budget is the maximum amount of unreliability your service is allowed to have, derived directly from your SLO. If your SLO is 99.9%, your error rate budget is 0.1% — meaning 0.1% of requests can fail, or 0.1% of time can be unavailable, in your measurement window.

The key insight is that 100% reliability is never the goal. Systems that chase 100% availability over-invest in reliability at the expense of feature development. Error budgets make this trade-off explicit and quantitative.

How to calculate an error budget

Error budget = 1 − SLO target. For a 99.9% SLO over 30 days:

Error budget = 0.1% of 30 days = 43.8 minutes of downtime
Or: 0.1% of 1,000,000 requests = 1,000 allowed failures

Use the Error Budget Calculator to compute these for your specific SLO and request volume.

Formula: Error budget (time) = Window duration × (1 − SLO target)
Error budget (requests) = Total requests × (1 − SLO target)

How error budgets work operationally

An error budget is most useful when your team has agreed in advance on what happens at different consumption levels. A common policy looks like this:

Budget > 50%: Normal operations. Feature deployments proceed.
Budget 10–50%: Increased monitoring. Reliability review before major releases.
Budget < 10%: Reliability sprint. Non-critical feature work paused.
Budget exhausted: Feature freeze until budget recovers. Post-incident review required.

The specific thresholds should be decided before a budget crunch — not during one.

Burn rate alerts

Tracking remaining budget is not enough — you also need to know if you are consuming budget faster than sustainable. Burn rate measures how quickly you are spending the budget relative to how quickly time is passing. A burn rate of 1.0 is exactly on pace; 2.0 means you will exhaust the budget in half the window.

Google's SRE workbook recommends a two-tiered alerting strategy: a fast burn alert (14.4× burn rate, 1-hour window) for catching severe incidents quickly, and a slow burn alert (6× burn rate, 6-hour window) for catching gradual degradation. Together, these catch incidents that would exhaust a 30-day budget in under a day.

Common pitfalls

No policy for when budget runs out. Without a pre-agreed policy, error budget exhaustion leads to arguments between product and engineering rather than action.

Resetting the budget manually. Artificially resetting a budget after exhaustion undermines trust in the system. The budget should recover naturally as the window rolls forward.

Single budget for a complex service. A service with many distinct user journeys may need separate SLOs and budgets for each critical path, not one aggregate number.