What is an error budget?
An error budget is the maximum amount of unreliability your service is allowed to have, derived directly from your SLO. If your SLO is 99.9%, your error rate budget is 0.1% — meaning 0.1% of requests can fail, or 0.1% of time can be unavailable, in your measurement window.
The key insight is that 100% reliability is never the goal. Systems that chase 100% availability over-invest in reliability at the expense of feature development. Error budgets make this trade-off explicit and quantitative.
How to calculate an error budget
Error budget = 1 − SLO target. For a 99.9% SLO over 30 days:
- Error budget = 0.1% of 30 days = 43.8 minutes of downtime
- Or: 0.1% of 1,000,000 requests = 1,000 allowed failures
Use the Error Budget Calculator to compute these for your specific SLO and request volume.
Error budget (requests) = Total requests × (1 − SLO target)
How error budgets work operationally
An error budget is most useful when your team has agreed in advance on what happens at different consumption levels. A common policy looks like this:
- Budget > 50%: Normal operations. Feature deployments proceed.
- Budget 10–50%: Increased monitoring. Reliability review before major releases.
- Budget < 10%: Reliability sprint. Non-critical feature work paused.
- Budget exhausted: Feature freeze until budget recovers. Post-incident review required.
The specific thresholds should be decided before a budget crunch — not during one.
Burn rate alerts
Tracking remaining budget is not enough — you also need to know if you are consuming budget faster than sustainable. Burn rate measures how quickly you are spending the budget relative to how quickly time is passing. A burn rate of 1.0 is exactly on pace; 2.0 means you will exhaust the budget in half the window.
Google's SRE workbook recommends a two-tiered alerting strategy: a fast burn alert (14.4× burn rate, 1-hour window) for catching severe incidents quickly, and a slow burn alert (6× burn rate, 6-hour window) for catching gradual degradation. Together, these catch incidents that would exhaust a 30-day budget in under a day.
Common pitfalls
No policy for when budget runs out. Without a pre-agreed policy, error budget exhaustion leads to arguments between product and engineering rather than action.
Resetting the budget manually. Artificially resetting a budget after exhaustion undermines trust in the system. The budget should recover naturally as the window rolls forward.
Single budget for a complex service. A service with many distinct user journeys may need separate SLOs and budgets for each critical path, not one aggregate number.