What is an error budget?

An error budget is the maximum amount of unreliability your service is allowed to have, derived directly from your SLO. If your SLO is 99.9%, your error rate budget is 0.1% — meaning 0.1% of requests can fail, or 0.1% of time can be unavailable, in your measurement window.

The key insight is that 100% reliability is never the goal. Systems that chase 100% availability over-invest in reliability at the expense of feature development. Error budgets make this trade-off explicit and quantitative.

How to calculate an error budget

Error budget = 1 − SLO target. For a 99.9% SLO over 30 days:

  • Error budget = 0.1% of 30 days = 43.8 minutes of downtime
  • Or: 0.1% of 1,000,000 requests = 1,000 allowed failures

Use the Error Budget Calculator to compute these for your specific SLO and request volume.

Formula: Error budget (time) = Window duration × (1 − SLO target)
Error budget (requests) = Total requests × (1 − SLO target)

How error budgets work operationally

An error budget is most useful when your team has agreed in advance on what happens at different consumption levels. A common policy looks like this:

  • Budget > 50%: Normal operations. Feature deployments proceed.
  • Budget 10–50%: Increased monitoring. Reliability review before major releases.
  • Budget < 10%: Reliability sprint. Non-critical feature work paused.
  • Budget exhausted: Feature freeze until budget recovers. Post-incident review required.

The specific thresholds should be decided before a budget crunch — not during one.

Burn rate alerts

Tracking remaining budget is not enough — you also need to know if you are consuming budget faster than sustainable. Burn rate measures how quickly you are spending the budget relative to how quickly time is passing. A burn rate of 1.0 is exactly on pace; 2.0 means you will exhaust the budget in half the window.

Google's SRE workbook recommends a two-tiered alerting strategy: a fast burn alert (14.4× burn rate, 1-hour window) for catching severe incidents quickly, and a slow burn alert (6× burn rate, 6-hour window) for catching gradual degradation. Together, these catch incidents that would exhaust a 30-day budget in under a day.

Common pitfalls

No policy for when budget runs out. Without a pre-agreed policy, error budget exhaustion leads to arguments between product and engineering rather than action.

Resetting the budget manually. Artificially resetting a budget after exhaustion undermines trust in the system. The budget should recover naturally as the window rolls forward.

Single budget for a complex service. A service with many distinct user journeys may need separate SLOs and budgets for each critical path, not one aggregate number.