guide

SLA vs SLO vs SLI: A Practical Guide for DevOps Engineers in 2026

# SLA vs SLO vs SLI: A Practical Guide for DevOps Engineers in 2026 Everyone knows the acronyms. Few teams implement them correctly. After analyzing uptime data for 500+ companies on ezmon.com, we see the same pattern repeatedly: engineers know *what* SLAs, SLOs, and SLIs are, but struggle with *how to set them* and *what to do when they breach*. This guide is practical. No theory. Real examples, real numbers, real consequences. --- ## The 30-Second Version | Term | What It Is | Who Cares | |------|-----------|-----------| | **SLI** (Service Level Indicator) | A metric you actually measure. E.g., "HTTP success rate over rolling 30 days." | Engineering | | **SLO** (Service Level Objective) | The target you aim to hit. E.g., "99.9% HTTP success rate." | Engineering + Product | | **SLA** (Service Level Agreement) | The contract with consequences. E.g., "99.9% uptime or we credit your bill." | Business + Legal | The relationship: **SLIs measure → SLOs target → SLAs commit.** --- ## SLI: What You Measure An SLI is just a number. It should be: 1. **Quantifiable** — a percentage, latency measurement, or count 2. **Directly tied to user experience** — not CPU usage, but request success rate 3. **Consistently measurable** — same methodology every time ### Common SLIs and how to measure them **Availability SLI:** ``` availability = (successful_requests / total_requests) × 100 # Example over 30 days: # total_requests = 8,640,000 (100 req/min × 60 × 24 × 30) # failed_requests = 8,640 (simulating 0.1% error rate) # availability = (8,631,360 / 8,640,000) × 100 = 99.90% ``` **Latency SLI:** ``` latency_sli = percentage_of_requests_under_threshold # Example: "95% of requests complete in under 200ms" # If p95 latency = 185ms → SLI = 100% # If p95 latency = 210ms → SLI = 0% (threshold breached) ``` **Error rate SLI:** ``` error_rate = (5xx_responses / total_responses) × 100 # Target: error_rate < 0.1% ``` **The common mistake:** Measuring the wrong thing. CPU utilization at 80% doesn't tell you whether users are affected. Request success rate does. --- ## SLO: The Target You Set An SLO is your internal reliability commitment. It should be: - **Slightly harder to achieve than your SLA** — your SLO is the internal guardrail before you breach the external commitment - **Based on real measurement data** — don't just pick 99.99% because it sounds good - **Achievable, not aspirational** — if you've never hit 99.95%, don't set 99.99% as your SLO ### Setting your first SLO: the right process 1. **Measure your current baseline** — what's your actual availability over the past 90 days? 2. **Identify your worst month** — what's the floor you can reliably commit to? 3. **Set SLO = (worst month - 0.05%)** — give yourself headroom for variance 4. **Define the measurement window** — rolling 30 days is standard ### Translating percentages to downtime time | Uptime % | Annual Downtime | Monthly Downtime | Weekly Downtime | |----------|----------------|-----------------|----------------| | 99.0% | 87.6 hours | 7.3 hours | 1.68 hours | | 99.5% | 43.8 hours | 3.65 hours | 50 minutes | | 99.9% | 8.76 hours | 43.8 minutes | 10 minutes | | 99.95% | 4.38 hours | 21.9 minutes | 5 minutes | | 99.99% | 52.6 minutes | 4.38 minutes | 1 minute | | 99.999% | 5.26 minutes | 26 seconds | 6 seconds | **The 99.99% reality check:** You have 4.38 minutes of downtime budget per month. One slow deploy, one network blip, one bad config push — you're done. Most teams should not commit to four nines without serious investment in redundancy and runbooks. --- ## Error Budgets: SLOs Made Actionable The SRE practice that makes SLOs actually useful. **Your error budget = (1 - SLO) × time window** If your SLO is 99.9% availability for 30 days: ``` error_budget = (1 - 0.999) × (30 × 24 × 60 minutes) = 43.2 minutes ``` You have **43.2 minutes** of downtime per month before you breach your SLO. **How to use it:** | Budget Remaining | Engineering Policy | |-----------------|-------------------| | > 50% | Deploy freely, run experiments, ship features | | 25–50% | Normal deployments, extra monitoring | | 10–25% | No risky deploys, investigate instability | | < 10% | Freeze changes, focus on reliability only | | 0% | Incident review mandatory, reliability sprint before new features | This turns "are we reliable enough?" from a subjective argument into an objective decision. --- ## SLA: The External Commitment An SLA is a business contract. Breaking it has financial and legal consequences. Rules for SLAs: 1. **SLA < SLO** — your SLA should be easier to hit than your internal SLO. Example: SLO = 99.9%, SLA = 99.5%. 2. **Define the measurement period explicitly** — "per calendar month" vs "rolling 30 days" matters 3. **Specify what counts as downtime** — planned maintenance? partial outages? degraded performance? 4. **Define the remedy** — service credits are standard; refunds are rare; termination clauses exist in enterprise contracts ### Typical SLA tiers by product type | Product Type | Standard SLA | Premium SLA | Enterprise SLA | |-------------|-------------|-------------|----------------| | SaaS (free) | No SLA | — | — | | SaaS (paid) | 99.5% | 99.9% | 99.95% | | Cloud Infrastructure | 99.9% | 99.95% | 99.99% | | Financial/Healthcare | 99.95% | 99.99% | Custom | --- ## The Real-World SLI/SLO/SLA Stack: An Example **Service:** Customer-facing API for a B2B SaaS product **SLIs defined:** - Request success rate (HTTP 2xx/3xx ÷ total requests) - p99 latency (99th percentile response time) - Error rate (HTTP 5xx ÷ total requests) **SLOs (internal targets, measured rolling 30 days):** - Availability: 99.95% (26 minutes/month budget) - p99 latency: < 500ms for 99% of requests - Error rate: < 0.05% **SLA (customer contract):** - Uptime: 99.9% per calendar month - Measurement: Excludes planned maintenance windows (announced 72h in advance) - Remedy: 10% service credit for each 0.1% below SLA; cap at 30% **Monitoring setup:** - 60-second health checks from 8 global locations (ezmon.com) - PagerDuty alert when error rate > 1% for 3+ consecutive minutes - Weekly error budget report to engineering leads - Monthly SLA compliance report to customers --- ## Common Mistakes That Get Teams Paged at 3am **Mistake 1: Setting SLOs without measuring first** You pick 99.99% because it sounds professional. Your actual baseline is 99.7%. You've committed to an SLO you'll breach constantly. **Mistake 2: Measuring availability from one location** Your health check server is in us-east-1. Your users are in APAC. You're measuring your own infra, not user experience. **Mistake 3: Counting successful health check pings as uptime** A health check returning 200 proves the server responds. It doesn't prove your app works. Use real user transaction monitoring. **Mistake 4: Ignoring latency SLOs** A page that loads in 8 seconds isn't "down" — but users experience it as broken. Include p95/p99 latency in your SLIs. **Mistake 5: No error budget policy** You set an SLO. You track it on a dashboard. Nothing changes based on it. Without a written error budget policy, the SLO is just a number. --- ## Getting Started Today 1. **Pick one SLI** — start with request success rate. It's the most meaningful and easiest to measure. 2. **Measure for 30 days** — don't set a target until you know your baseline. 3. **Set a conservative SLO** — current_average minus 0.5 percentage points. 4. **Write an error budget policy** — one paragraph: what happens when budget is at 50%, 10%, 0%. 5. **Monitor from outside your infrastructure** — use ezmon.com or similar to get an objective measurement your own metrics can't game. The goal isn't perfect uptime. It's calibrated commitments and fast, clear decisions when things go wrong. --- *ezmon.com monitors 500+ companies from 12 global probe locations. [See current uptime status →](/)* *Tags: sla vs slo vs sli, what is slo, sre reliability targets, how to set uptime sla, error budget, service level objectives*
slasloslidevopssrereliabilitymonitoring