observability

Is Gatus Down? Real-Time Status & Outage Checker

Is Gatus Down? Real-Time Status & Outage Checker

Gatus is an open-source automated health dashboard written in Go with over 7,000 GitHub stars. It monitors endpoints via HTTP, TCP, DNS, ICMP, and WebSocket checks with configurable thresholds — response time, status codes, response body content, and certificate expiry. Gatus renders a beautiful public-facing status page, supports rich alerting integrations (Slack, PagerDuty, Microsoft Teams, Discord, email, and more), and stores check history in SQLite or PostgreSQL. Created as a self-hosted alternative to Pingdom, Freshping, and Betteruptime, Gatus is widely used by indie developers, DevOps teams, and SRE teams who want full control over their monitoring infrastructure. Its single Go binary with a YAML configuration file makes deployment trivially simple — Docker, Kubernetes, or a bare VPS all work out of the box.

The irony of monitoring tools is that they can go down too — and when Gatus fails, it does so silently. Endpoint checks stop executing, alerts stop firing, and your status page goes stale or returns a 502. If Gatus is your only monitoring layer, any downstream outage during a Gatus failure goes completely undetected. Because Gatus is typically the tool you use to catch outages, running a separate check on Gatus itself is a critical reliability practice.

Quick Status Check

#!/bin/bash
# Gatus health check
# Checks health endpoint, API, process, port, and config file

GATUS_HOST="${GATUS_HOST:-localhost}"
GATUS_PORT="${GATUS_PORT:-8080}"
GATUS_CONFIG="${GATUS_CONFIG:-/etc/gatus/config.yaml}"
FAIL=0

echo "=== Gatus Status Check ==="
echo "Host: ${GATUS_HOST}:${GATUS_PORT}"
echo ""

# Check /health endpoint (returns {"healthy":true})
HEALTH_RESP=$(curl -sf --max-time 5 "http://${GATUS_HOST}:${GATUS_PORT}/health" 2>&1)
if echo "${HEALTH_RESP}" | grep -q '"healthy":true'; then
  echo "[OK] Health endpoint: {\"healthy\":true}"
elif echo "${HEALTH_RESP}" | grep -q "healthy"; then
  echo "[WARN] Health endpoint responded but may be degraded: ${HEALTH_RESP}"
else
  echo "[FAIL] Health endpoint unreachable or returned unhealthy"
  FAIL=1
fi

# Check API endpoint statuses
HTTP_CODE=$(curl -so /dev/null -w "%{http_code}" --max-time 5 \
  "http://${GATUS_HOST}:${GATUS_PORT}/api/v1/endpoints/statuses" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
  echo "[OK] Endpoint statuses API returned HTTP 200"
else
  echo "[FAIL] Endpoint statuses API returned HTTP ${HTTP_CODE}"
  FAIL=1
fi

# Check Gatus process
if pgrep -f "gatus" > /dev/null 2>&1; then
  echo "[OK] Gatus process is running"
else
  echo "[WARN] Gatus process not found via pgrep"
fi

# Check port is listening
if ss -tlnp 2>/dev/null | grep -q ":${GATUS_PORT}" || \
   netstat -tlnp 2>/dev/null | grep -q ":${GATUS_PORT}"; then
  echo "[OK] Port ${GATUS_PORT} is listening"
else
  echo "[FAIL] Port ${GATUS_PORT} not listening"
  FAIL=1
fi

# Check config file exists and is non-empty
if [ -f "${GATUS_CONFIG}" ]; then
  LINES=$(wc -l < "${GATUS_CONFIG}" 2>/dev/null || echo 0)
  echo "[OK] Config file found: ${GATUS_CONFIG} (${LINES} lines)"
else
  # Try common alternative locations
  for path in ./config.yaml /config/config.yaml /app/config.yaml; do
    if [ -f "$path" ]; then
      echo "[OK] Config found at ${path}"
      GATUS_CONFIG="$path"
      break
    fi
  done
  if [ ! -f "${GATUS_CONFIG}" ]; then
    echo "[WARN] Config file not found at ${GATUS_CONFIG}"
  fi
fi

echo ""
if [ "$FAIL" -eq 0 ]; then
  echo "Result: Gatus appears healthy"
else
  echo "Result: Gatus has failures — review output above"
  exit 1
fi

Python Health Check

#!/usr/bin/env python3
"""
Gatus health check
Verifies health endpoint, endpoint status API, and monitoring coverage
"""

import json
import os
import subprocess
import sys
import time
from pathlib import Path

try:
    import urllib.request as urlreq
    import urllib.error as urlerr
except ImportError:
    print("ERROR: urllib not available")
    sys.exit(1)

HOST = os.environ.get("GATUS_HOST", "localhost")
PORT = int(os.environ.get("GATUS_PORT", "8080"))
BASE_URL = f"http://{HOST}:{PORT}"
GATUS_CONFIG = os.environ.get("GATUS_CONFIG", "/etc/gatus/config.yaml")
WARN_LATENCY_MS = 2000
TIMEOUT = 8
results = []


def check(label, ok, detail=""):
    status = "OK" if ok else ("WARN" if ok is None else "FAIL")
    msg = f"[{status}] {label}"
    if detail:
        msg += f" — {detail}"
    print(msg)
    if ok is not None:
        results.append(ok)
    return ok


def fetch(path, timeout=TIMEOUT):
    try:
        url = f"{BASE_URL}{path}"
        req = urlreq.Request(url, headers={"Accept": "application/json"})
        with urlreq.urlopen(req, timeout=timeout) as resp:
            return resp.status, resp.read().decode("utf-8", errors="replace")
    except urlerr.HTTPError as e:
        return e.code, e.read().decode("utf-8", errors="replace")
    except Exception as e:
        return 0, str(e)


print(f"=== Gatus Python Health Check ===")
print(f"Target: {BASE_URL}")
print()

# 1. Health endpoint
t0 = time.time()
status, body = fetch("/health")
latency_ms = (time.time() - t0) * 1000

if status == 200:
    try:
        data = json.loads(body)
        healthy = data.get("healthy", False)
        check("Health endpoint", healthy,
              f'{{"healthy":{str(healthy).lower()}}} ({latency_ms:.0f}ms)')
    except json.JSONDecodeError:
        check("Health endpoint", False, f"HTTP 200 but body not valid JSON: {body[:80]}")
else:
    check("Health endpoint", False, f"HTTP {status}")

# 2. Response latency (Go service — should be very fast)
if latency_ms > 0:
    slow = latency_ms > WARN_LATENCY_MS
    if slow:
        check("Response latency", False,
              f"{latency_ms:.0f}ms exceeds {WARN_LATENCY_MS}ms threshold — Go service may be overloaded")
    else:
        check("Response latency", True, f"{latency_ms:.0f}ms (Go service expected to be fast)")

# 3. Endpoint statuses API
status, body = fetch("/api/v1/endpoints/statuses")
if status == 200:
    try:
        data = json.loads(body)
        if isinstance(data, list):
            total = len(data)
            unhealthy = []
            for ep in data:
                name = ep.get("name", ep.get("key", "unknown"))
                results_list = ep.get("results", [])
                if results_list:
                    latest = results_list[-1]
                    if not latest.get("success", True):
                        unhealthy.append(name)
            check("Total endpoints monitored", total > 0, f"{total} endpoint(s) configured")
            if unhealthy:
                check("Endpoint health", False,
                      f"{len(unhealthy)}/{total} endpoint(s) currently failing: "
                      + ", ".join(unhealthy[:5]))
            else:
                check("Endpoint health", True, f"All {total} endpoint(s) passing")
        elif isinstance(data, dict):
            # Paginated or wrapped response
            check("Endpoint statuses API", True, f"HTTP 200, response shape: dict with keys {list(data.keys())[:4]}")
        else:
            check("Endpoint statuses API", False, "Unexpected response format")
    except json.JSONDecodeError:
        check("Endpoint statuses API", False, f"HTTP 200 but body not valid JSON")
else:
    check("Endpoint statuses API", False, f"HTTP {status}")

# 4. At least one endpoint is being monitored
# (checked above via total > 0, but confirm separately)
# 5. Config file exists
config_paths = [
    Path(GATUS_CONFIG),
    Path("./config.yaml"),
    Path("/config/config.yaml"),
    Path("/app/config.yaml"),
]
config_found = next((p for p in config_paths if p.exists()), None)
if config_found:
    size = config_found.stat().st_size
    check("Config file", True, f"Found at {config_found} ({size} bytes)")
else:
    check("Config file", None, f"Not found at common paths (may be mounted differently in Docker)")

# 6. Gatus process
try:
    out = subprocess.run(["pgrep", "-f", "gatus"], capture_output=True, text=True)
    check("Gatus process", out.returncode == 0,
          "running" if out.returncode == 0 else "not found — may be in Docker")
except FileNotFoundError:
    check("Gatus process", None, "pgrep not available")

# 7. Database file (SQLite default)
db_paths = [
    Path("/data/data.db"),
    Path("/app/data.db"),
    Path("./data.db"),
    Path("/etc/gatus/data.db"),
]
db_found = next((p for p in db_paths if p.exists()), None)
if db_found:
    size_mb = db_found.stat().st_size / (1024 * 1024)
    check("SQLite database", True, f"Found at {db_found} ({size_mb:.1f} MB)")
else:
    check("SQLite database", None, "Not found at common paths — may use PostgreSQL or different mount")

print()
failures = [r for r in results if r is False]
if not failures:
    print("Result: Gatus appears healthy")
    sys.exit(0)
else:
    print(f"Result: {len(failures)} check(s) failed — review output above")
    sys.exit(1)

Common Gatus Outage Causes

SymptomLikely CauseResolution
Gatus refuses to start, all monitoring stops YAML configuration parse error — invalid syntax or unknown key after upgrade Validate config with gatus --config config.yaml --dry-run; check Gatus release notes for deprecated config keys
Status page loads but history is missing or frozen SQLite database locked — typically caused by two Gatus instances writing simultaneously Ensure only one Gatus instance runs per SQLite file; for HA deployments, switch to PostgreSQL as the storage backend
Incidents occur but no alerts are sent Alerting credentials expired — Slack webhook revoked, PagerDuty key rotated, SMTP password changed Test alert credentials via Gatus' built-in alert testing; rotate and update secrets in config or environment variables
High false positive rate on endpoint checks Endpoint timeout configured too low for the target's typical response time Increase client.timeout per endpoint; review response-time condition thresholds; add failure-threshold: 3 to require consecutive failures before alerting
DNS check endpoints always failing DNS resolver misconfigured — Gatus container using incorrect or unreachable nameserver Set explicit dns.query-type and verify container DNS settings; use 8.8.8.8 for testing; check Docker DNS resolution from inside the container
External endpoint checks always timing out Docker network isolation preventing outbound connections to monitored services Ensure Gatus container is on a network with external internet access; use --network host for internal network monitoring; check egress firewall rules

Architecture Overview

ComponentFunctionFailure Impact
Gatus Core (Go binary) Executes all endpoint checks on configured intervals; evaluates conditions and triggers alerts All monitoring stops; no checks execute and no alerts fire
YAML Configuration Defines endpoints, check intervals, conditions, alerting rules, and UI settings Parse errors prevent startup; misconfigured endpoints produce false positives or missed failures
Storage Backend (SQLite/PostgreSQL) Persists check history, uptime percentages, and incident timeline for status page Status page shows no history; uptime calculations reset; SQLite lock blocks all writes
HTTP/TCP/DNS/ICMP Check Engine Executes protocol-specific probes against monitored endpoints Specific check type fails silently; endpoints using that protocol appear always healthy
Alerting Integrations Sends notifications to Slack, PagerDuty, Teams, Discord, email on threshold breach Incidents detected but team not notified; silent outage without alert delivery
Status Page (port 8080) Public-facing or internal dashboard showing endpoint health, uptime, and incident history Stakeholders cannot view service health; 502 from reverse proxy if Gatus crashes

Uptime History

DateIncident TypeDurationImpact
2026-01 Breaking config schema change in Gatus v5 — services key renamed to endpoints Until manually migrated Existing configs caused startup failure after upgrade; all monitoring stopped until config updated
2025-09 SQLite WAL corruption after host power loss Variable (user-managed) Status page history lost; Gatus required database deletion and restart to recover
2025-08 Docker Hub rate limit blocking Gatus image pulls during CI/CD updates ~3 hours Automated Gatus deployments failed; running containers unaffected
2025-07 Slack alerting API changes breaking Gatus webhook format Several days until patch release Slack alerts silently failing; PagerDuty and email alerts continued normally

Monitor Gatus Automatically

The fundamental problem with any monitoring tool is that it cannot alert you to its own failure. A crashed Gatus instance is indistinguishable from a perfect day with no incidents — checks simply stop executing silently. ezmon.com monitors your Gatus endpoints from multiple external probes and alerts your team via Slack, PagerDuty, or SMS the moment the /health endpoint stops returning {"healthy":true} or the status page becomes unreachable.

Set up Gatus monitoring free at ezmon.com →

gatusmonitoringstatus-pagehealth-checksself-hostedstatus-checker