Is ComfyUI Down? Real-Time Status & Outage Checker
Is ComfyUI Down? Real-Time Status & Outage Checker
ComfyUI is an open-source, node-based Stable Diffusion GUI and inference server with over 60,000 GitHub stars. It exposes a modular graph/node UI where each step of the image generation pipeline — model loading, conditioning, sampling, decoding, upscaling — is a discrete node that can be wired together into reusable workflows. It supports SDXL, SD 1.5, Flux, ControlNet, LoRA, IPAdapter, AnimateDiff, and hundreds of community custom nodes via the ComfyUI Manager. Used by AI artists, researchers, game studios, and automated content pipelines, ComfyUI can run on GPU or CPU and exposes a JSON-based workflow API that makes it easy to call programmatically. It is the backbone of many production AI image generation systems.
Because ComfyUI is GPU-bound and tightly coupled to CUDA, Python dependencies, and a large ecosystem of custom nodes, failures are often silent: a single stuck job blocks the entire queue, an OOM error leaves the server responsive but unable to generate, and a broken custom node import can silently remove entire workflow capabilities after a restart.
Quick Status Check
#!/bin/bash
# ComfyUI health check
# Usage: bash check-comfyui.sh [host] [port]
HOST="${1:-localhost}"
PORT="${2:-8188}"
BASE_URL="http://${HOST}:${PORT}"
echo "=== ComfyUI Health Check ==="
echo "Target: ${BASE_URL}"
echo ""
# 1. Check system stats (VRAM, RAM, Python version)
echo "[1/5] Checking system stats..."
STATS=$(curl -sf --max-time 8 "${BASE_URL}/system_stats" 2>/dev/null)
if [ -n "${STATS}" ]; then
echo " OK /system_stats responded"
VRAM_FREE=$(echo "${STATS}" | grep -o '"vram_free":[0-9]*' | grep -o '[0-9]*' | head -1)
if [ -n "${VRAM_FREE}" ]; then
VRAM_FREE_MB=$((VRAM_FREE / 1024 / 1024))
echo " VRAM free: ${VRAM_FREE_MB} MB"
if [ "${VRAM_FREE_MB}" -lt 1024 ]; then
echo " WARN Less than 1 GB VRAM free — generation may OOM"
fi
fi
else
echo " FAIL /system_stats unreachable — ComfyUI may be down"
fi
# 2. Check generation queue
echo "[2/5] Checking generation queue..."
QUEUE=$(curl -sf --max-time 5 "${BASE_URL}/queue" 2>/dev/null)
if [ -n "${QUEUE}" ]; then
RUNNING=$(echo "${QUEUE}" | grep -o '"queue_running":\[[^]]*\]' | grep -o '\[.*\]')
PENDING=$(echo "${QUEUE}" | grep -o '"queue_pending":\[[^]]*\]' | grep -o '\[.*\]')
echo " OK Queue endpoint reachable"
echo " Running: $(echo "${RUNNING}" | tr ',' '\n' | wc -l | tr -d ' ') job(s)"
echo " Pending: $(echo "${PENDING}" | tr ',' '\n' | wc -l | tr -d ' ') job(s)"
else
echo " FAIL /queue unreachable"
fi
# 3. Check process / Docker container
echo "[3/5] Checking process/container..."
if docker ps --format '{{.Names}}' 2>/dev/null | grep -qi "comfyui"; then
echo " OK ComfyUI Docker container is running"
elif pgrep -f "comfyui\|main.py" > /dev/null 2>&1; then
echo " OK ComfyUI process is running"
else
echo " WARN No ComfyUI container or process detected"
fi
# 4. Check GPU with nvidia-smi
echo "[4/5] Checking GPU status..."
if command -v nvidia-smi > /dev/null 2>&1; then
GPU_INFO=$(nvidia-smi --query-gpu=name,memory.free,memory.total,utilization.gpu \
--format=csv,noheader,nounits 2>/dev/null | head -1)
if [ -n "${GPU_INFO}" ]; then
echo " OK GPU: ${GPU_INFO}"
else
echo " WARN nvidia-smi available but returned no GPU info"
fi
else
echo " INFO nvidia-smi not found — CPU mode or non-NVIDIA GPU"
fi
# 5. Check port is listening
echo "[5/5] Checking port ${PORT}..."
if nc -z -w3 "${HOST}" "${PORT}" 2>/dev/null; then
echo " OK Port ${PORT} is open"
else
echo " FAIL Port ${PORT} not reachable"
fi
echo ""
echo "=== Check complete ==="
Python Health Check
#!/usr/bin/env python3
"""
ComfyUI health check
Verifies server responsiveness, VRAM headroom, queue state, job history,
loaded node types, and WebSocket connectivity.
"""
import sys
import json
import time
import socket
import urllib.request
import urllib.error
BASE_URL = "http://localhost:8188"
WS_HOST = "localhost"
WS_PORT = 8188
TIMEOUT = 10
VRAM_WARN_MB = 1024 # warn if free VRAM < 1 GB
NODE_WARN_COUNT = 50 # warn if fewer than this many node types loaded
def fetch(url):
try:
req = urllib.request.Request(url, headers={"Accept": "application/json"})
with urllib.request.urlopen(req, timeout=TIMEOUT) as resp:
return json.loads(resp.read().decode())
except urllib.error.HTTPError as e:
return {"_error": f"HTTP {e.code}"}
except Exception as e:
return {"_error": str(e)}
results = []
print("=== ComfyUI Health Check ===")
print(f"Target: {BASE_URL}\n")
# 1. System stats — VRAM and RAM
print("[1/5] System stats (VRAM & RAM)...")
r = fetch(f"{BASE_URL}/system_stats")
if "_error" in r:
print(f" [FAIL] /system_stats: {r['_error']}")
results.append(False)
else:
devices = r.get("devices", [])
ram = r.get("ram", {})
ram_free_mb = int(ram.get("free", 0)) // (1024 * 1024)
print(f" [OK ] /system_stats responded | RAM free: {ram_free_mb} MB")
results.append(True)
for dev in devices:
name = dev.get("name", "unknown")
vram_free = int(dev.get("vram_free", 0)) // (1024 * 1024)
vram_total = int(dev.get("vram_total", 0)) // (1024 * 1024)
vram_pct = int(vram_free / vram_total * 100) if vram_total > 0 else 0
level = "OK " if vram_free >= VRAM_WARN_MB else "WARN"
print(f" [{level}] GPU: {name} | VRAM free: {vram_free} MB / {vram_total} MB ({vram_pct}%)")
if vram_free < VRAM_WARN_MB:
print(f" Warning: < {VRAM_WARN_MB} MB VRAM free — large generations may OOM")
# 2. Queue — pending and running job counts
print("[2/5] Generation queue...")
r = fetch(f"{BASE_URL}/queue")
if "_error" in r:
print(f" [FAIL] /queue: {r['_error']}")
results.append(False)
else:
running = r.get("queue_running", [])
pending = r.get("queue_pending", [])
print(f" [OK ] Queue: {len(running)} running, {len(pending)} pending")
if len(running) > 0 and len(pending) > 5:
print(f" [WARN] {len(pending)} jobs pending — queue may be stuck")
results.append(True)
# 3. History — recent job success/failure rate
print("[3/5] Job history (recent success rate)...")
r = fetch(f"{BASE_URL}/history?max_items=20")
if "_error" in r:
print(f" [FAIL] /history: {r['_error']}")
results.append(False)
else:
history_items = r if isinstance(r, dict) else {}
total = len(history_items)
if total == 0:
print(" [OK ] History empty — no jobs run yet")
results.append(True)
else:
successes = sum(
1 for v in history_items.values()
if isinstance(v, dict) and v.get("status", {}).get("status_str") == "success"
)
fail_rate = int((total - successes) / total * 100) if total > 0 else 0
level = "OK " if fail_rate < 20 else "WARN"
print(f" [{level}] {successes}/{total} recent jobs succeeded ({fail_rate}% failure rate)")
results.append(fail_rate < 50)
# 4. Object info — count loaded node types
print("[4/5] Loaded node types...")
r = fetch(f"{BASE_URL}/object_info")
if "_error" in r:
print(f" [FAIL] /object_info: {r['_error']}")
results.append(False)
else:
node_count = len(r) if isinstance(r, dict) else 0
level = "OK " if node_count >= NODE_WARN_COUNT else "WARN"
print(f" [{level}] {node_count} node types loaded", end="")
if node_count < NODE_WARN_COUNT:
print(f" — expected >= {NODE_WARN_COUNT}; custom nodes may have failed to import")
else:
print()
results.append(node_count >= NODE_WARN_COUNT)
# 5. WebSocket connectivity
print("[5/5] WebSocket endpoint...")
try:
sock = socket.create_connection((WS_HOST, WS_PORT), timeout=TIMEOUT)
# Send a minimal HTTP upgrade request to confirm WS is listening
handshake = (
f"GET /ws HTTP/1.1\r\n"
f"Host: {WS_HOST}:{WS_PORT}\r\n"
f"Upgrade: websocket\r\n"
f"Connection: Upgrade\r\n"
f"Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==\r\n"
f"Sec-WebSocket-Version: 13\r\n\r\n"
)
sock.sendall(handshake.encode())
response = sock.recv(256).decode(errors="ignore")
sock.close()
if "101" in response or "websocket" in response.lower():
print(f" [OK ] WebSocket at ws://{WS_HOST}:{WS_PORT}/ws accepted upgrade")
results.append(True)
else:
print(f" [WARN] WebSocket handshake response unexpected: {response[:80]}")
results.append(False)
except Exception as e:
print(f" [FAIL] WebSocket connection failed: {e}")
results.append(False)
# Summary
passed = sum(results)
total = len(results)
print(f"\n=== Summary: {passed}/{total} checks passed ===")
if passed < total:
print("Action required: review FAIL/WARN items above.")
sys.exit(1)
else:
print("ComfyUI appears healthy.")
sys.exit(0)
Common ComfyUI Outage Causes
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Generation fails with CUDA out of memory error; server still responsive | VRAM exhausted by model or batch size; previous job left tensors allocated | Restart ComfyUI to free VRAM; reduce resolution or batch size; enable --lowvram or --cpu flag |
| Workflow nodes show as "missing" or "red" after restart | Custom node failed to import due to dependency conflict or syntax error after update | Check ComfyUI console for import errors; use ComfyUI Manager to reinstall or disable the broken node |
| Generation fails with "model not found" or checkpoint error | Checkpoint file missing from models/checkpoints/ or path misconfigured |
Verify model file exists at expected path; check extra_model_paths.yaml configuration |
| All generations fail with CUDA / driver error after system update | CUDA toolkit version mismatch between PyTorch and installed NVIDIA drivers | Reinstall PyTorch matching installed CUDA version; run nvidia-smi to confirm driver CUDA version |
| New custom node install breaks existing workflows | Python dependency conflict — new node requires incompatible package version | Use a virtual environment; roll back the conflicting package; check ComfyUI Manager for conflict warnings |
| Queue shows multiple pending jobs but nothing executes | Single failed job stuck at the head of the queue, blocking all subsequent jobs | Use the queue management UI or DELETE /queue API to clear the stuck job; restart the server if needed |
Architecture Overview
| Component | Function | Failure Impact |
|---|---|---|
| ComfyUI Python server | HTTP API, workflow execution engine, node graph runner | Complete loss of all generation; queue stops processing |
| Custom nodes (ComfyUI Manager) | Extend workflow capabilities with ControlNet, LoRA, upscalers, etc. | Dependent workflows break; nodes appear missing or red in UI |
| GPU / CUDA runtime | Hardware-accelerated tensor operations for diffusion sampling | OOM crashes generation; driver mismatch makes GPU unavailable |
| Model files (checkpoints, LoRAs, VAEs) | Neural network weights loaded into VRAM for inference | Missing or corrupt model causes workflow to fail at load step |
| WebSocket server | Pushes real-time progress updates and preview images to the browser | UI shows no progress; generation may still run but feedback is lost |
| Job queue (in-memory) | Serializes and orders generation requests; single-threaded execution | Stuck job blocks all subsequent requests until cleared or server restarted |
Uptime History
| Date | Incident Type | Duration | Impact |
|---|---|---|---|
| Jan 2026 | PyTorch update broke CUDA compatibility on RTX 40-series GPUs | 2–6 hrs (until rollback or patch) | All GPU-based generation failed; CPU fallback available but extremely slow |
| Oct 2025 | Popular custom node (ComfyUI-Impact-Pack) import error after upstream change | 1–4 hrs | Workflows depending on Impact Pack nodes failed; other workflows unaffected |
| Aug 2025 | Queue deadlock — single malformed workflow job blocked all generation | 30 min–2 hrs | All queued generation stalled; required manual queue clear or server restart |
| Jul 2025 | VRAM fragmentation after long uptime caused OOM on previously working workflows | 15–60 min | Intermittent generation failures; resolved by restarting to free VRAM |
Monitor ComfyUI Automatically
ComfyUI has no built-in alerting — a crashed process, a stuck queue, or an OOM condition can go unnoticed for hours in unattended generation pipelines. ezmon.com monitors your ComfyUI endpoints from multiple external probes and alerts your team via Slack, PagerDuty, or SMS the moment /system_stats stops responding or your generation queue stops draining.