developer-tools

Is ComfyUI Down? Real-Time Status & Outage Checker

ezmon • March 27, 2026 • 7 min read

Is ComfyUI Down? Real-Time Status & Outage Checker

ComfyUI is an open-source, node-based Stable Diffusion GUI and inference server with over 60,000 GitHub stars. It exposes a modular graph/node UI where each step of the image generation pipeline — model loading, conditioning, sampling, decoding, upscaling — is a discrete node that can be wired together into reusable workflows. It supports SDXL, SD 1.5, Flux, ControlNet, LoRA, IPAdapter, AnimateDiff, and hundreds of community custom nodes via the ComfyUI Manager. Used by AI artists, researchers, game studios, and automated content pipelines, ComfyUI can run on GPU or CPU and exposes a JSON-based workflow API that makes it easy to call programmatically. It is the backbone of many production AI image generation systems.

Because ComfyUI is GPU-bound and tightly coupled to CUDA, Python dependencies, and a large ecosystem of custom nodes, failures are often silent: a single stuck job blocks the entire queue, an OOM error leaves the server responsive but unable to generate, and a broken custom node import can silently remove entire workflow capabilities after a restart.

Quick Status Check

#!/bin/bash
# ComfyUI health check
# Usage: bash check-comfyui.sh [host] [port]

HOST="${1:-localhost}"
PORT="${2:-8188}"
BASE_URL="http://${HOST}:${PORT}"

echo "=== ComfyUI Health Check ==="
echo "Target: ${BASE_URL}"
echo ""

# 1. Check system stats (VRAM, RAM, Python version)
echo "[1/5] Checking system stats..."
STATS=$(curl -sf --max-time 8 "${BASE_URL}/system_stats" 2>/dev/null)
if [ -n "${STATS}" ]; then
  echo "  OK  /system_stats responded"
  VRAM_FREE=$(echo "${STATS}" | grep -o '"vram_free":[0-9]*' | grep -o '[0-9]*' | head -1)
  if [ -n "${VRAM_FREE}" ]; then
    VRAM_FREE_MB=$((VRAM_FREE / 1024 / 1024))
    echo "       VRAM free: ${VRAM_FREE_MB} MB"
    if [ "${VRAM_FREE_MB}" -lt 1024 ]; then
      echo "  WARN  Less than 1 GB VRAM free — generation may OOM"
    fi
  fi
else
  echo "  FAIL  /system_stats unreachable — ComfyUI may be down"
fi

# 2. Check generation queue
echo "[2/5] Checking generation queue..."
QUEUE=$(curl -sf --max-time 5 "${BASE_URL}/queue" 2>/dev/null)
if [ -n "${QUEUE}" ]; then
  RUNNING=$(echo "${QUEUE}" | grep -o '"queue_running":\[[^]]*\]' | grep -o '\[.*\]')
  PENDING=$(echo "${QUEUE}" | grep -o '"queue_pending":\[[^]]*\]' | grep -o '\[.*\]')
  echo "  OK  Queue endpoint reachable"
  echo "       Running: $(echo "${RUNNING}" | tr ',' '\n' | wc -l | tr -d ' ') job(s)"
  echo "       Pending: $(echo "${PENDING}" | tr ',' '\n' | wc -l | tr -d ' ') job(s)"
else
  echo "  FAIL  /queue unreachable"
fi

# 3. Check process / Docker container
echo "[3/5] Checking process/container..."
if docker ps --format '{{.Names}}' 2>/dev/null | grep -qi "comfyui"; then
  echo "  OK  ComfyUI Docker container is running"
elif pgrep -f "comfyui\|main.py" > /dev/null 2>&1; then
  echo "  OK  ComfyUI process is running"
else
  echo "  WARN  No ComfyUI container or process detected"
fi

# 4. Check GPU with nvidia-smi
echo "[4/5] Checking GPU status..."
if command -v nvidia-smi > /dev/null 2>&1; then
  GPU_INFO=$(nvidia-smi --query-gpu=name,memory.free,memory.total,utilization.gpu \
    --format=csv,noheader,nounits 2>/dev/null | head -1)
  if [ -n "${GPU_INFO}" ]; then
    echo "  OK  GPU: ${GPU_INFO}"
  else
    echo "  WARN  nvidia-smi available but returned no GPU info"
  fi
else
  echo "  INFO  nvidia-smi not found — CPU mode or non-NVIDIA GPU"
fi

# 5. Check port is listening
echo "[5/5] Checking port ${PORT}..."
if nc -z -w3 "${HOST}" "${PORT}" 2>/dev/null; then
  echo "  OK  Port ${PORT} is open"
else
  echo "  FAIL  Port ${PORT} not reachable"
fi

echo ""
echo "=== Check complete ==="

Python Health Check

#!/usr/bin/env python3
"""
ComfyUI health check
Verifies server responsiveness, VRAM headroom, queue state, job history,
loaded node types, and WebSocket connectivity.
"""

import sys
import json
import time
import socket
import urllib.request
import urllib.error

BASE_URL = "http://localhost:8188"
WS_HOST = "localhost"
WS_PORT = 8188
TIMEOUT = 10
VRAM_WARN_MB = 1024  # warn if free VRAM < 1 GB
NODE_WARN_COUNT = 50  # warn if fewer than this many node types loaded


def fetch(url):
    try:
        req = urllib.request.Request(url, headers={"Accept": "application/json"})
        with urllib.request.urlopen(req, timeout=TIMEOUT) as resp:
            return json.loads(resp.read().decode())
    except urllib.error.HTTPError as e:
        return {"_error": f"HTTP {e.code}"}
    except Exception as e:
        return {"_error": str(e)}


results = []
print("=== ComfyUI Health Check ===")
print(f"Target: {BASE_URL}\n")

# 1. System stats — VRAM and RAM
print("[1/5] System stats (VRAM & RAM)...")
r = fetch(f"{BASE_URL}/system_stats")
if "_error" in r:
    print(f"  [FAIL] /system_stats: {r['_error']}")
    results.append(False)
else:
    devices = r.get("devices", [])
    ram = r.get("ram", {})
    ram_free_mb = int(ram.get("free", 0)) // (1024 * 1024)
    print(f"  [OK  ] /system_stats responded | RAM free: {ram_free_mb} MB")
    results.append(True)
    for dev in devices:
        name = dev.get("name", "unknown")
        vram_free = int(dev.get("vram_free", 0)) // (1024 * 1024)
        vram_total = int(dev.get("vram_total", 0)) // (1024 * 1024)
        vram_pct = int(vram_free / vram_total * 100) if vram_total > 0 else 0
        level = "OK  " if vram_free >= VRAM_WARN_MB else "WARN"
        print(f"  [{level}] GPU: {name} | VRAM free: {vram_free} MB / {vram_total} MB ({vram_pct}%)")
        if vram_free < VRAM_WARN_MB:
            print(f"       Warning: < {VRAM_WARN_MB} MB VRAM free — large generations may OOM")

# 2. Queue — pending and running job counts
print("[2/5] Generation queue...")
r = fetch(f"{BASE_URL}/queue")
if "_error" in r:
    print(f"  [FAIL] /queue: {r['_error']}")
    results.append(False)
else:
    running = r.get("queue_running", [])
    pending = r.get("queue_pending", [])
    print(f"  [OK  ] Queue: {len(running)} running, {len(pending)} pending")
    if len(running) > 0 and len(pending) > 5:
        print(f"  [WARN] {len(pending)} jobs pending — queue may be stuck")
    results.append(True)

# 3. History — recent job success/failure rate
print("[3/5] Job history (recent success rate)...")
r = fetch(f"{BASE_URL}/history?max_items=20")
if "_error" in r:
    print(f"  [FAIL] /history: {r['_error']}")
    results.append(False)
else:
    history_items = r if isinstance(r, dict) else {}
    total = len(history_items)
    if total == 0:
        print("  [OK  ] History empty — no jobs run yet")
        results.append(True)
    else:
        successes = sum(
            1 for v in history_items.values()
            if isinstance(v, dict) and v.get("status", {}).get("status_str") == "success"
        )
        fail_rate = int((total - successes) / total * 100) if total > 0 else 0
        level = "OK  " if fail_rate < 20 else "WARN"
        print(f"  [{level}] {successes}/{total} recent jobs succeeded ({fail_rate}% failure rate)")
        results.append(fail_rate < 50)

# 4. Object info — count loaded node types
print("[4/5] Loaded node types...")
r = fetch(f"{BASE_URL}/object_info")
if "_error" in r:
    print(f"  [FAIL] /object_info: {r['_error']}")
    results.append(False)
else:
    node_count = len(r) if isinstance(r, dict) else 0
    level = "OK  " if node_count >= NODE_WARN_COUNT else "WARN"
    print(f"  [{level}] {node_count} node types loaded", end="")
    if node_count < NODE_WARN_COUNT:
        print(f" — expected >= {NODE_WARN_COUNT}; custom nodes may have failed to import")
    else:
        print()
    results.append(node_count >= NODE_WARN_COUNT)

# 5. WebSocket connectivity
print("[5/5] WebSocket endpoint...")
try:
    sock = socket.create_connection((WS_HOST, WS_PORT), timeout=TIMEOUT)
    # Send a minimal HTTP upgrade request to confirm WS is listening
    handshake = (
        f"GET /ws HTTP/1.1\r\n"
        f"Host: {WS_HOST}:{WS_PORT}\r\n"
        f"Upgrade: websocket\r\n"
        f"Connection: Upgrade\r\n"
        f"Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==\r\n"
        f"Sec-WebSocket-Version: 13\r\n\r\n"
    )
    sock.sendall(handshake.encode())
    response = sock.recv(256).decode(errors="ignore")
    sock.close()
    if "101" in response or "websocket" in response.lower():
        print(f"  [OK  ] WebSocket at ws://{WS_HOST}:{WS_PORT}/ws accepted upgrade")
        results.append(True)
    else:
        print(f"  [WARN] WebSocket handshake response unexpected: {response[:80]}")
        results.append(False)
except Exception as e:
    print(f"  [FAIL] WebSocket connection failed: {e}")
    results.append(False)

# Summary
passed = sum(results)
total = len(results)
print(f"\n=== Summary: {passed}/{total} checks passed ===")
if passed < total:
    print("Action required: review FAIL/WARN items above.")
    sys.exit(1)
else:
    print("ComfyUI appears healthy.")
    sys.exit(0)

Common ComfyUI Outage Causes

Symptom	Likely Cause	Resolution
Generation fails with CUDA out of memory error; server still responsive	VRAM exhausted by model or batch size; previous job left tensors allocated	Restart ComfyUI to free VRAM; reduce resolution or batch size; enable `--lowvram` or `--cpu` flag
Workflow nodes show as "missing" or "red" after restart	Custom node failed to import due to dependency conflict or syntax error after update	Check ComfyUI console for import errors; use ComfyUI Manager to reinstall or disable the broken node
Generation fails with "model not found" or checkpoint error	Checkpoint file missing from `models/checkpoints/` or path misconfigured	Verify model file exists at expected path; check `extra_model_paths.yaml` configuration
All generations fail with CUDA / driver error after system update	CUDA toolkit version mismatch between PyTorch and installed NVIDIA drivers	Reinstall PyTorch matching installed CUDA version; run `nvidia-smi` to confirm driver CUDA version
New custom node install breaks existing workflows	Python dependency conflict — new node requires incompatible package version	Use a virtual environment; roll back the conflicting package; check ComfyUI Manager for conflict warnings
Queue shows multiple pending jobs but nothing executes	Single failed job stuck at the head of the queue, blocking all subsequent jobs	Use the queue management UI or `DELETE /queue` API to clear the stuck job; restart the server if needed

Architecture Overview

Component	Function	Failure Impact
ComfyUI Python server	HTTP API, workflow execution engine, node graph runner	Complete loss of all generation; queue stops processing
Custom nodes (ComfyUI Manager)	Extend workflow capabilities with ControlNet, LoRA, upscalers, etc.	Dependent workflows break; nodes appear missing or red in UI
GPU / CUDA runtime	Hardware-accelerated tensor operations for diffusion sampling	OOM crashes generation; driver mismatch makes GPU unavailable
Model files (checkpoints, LoRAs, VAEs)	Neural network weights loaded into VRAM for inference	Missing or corrupt model causes workflow to fail at load step
WebSocket server	Pushes real-time progress updates and preview images to the browser	UI shows no progress; generation may still run but feedback is lost
Job queue (in-memory)	Serializes and orders generation requests; single-threaded execution	Stuck job blocks all subsequent requests until cleared or server restarted

Uptime History

Date	Incident Type	Duration	Impact
Jan 2026	PyTorch update broke CUDA compatibility on RTX 40-series GPUs	2–6 hrs (until rollback or patch)	All GPU-based generation failed; CPU fallback available but extremely slow
Oct 2025	Popular custom node (ComfyUI-Impact-Pack) import error after upstream change	1–4 hrs	Workflows depending on Impact Pack nodes failed; other workflows unaffected
Aug 2025	Queue deadlock — single malformed workflow job blocked all generation	30 min–2 hrs	All queued generation stalled; required manual queue clear or server restart
Jul 2025	VRAM fragmentation after long uptime caused OOM on previously working workflows	15–60 min	Intermittent generation failures; resolved by restarting to free VRAM

Monitor ComfyUI Automatically

ComfyUI has no built-in alerting — a crashed process, a stuck queue, or an OOM condition can go unnoticed for hours in unattended generation pipelines. ezmon.com monitors your ComfyUI endpoints from multiple external probes and alerts your team via Slack, PagerDuty, or SMS the moment /system_stats stops responding or your generation queue stops draining.

Set up ComfyUI monitoring free at ezmon.com →

comfyuistable-diffusionimage-generationai-artself-hostedstatus-checker