developer-tools

Is ComfyUI Down? Real-Time Status & Outage Checker

Is ComfyUI Down? Real-Time Status & Outage Checker

ComfyUI is an open-source, node-based Stable Diffusion GUI and inference server with over 60,000 GitHub stars. It exposes a modular graph/node UI where each step of the image generation pipeline — model loading, conditioning, sampling, decoding, upscaling — is a discrete node that can be wired together into reusable workflows. It supports SDXL, SD 1.5, Flux, ControlNet, LoRA, IPAdapter, AnimateDiff, and hundreds of community custom nodes via the ComfyUI Manager. Used by AI artists, researchers, game studios, and automated content pipelines, ComfyUI can run on GPU or CPU and exposes a JSON-based workflow API that makes it easy to call programmatically. It is the backbone of many production AI image generation systems.

Because ComfyUI is GPU-bound and tightly coupled to CUDA, Python dependencies, and a large ecosystem of custom nodes, failures are often silent: a single stuck job blocks the entire queue, an OOM error leaves the server responsive but unable to generate, and a broken custom node import can silently remove entire workflow capabilities after a restart.

Quick Status Check

#!/bin/bash
# ComfyUI health check
# Usage: bash check-comfyui.sh [host] [port]

HOST="${1:-localhost}"
PORT="${2:-8188}"
BASE_URL="http://${HOST}:${PORT}"

echo "=== ComfyUI Health Check ==="
echo "Target: ${BASE_URL}"
echo ""

# 1. Check system stats (VRAM, RAM, Python version)
echo "[1/5] Checking system stats..."
STATS=$(curl -sf --max-time 8 "${BASE_URL}/system_stats" 2>/dev/null)
if [ -n "${STATS}" ]; then
  echo "  OK  /system_stats responded"
  VRAM_FREE=$(echo "${STATS}" | grep -o '"vram_free":[0-9]*' | grep -o '[0-9]*' | head -1)
  if [ -n "${VRAM_FREE}" ]; then
    VRAM_FREE_MB=$((VRAM_FREE / 1024 / 1024))
    echo "       VRAM free: ${VRAM_FREE_MB} MB"
    if [ "${VRAM_FREE_MB}" -lt 1024 ]; then
      echo "  WARN  Less than 1 GB VRAM free — generation may OOM"
    fi
  fi
else
  echo "  FAIL  /system_stats unreachable — ComfyUI may be down"
fi

# 2. Check generation queue
echo "[2/5] Checking generation queue..."
QUEUE=$(curl -sf --max-time 5 "${BASE_URL}/queue" 2>/dev/null)
if [ -n "${QUEUE}" ]; then
  RUNNING=$(echo "${QUEUE}" | grep -o '"queue_running":\[[^]]*\]' | grep -o '\[.*\]')
  PENDING=$(echo "${QUEUE}" | grep -o '"queue_pending":\[[^]]*\]' | grep -o '\[.*\]')
  echo "  OK  Queue endpoint reachable"
  echo "       Running: $(echo "${RUNNING}" | tr ',' '\n' | wc -l | tr -d ' ') job(s)"
  echo "       Pending: $(echo "${PENDING}" | tr ',' '\n' | wc -l | tr -d ' ') job(s)"
else
  echo "  FAIL  /queue unreachable"
fi

# 3. Check process / Docker container
echo "[3/5] Checking process/container..."
if docker ps --format '{{.Names}}' 2>/dev/null | grep -qi "comfyui"; then
  echo "  OK  ComfyUI Docker container is running"
elif pgrep -f "comfyui\|main.py" > /dev/null 2>&1; then
  echo "  OK  ComfyUI process is running"
else
  echo "  WARN  No ComfyUI container or process detected"
fi

# 4. Check GPU with nvidia-smi
echo "[4/5] Checking GPU status..."
if command -v nvidia-smi > /dev/null 2>&1; then
  GPU_INFO=$(nvidia-smi --query-gpu=name,memory.free,memory.total,utilization.gpu \
    --format=csv,noheader,nounits 2>/dev/null | head -1)
  if [ -n "${GPU_INFO}" ]; then
    echo "  OK  GPU: ${GPU_INFO}"
  else
    echo "  WARN  nvidia-smi available but returned no GPU info"
  fi
else
  echo "  INFO  nvidia-smi not found — CPU mode or non-NVIDIA GPU"
fi

# 5. Check port is listening
echo "[5/5] Checking port ${PORT}..."
if nc -z -w3 "${HOST}" "${PORT}" 2>/dev/null; then
  echo "  OK  Port ${PORT} is open"
else
  echo "  FAIL  Port ${PORT} not reachable"
fi

echo ""
echo "=== Check complete ==="

Python Health Check

#!/usr/bin/env python3
"""
ComfyUI health check
Verifies server responsiveness, VRAM headroom, queue state, job history,
loaded node types, and WebSocket connectivity.
"""

import sys
import json
import time
import socket
import urllib.request
import urllib.error

BASE_URL = "http://localhost:8188"
WS_HOST = "localhost"
WS_PORT = 8188
TIMEOUT = 10
VRAM_WARN_MB = 1024  # warn if free VRAM < 1 GB
NODE_WARN_COUNT = 50  # warn if fewer than this many node types loaded


def fetch(url):
    try:
        req = urllib.request.Request(url, headers={"Accept": "application/json"})
        with urllib.request.urlopen(req, timeout=TIMEOUT) as resp:
            return json.loads(resp.read().decode())
    except urllib.error.HTTPError as e:
        return {"_error": f"HTTP {e.code}"}
    except Exception as e:
        return {"_error": str(e)}


results = []
print("=== ComfyUI Health Check ===")
print(f"Target: {BASE_URL}\n")

# 1. System stats — VRAM and RAM
print("[1/5] System stats (VRAM & RAM)...")
r = fetch(f"{BASE_URL}/system_stats")
if "_error" in r:
    print(f"  [FAIL] /system_stats: {r['_error']}")
    results.append(False)
else:
    devices = r.get("devices", [])
    ram = r.get("ram", {})
    ram_free_mb = int(ram.get("free", 0)) // (1024 * 1024)
    print(f"  [OK  ] /system_stats responded | RAM free: {ram_free_mb} MB")
    results.append(True)
    for dev in devices:
        name = dev.get("name", "unknown")
        vram_free = int(dev.get("vram_free", 0)) // (1024 * 1024)
        vram_total = int(dev.get("vram_total", 0)) // (1024 * 1024)
        vram_pct = int(vram_free / vram_total * 100) if vram_total > 0 else 0
        level = "OK  " if vram_free >= VRAM_WARN_MB else "WARN"
        print(f"  [{level}] GPU: {name} | VRAM free: {vram_free} MB / {vram_total} MB ({vram_pct}%)")
        if vram_free < VRAM_WARN_MB:
            print(f"       Warning: < {VRAM_WARN_MB} MB VRAM free — large generations may OOM")

# 2. Queue — pending and running job counts
print("[2/5] Generation queue...")
r = fetch(f"{BASE_URL}/queue")
if "_error" in r:
    print(f"  [FAIL] /queue: {r['_error']}")
    results.append(False)
else:
    running = r.get("queue_running", [])
    pending = r.get("queue_pending", [])
    print(f"  [OK  ] Queue: {len(running)} running, {len(pending)} pending")
    if len(running) > 0 and len(pending) > 5:
        print(f"  [WARN] {len(pending)} jobs pending — queue may be stuck")
    results.append(True)

# 3. History — recent job success/failure rate
print("[3/5] Job history (recent success rate)...")
r = fetch(f"{BASE_URL}/history?max_items=20")
if "_error" in r:
    print(f"  [FAIL] /history: {r['_error']}")
    results.append(False)
else:
    history_items = r if isinstance(r, dict) else {}
    total = len(history_items)
    if total == 0:
        print("  [OK  ] History empty — no jobs run yet")
        results.append(True)
    else:
        successes = sum(
            1 for v in history_items.values()
            if isinstance(v, dict) and v.get("status", {}).get("status_str") == "success"
        )
        fail_rate = int((total - successes) / total * 100) if total > 0 else 0
        level = "OK  " if fail_rate < 20 else "WARN"
        print(f"  [{level}] {successes}/{total} recent jobs succeeded ({fail_rate}% failure rate)")
        results.append(fail_rate < 50)

# 4. Object info — count loaded node types
print("[4/5] Loaded node types...")
r = fetch(f"{BASE_URL}/object_info")
if "_error" in r:
    print(f"  [FAIL] /object_info: {r['_error']}")
    results.append(False)
else:
    node_count = len(r) if isinstance(r, dict) else 0
    level = "OK  " if node_count >= NODE_WARN_COUNT else "WARN"
    print(f"  [{level}] {node_count} node types loaded", end="")
    if node_count < NODE_WARN_COUNT:
        print(f" — expected >= {NODE_WARN_COUNT}; custom nodes may have failed to import")
    else:
        print()
    results.append(node_count >= NODE_WARN_COUNT)

# 5. WebSocket connectivity
print("[5/5] WebSocket endpoint...")
try:
    sock = socket.create_connection((WS_HOST, WS_PORT), timeout=TIMEOUT)
    # Send a minimal HTTP upgrade request to confirm WS is listening
    handshake = (
        f"GET /ws HTTP/1.1\r\n"
        f"Host: {WS_HOST}:{WS_PORT}\r\n"
        f"Upgrade: websocket\r\n"
        f"Connection: Upgrade\r\n"
        f"Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==\r\n"
        f"Sec-WebSocket-Version: 13\r\n\r\n"
    )
    sock.sendall(handshake.encode())
    response = sock.recv(256).decode(errors="ignore")
    sock.close()
    if "101" in response or "websocket" in response.lower():
        print(f"  [OK  ] WebSocket at ws://{WS_HOST}:{WS_PORT}/ws accepted upgrade")
        results.append(True)
    else:
        print(f"  [WARN] WebSocket handshake response unexpected: {response[:80]}")
        results.append(False)
except Exception as e:
    print(f"  [FAIL] WebSocket connection failed: {e}")
    results.append(False)

# Summary
passed = sum(results)
total = len(results)
print(f"\n=== Summary: {passed}/{total} checks passed ===")
if passed < total:
    print("Action required: review FAIL/WARN items above.")
    sys.exit(1)
else:
    print("ComfyUI appears healthy.")
    sys.exit(0)

Common ComfyUI Outage Causes

SymptomLikely CauseResolution
Generation fails with CUDA out of memory error; server still responsive VRAM exhausted by model or batch size; previous job left tensors allocated Restart ComfyUI to free VRAM; reduce resolution or batch size; enable --lowvram or --cpu flag
Workflow nodes show as "missing" or "red" after restart Custom node failed to import due to dependency conflict or syntax error after update Check ComfyUI console for import errors; use ComfyUI Manager to reinstall or disable the broken node
Generation fails with "model not found" or checkpoint error Checkpoint file missing from models/checkpoints/ or path misconfigured Verify model file exists at expected path; check extra_model_paths.yaml configuration
All generations fail with CUDA / driver error after system update CUDA toolkit version mismatch between PyTorch and installed NVIDIA drivers Reinstall PyTorch matching installed CUDA version; run nvidia-smi to confirm driver CUDA version
New custom node install breaks existing workflows Python dependency conflict — new node requires incompatible package version Use a virtual environment; roll back the conflicting package; check ComfyUI Manager for conflict warnings
Queue shows multiple pending jobs but nothing executes Single failed job stuck at the head of the queue, blocking all subsequent jobs Use the queue management UI or DELETE /queue API to clear the stuck job; restart the server if needed

Architecture Overview

ComponentFunctionFailure Impact
ComfyUI Python server HTTP API, workflow execution engine, node graph runner Complete loss of all generation; queue stops processing
Custom nodes (ComfyUI Manager) Extend workflow capabilities with ControlNet, LoRA, upscalers, etc. Dependent workflows break; nodes appear missing or red in UI
GPU / CUDA runtime Hardware-accelerated tensor operations for diffusion sampling OOM crashes generation; driver mismatch makes GPU unavailable
Model files (checkpoints, LoRAs, VAEs) Neural network weights loaded into VRAM for inference Missing or corrupt model causes workflow to fail at load step
WebSocket server Pushes real-time progress updates and preview images to the browser UI shows no progress; generation may still run but feedback is lost
Job queue (in-memory) Serializes and orders generation requests; single-threaded execution Stuck job blocks all subsequent requests until cleared or server restarted

Uptime History

DateIncident TypeDurationImpact
Jan 2026 PyTorch update broke CUDA compatibility on RTX 40-series GPUs 2–6 hrs (until rollback or patch) All GPU-based generation failed; CPU fallback available but extremely slow
Oct 2025 Popular custom node (ComfyUI-Impact-Pack) import error after upstream change 1–4 hrs Workflows depending on Impact Pack nodes failed; other workflows unaffected
Aug 2025 Queue deadlock — single malformed workflow job blocked all generation 30 min–2 hrs All queued generation stalled; required manual queue clear or server restart
Jul 2025 VRAM fragmentation after long uptime caused OOM on previously working workflows 15–60 min Intermittent generation failures; resolved by restarting to free VRAM

Monitor ComfyUI Automatically

ComfyUI has no built-in alerting — a crashed process, a stuck queue, or an OOM condition can go unnoticed for hours in unattended generation pipelines. ezmon.com monitors your ComfyUI endpoints from multiple external probes and alerts your team via Slack, PagerDuty, or SMS the moment /system_stats stops responding or your generation queue stops draining.

Set up ComfyUI monitoring free at ezmon.com →

comfyuistable-diffusionimage-generationai-artself-hostedstatus-checker