Is Kubernetes Down? How to Diagnose K8s Cluster Issues in 2026
Is Kubernetes Down? How to Diagnose K8s Cluster Issues in 2026
Kubernetes cluster problems come in layers: control plane, node, networking, DNS, storage, or application. This guide walks you through each layer so you can isolate the failure fast — whether it's your cluster or an upstream managed service.
Quick Triage: What's Actually Down?
Before assuming Kubernetes itself is broken, determine where in the stack the problem lives:
| Symptom | Likely Layer | First Check |
|---|---|---|
| kubectl commands time out or return connection errors | Control plane / API server | Check managed K8s status page (below) |
| Pods stuck in Pending | Scheduling / Node capacity | kubectl describe pod <pod> → Events section |
| Pods in CrashLoopBackOff | Application / config error | kubectl logs <pod> --previous |
| Pods in ImagePullBackOff | Container registry | Check registry status, verify credentials |
| Service unavailable but pods are Running | Networking / DNS / Service mesh | kubectl exec + curl from inside cluster |
| Nodes showing NotReady | Node / cloud provider | Check node events: kubectl describe node <node> |
| Persistent volumes stuck in Terminating | Storage / CSI driver | Check finalizers, CSI node plugin pods |
| Everything looks fine but no traffic reaches pods | Ingress / Load balancer | Check ingress controller logs and cloud LB status |
Managed Kubernetes Status Pages
If you're running on a managed Kubernetes service, check the provider's status page first. Control plane issues are often the provider's problem, not yours.
| Provider | Service | Status Page |
|---|---|---|
| Amazon Web Services | Amazon EKS | health.aws.amazon.com |
| Google Cloud | Google Kubernetes Engine (GKE) | status.cloud.google.com |
| Microsoft Azure | Azure Kubernetes Service (AKS) | azure.status.microsoft |
| DigitalOcean | DOKS (DigitalOcean Kubernetes) | status.digitalocean.com |
| Linode/Akamai | LKE (Linode Kubernetes Engine) | status.linode.com |
| Civo | Civo Kubernetes | status.civo.com |
What to look for on managed K8s status pages: Control plane availability, etcd health, API server latency, node provisioning failures. Even a "minor degradation" on the control plane can make kubectl appear broken while your workloads continue running.
Control Plane Diagnostics
Is the API Server Responding?
# Basic connectivity check
kubectl cluster-info
# Check API server directly (replace URL with your cluster endpoint)
curl -k https://<cluster-endpoint>/healthz
# Check component status (deprecated in 1.19+ but still useful)
kubectl get componentstatuses
# More reliable: check system pods
kubectl get pods -n kube-system
Control Plane Component Health (Self-Managed Clusters)
# Check control plane pods (kubeadm clusters)
kubectl get pods -n kube-system | grep -E 'etcd|apiserver|controller|scheduler'
# Check etcd health (requires etcdctl access)
etcdctl endpoint health --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# API server logs (systemd-based)
journalctl -u kube-apiserver --since "10 minutes ago"
On managed services (EKS/GKE/AKS): You don't have direct access to control plane components. Use the provider status page and check node/system pod health as a proxy.
Node Diagnostics
# List all nodes and their status
kubectl get nodes -o wide
# Detailed node status — check Conditions section
kubectl describe node <node-name>
# Key conditions to check:
# Ready = False → node is failing health checks
# MemoryPressure = True → OOM risk
# DiskPressure = True → disk full
# PIDPressure = True → too many processes
# NetworkUnavailable = True → CNI problem
# Node resource usage
kubectl top nodes # requires metrics-server
# Get node events
kubectl get events --field-selector source=kubelet --all-namespaces | sort-by .lastTimestamp
Node NotReady — Common Causes
| Condition | Root Cause | Fix |
|---|---|---|
| NetworkUnavailable | CNI plugin not running (Calico, Flannel, Cilium) | Check CNI daemonset pods: kubectl get pods -n kube-system |
| MemoryPressure | Node OOM — too many pods or memory leak | Check pod memory requests/limits, evict or cordon node |
| DiskPressure | Container image cache full, log rotation failing | Clean up unused images: crictl rmi --prune |
| No recent heartbeat | kubelet crashed, node rebooted, cloud instance terminated | Check cloud provider console, kubelet service status |
Pod Diagnostics
# Check pod status across all namespaces
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
# Describe a failing pod (Events section is most useful)
kubectl describe pod <pod-name> -n <namespace>
# Get logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # previous container instance
# Watch pod restarts in real time
kubectl get pods -n <namespace> -w
Pod Status Reference
| Status | Meaning | Diagnostic Approach |
|---|---|---|
| Pending | Not scheduled to any node | describe pod → Events; check resources, node selectors, taints/tolerations |
| ContainerCreating | Image pulling or volume mounting | Check image pull policy, registry credentials, PVC status |
| ImagePullBackOff | Can't pull container image | Verify image name, tag, registry credentials (imagePullSecrets) |
| CrashLoopBackOff | Container starting then crashing repeatedly | logs --previous, check readiness probe, resource limits |
| OOMKilled | Container exceeded memory limit | Increase memory limit or fix memory leak |
| Evicted | Node ran out of resources | Check node pressure, increase resources or reduce pod density |
| Terminating (stuck) | Finalizer blocking deletion | Check finalizers: kubectl get pod -o yaml | grep finalizers |
Networking Diagnostics
Service DNS Not Resolving
# Test DNS from inside a pod
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test service endpoint resolution
nslookup <service-name>.<namespace>.svc.cluster.local
Pod-to-Pod Connectivity
# Exec into a running pod and curl another service
kubectl exec -it <pod-name> -n <namespace> -- curl http://<service-name>:<port>/healthz
# Check if service has endpoints
kubectl get endpoints <service-name> -n <namespace>
# Empty ENDPOINTS column = no pods matching service selector
# Check network policies
kubectl get networkpolicies --all-namespaces
# A NetworkPolicy blocking ingress/egress is a common "is it down?" false alarm
Storage Diagnostics
# Check PersistentVolumeClaims
kubectl get pvc --all-namespaces | grep -v Bound
# Describe a pending or failed PVC
kubectl describe pvc <pvc-name> -n <namespace>
# Check StorageClass provisioner
kubectl get storageclass
# Check CSI driver pods (common in cloud managed K8s)
kubectl get pods -n kube-system | grep csi
Common storage issues: PVC stuck in Pending (provisioner can't create volume), PVC stuck in Terminating (finalizer blocking deletion or PV not yet recycled), ReadWriteOnce volume attached to wrong node.
Is It Kubernetes or Your Application?
| Symptom | Kubernetes Issue | Application Issue |
|---|---|---|
| All pods in a deployment fail | Node failure, resource exhaustion, CNI problem | Bad deployment (config error, bad image), shared secret/configmap broken |
| One pod fails intermittently | Flaky node (disk, memory, CPU steal) | Memory leak, deadlock, unhandled exception under load |
| External requests time out | Ingress controller, load balancer, NodePort issue | Slow database query, upstream API timeout, thread exhaustion |
| kubectl works, app is down | Not K8s (cluster healthy) | Application bug, config change, upstream dependency |
| New pods won't start (rolling deploy stuck) | Resource quota, node pressure, taint mismatch | Readiness probe failing (bad health check endpoint) |
Notable Managed K8s Incidents — Q1 2026
- AWS EKS — Elastic Load Balancing Delays (Feb 2026): EKS clusters in us-east-1 experienced delays in load balancer target group registration during peak provisioning. Pods were running but not receiving traffic for 10–20 minutes after becoming Ready. Workaround: reduce
deregistration_delay.timeout_seconds. - GKE — Cluster Upgrades (Ongoing): GKE's automated upgrade windows occasionally cause brief API server interruptions. Check GKE maintenance window settings to schedule around production load.
- AKS — Azure OpenAI Integration (Mar 9–10, 2026): Workloads using Azure OpenAI via pod identity/workload identity saw failures during the Azure OpenAI multi-region outage. The AKS nodes were healthy — the dependency was the issue.
Monitoring Kubernetes with Ezmon
Kubernetes health has multiple layers. Ezmon monitors the endpoints that matter to users — the HTTP/HTTPS services your pods expose — from multiple global locations simultaneously.
What this catches:
- Ingress failures (your pods are Running but traffic isn't reaching them)
- Partial outages (one region's nodes fail but others are fine)
- Slow degradation before a full outage (response time spikes before complete failure)
- Recovery confirmation (multi-probe verification before marking a service as recovered)
For internal cluster health (pod/node/etcd monitoring), you'll want Prometheus + Alertmanager inside the cluster. Ezmon handles the external availability layer — the perspective your users actually experience.
Start monitoring your Kubernetes-backed services →
Related Guides
- Is AWS Down? How to Check AWS Status
- Is Cloudflare Down? (Including Workers + CDN)
- Is GitHub Down? (Including Actions, Pages, Codespaces)
- Datadog Alternatives for 2026
- Monitoring Best Practices 2026
Kubernetes and cloud provider status sourced from official status pages: AWS Health, Google Cloud Status, Azure Status. All times UTC.