status-checker

Is Kubernetes Down? How to Diagnose K8s Cluster Issues in 2026

Ezmon Team • March 17, 2026 • 9 min read

Is Kubernetes Down? How to Diagnose K8s Cluster Issues in 2026

Kubernetes cluster problems come in layers: control plane, node, networking, DNS, storage, or application. This guide walks you through each layer so you can isolate the failure fast — whether it's your cluster or an upstream managed service.

Quick Triage: What's Actually Down?

Before assuming Kubernetes itself is broken, determine where in the stack the problem lives:

Symptom	Likely Layer	First Check
kubectl commands time out or return connection errors	Control plane / API server	Check managed K8s status page (below)
Pods stuck in Pending	Scheduling / Node capacity	`kubectl describe pod <pod>` → Events section
Pods in CrashLoopBackOff	Application / config error	`kubectl logs <pod> --previous`
Pods in ImagePullBackOff	Container registry	Check registry status, verify credentials
Service unavailable but pods are Running	Networking / DNS / Service mesh	`kubectl exec` + curl from inside cluster
Nodes showing NotReady	Node / cloud provider	Check node events: `kubectl describe node <node>`
Persistent volumes stuck in Terminating	Storage / CSI driver	Check finalizers, CSI node plugin pods
Everything looks fine but no traffic reaches pods	Ingress / Load balancer	Check ingress controller logs and cloud LB status

Managed Kubernetes Status Pages

If you're running on a managed Kubernetes service, check the provider's status page first. Control plane issues are often the provider's problem, not yours.

Provider	Service	Status Page
Amazon Web Services	Amazon EKS	health.aws.amazon.com
Google Cloud	Google Kubernetes Engine (GKE)	status.cloud.google.com
Microsoft Azure	Azure Kubernetes Service (AKS)	azure.status.microsoft
DigitalOcean	DOKS (DigitalOcean Kubernetes)	status.digitalocean.com
Linode/Akamai	LKE (Linode Kubernetes Engine)	status.linode.com
Civo	Civo Kubernetes	status.civo.com

What to look for on managed K8s status pages: Control plane availability, etcd health, API server latency, node provisioning failures. Even a "minor degradation" on the control plane can make kubectl appear broken while your workloads continue running.

Control Plane Diagnostics

Is the API Server Responding?

# Basic connectivity check
kubectl cluster-info

# Check API server directly (replace URL with your cluster endpoint)
curl -k https://<cluster-endpoint>/healthz

# Check component status (deprecated in 1.19+ but still useful)
kubectl get componentstatuses

# More reliable: check system pods
kubectl get pods -n kube-system

Control Plane Component Health (Self-Managed Clusters)

# Check control plane pods (kubeadm clusters)
kubectl get pods -n kube-system | grep -E 'etcd|apiserver|controller|scheduler'

# Check etcd health (requires etcdctl access)
etcdctl endpoint health --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# API server logs (systemd-based)
journalctl -u kube-apiserver --since "10 minutes ago"

On managed services (EKS/GKE/AKS): You don't have direct access to control plane components. Use the provider status page and check node/system pod health as a proxy.

Node Diagnostics

# List all nodes and their status
kubectl get nodes -o wide

# Detailed node status — check Conditions section
kubectl describe node <node-name>

# Key conditions to check:
# Ready = False → node is failing health checks
# MemoryPressure = True → OOM risk
# DiskPressure = True → disk full
# PIDPressure = True → too many processes
# NetworkUnavailable = True → CNI problem

# Node resource usage
kubectl top nodes  # requires metrics-server

# Get node events
kubectl get events --field-selector source=kubelet --all-namespaces | sort-by .lastTimestamp

Node NotReady — Common Causes

Condition	Root Cause	Fix
NetworkUnavailable	CNI plugin not running (Calico, Flannel, Cilium)	Check CNI daemonset pods: `kubectl get pods -n kube-system`
MemoryPressure	Node OOM — too many pods or memory leak	Check pod memory requests/limits, evict or cordon node
DiskPressure	Container image cache full, log rotation failing	Clean up unused images: `crictl rmi --prune`
No recent heartbeat	kubelet crashed, node rebooted, cloud instance terminated	Check cloud provider console, kubelet service status

Pod Diagnostics

# Check pod status across all namespaces
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Describe a failing pod (Events section is most useful)
kubectl describe pod <pod-name> -n <namespace>

# Get logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # previous container instance

# Watch pod restarts in real time
kubectl get pods -n <namespace> -w

Pod Status Reference

Status	Meaning	Diagnostic Approach
Pending	Not scheduled to any node	`describe pod` → Events; check resources, node selectors, taints/tolerations
ContainerCreating	Image pulling or volume mounting	Check image pull policy, registry credentials, PVC status
ImagePullBackOff	Can't pull container image	Verify image name, tag, registry credentials (imagePullSecrets)
CrashLoopBackOff	Container starting then crashing repeatedly	`logs --previous`, check readiness probe, resource limits
OOMKilled	Container exceeded memory limit	Increase memory limit or fix memory leak
Evicted	Node ran out of resources	Check node pressure, increase resources or reduce pod density
Terminating (stuck)	Finalizer blocking deletion	Check finalizers: `kubectl get pod -o yaml \| grep finalizers`

Networking Diagnostics

Service DNS Not Resolving

# Test DNS from inside a pod
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test service endpoint resolution
nslookup <service-name>.<namespace>.svc.cluster.local

Pod-to-Pod Connectivity

# Exec into a running pod and curl another service
kubectl exec -it <pod-name> -n <namespace> -- curl http://<service-name>:<port>/healthz

# Check if service has endpoints
kubectl get endpoints <service-name> -n <namespace>
# Empty ENDPOINTS column = no pods matching service selector

# Check network policies
kubectl get networkpolicies --all-namespaces
# A NetworkPolicy blocking ingress/egress is a common "is it down?" false alarm

Storage Diagnostics

# Check PersistentVolumeClaims
kubectl get pvc --all-namespaces | grep -v Bound

# Describe a pending or failed PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check StorageClass provisioner
kubectl get storageclass

# Check CSI driver pods (common in cloud managed K8s)
kubectl get pods -n kube-system | grep csi

Common storage issues: PVC stuck in Pending (provisioner can't create volume), PVC stuck in Terminating (finalizer blocking deletion or PV not yet recycled), ReadWriteOnce volume attached to wrong node.

Is It Kubernetes or Your Application?

Symptom	Kubernetes Issue	Application Issue
All pods in a deployment fail	Node failure, resource exhaustion, CNI problem	Bad deployment (config error, bad image), shared secret/configmap broken
One pod fails intermittently	Flaky node (disk, memory, CPU steal)	Memory leak, deadlock, unhandled exception under load
External requests time out	Ingress controller, load balancer, NodePort issue	Slow database query, upstream API timeout, thread exhaustion
kubectl works, app is down	Not K8s (cluster healthy)	Application bug, config change, upstream dependency
New pods won't start (rolling deploy stuck)	Resource quota, node pressure, taint mismatch	Readiness probe failing (bad health check endpoint)

Notable Managed K8s Incidents — Q1 2026

AWS EKS — Elastic Load Balancing Delays (Feb 2026): EKS clusters in us-east-1 experienced delays in load balancer target group registration during peak provisioning. Pods were running but not receiving traffic for 10–20 minutes after becoming Ready. Workaround: reduce deregistration_delay.timeout_seconds.
GKE — Cluster Upgrades (Ongoing): GKE's automated upgrade windows occasionally cause brief API server interruptions. Check GKE maintenance window settings to schedule around production load.
AKS — Azure OpenAI Integration (Mar 9–10, 2026): Workloads using Azure OpenAI via pod identity/workload identity saw failures during the Azure OpenAI multi-region outage. The AKS nodes were healthy — the dependency was the issue.

Monitoring Kubernetes with Ezmon

Kubernetes health has multiple layers. Ezmon monitors the endpoints that matter to users — the HTTP/HTTPS services your pods expose — from multiple global locations simultaneously.

What this catches:

Ingress failures (your pods are Running but traffic isn't reaching them)
Partial outages (one region's nodes fail but others are fine)
Slow degradation before a full outage (response time spikes before complete failure)
Recovery confirmation (multi-probe verification before marking a service as recovered)

For internal cluster health (pod/node/etcd monitoring), you'll want Prometheus + Alertmanager inside the cluster. Ezmon handles the external availability layer — the perspective your users actually experience.

Start monitoring your Kubernetes-backed services →

Related Guides

Kubernetes and cloud provider status sourced from official status pages: AWS Health, Google Cloud Status, Azure Status. All times UTC.

kubernetes downk8s outagekubernetes troubleshootingk8s debugkubernetes cluster issuesdevopssre