status-checker

Is Kubernetes Down? How to Diagnose K8s Cluster Issues in 2026

Is Kubernetes Down? How to Diagnose K8s Cluster Issues in 2026

Kubernetes cluster problems come in layers: control plane, node, networking, DNS, storage, or application. This guide walks you through each layer so you can isolate the failure fast — whether it's your cluster or an upstream managed service.


Quick Triage: What's Actually Down?

Before assuming Kubernetes itself is broken, determine where in the stack the problem lives:

SymptomLikely LayerFirst Check
kubectl commands time out or return connection errorsControl plane / API serverCheck managed K8s status page (below)
Pods stuck in PendingScheduling / Node capacitykubectl describe pod <pod> → Events section
Pods in CrashLoopBackOffApplication / config errorkubectl logs <pod> --previous
Pods in ImagePullBackOffContainer registryCheck registry status, verify credentials
Service unavailable but pods are RunningNetworking / DNS / Service meshkubectl exec + curl from inside cluster
Nodes showing NotReadyNode / cloud providerCheck node events: kubectl describe node <node>
Persistent volumes stuck in TerminatingStorage / CSI driverCheck finalizers, CSI node plugin pods
Everything looks fine but no traffic reaches podsIngress / Load balancerCheck ingress controller logs and cloud LB status

Managed Kubernetes Status Pages

If you're running on a managed Kubernetes service, check the provider's status page first. Control plane issues are often the provider's problem, not yours.

ProviderServiceStatus Page
Amazon Web ServicesAmazon EKShealth.aws.amazon.com
Google CloudGoogle Kubernetes Engine (GKE)status.cloud.google.com
Microsoft AzureAzure Kubernetes Service (AKS)azure.status.microsoft
DigitalOceanDOKS (DigitalOcean Kubernetes)status.digitalocean.com
Linode/AkamaiLKE (Linode Kubernetes Engine)status.linode.com
CivoCivo Kubernetesstatus.civo.com

What to look for on managed K8s status pages: Control plane availability, etcd health, API server latency, node provisioning failures. Even a "minor degradation" on the control plane can make kubectl appear broken while your workloads continue running.


Control Plane Diagnostics

Is the API Server Responding?

# Basic connectivity check
kubectl cluster-info

# Check API server directly (replace URL with your cluster endpoint)
curl -k https://<cluster-endpoint>/healthz

# Check component status (deprecated in 1.19+ but still useful)
kubectl get componentstatuses

# More reliable: check system pods
kubectl get pods -n kube-system

Control Plane Component Health (Self-Managed Clusters)

# Check control plane pods (kubeadm clusters)
kubectl get pods -n kube-system | grep -E 'etcd|apiserver|controller|scheduler'

# Check etcd health (requires etcdctl access)
etcdctl endpoint health --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# API server logs (systemd-based)
journalctl -u kube-apiserver --since "10 minutes ago"

On managed services (EKS/GKE/AKS): You don't have direct access to control plane components. Use the provider status page and check node/system pod health as a proxy.


Node Diagnostics

# List all nodes and their status
kubectl get nodes -o wide

# Detailed node status — check Conditions section
kubectl describe node <node-name>

# Key conditions to check:
# Ready = False → node is failing health checks
# MemoryPressure = True → OOM risk
# DiskPressure = True → disk full
# PIDPressure = True → too many processes
# NetworkUnavailable = True → CNI problem

# Node resource usage
kubectl top nodes  # requires metrics-server

# Get node events
kubectl get events --field-selector source=kubelet --all-namespaces | sort-by .lastTimestamp

Node NotReady — Common Causes

ConditionRoot CauseFix
NetworkUnavailableCNI plugin not running (Calico, Flannel, Cilium)Check CNI daemonset pods: kubectl get pods -n kube-system
MemoryPressureNode OOM — too many pods or memory leakCheck pod memory requests/limits, evict or cordon node
DiskPressureContainer image cache full, log rotation failingClean up unused images: crictl rmi --prune
No recent heartbeatkubelet crashed, node rebooted, cloud instance terminatedCheck cloud provider console, kubelet service status

Pod Diagnostics

# Check pod status across all namespaces
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Describe a failing pod (Events section is most useful)
kubectl describe pod <pod-name> -n <namespace>

# Get logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # previous container instance

# Watch pod restarts in real time
kubectl get pods -n <namespace> -w

Pod Status Reference

StatusMeaningDiagnostic Approach
PendingNot scheduled to any nodedescribe pod → Events; check resources, node selectors, taints/tolerations
ContainerCreatingImage pulling or volume mountingCheck image pull policy, registry credentials, PVC status
ImagePullBackOffCan't pull container imageVerify image name, tag, registry credentials (imagePullSecrets)
CrashLoopBackOffContainer starting then crashing repeatedlylogs --previous, check readiness probe, resource limits
OOMKilledContainer exceeded memory limitIncrease memory limit or fix memory leak
EvictedNode ran out of resourcesCheck node pressure, increase resources or reduce pod density
Terminating (stuck)Finalizer blocking deletionCheck finalizers: kubectl get pod -o yaml | grep finalizers

Networking Diagnostics

Service DNS Not Resolving

# Test DNS from inside a pod
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test service endpoint resolution
nslookup <service-name>.<namespace>.svc.cluster.local

Pod-to-Pod Connectivity

# Exec into a running pod and curl another service
kubectl exec -it <pod-name> -n <namespace> -- curl http://<service-name>:<port>/healthz

# Check if service has endpoints
kubectl get endpoints <service-name> -n <namespace>
# Empty ENDPOINTS column = no pods matching service selector

# Check network policies
kubectl get networkpolicies --all-namespaces
# A NetworkPolicy blocking ingress/egress is a common "is it down?" false alarm

Storage Diagnostics

# Check PersistentVolumeClaims
kubectl get pvc --all-namespaces | grep -v Bound

# Describe a pending or failed PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check StorageClass provisioner
kubectl get storageclass

# Check CSI driver pods (common in cloud managed K8s)
kubectl get pods -n kube-system | grep csi

Common storage issues: PVC stuck in Pending (provisioner can't create volume), PVC stuck in Terminating (finalizer blocking deletion or PV not yet recycled), ReadWriteOnce volume attached to wrong node.


Is It Kubernetes or Your Application?

SymptomKubernetes IssueApplication Issue
All pods in a deployment failNode failure, resource exhaustion, CNI problemBad deployment (config error, bad image), shared secret/configmap broken
One pod fails intermittentlyFlaky node (disk, memory, CPU steal)Memory leak, deadlock, unhandled exception under load
External requests time outIngress controller, load balancer, NodePort issueSlow database query, upstream API timeout, thread exhaustion
kubectl works, app is downNot K8s (cluster healthy)Application bug, config change, upstream dependency
New pods won't start (rolling deploy stuck)Resource quota, node pressure, taint mismatchReadiness probe failing (bad health check endpoint)

Notable Managed K8s Incidents — Q1 2026

  • AWS EKS — Elastic Load Balancing Delays (Feb 2026): EKS clusters in us-east-1 experienced delays in load balancer target group registration during peak provisioning. Pods were running but not receiving traffic for 10–20 minutes after becoming Ready. Workaround: reduce deregistration_delay.timeout_seconds.
  • GKE — Cluster Upgrades (Ongoing): GKE's automated upgrade windows occasionally cause brief API server interruptions. Check GKE maintenance window settings to schedule around production load.
  • AKS — Azure OpenAI Integration (Mar 9–10, 2026): Workloads using Azure OpenAI via pod identity/workload identity saw failures during the Azure OpenAI multi-region outage. The AKS nodes were healthy — the dependency was the issue.

Monitoring Kubernetes with Ezmon

Kubernetes health has multiple layers. Ezmon monitors the endpoints that matter to users — the HTTP/HTTPS services your pods expose — from multiple global locations simultaneously.

What this catches:

  • Ingress failures (your pods are Running but traffic isn't reaching them)
  • Partial outages (one region's nodes fail but others are fine)
  • Slow degradation before a full outage (response time spikes before complete failure)
  • Recovery confirmation (multi-probe verification before marking a service as recovered)

For internal cluster health (pod/node/etcd monitoring), you'll want Prometheus + Alertmanager inside the cluster. Ezmon handles the external availability layer — the perspective your users actually experience.

Start monitoring your Kubernetes-backed services →


Related Guides


Kubernetes and cloud provider status sourced from official status pages: AWS Health, Google Cloud Status, Azure Status. All times UTC.

kubernetes downk8s outagekubernetes troubleshootingk8s debugkubernetes cluster issuesdevopssre