status-checker

AWS Lambda, RDS, EC2 Down? How to Diagnose Individual AWS Service Failures in 2026

AWS Lambda, RDS, EC2 Down? How to Diagnose Individual AWS Service Failures in 2026

The AWS Health Dashboard can show "All services operational" while specific services in your region experience intermittent failures. This guide covers granular diagnosis for the three most impactful AWS services: Lambda, RDS, and EC2.


AWS Status Pages — Which One to Check

AWS has multiple health status interfaces. Each serves a different purpose:

InterfaceURLBest For
AWS Service Health Dashboardhealth.aws.amazon.com/health/statusBroad service-level status, publicly visible
AWS Personal Health DashboardAWS Console → Health → Your Account HealthIssues affecting YOUR specific account/resources
AWS Health APIAWS Health API (requires Business/Enterprise support)Programmatic alerts for your affected services
AWS Service Status RSSstatus.aws.amazon.com/rssMachine-readable status feed

Important: The public dashboard often lags behind actual incidents by 15–30 minutes. If your service is failing but the dashboard shows green, check the Personal Health Dashboard first — AWS often notifies affected accounts before updating the public page.


AWS Lambda Diagnostics

Common Lambda Failure Patterns

SymptomLikely CauseFirst Check
Task timed out (e.g., 3.00 seconds)Cold start + initialization exceeds timeout, or downstream API slowCloudWatch logs: INIT_DURATION line
ERR_CONNECTION_REFUSED from LambdaVPC config: Lambda can't reach RDS/ElastiCache; wrong subnet or missing NATCheck VPC config, security groups, route tables
429 TooManyRequestsAccount-level concurrent execution limit hit (default: 1,000 per region)CloudWatch → Lambda → ConcurrentExecutions metric
502/503 from API GatewayLambda error not caught → API Gateway timeout (29s max), or Lambda throttleAPI Gateway execution logs, Lambda error rate
ENI creation failureVPC Lambda exhausting ENIs (subnet /27 or smaller)VPC → Network Interfaces, check subnet capacity
Runtime.ImportModuleErrorLambda layer missing, wrong architecture (x86 vs arm64), wrong runtimeCheck layer ARNs, runtime version, architecture setting

Lambda Diagnostic Commands (AWS CLI)

# Check function configuration
aws lambda get-function-configuration --function-name your-function-name

# Get last 5 minutes of invocation errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/your-function-name \
  --start-time $(date -d '5 minutes ago' +%s000) \
  --filter-pattern "ERROR"

# Check concurrent executions (CloudWatch metric)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name ConcurrentExecutions \
  --dimensions Name=FunctionName,Value=your-function-name \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum

# Check throttles
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Throttles \
  --dimensions Name=FunctionName,Value=your-function-name \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

Lambda Cold Start Diagnosis

In CloudWatch Logs, filter for REPORT lines. A cold start shows Init Duration: X ms:

REPORT RequestId: abc123 Duration: 243.55 ms Billed Duration: 244 ms
Memory Size: 512 MB Max Memory Used: 89 MB Init Duration: 487.23 ms

If Init Duration is consuming most of your timeout budget, options: increase timeout, use Provisioned Concurrency, or reduce initialization work (lazy imports, connection reuse).


Amazon RDS Diagnostics

Common RDS Failure Patterns

SymptomLikely CauseFirst Check
Connection refused / ECONNREFUSEDSecurity group blocking access, RDS instance stopped, max_connections hitSecurity groups → inbound rules on port 5432/3306; RDS console → Status
FATAL: sorry, too many clientsmax_connections exhausted (common with Lambda at scale)CloudWatch → DatabaseConnections; use RDS Proxy
SSL SYSCALL error: EOF detectedFailover in progress (Multi-AZ) or network interruptionRDS Events for failover events; implement retry logic
Read replica lag > thresholdHigh write load on primary, large transactions, binlog delayCloudWatch → ReplicaLag metric
FreeStorageSpace = 0Disk full — instance will become read-onlyCloudWatch → FreeStorageSpace; enable autoscaling storage
High CPUUtilization (>80%)Missing index, N+1 queries, autovacuum conflictPerformance Insights → Top SQL statements

RDS Diagnostic Commands (AWS CLI)

# Check RDS instance status
aws rds describe-db-instances --db-instance-identifier your-db-id \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,AZ:AvailabilityZone,Endpoint:Endpoint.Address}'

# Check recent RDS events (last 1 hour)
aws rds describe-events \
  --source-identifier your-db-id \
  --source-type db-instance \
  --duration 60

# Check CloudWatch metrics (CPU, connections, free storage)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=your-db-id \
  --start-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum

# List parameter groups (check max_connections setting)
aws rds describe-db-parameters \
  --db-parameter-group-name your-parameter-group \
  --query 'Parameters[?ParameterName==`max_connections`]'

RDS max_connections and Lambda

Lambda functions can spawn hundreds of concurrent instances, each opening its own database connection. For db.t3.micro (default), max_connections is ~87. This is almost always the cause of "too many clients" errors in Lambda + RDS architectures.

Fix: Use RDS Proxy (connection pooling) or a connection pool library (PgBouncer, pgx's built-in pool). RDS Proxy is especially effective for Lambda use cases — connections are pooled at the proxy level, not per-Lambda-instance.


Amazon EC2 Diagnostics

Common EC2 Failure Patterns

SymptomLikely CauseFirst Check
Instance unreachable (SSH timeout)Security group rule missing, VPC routing issue, instance stopped/terminatedEC2 console → Instance State; security groups inbound 22/443
Instance status check failed (1/2)OS-level issue: disk full, OOM, kernel panic, filesystem errorEC2 console → Monitoring tab → Status Checks
System status check failed (2/2)AWS infrastructure issue: host hardware failure, power/network issueCheck AWS Health Dashboard for your AZ; stop/start instance (migrates to new host)
EBS volume unresponsiveEBS service degradation (AZ-specific), I/O credit exhaustion (gp2), volume offlineCloudWatch → VolumeQueueLength spike; check EBS status in AWS console
Instance terminated unexpectedlySpot instance interruption, Auto Scaling scale-in, account billing issueCloudTrail → TerminateInstances events
ELB target health: unhealthyHealth check path returning non-200, security group blocking ELBEC2 → Target Groups → Health status; check security group allows ELB CIDR

EC2 Diagnostic Commands (AWS CLI)

# Check instance status and status checks
aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --query 'InstanceStatuses[0].{State:InstanceState.Name,System:SystemStatus.Status,Instance:InstanceStatus.Status}'

# Get system log (last output before connectivity loss)
aws ec2 get-console-output --instance-id i-0123456789abcdef0 --output text

# Check recent CloudTrail events for instance
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=i-0123456789abcdef0 \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ)

# Describe target group health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/id

System Status Check Failed — What To Do

A System Status Check failure (2/2) means AWS infrastructure is having problems with the underlying hardware. This is AWS's problem, not yours. Options:

  1. Stop and start the instance (not reboot) — migrates to healthy host. Note: Elastic IP stays attached, but if you have an instance-store-backed instance you'll lose data.
  2. Check AWS Personal Health Dashboard — AWS may have already flagged the AZ/host issue and scheduled a maintenance event.
  3. Scheduled Retirement — AWS occasionally schedules retirement of instances on degraded hosts. Check the console for retirement notifications.

AWS Availability Zone Failures — Pattern Recognition

AWS incidents are often AZ-specific, not regional. Signs that an AZ is degraded rather than a whole region:

  • Some instances/services in the same region are fine, others fail
  • The issue correlates with a specific AZ suffix (e.g., us-east-1c vs us-east-1a)
  • AWS Health shows "us-east-1c" in the affected scope
  • Some RDS replicas fail but primary (in different AZ) is fine

Mitigation: Multi-AZ deployments for RDS, Auto Scaling groups spread across 3+ AZs, ELB with health checks to automatically route around unhealthy AZ instances.


Notable AWS Individual Service Incidents — Q1 2026

  • AWS Lambda — us-east-1 (Feb 2026): Lambda cold start latency increased 3–5x for approximately 2 hours. Functions that worked within their timeout limit began timing out. AWS attributed it to internal capacity management changes. Workaround: Provisioned Concurrency on critical functions.
  • Amazon RDS (Aurora) — ap-southeast-1 (Jan 2026): Aurora Serverless v2 auto-scaling delays caused connection pool exhaustion on rapidly scaling workloads. Instances were healthy; capacity wasn't scaling fast enough to match demand spike.
  • EBS — us-west-2a (Mar 2026): Elevated error rates for EBS volumes in us-west-2a. EC2 instances with EBS root volumes in the affected AZ saw I/O stalls. Instances using instance store were unaffected.

Monitoring AWS Services with Ezmon

AWS's own status page tracks service availability, but your monitoring should answer a more specific question: is your application working for your users?

The layers:

  • AWS status page: "Is the service having problems globally/regionally?"
  • Personal Health Dashboard: "Is the service having problems for my specific account/resources?"
  • CloudWatch metrics: "What's happening inside my resources right now?"
  • External monitoring (Ezmon): "Can a user in Tokyo actually hit my endpoint and get a response in under 2 seconds?"

Ezmon monitors your actual application endpoints from 15+ global locations. This catches the gap between "AWS is healthy" and "your users can't reach you" — which is where most real outages live.

Monitor your AWS-hosted services from outside AWS →


Related Guides


AWS service status sourced from AWS Service Health Dashboard. All times UTC. For account-specific issues, always check the AWS Personal Health Dashboard in your AWS Console.

aws lambda downrds downec2 downaws services statusaws troubleshootinglambda timeoutrds connection refused