Monitoring ECS and EKS at Scale with CloudWatch Container Insights and Prometheus

I am experienced Cloud Devops Engineer I blog about Solutions, Cloud and DevOps Projects that boost your portfolio and provide troubleshooting guides on Cloud and DevOps
As containerized workloads continue to dominate modern cloud infrastructure, effective monitoring becomes critical for maintaining reliability, performance, and cost efficiency. Amazon ECS and EKS power millions of production containers, yet many organizations struggle with comprehensive observability at scale.
This article explores a production-tested approach to monitoring container workloads on AWS, combining CloudWatch Container Insights with Prometheus to achieve complete visibility across your infrastructure. We'll examine why a dual-monitoring strategy outperforms single-tool approaches and demonstrate practical implementation patterns.
What you'll learn:
How CloudWatch Container Insights and Prometheus complement each other
Architecture patterns for monitoring EKS at scale
Implementation strategies using AWS native tools
Performance metrics that matter for production workloads
Cost optimization techniques for monitoring infrastructure
Prerequisites
Before diving into the implementation, ensure you have the following:
Required AWS Resources
AWS Account with appropriate IAM permissions
AWS CLI (v2.x or later) configured with credentials
Active EKS Cluster (Kubernetes 1.21+) or ability to create one
VPC with public and private subnets across multiple AZs
IAM permissions for:
Creating/managing EKS clusters and addons
Creating IAM roles and policies
Writing to CloudWatch metrics and logs
Managing EKS OIDC providers
Understanding the Container Monitoring Challenge
The Complexity of Container Observability
Traditional monitoring approaches fall short with containers due to their ephemeral nature, dynamic scheduling, and distributed architecture. Key challenges include:
Dynamic Infrastructure: Containers start and stop frequently, making static monitoring configurations obsolete. A pod running on node A at 10:00 AM might be rescheduled to node B by 10:05 AM.
Multi-Layer Visibility: Effective monitoring requires insights at multiple levels—cluster, node, pod, and container—each with different operational concerns.
Metric Volume: A moderate Kubernetes cluster generates thousands of metrics per minute. Without proper aggregation and filtering, the signal-to-noise ratio becomes problematic.
Distributed Tracing: Microservices communicate across network boundaries, requiring correlation of metrics, logs, and traces to understand system behavior.
Why Dual Monitoring?
Rather than choosing between AWS native tools and open-source solutions, the optimal strategy combines both:
CloudWatch Container Insights excels at:
Native AWS integration and managed infrastructure
Automatic metric collection without configuration
Built-in dashboards for immediate visibility
Integration with AWS services (SNS, Lambda, EventBridge)
Compliance and audit logging requirements
Prometheus + Grafana provides:
Flexible query language (PromQL) for complex analysis
Custom metric collection and application instrumentation
Community-driven dashboards and exporters
Longer retention periods (configurable)
No AWS API rate limits or costs per metric
This combination ensures both operational reliability (CloudWatch) and deep analytical capability (Prometheus).
Component Responsibilities
CloudWatch Agent: Collects cluster, node, pod, and container metrics. Deployed as DaemonSet (one pod per node) to gather host-level metrics and Kubernetes resource utilization.
Fluent Bit: Aggregates logs from all containers and ships to CloudWatch Logs. Handles log parsing, filtering, and routing based on namespace, pod labels, or log content.
Prometheus Operator: Manages Prometheus instances and monitoring configuration through Kubernetes CRDs (Custom Resource Definitions). Automatically discovers targets using ServiceMonitor resources.
Grafana: Provides visualization layer with support for both CloudWatch and Prometheus data sources, enabling unified dashboards.
Implementation: EKS Monitoring
Infrastructure Prerequisites
Before implementing monitoring, ensure your EKS cluster has:
OIDC Provider: Required for IAM Roles for Service Accounts (IRSA)
Appropriate Node Sizing: Reserve 10-15% cluster capacity for monitoring workloads
Network Connectivity: Ensure pods can reach AWS API endpoints (use VPC endpoints to avoid NAT costs)

Step-by-Step Implementation
CloudWatch Container Insights Setup
Step 1: Create IAM Role for CloudWatch Agent
CloudWatch Container Insights is available as an EKS addon, providing the most reliable installation method:
# Create IAM role for CloudWatch agent
eksctl create iamserviceaccount \
--name cloudwatch-agent \
--namespace amazon-cloudwatch \
--cluster your-cluster-name \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve
This creates an IAM role with proper trust relationship to the cluster's OIDC provider, enabling secure API access without static credentials.

Step 2: Install the CloudWatch Observability Addon
# Install the CloudWatch Observability addon
aws eks create-addon \
--cluster-name your-cluster-name \
--addon-name amazon-cloudwatch-observability \
--service-account-role-arn <ROLE_ARN_FROM_PREVIOUS_STEP>
The addon automatically deploys:
CloudWatch agent DaemonSet for metrics
Fluent Bit DaemonSet for logs
Required ConfigMaps and RBAC permissions

Step 3: Verification
# Check pod status
kubectl get pods -n amazon-cloudwatch
# Expected output:
# NAME READY STATUS RESTARTS AGE
# cloudwatch-agent-xxxxx 1/1 Running 0 2m
# fluent-bit-xxxxx 1/1 Running 0 2m
# Verify metrics flowing to CloudWatch
aws cloudwatch list-metrics \
--namespace ContainerInsights \
--dimensions Name=ClusterName,Value=your-cluster-name


Prometheus Stack Deployment
Step 4: Deploy Prometheus using Helm
Deploy Prometheus using the community-maintained kube-prometheus-stack:
# Add Prometheus Helm repository
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.resources.requests.memory=2Gi \
--set grafana.enabled=true \
--set grafana.adminPassword=admin123
This installs:
Prometheus Operator for lifecycle management
Prometheus server with 15-day retention
Alertmanager for notification routing
Grafana with pre-configured dashboards
Node exporters for host metrics
Kube-state-metrics for Kubernetes object metrics

Step 5: Verify Prometheus Stack
# Check all monitoring pods
kubectl get pods -n monitoring
# Expected output shows all pods running:
# prometheus-kube-prometheus-operator-xxxxx 1/1 Running
# prometheus-kube-state-metrics-xxxxx 1/1 Running
# prometheus-prometheus-node-exporter-xxxxx 1/1 Running
# alertmanager-prometheus-kube-prometheus-xxxxx 2/2 Running
# prometheus-grafana-xxxxx 3/3 Running

Configuration Tips:
Resource Allocation: Prometheus memory usage grows with cardinality. For a 50-node cluster, allocate 4-8GB RAM.
Retention Period: Balance storage costs against analysis needs. 15 days handles most troubleshooting scenarios; use Thanos for longer-term storage.
ServiceMonitor Pattern: Create ServiceMonitors to automatically discover and scrape application metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: your-application
endpoints:
- port: metrics
interval: 30s
Sample Application with Instrumentation
Step 6: Deploy Test Application
Deploy a sample application to verify end-to-end monitoring:
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
spec:
containers:
- name: nginx
image: nginx:1.21
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
ports:
- containerPort: 80
# Apply the manifest
kubectl apply -f sample-app.yaml
# Verify deployment
kubectl get pods -l app=sample-app
This generates realistic metrics for testing your monitoring stack.
Key Metrics and Dashboards
CloudWatch Container Insights Metrics
CloudWatch automatically collects metrics across four levels:
Cluster Level:
cluster_failed_node_count: Nodes in NotReady statecluster_node_count: Total nodescluster_number_of_running_pods: Active pods
Node Level:
node_cpu_utilization: Percentage CPU usagenode_memory_utilization: Percentage memory usagenode_network_total_bytes: Network throughputnode_filesystem_utilization: Disk usage
Pod Level:
pod_cpu_utilization: Per-pod CPU usagepod_memory_utilization: Per-pod memory usagepod_network_rx_bytes: Inbound network trafficpod_network_tx_bytes: Outbound network traffic
Container Level:
container_cpu_utilization: Individual container CPUcontainer_memory_utilization: Individual container memory
Accessing CloudWatch Dashboards
Navigate to CloudWatch Console → Container Insights → Performance monitoring to access built-in dashboards:

Cluster View: High-level cluster health and resource utilization
Node View: Per-node metrics with drill-down capability
Pod View: Pod-level metrics filtered by namespace
Service View: Service-level aggregated metrics


Prometheus Query Patterns
Prometheus excels at complex queries and aggregations:
CPU Utilization by Namespace:
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)
Memory Usage Above Threshold:
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
Pod Restart Rate:
increase(kube_pod_container_status_restarts_total[1h]) > 3
Network Traffic Per Service:
sum(rate(container_network_transmit_bytes_total[5m])) by (service)
Grafana Dashboard Setup
Access Grafana via port-forward:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Open your browser to http://localhost:3000 and login with:
Username:
adminPassword:
admin123(or the password you set during installation)
Pre-installed dashboards include:
Kubernetes / Compute Resources / Cluster: Overall cluster utilization
Kubernetes / Compute Resources / Namespace (Pods): Per-namespace pod metrics
Kubernetes / Compute Resources / Node (Pods): Per-node resource usage
Kubernetes / Networking / Cluster: Network I/O and errors


💡 Pro Tip: You can add CloudWatch as a data source in Grafana to create unified dashboards combining both CloudWatch and Prometheus metrics. Production Considerations
High Availability
CloudWatch Agent: DaemonSet pattern ensures coverage even during node failures. If a node dies, metrics stop from that node only; cluster-level visibility remains.
Prometheus: Run multiple replicas with anti-affinity rules:
prometheus:
prometheusSpec:
replicas: 2
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- prometheus
topologyKey: kubernetes.io/hostname
Security Best Practices
1. IRSA Over Static Credentials: CloudWatch agent uses IAM roles attached to service accounts, eliminating credential management overhead.
2. Network Policies: Restrict Prometheus scraping to authorized namespaces:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-prometheus-scrape
spec:
podSelector:
matchLabels:
app: your-app
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8080
3. Grafana Authentication: Integrate with corporate SSO (LDAP, SAML, OAuth) rather than local authentication.
Scaling Considerations
CloudWatch Agent: Automatically scales with cluster size (one pod per node).
Prometheus: Vertical scaling limits exist (~1M samples/second). For larger deployments:
Use Prometheus federation (hierarchical Prometheus instances)
Implement sharding by namespace or service
Deploy Thanos for horizontal scalability
Monitoring ECS (Comparison)
While this article focuses on EKS, ECS monitoring follows similar patterns with key differences:
ECS Container Insights Setup
Enable Container Insights at cluster creation:
aws ecs create-cluster \
--cluster-name production-ecs \
--settings name=containerInsights,value=enabled
Or enable on existing cluster:
aws ecs update-cluster-settings \
--cluster production-ecs \
--settings name=containerInsights,value=enabled
Key Differences from EKS
Metric Collection:
ECS: Agent runs on EC2 instances (ECS-optimized AMI includes agent)
EKS: Agent runs as Kubernetes DaemonSet
Metric Granularity:
ECS: Task and container level metrics
EKS: Cluster, node, pod, and container level metrics
Prometheus Integration:
ECS: Requires FireLens for log routing and custom service discovery
EKS: Native ServiceMonitor discovery via Kubernetes API
Use Case Guidance:
Choose EKS for: Kubernetes-native applications, complex orchestration, multi-cloud portability
Choose ECS for: AWS-native workflows, simpler operational model, faster time-to-production
Troubleshooting Common Issues
CloudWatch Agent Not Reporting Metrics
Symptom: No metrics in CloudWatch Console after 10+ minutes.
Diagnosis:
# Check agent pod status
kubectl get pods -n amazon-cloudwatch
# View agent logs
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent
Common causes:
- IRSA misconfiguration: Verify service account annotation
kubectl get sa cloudwatch-agent -n amazon-cloudwatch -o yaml | grep role-arn
IAM permissions: Ensure role has
CloudWatchAgentServerPolicyNetwork connectivity: Verify pods can reach CloudWatch endpoints
Prometheus Targets Down
Symptom: Targets show "DOWN" status in Prometheus UI.
Diagnosis:
# Check Prometheus pod logs
kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0
# Verify ServiceMonitor configuration
kubectl get servicemonitor -n monitoring
Common causes:
Port mismatch: Ensure ServiceMonitor port matches pod's metrics port
Label selectors: Verify ServiceMonitor selector matches service labels
Network policies: Check if Prometheus is blocked from scraping
High Metric Cardinality
Symptom: Prometheus OOM errors or slow queries.
Diagnosis: Check active series count:
prometheus_tsdb_head_series
Solutions:
- Drop high-cardinality metrics:
prometheusSpec:
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
metric_relabel_configs:
- source_labels: [__name__]
regex: 'expensive_metric_pattern'
action: drop
Aggregate metrics: Use recording rules to pre-compute aggregations
Sample less frequently: Increase scrape interval for non-critical metrics
Best Practices and Recommendations
Metric Collection Strategy
1. Start with Defaults: CloudWatch Container Insights and Prometheus default configurations cover 90% of monitoring needs.
2. Add Custom Metrics Gradually: Instrument applications with Prometheus client libraries only when default metrics prove insufficient.
3. Use Labels Wisely: Excessive labels increase cardinality exponentially. Limit to essential dimensions (service, environment, version).
Alerting Philosophy
Critical Alerts (CloudWatch Alarms → SNS → PagerDuty):
Node NotReady state
Pod crash loops (>5 restarts in 10 minutes)
Memory utilization >85%
Disk utilization >90%
Warning Alerts (Prometheus → Alertmanager → Slack):
High CPU usage (>70% for 15 minutes)
Increased error rates
Elevated API latency
Informational (Metrics only, no alerts):
Request counts
Response time percentiles
Resource utilization trends
Dashboard Organization
Executive Dashboard (CloudWatch):
Cluster health summary
Cost metrics
SLA compliance
Engineering Dashboard (Grafana):
Service-level metrics (RED: Rate, Errors, Duration)
Resource utilization by namespace
Network performance
Troubleshooting Views (Grafana):
Per-pod metrics for debugging
Log correlation with metrics
Distributed tracing integration
Conclusion
Effective monitoring at scale requires combining AWS native tools with open-source solutions. CloudWatch Container Insights provides managed infrastructure and native AWS integration, while Prometheus offers flexibility and powerful analytics.




