Monitoring ECS and EKS at Scale with CloudWatch Container Insights

As containerized workloads continue to dominate modern cloud infrastructure, effective monitoring becomes critical for maintaining reliability, performance, and cost efficiency. Amazon ECS and EKS power millions of production containers, yet many organizations struggle with comprehensive observability at scale.

This article explores a production-tested approach to monitoring container workloads on AWS, combining CloudWatch Container Insights with Prometheus to achieve complete visibility across your infrastructure. We'll examine why a dual-monitoring strategy outperforms single-tool approaches and demonstrate practical implementation patterns.

What you'll learn:

How CloudWatch Container Insights and Prometheus complement each other
Architecture patterns for monitoring EKS at scale
Implementation strategies using AWS native tools
Performance metrics that matter for production workloads
Cost optimization techniques for monitoring infrastructure

Prerequisites

Before diving into the implementation, ensure you have the following:

Required AWS Resources

AWS Account with appropriate IAM permissions
AWS CLI (v2.x or later) configured with credentials
Active EKS Cluster (Kubernetes 1.21+) or ability to create one
VPC with public and private subnets across multiple AZs
IAM permissions for:
- Creating/managing EKS clusters and addons
- Creating IAM roles and policies
- Writing to CloudWatch metrics and logs
- Managing EKS OIDC providers

Understanding the Container Monitoring Challenge

The Complexity of Container Observability

Traditional monitoring approaches fall short with containers due to their ephemeral nature, dynamic scheduling, and distributed architecture. Key challenges include:

Dynamic Infrastructure: Containers start and stop frequently, making static monitoring configurations obsolete. A pod running on node A at 10:00 AM might be rescheduled to node B by 10:05 AM.

Multi-Layer Visibility: Effective monitoring requires insights at multiple levels—cluster, node, pod, and container—each with different operational concerns.

Metric Volume: A moderate Kubernetes cluster generates thousands of metrics per minute. Without proper aggregation and filtering, the signal-to-noise ratio becomes problematic.

Distributed Tracing: Microservices communicate across network boundaries, requiring correlation of metrics, logs, and traces to understand system behavior.

Why Dual Monitoring?

Rather than choosing between AWS native tools and open-source solutions, the optimal strategy combines both:

CloudWatch Container Insights excels at:

Native AWS integration and managed infrastructure
Automatic metric collection without configuration
Built-in dashboards for immediate visibility
Integration with AWS services (SNS, Lambda, EventBridge)
Compliance and audit logging requirements

Prometheus + Grafana provides:

Flexible query language (PromQL) for complex analysis
Custom metric collection and application instrumentation
Community-driven dashboards and exporters
Longer retention periods (configurable)
No AWS API rate limits or costs per metric

This combination ensures both operational reliability (CloudWatch) and deep analytical capability (Prometheus).

Component Responsibilities

CloudWatch Agent: Collects cluster, node, pod, and container metrics. Deployed as DaemonSet (one pod per node) to gather host-level metrics and Kubernetes resource utilization.

Fluent Bit: Aggregates logs from all containers and ships to CloudWatch Logs. Handles log parsing, filtering, and routing based on namespace, pod labels, or log content.

Prometheus Operator: Manages Prometheus instances and monitoring configuration through Kubernetes CRDs (Custom Resource Definitions). Automatically discovers targets using ServiceMonitor resources.

Grafana: Provides visualization layer with support for both CloudWatch and Prometheus data sources, enabling unified dashboards.

Implementation: EKS Monitoring

Infrastructure Prerequisites

Before implementing monitoring, ensure your EKS cluster has:

OIDC Provider: Required for IAM Roles for Service Accounts (IRSA)
Appropriate Node Sizing: Reserve 10-15% cluster capacity for monitoring workloads
Network Connectivity: Ensure pods can reach AWS API endpoints (use VPC endpoints to avoid NAT costs)

Step-by-Step Implementation

CloudWatch Container Insights Setup

Step 1: Create IAM Role for CloudWatch Agent

CloudWatch Container Insights is available as an EKS addon, providing the most reliable installation method:

# Create IAM role for CloudWatch agent
eksctl create iamserviceaccount \
  --name cloudwatch-agent \
  --namespace amazon-cloudwatch \
  --cluster your-cluster-name \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve

This creates an IAM role with proper trust relationship to the cluster's OIDC provider, enabling secure API access without static credentials.

Step 2: Install the CloudWatch Observability Addon

# Install the CloudWatch Observability addon
aws eks create-addon \
  --cluster-name your-cluster-name \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn <ROLE_ARN_FROM_PREVIOUS_STEP>

The addon automatically deploys:

CloudWatch agent DaemonSet for metrics
Fluent Bit DaemonSet for logs
Required ConfigMaps and RBAC permissions

Step 3: Verification

# Check pod status
kubectl get pods -n amazon-cloudwatch

# Expected output:
# NAME                                 READY   STATUS    RESTARTS   AGE
# cloudwatch-agent-xxxxx              1/1     Running   0          2m
# fluent-bit-xxxxx                    1/1     Running   0          2m

# Verify metrics flowing to CloudWatch
aws cloudwatch list-metrics \
  --namespace ContainerInsights \
  --dimensions Name=ClusterName,Value=your-cluster-name

Prometheus Stack Deployment

Step 4: Deploy Prometheus using Helm

Deploy Prometheus using the community-maintained kube-prometheus-stack:

# Add Prometheus Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts

helm repo update

# Install Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set grafana.enabled=true \
  --set grafana.adminPassword=admin123

This installs:

Prometheus Operator for lifecycle management
Prometheus server with 15-day retention
Alertmanager for notification routing
Grafana with pre-configured dashboards
Node exporters for host metrics
Kube-state-metrics for Kubernetes object metrics

Step 5: Verify Prometheus Stack

# Check all monitoring pods
kubectl get pods -n monitoring

# Expected output shows all pods running:
# prometheus-kube-prometheus-operator-xxxxx      1/1     Running
# prometheus-kube-state-metrics-xxxxx            1/1     Running
# prometheus-prometheus-node-exporter-xxxxx      1/1     Running
# alertmanager-prometheus-kube-prometheus-xxxxx  2/2     Running
# prometheus-grafana-xxxxx                       3/3     Running

Configuration Tips:

Resource Allocation: Prometheus memory usage grows with cardinality. For a 50-node cluster, allocate 4-8GB RAM.
Retention Period: Balance storage costs against analysis needs. 15 days handles most troubleshooting scenarios; use Thanos for longer-term storage.
ServiceMonitor Pattern: Create ServiceMonitors to automatically discover and scrape application metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: your-application
  endpoints:
  - port: metrics
    interval: 30s

Sample Application with Instrumentation

Step 6: Deploy Test Application

Deploy a sample application to verify end-to-end monitoring:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        ports:
        - containerPort: 80

# Apply the manifest
kubectl apply -f sample-app.yaml

# Verify deployment
kubectl get pods -l app=sample-app

This generates realistic metrics for testing your monitoring stack.

Key Metrics and Dashboards

CloudWatch Container Insights Metrics

CloudWatch automatically collects metrics across four levels:

Cluster Level:

cluster_failed_node_count: Nodes in NotReady state
cluster_node_count: Total nodes
cluster_number_of_running_pods: Active pods

Node Level:

node_cpu_utilization: Percentage CPU usage
node_memory_utilization: Percentage memory usage
node_network_total_bytes: Network throughput
node_filesystem_utilization: Disk usage

Pod Level:

pod_cpu_utilization: Per-pod CPU usage
pod_memory_utilization: Per-pod memory usage
pod_network_rx_bytes: Inbound network traffic
pod_network_tx_bytes: Outbound network traffic

Container Level:

container_cpu_utilization: Individual container CPU
container_memory_utilization: Individual container memory

Accessing CloudWatch Dashboards

Navigate to CloudWatch Console → Container Insights → Performance monitoring to access built-in dashboards:

Cluster View: High-level cluster health and resource utilization

Node View: Per-node metrics with drill-down capability
Pod View: Pod-level metrics filtered by namespace
Service View: Service-level aggregated metrics

Prometheus Query Patterns

Prometheus excels at complex queries and aggregations:

CPU Utilization by Namespace:

sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Memory Usage Above Threshold:

container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8

Pod Restart Rate:

increase(kube_pod_container_status_restarts_total[1h]) > 3

Network Traffic Per Service:

sum(rate(container_network_transmit_bytes_total[5m])) by (service)

Grafana Dashboard Setup

Access Grafana via port-forward:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Open your browser to http://localhost:3000 and login with:

Username: admin
Password: admin123 (or the password you set during installation)

Pre-installed dashboards include:

Kubernetes / Compute Resources / Cluster: Overall cluster utilization
Kubernetes / Compute Resources / Namespace (Pods): Per-namespace pod metrics
Kubernetes / Compute Resources / Node (Pods): Per-node resource usage
Kubernetes / Networking / Cluster: Network I/O and errors

💡 Pro Tip: You can add CloudWatch as a data source in Grafana to create unified dashboards combining both CloudWatch and Prometheus metrics. Production Considerations

High Availability

CloudWatch Agent: DaemonSet pattern ensures coverage even during node failures. If a node dies, metrics stop from that node only; cluster-level visibility remains.

Prometheus: Run multiple replicas with anti-affinity rules:

prometheus:
  prometheusSpec:
    replicas: 2
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - prometheus
        topologyKey: kubernetes.io/hostname

Security Best Practices

1. IRSA Over Static Credentials: CloudWatch agent uses IAM roles attached to service accounts, eliminating credential management overhead.

2. Network Policies: Restrict Prometheus scraping to authorized namespaces:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scrape
spec:
  podSelector:
    matchLabels:
      app: your-app
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080

3. Grafana Authentication: Integrate with corporate SSO (LDAP, SAML, OAuth) rather than local authentication.

Scaling Considerations

CloudWatch Agent: Automatically scales with cluster size (one pod per node).

Prometheus: Vertical scaling limits exist (~1M samples/second). For larger deployments:

Use Prometheus federation (hierarchical Prometheus instances)
Implement sharding by namespace or service
Deploy Thanos for horizontal scalability

Monitoring ECS (Comparison)

While this article focuses on EKS, ECS monitoring follows similar patterns with key differences:

ECS Container Insights Setup

Enable Container Insights at cluster creation:

aws ecs create-cluster \
  --cluster-name production-ecs \
  --settings name=containerInsights,value=enabled

Or enable on existing cluster:

aws ecs update-cluster-settings \
  --cluster production-ecs \
  --settings name=containerInsights,value=enabled

Key Differences from EKS

Metric Collection:

ECS: Agent runs on EC2 instances (ECS-optimized AMI includes agent)
EKS: Agent runs as Kubernetes DaemonSet

Metric Granularity:

ECS: Task and container level metrics
EKS: Cluster, node, pod, and container level metrics

Prometheus Integration:

ECS: Requires FireLens for log routing and custom service discovery
EKS: Native ServiceMonitor discovery via Kubernetes API

Use Case Guidance:

Choose EKS for: Kubernetes-native applications, complex orchestration, multi-cloud portability
Choose ECS for: AWS-native workflows, simpler operational model, faster time-to-production

Troubleshooting Common Issues

CloudWatch Agent Not Reporting Metrics

Symptom: No metrics in CloudWatch Console after 10+ minutes.

Diagnosis:

# Check agent pod status
kubectl get pods -n amazon-cloudwatch

# View agent logs
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent

Common causes:

IRSA misconfiguration: Verify service account annotation

kubectl get sa cloudwatch-agent -n amazon-cloudwatch -o yaml | grep role-arn

IAM permissions: Ensure role has CloudWatchAgentServerPolicy
Network connectivity: Verify pods can reach CloudWatch endpoints

Prometheus Targets Down

Symptom: Targets show "DOWN" status in Prometheus UI.

Diagnosis:

# Check Prometheus pod logs
kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0

# Verify ServiceMonitor configuration
kubectl get servicemonitor -n monitoring

Common causes:

Port mismatch: Ensure ServiceMonitor port matches pod's metrics port
Label selectors: Verify ServiceMonitor selector matches service labels
Network policies: Check if Prometheus is blocked from scraping

High Metric Cardinality

Symptom: Prometheus OOM errors or slow queries.

Diagnosis: Check active series count:

prometheus_tsdb_head_series

Solutions:

Drop high-cardinality metrics:

prometheusSpec:
  additionalScrapeConfigs:
  - job_name: 'kubernetes-pods'
    metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'expensive_metric_pattern'
      action: drop

Aggregate metrics: Use recording rules to pre-compute aggregations
Sample less frequently: Increase scrape interval for non-critical metrics

Best Practices and Recommendations

Metric Collection Strategy

1. Start with Defaults: CloudWatch Container Insights and Prometheus default configurations cover 90% of monitoring needs.

2. Add Custom Metrics Gradually: Instrument applications with Prometheus client libraries only when default metrics prove insufficient.

3. Use Labels Wisely: Excessive labels increase cardinality exponentially. Limit to essential dimensions (service, environment, version).

Alerting Philosophy

Critical Alerts (CloudWatch Alarms → SNS → PagerDuty):

Node NotReady state
Pod crash loops (>5 restarts in 10 minutes)
Memory utilization >85%
Disk utilization >90%

Warning Alerts (Prometheus → Alertmanager → Slack):

High CPU usage (>70% for 15 minutes)
Increased error rates
Elevated API latency

Informational (Metrics only, no alerts):

Request counts
Response time percentiles
Resource utilization trends

Dashboard Organization

Executive Dashboard (CloudWatch):

Cluster health summary
Cost metrics
SLA compliance

Engineering Dashboard (Grafana):

Service-level metrics (RED: Rate, Errors, Duration)
Resource utilization by namespace
Network performance

Troubleshooting Views (Grafana):

Per-pod metrics for debugging
Log correlation with metrics
Distributed tracing integration

Conclusion

Effective monitoring at scale requires combining AWS native tools with open-source solutions. CloudWatch Container Insights provides managed infrastructure and native AWS integration, while Prometheus offers flexibility and powerful analytics.

Monitoring ECS and EKS at Scale with CloudWatch Container Insights and Prometheus

Prerequisites

Required AWS Resources

Understanding the Container Monitoring Challenge

The Complexity of Container Observability

Why Dual Monitoring?

Component Responsibilities

Implementation: EKS Monitoring

Infrastructure Prerequisites

Step-by-Step Implementation

CloudWatch Container Insights Setup

Prometheus Stack Deployment

Sample Application with Instrumentation

Key Metrics and Dashboards

CloudWatch Container Insights Metrics

Accessing CloudWatch Dashboards

Prometheus Query Patterns

Grafana Dashboard Setup

High Availability

Security Best Practices

Scaling Considerations

Monitoring ECS (Comparison)

ECS Container Insights Setup

Key Differences from EKS

Troubleshooting Common Issues

CloudWatch Agent Not Reporting Metrics

Prometheus Targets Down

High Metric Cardinality

Best Practices and Recommendations

Metric Collection Strategy

Alerting Philosophy

Dashboard Organization

Conclusion

Further Reading

Comments

More from this blog

Secrets Management for Containers: Parameter Store vs Secrets Manager vs HashiCorp Vault on AWS

Incident Recovery on Amazon EKS: Self-Healing Pods, PodDisruptionBudgets, and Auto Scaling Groups

GitOps on AWS: ArgoCD + EKS + CodeCommit/CodePipeline

How to Scan Container Images in AWS Using ECR Image Scanning + Inspector

Command Palette

Prerequisites

Required AWS Resources

Understanding the Container Monitoring Challenge

The Complexity of Container Observability

Why Dual Monitoring?

Component Responsibilities

Implementation: EKS Monitoring

Infrastructure Prerequisites

Step-by-Step Implementation

CloudWatch Container Insights Setup

Prometheus Stack Deployment

Sample Application with Instrumentation

Key Metrics and Dashboards

CloudWatch Container Insights Metrics

Accessing CloudWatch Dashboards

Prometheus Query Patterns

Grafana Dashboard Setup

High Availability

Security Best Practices

Scaling Considerations

Monitoring ECS (Comparison)

ECS Container Insights Setup

Key Differences from EKS

Troubleshooting Common Issues

CloudWatch Agent Not Reporting Metrics

Prometheus Targets Down

High Metric Cardinality

Best Practices and Recommendations

Metric Collection Strategy

Alerting Philosophy

Dashboard Organization

Conclusion

Further Reading

Comments

More from this blog