Skip to main content

Command Palette

Search for a command to run...

Monitoring ECS and EKS at Scale with CloudWatch Container Insights and Prometheus

Published
10 min read
Monitoring ECS and EKS at Scale with CloudWatch Container Insights and Prometheus
G

I am experienced Cloud Devops Engineer I blog about Solutions, Cloud and DevOps Projects that boost your portfolio and provide troubleshooting guides on Cloud and DevOps

As containerized workloads continue to dominate modern cloud infrastructure, effective monitoring becomes critical for maintaining reliability, performance, and cost efficiency. Amazon ECS and EKS power millions of production containers, yet many organizations struggle with comprehensive observability at scale.

This article explores a production-tested approach to monitoring container workloads on AWS, combining CloudWatch Container Insights with Prometheus to achieve complete visibility across your infrastructure. We'll examine why a dual-monitoring strategy outperforms single-tool approaches and demonstrate practical implementation patterns.

What you'll learn:

  • How CloudWatch Container Insights and Prometheus complement each other

  • Architecture patterns for monitoring EKS at scale

  • Implementation strategies using AWS native tools

  • Performance metrics that matter for production workloads

  • Cost optimization techniques for monitoring infrastructure

Prerequisites

Before diving into the implementation, ensure you have the following:

Required AWS Resources

  • AWS Account with appropriate IAM permissions

  • AWS CLI (v2.x or later) configured with credentials

  • Active EKS Cluster (Kubernetes 1.21+) or ability to create one

  • VPC with public and private subnets across multiple AZs

  • IAM permissions for:

    • Creating/managing EKS clusters and addons

    • Creating IAM roles and policies

    • Writing to CloudWatch metrics and logs

    • Managing EKS OIDC providers

Understanding the Container Monitoring Challenge

The Complexity of Container Observability

Traditional monitoring approaches fall short with containers due to their ephemeral nature, dynamic scheduling, and distributed architecture. Key challenges include:

Dynamic Infrastructure: Containers start and stop frequently, making static monitoring configurations obsolete. A pod running on node A at 10:00 AM might be rescheduled to node B by 10:05 AM.

Multi-Layer Visibility: Effective monitoring requires insights at multiple levels—cluster, node, pod, and container—each with different operational concerns.

Metric Volume: A moderate Kubernetes cluster generates thousands of metrics per minute. Without proper aggregation and filtering, the signal-to-noise ratio becomes problematic.

Distributed Tracing: Microservices communicate across network boundaries, requiring correlation of metrics, logs, and traces to understand system behavior.

Why Dual Monitoring?

Rather than choosing between AWS native tools and open-source solutions, the optimal strategy combines both:

CloudWatch Container Insights excels at:

  • Native AWS integration and managed infrastructure

  • Automatic metric collection without configuration

  • Built-in dashboards for immediate visibility

  • Integration with AWS services (SNS, Lambda, EventBridge)

  • Compliance and audit logging requirements

Prometheus + Grafana provides:

  • Flexible query language (PromQL) for complex analysis

  • Custom metric collection and application instrumentation

  • Community-driven dashboards and exporters

  • Longer retention periods (configurable)

  • No AWS API rate limits or costs per metric

This combination ensures both operational reliability (CloudWatch) and deep analytical capability (Prometheus).

Component Responsibilities

CloudWatch Agent: Collects cluster, node, pod, and container metrics. Deployed as DaemonSet (one pod per node) to gather host-level metrics and Kubernetes resource utilization.

Fluent Bit: Aggregates logs from all containers and ships to CloudWatch Logs. Handles log parsing, filtering, and routing based on namespace, pod labels, or log content.

Prometheus Operator: Manages Prometheus instances and monitoring configuration through Kubernetes CRDs (Custom Resource Definitions). Automatically discovers targets using ServiceMonitor resources.

Grafana: Provides visualization layer with support for both CloudWatch and Prometheus data sources, enabling unified dashboards.

Implementation: EKS Monitoring

Infrastructure Prerequisites

Before implementing monitoring, ensure your EKS cluster has:

  1. OIDC Provider: Required for IAM Roles for Service Accounts (IRSA)

  2. Appropriate Node Sizing: Reserve 10-15% cluster capacity for monitoring workloads

  3. Network Connectivity: Ensure pods can reach AWS API endpoints (use VPC endpoints to avoid NAT costs)

Step-by-Step Implementation

CloudWatch Container Insights Setup

Step 1: Create IAM Role for CloudWatch Agent

CloudWatch Container Insights is available as an EKS addon, providing the most reliable installation method:

# Create IAM role for CloudWatch agent
eksctl create iamserviceaccount \
  --name cloudwatch-agent \
  --namespace amazon-cloudwatch \
  --cluster your-cluster-name \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve

This creates an IAM role with proper trust relationship to the cluster's OIDC provider, enabling secure API access without static credentials.

Step 2: Install the CloudWatch Observability Addon

# Install the CloudWatch Observability addon
aws eks create-addon \
  --cluster-name your-cluster-name \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn <ROLE_ARN_FROM_PREVIOUS_STEP>

The addon automatically deploys:

  • CloudWatch agent DaemonSet for metrics

  • Fluent Bit DaemonSet for logs

  • Required ConfigMaps and RBAC permissions

Step 3: Verification

# Check pod status
kubectl get pods -n amazon-cloudwatch

# Expected output:
# NAME                                 READY   STATUS    RESTARTS   AGE
# cloudwatch-agent-xxxxx              1/1     Running   0          2m
# fluent-bit-xxxxx                    1/1     Running   0          2m

# Verify metrics flowing to CloudWatch
aws cloudwatch list-metrics \
  --namespace ContainerInsights \
  --dimensions Name=ClusterName,Value=your-cluster-name

Prometheus Stack Deployment

Step 4: Deploy Prometheus using Helm

Deploy Prometheus using the community-maintained kube-prometheus-stack:

# Add Prometheus Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts

helm repo update

# Install Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set grafana.enabled=true \
  --set grafana.adminPassword=admin123

This installs:

  • Prometheus Operator for lifecycle management

  • Prometheus server with 15-day retention

  • Alertmanager for notification routing

  • Grafana with pre-configured dashboards

  • Node exporters for host metrics

  • Kube-state-metrics for Kubernetes object metrics

Step 5: Verify Prometheus Stack

# Check all monitoring pods
kubectl get pods -n monitoring

# Expected output shows all pods running:
# prometheus-kube-prometheus-operator-xxxxx      1/1     Running
# prometheus-kube-state-metrics-xxxxx            1/1     Running
# prometheus-prometheus-node-exporter-xxxxx      1/1     Running
# alertmanager-prometheus-kube-prometheus-xxxxx  2/2     Running
# prometheus-grafana-xxxxx                       3/3     Running

Configuration Tips:

  1. Resource Allocation: Prometheus memory usage grows with cardinality. For a 50-node cluster, allocate 4-8GB RAM.

  2. Retention Period: Balance storage costs against analysis needs. 15 days handles most troubleshooting scenarios; use Thanos for longer-term storage.

  3. ServiceMonitor Pattern: Create ServiceMonitors to automatically discover and scrape application metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: your-application
  endpoints:
  - port: metrics
    interval: 30s

Sample Application with Instrumentation

Step 6: Deploy Test Application

Deploy a sample application to verify end-to-end monitoring:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        ports:
        - containerPort: 80
# Apply the manifest
kubectl apply -f sample-app.yaml

# Verify deployment
kubectl get pods -l app=sample-app

This generates realistic metrics for testing your monitoring stack.

Key Metrics and Dashboards

CloudWatch Container Insights Metrics

CloudWatch automatically collects metrics across four levels:

Cluster Level:

  • cluster_failed_node_count: Nodes in NotReady state

  • cluster_node_count: Total nodes

  • cluster_number_of_running_pods: Active pods

Node Level:

  • node_cpu_utilization: Percentage CPU usage

  • node_memory_utilization: Percentage memory usage

  • node_network_total_bytes: Network throughput

  • node_filesystem_utilization: Disk usage

Pod Level:

  • pod_cpu_utilization: Per-pod CPU usage

  • pod_memory_utilization: Per-pod memory usage

  • pod_network_rx_bytes: Inbound network traffic

  • pod_network_tx_bytes: Outbound network traffic

Container Level:

  • container_cpu_utilization: Individual container CPU

  • container_memory_utilization: Individual container memory

Accessing CloudWatch Dashboards

Navigate to CloudWatch Console → Container Insights → Performance monitoring to access built-in dashboards:

Cluster View: High-level cluster health and resource utilization

  1. Node View: Per-node metrics with drill-down capability

  2. Pod View: Pod-level metrics filtered by namespace

  3. Service View: Service-level aggregated metrics

Prometheus Query Patterns

Prometheus excels at complex queries and aggregations:

CPU Utilization by Namespace:

sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Memory Usage Above Threshold:

container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8

Pod Restart Rate:

increase(kube_pod_container_status_restarts_total[1h]) > 3

Network Traffic Per Service:

sum(rate(container_network_transmit_bytes_total[5m])) by (service)

Grafana Dashboard Setup

Access Grafana via port-forward:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Open your browser to http://localhost:3000 and login with:

  • Username: admin

  • Password: admin123 (or the password you set during installation)

Pre-installed dashboards include:

  • Kubernetes / Compute Resources / Cluster: Overall cluster utilization

  • Kubernetes / Compute Resources / Namespace (Pods): Per-namespace pod metrics

  • Kubernetes / Compute Resources / Node (Pods): Per-node resource usage

  • Kubernetes / Networking / Cluster: Network I/O and errors

💡 Pro Tip: You can add CloudWatch as a data source in Grafana to create unified dashboards combining both CloudWatch and Prometheus metrics. Production Considerations

High Availability

CloudWatch Agent: DaemonSet pattern ensures coverage even during node failures. If a node dies, metrics stop from that node only; cluster-level visibility remains.

Prometheus: Run multiple replicas with anti-affinity rules:

prometheus:
  prometheusSpec:
    replicas: 2
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - prometheus
        topologyKey: kubernetes.io/hostname

Security Best Practices

1. IRSA Over Static Credentials: CloudWatch agent uses IAM roles attached to service accounts, eliminating credential management overhead.

2. Network Policies: Restrict Prometheus scraping to authorized namespaces:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scrape
spec:
  podSelector:
    matchLabels:
      app: your-app
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080

3. Grafana Authentication: Integrate with corporate SSO (LDAP, SAML, OAuth) rather than local authentication.

Scaling Considerations

CloudWatch Agent: Automatically scales with cluster size (one pod per node).

Prometheus: Vertical scaling limits exist (~1M samples/second). For larger deployments:

  • Use Prometheus federation (hierarchical Prometheus instances)

  • Implement sharding by namespace or service

  • Deploy Thanos for horizontal scalability

Monitoring ECS (Comparison)

While this article focuses on EKS, ECS monitoring follows similar patterns with key differences:

ECS Container Insights Setup

Enable Container Insights at cluster creation:

aws ecs create-cluster \
  --cluster-name production-ecs \
  --settings name=containerInsights,value=enabled

Or enable on existing cluster:

aws ecs update-cluster-settings \
  --cluster production-ecs \
  --settings name=containerInsights,value=enabled

Key Differences from EKS

Metric Collection:

  • ECS: Agent runs on EC2 instances (ECS-optimized AMI includes agent)

  • EKS: Agent runs as Kubernetes DaemonSet

Metric Granularity:

  • ECS: Task and container level metrics

  • EKS: Cluster, node, pod, and container level metrics

Prometheus Integration:

  • ECS: Requires FireLens for log routing and custom service discovery

  • EKS: Native ServiceMonitor discovery via Kubernetes API

Use Case Guidance:

  • Choose EKS for: Kubernetes-native applications, complex orchestration, multi-cloud portability

  • Choose ECS for: AWS-native workflows, simpler operational model, faster time-to-production

Troubleshooting Common Issues

CloudWatch Agent Not Reporting Metrics

Symptom: No metrics in CloudWatch Console after 10+ minutes.

Diagnosis:

# Check agent pod status
kubectl get pods -n amazon-cloudwatch

# View agent logs
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent

Common causes:

  1. IRSA misconfiguration: Verify service account annotation
kubectl get sa cloudwatch-agent -n amazon-cloudwatch -o yaml | grep role-arn
  1. IAM permissions: Ensure role has CloudWatchAgentServerPolicy

  2. Network connectivity: Verify pods can reach CloudWatch endpoints

Prometheus Targets Down

Symptom: Targets show "DOWN" status in Prometheus UI.

Diagnosis:

# Check Prometheus pod logs
kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0

# Verify ServiceMonitor configuration
kubectl get servicemonitor -n monitoring

Common causes:

  1. Port mismatch: Ensure ServiceMonitor port matches pod's metrics port

  2. Label selectors: Verify ServiceMonitor selector matches service labels

  3. Network policies: Check if Prometheus is blocked from scraping

High Metric Cardinality

Symptom: Prometheus OOM errors or slow queries.

Diagnosis: Check active series count:

prometheus_tsdb_head_series

Solutions:

  1. Drop high-cardinality metrics:
prometheusSpec:
  additionalScrapeConfigs:
  - job_name: 'kubernetes-pods'
    metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'expensive_metric_pattern'
      action: drop
  1. Aggregate metrics: Use recording rules to pre-compute aggregations

  2. Sample less frequently: Increase scrape interval for non-critical metrics

Best Practices and Recommendations

Metric Collection Strategy

1. Start with Defaults: CloudWatch Container Insights and Prometheus default configurations cover 90% of monitoring needs.

2. Add Custom Metrics Gradually: Instrument applications with Prometheus client libraries only when default metrics prove insufficient.

3. Use Labels Wisely: Excessive labels increase cardinality exponentially. Limit to essential dimensions (service, environment, version).

Alerting Philosophy

Critical Alerts (CloudWatch Alarms → SNS → PagerDuty):

  • Node NotReady state

  • Pod crash loops (>5 restarts in 10 minutes)

  • Memory utilization >85%

  • Disk utilization >90%

Warning Alerts (Prometheus → Alertmanager → Slack):

  • High CPU usage (>70% for 15 minutes)

  • Increased error rates

  • Elevated API latency

Informational (Metrics only, no alerts):

  • Request counts

  • Response time percentiles

  • Resource utilization trends

Dashboard Organization

Executive Dashboard (CloudWatch):

  • Cluster health summary

  • Cost metrics

  • SLA compliance

Engineering Dashboard (Grafana):

  • Service-level metrics (RED: Rate, Errors, Duration)

  • Resource utilization by namespace

  • Network performance

Troubleshooting Views (Grafana):

  • Per-pod metrics for debugging

  • Log correlation with metrics

  • Distributed tracing integration

Conclusion

Effective monitoring at scale requires combining AWS native tools with open-source solutions. CloudWatch Container Insights provides managed infrastructure and native AWS integration, while Prometheus offers flexibility and powerful analytics.

Further Reading

More from this blog

G

Godstime Chisom

22 posts

Cloud Engineer • DevOps • SRE •
I am Devops Engineer, I blog about Solutions to Devops/SRE Tasks. Be sure to find help and troubleshooting guides for DevOps-related topics here