ECR/ECS Deployment Workflow

Our first ECS deployment looked perfect in the GitHub Actions logs — green checkmarks everywhere. Then we checked the running service: it was still serving the old image. The task definition had been registered, but nobody told the ECS service to actually use it. That was the first of many “it works but not really” moments on our path to reliable container deployments.

If you’re deploying Docker containers to AWS and want to understand the full pipeline — from docker build to zero-downtime rolling updates with automatic rollback — this guide walks through every step, including the gotchas that docs don’t warn you about.

The Problem

Deploying containerized applications to AWS requires coordinating multiple services (ECR for image storage, ECS for orchestration, Fargate for compute) with specific authentication flows, image tagging strategies, and deployment configurations. Without a clear end-to-end workflow, deployments are error-prone: images get pushed to wrong repos, task definitions reference stale images, rolling updates cause downtime, and failed deployments have no automatic rollback.

Difficulties Encountered

ECR authentication is session-based — the Docker login token expires after 12 hours, causing CI/CD pipelines to fail silently with cryptic “no basic auth credentials” errors if not refreshed before each push
Task definition versioning confusion — ECS creates a new revision on every register-task-definition call, but the service does not automatically pick up the latest revision; you must explicitly update the service with the new revision ARN
Rolling update percentage math is unintuitive — minimum_healthy_percent and maximum_percent are relative to desired_count, not absolute numbers, so the actual task count during deployment depends on the combination of all three values
Health check timing gaps — if the health check grace period is too short, ECS kills tasks that are still starting up (especially JVM or NestJS apps with slow cold starts), causing an infinite deployment loop
Circuit breaker is not enabled by default — without deployment_circuit_breaker, a bad image causes ECS to endlessly retry launching failing tasks, burning Fargate costs until you manually intervene

When to Use

Deploying Docker containers to AWS with managed orchestration
Teams wanting AWS-native CI/CD without Kubernetes complexity
Applications needing zero-downtime rolling deployments
Projects already using Terraform for AWS infrastructure

When NOT to Use

Single static site or Lambda function — ECS/Fargate is overkill; use S3
- CloudFront or Lambda directly
Multi-cloud or cloud-agnostic requirement — ECR/ECS locks you into AWS; use Kubernetes (EKS or self-managed) instead
Very short-lived batch jobs — Fargate has a minimum 1-minute billing granularity and cold start overhead; consider Lambda or Step Functions
Local development workflows — use Docker Compose locally, not ECS; the feedback loop with ECR push/ECS deploy is too slow for iterative development
Budget-constrained hobby projects — Fargate costs add up quickly; a single t3.micro EC2 with Docker is cheaper for low-traffic services

With those pitfalls in mind, let’s walk through the architecture and each step of the deployment pipeline.

Architecture Overview

At a high level, the deployment pipeline moves code from your local machine through three AWS services:

B[Process] --> C[Output] `} /> ``` NOTE: Curly braces in mermaid code will be interpreted as Svelte expressions. Either escape them or avoid using braces in labels. REFERENCES: - MDsveX + Mermaid issue: https://github.com/pngwn/MDsveX/issues/737 - MDsveX plugin discussion: https://github.com/pngwn/MDsveX/discussions/354 - Svelte Mermaid approach: https://jamesjoy.site/posts/2023-06-26-svelte-mermaidjs -->

Loading diagram...

ECR (Elastic Container Registry)

The first stop in the pipeline is ECR — AWS’s managed Docker container registry. This is where your built images live before ECS pulls them down to run as containers.

Creating ECR Repository

resource "aws_ecr_repository" "app" {
  name                 = "my-app"
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true  # Security scanning
  }
}

scan_on_push features:

Scans for known CVEs in OS packages
Checks dependencies for vulnerabilities
Results viewable in AWS Console or API

Push Workflow

# 1. Authenticate Docker to ECR
aws ecr get-login-password --region ap-northeast-2 | 
  docker login --username AWS --password-stdin 
  ${ACCOUNT_ID}.dkr.ecr.ap-northeast-2.amazonaws.com

# 2. Build image
docker build -t my-app .

# 3. Tag for ECR
docker tag my-app:latest 
  ${ACCOUNT_ID}.dkr.ecr.ap-northeast-2.amazonaws.com/my-app:latest

# 4. Push to ECR
docker push 
  ${ACCOUNT_ID}.dkr.ecr.ap-northeast-2.amazonaws.com/my-app:latest

Once your image is in ECR, ECS takes over to orchestrate the deployment. The flow involves registering a new task definition and then telling the ECS service to use it.

ECS Deployment Flow

Complete Deployment Pipeline

Here’s the full sequence from code push to running containers:

Loading diagram...

Manual Deployment Steps

# 1. Build and push image (see above)

# 2. Register new task definition
aws ecs register-task-definition 
  --cli-input-json file://task-definition.json

# 3. Update service with new task definition
aws ecs update-service 
  --cluster my-cluster 
  --service my-service 
  --task-definition my-task:NEW_REVISION 
  --force-new-deployment

The manual steps above show the mechanics, but in production you want zero-downtime deployments. That’s where rolling updates come in.

Rolling Updates

How Rolling Updates Work

ECS replaces tasks one by one to ensure zero downtime. The key idea: new tasks start and pass health checks before old tasks are drained and terminated:

Time     | Old v1.0 | New v2.0 | Total | Status
---------|----------|----------|-------|------------------
00:00    | 2        | 0        | 2     | Deploy starts
00:30    | 2        | 1        | 3     | New task starting
01:30    | 1        | 1        | 2     | First old removed
02:00    | 1        | 2        | 3     | Second new starting
03:00    | 0        | 2        | 2     | Complete

Rolling Update Process (3 Tasks)

Loading diagram...

Rolling Update Timeline

Loading diagram...

Key Phases of Each Task Replacement

Starting (60-90 seconds) — Pull new Docker image from ECR, start container, initialize application
Health Checks (30-60 seconds) — ALB health checks must pass, app must respond on the configured port, multiple successful checks required
Draining (30-300 seconds) — Stop sending new requests to old task, allow existing requests to complete, graceful shutdown period
Termination — Old task fully stopped, resources released, new task fully operational

Deployment Configuration

resource "aws_ecs_service" "app" {
  name            = "my-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2

  # Deployment behavior
  deployment_minimum_healthy_percent = 100  # Never below desired
  deployment_maximum_percent         = 200  # Can double temporarily

  # Circuit breaker for automatic rollback
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

Percentage meaning (with desired_count = 3):

minimum_healthy_percent = 100: Always keep at least 3 tasks
maximum_percent = 200: Can have up to 6 tasks during deployment

Deployment Strategies

Strategy	Min %	Max %	Speed	Risk	Use Case
Conservative	100	150	Slow	Low	Production
Balanced	100	200	Medium	Low	Most apps
Aggressive	50	200	Fast	Medium	Staging

Circuit Breaker and Automatic Rollback

When the circuit breaker is enabled, ECS detects failed deployments and automatically rolls back to the last stable version:

Loading diagram...

Without deployment_circuit_breaker, a bad image causes ECS to endlessly retry launching failing tasks, burning Fargate costs until you manually intervene. Always enable it for production services:

deployment_circuit_breaker {
  enable   = true
  rollback = true
}

One question that comes up frequently: what happens if auto-scaling kicks in during a deployment? The good news is ECS handles this gracefully.

Deployment with Auto-Scaling

Auto-scaling continues working during deployments:

Loading diagram...

Key behaviors:

If auto-scaling adds tasks during deployment, new tasks get latest version
If auto-scaling removes tasks, ECS prioritizes removing old version tasks
Scale state is preserved after deployment

Auto-Scaling Interaction Scenarios

Scenario	What Happens	Result
CPU spike during deploy	Auto-scaling adds tasks	Deployment updates ALL tasks including new ones
CPU drop during deploy	Auto-scaling removes tasks	Deployment continues with fewer tasks
Memory issue during deploy	Auto-scaling triggers	Both new and old tasks can scale
Deploy fails	Tasks remain at current version	Auto-scaling continues normally

Deployment with CPU Spike (Scales from 2 to 3)

Time     | Old v1.0 | New v2.0 | Total | Event
---------|----------|----------|-------|------------------
00:00    | 2        | 0        | 2     | Deployment starts
00:30    | 2        | 1        | 3     | New task starting
01:00    | 2        | 1        | 3     | CPU SPIKE -- auto-scale triggered
01:30    | 2        | 2        | 4     | Scale-out adds v2.0 task
02:00    | 1        | 2        | 3     | Remove one old task
02:30    | 1        | 3        | 4     | Add final new task
03:00    | 0        | 3        | 3     | All tasks now v2.0

Resolving Auto-Scaling Conflicts

If auto-scaling fights with deployment, temporarily suspend scaling:

# Suspend auto-scaling during deployment
aws application-autoscaling register-scalable-target 
  --service-namespace ecs 
  --resource-id service/my-cluster/my-service 
  --scalable-dimension ecs:service:DesiredCount 
  --suspended-state 
    '{"DynamicScalingInSuspended": true, "DynamicScalingOutSuspended": true}'

# Re-enable after deployment completes
aws application-autoscaling register-scalable-target 
  --service-namespace ecs 
  --resource-id service/my-cluster/my-service 
  --scalable-dimension ecs:service:DesiredCount 
  --suspended-state 
    '{"DynamicScalingInSuspended": false, "DynamicScalingOutSuspended": false}'

With the deployment mechanics understood, let’s automate the entire pipeline with GitHub Actions.

GitHub Actions Workflow

The following workflow builds a Docker image, pushes it to ECR, updates the task definition, and deploys to ECS — all triggered by a push to main:

name: Deploy to ECS

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ap-northeast-2

      - name: Login to ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t $ECR_REGISTRY/my-app:$IMAGE_TAG .
          docker push $ECR_REGISTRY/my-app:$IMAGE_TAG
          docker tag $ECR_REGISTRY/my-app:$IMAGE_TAG $ECR_REGISTRY/my-app:latest
          docker push $ECR_REGISTRY/my-app:latest

      - name: Update ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: task-definition.json
          container-name: my-app
          image:
            ${{ steps.login-ecr.outputs.registry }}/my-app:${{ github.sha }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: my-service
          cluster: my-cluster
          wait-for-service-stability: true

With the pipeline automated, here are the practices that keep deployments reliable over time.

Best Practices

Image Tagging

Consistent tagging makes it possible to trace a running container back to its source code and roll back to any previous version:

Recommended tags:
- Git SHA: my-app:abc123def  (unique, traceable)
- Environment: my-app:prod-latest  (current production)
- Semantic: my-app:v1.2.3  (releases)

ECR Lifecycle Policy

Without cleanup, ECR accumulates images indefinitely — each push adds a new one. Lifecycle policies automatically expire old images to keep storage costs under control:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep last 30 images",
      "selection": {
        "tagStatus": "any",
        "countType": "imageCountMoreThan",
        "countNumber": 30
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Health Checks

Health checks are the mechanism that prevents bad deployments from receiving traffic. The ALB checks each task’s health endpoint before routing requests to it:

resource "aws_lb_target_group" "app" {
  # ...
  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    matcher             = "200"
  }
}

Graceful Shutdown

ECS sends SIGTERM before terminating tasks during rolling updates. The application must handle this signal to avoid dropping in-flight requests:

// Graceful shutdown handling (Node.js)
process.on("SIGTERM", async () => {
  console.log("SIGTERM received, starting graceful shutdown");

  // Stop accepting new requests
  server.close(() => {
    console.log("HTTP server closed");
  });

  // Close database connections
  await database.close();

  // Wait for ongoing requests to complete (max 30 seconds)
  setTimeout(() => {
    process.exit(0);
  }, 30000);
});

// Health check endpoint (include version for deployment tracking)
app.get("/health", (req, res) => {
  res.status(200).json({
    status: "healthy",
    version: process.env.APP_VERSION
  });
});

Deployment Monitoring

Monitor these signals during rolling updates to detect anomalies early:

Loading diagram...

# Watch deployment progress in real time
watch -n 5 'aws ecs describe-services 
  --cluster my-cluster 
  --services my-service 
  --query "services[0].deployments"'

# Check recent deployment events
aws ecs describe-services 
  --cluster my-cluster 
  --services my-service 
  --query 'services[0].events[0:5]'

Troubleshooting

Deployment Stuck

# Check task stopped reason
aws ecs describe-tasks 
  --cluster my-cluster 
  --tasks $(aws ecs list-tasks --cluster my-cluster --query 'taskArns[0]' --output text) 
  --query 'tasks[0].stoppedReason'

# Check service events
aws ecs describe-services 
  --cluster my-cluster 
  --services my-service 
  --query 'services[0].events[:5]'

Common Issues

Issue	Cause	Solution
Tasks fail health check	App not ready	Increase health check grace period
Out of memory	Container needs more RAM	Increase task memory
No IP available	Subnet full	Use larger subnet or multiple AZs
Image pull failed	ECR auth expired	Refresh ECR token
Slow deployments	Conservative min/max %	Increase max % or decrease min %
No automatic rollback	Circuit breaker not set	Enable `deployment_circuit_breaker`
Auto-scaling conflicts	Scaling fights deployment	Temporarily suspend auto-scaling

Stuck Deployment Decision Tree

Loading diagram...

Practical Takeaways

ECR/ECS deployment is a coordination problem more than a technical one. Each piece (image registry, task definitions, service updates, health checks) works fine individually — the challenge is making them work together reliably. Here’s what matters most:

Always enable the circuit breaker. Without deployment_circuit_breaker, a bad image causes ECS to endlessly retry launching failing tasks, burning Fargate costs until you manually intervene. This is a one-line Terraform addition that saves you from 3 AM pages.
Tag images with git SHAs, not just latest. The :latest tag is convenient but makes rollbacks painful because you can’t tell which version is running. Git SHA tags (my-app:abc123def) give you instant traceability from running task to source commit.
Set health check grace periods generously. If your application takes 30 seconds to start (common for JVM or NestJS apps), a 10-second grace period creates an infinite deployment loop: ECS launches a task, kills it before it’s ready, launches another, kills it again. Set the grace period to at least 2x your worst-case startup time.
Handle SIGTERM in your application. ECS sends SIGTERM before terminating tasks during rolling updates. If your app doesn’t handle this signal, in-flight requests get dropped. The Node.js graceful shutdown pattern above takes 10 lines and prevents data loss during deployments.

The GitHub Actions workflow in this post is a production-ready starting point. Clone it, update the cluster/service names, and you have zero-downtime deployments with automatic rollback on failure.

ECR/ECS Deployment Workflow

The Problem

Difficulties Encountered

When to Use

When NOT to Use

Architecture Overview

ECR (Elastic Container Registry)

Creating ECR Repository

Push Workflow

ECS Deployment Flow

Complete Deployment Pipeline

Manual Deployment Steps

Rolling Updates

How Rolling Updates Work

Rolling Update Process (3 Tasks)

Rolling Update Timeline

Key Phases of Each Task Replacement

Deployment Configuration

Deployment Strategies

Circuit Breaker and Automatic Rollback

Deployment with Auto-Scaling

Auto-Scaling Interaction Scenarios

Deployment with CPU Spike (Scales from 2 to 3)

Resolving Auto-Scaling Conflicts

GitHub Actions Workflow

Best Practices

Image Tagging

ECR Lifecycle Policy

Health Checks

Graceful Shutdown

Deployment Monitoring

Troubleshooting

Deployment Stuck

Common Issues

Stuck Deployment Decision Tree

Practical Takeaways

References

Comments