On this page
Terraform State Recovery
Procedures for recovering from Terraform state drift when state file doesn't
I ran terraform plan and it wanted to destroy an actively-used RDS cluster. The state file had drifted from AWS reality — resources existed in AWS that Terraform didn’t know about, and Terraform’s view of existing resources was outdated. Instead of panicking and running apply, I needed a systematic recovery process.
Terraform state drift happens when the state file doesn’t match what actually exists in your cloud provider. This can occur from manual console changes, failed applies, or state file corruption. The recovery process is methodical: backup first, assess the damage, import missing resources, and fix configuration drift.
Recognizing State Drift
These symptoms in terraform plan output indicate drift:
- Resources marked for destroy that are actively used in production
- Resources marked for create that already exist in AWS
- Unexpected instance replacements (destroy + recreate)
- Changes that nobody made appearing in the plan
If terraform plan shows any of these, do NOT run terraform apply. Diagnose first.
Recovery: Phase 1 — Assessment
The first rule is: always backup the current state before touching anything.
cp terraform.tfstate terraform.tfstate.backup-$(date +%Y%m%d) Next, run a refresh to sync Terraform’s view with AWS reality. This updates the state file with the current state of resources Terraform already tracks:
terraform refresh Then analyze what drift remains:
terraform plan -out=drift-analysis.tfplan Review this plan carefully. Categorize each change: is Terraform trying to create something that exists? Destroy something that’s running? Modify something that was changed manually?
Recovery: Phase 2 — Import Missing Resources
For resources that exist in AWS but aren’t in Terraform state (Terraform wants to create them when they already exist), use terraform import:
# RDS Cluster
terraform import aws_rds_cluster.main moba-rds-prod-cluster
# RDS Instance
terraform import aws_rds_cluster_instance.main moba-rds-prod
# EC2 Instance
terraform import aws_instance.main i-0123456789abcdef0 Each import command tells Terraform “this resource in my configuration corresponds to this existing resource in AWS.” After importing, Terraform tracks the resource without trying to recreate it.
Recovery: Phase 3 — Fix Configuration Drift
After importing, terraform plan may still show changes because your .tf configuration doesn’t match the actual resource attributes. Common issues and fixes:
| Issue | Fix |
|---|---|
| AMI mismatch | Pin AMI in configuration |
| Security group type | Use vpc_security_group_ids for VPC |
| ECS task definition | Add lifecycle ignore |
Pin AMI to Prevent Replacement
If Terraform wants to replace an EC2 instance because the AMI changed:
resource "aws_instance" "main" {
ami = "ami-03205447c85f5199b" # Pin to current
instance_type = "t3.medium"
lifecycle {
ignore_changes = [ami] # Or pin and manage manually
}
} The lifecycle.ignore_changes block tells Terraform to skip this attribute during planning. Use this for attributes managed outside of Terraform (like AMIs updated by a separate patching process).
ECS Task Definition Lifecycle
When CI/CD manages task definitions separately from Terraform:
resource "aws_ecs_service" "main" {
name = "api"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.main.arn
lifecycle {
ignore_changes = [task_definition]
}
} Without this, every terraform plan shows a diff because CI/CD has updated the task definition since Terraform last applied.
VPC Security Groups
A common source of drift is using the wrong security group attribute for VPC instances:
# ❌ Wrong for VPC instances
resource "aws_instance" "main" {
security_groups = [aws_security_group.main.name]
}
# ✅ Correct for VPC instances
resource "aws_instance" "main" {
vpc_security_group_ids = [aws_security_group.main.id]
} Using security_groups (by name) instead of vpc_security_group_ids (by ID) causes Terraform to detect drift on every plan because the API returns IDs, not names.
Prevention: Remote State Backend
Most state drift can be prevented by using a remote state backend. This centralizes the state file and adds locking to prevent concurrent modifications:
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
} The DynamoDB table provides state locking — preventing two people from running terraform apply simultaneously:
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
} Key Lessons
- Always backup state first — Before any recovery operation, copy the state file
- Import before manage — Don’t recreate existing resources; import them into state
- Use lifecycle blocks — For resources managed by CI/CD or external processes
- Plan extensively — Run
terraform planmultiple times during recovery; neverapplywithout reviewing - Document each step — For future reference and auditing
- Set up remote state — Prevents most drift issues by centralizing state management
Takeaway
Terraform state recovery follows a predictable pattern: backup, refresh, import missing resources, fix configuration drift, and verify with plan. The key is to never run apply before understanding every change in the plan. Set up remote state with locking from day one to prevent most drift scenarios. When drift does happen, the systematic approach (assess → import → fix → verify) gets you back to a clean state without destroying production resources.