Docker on EC2 with Terraform | diary of a lazy developer

Architecture

Why this project

I was preparing a workshop and needed to expose a url with a specific interface, sparing participants from installing docker or anything else on their machines.

I built the workshop locally with docker compose, which is one of the ways to develop and test locally: it works, it’s fast, it’s reproducible. And then?

Then you need to move everything to the cloud. And as a lazy developer, why not use that same docker compose?

The point isn’t running Docker in the cloud - it’s everything around it: HTTPS, custom domain, machine access, data backups, and the ability to rebuild or tear it all down with one command.

With IaC you can manage HTTPS, custom domain, backups, access and cleanup smoothly: everything in one place, versioned, reproducible. Without IaC, you start from scratch every time.

The usual options:

Manual EC2 setup: SSH in, install Docker, configure nginx, certbot, and pray. Slow, fragile, and hard to reproduce.
ECS/Fargate: task definition, service discovery, cluster .. for what ? Using Fargate for a single container is like hiring a moving truck to carry your groceries home.
Docker on EC2 with Terraform: one terraform apply to spin up, one bash scripts/destroy.sh to tear down. Backups included.

The third option is what I chose because it has the simplest architecture .. and the most complex part depends on your user data !

The architecture in the image above is generated directly from the Terraform code (spoiler) in the repo, where you can find the README.md and all the details to use it.

But let’s take it step by step. The third option can be implemented in 1024 different ways: which IaC tool ? How do you handle HTTPS ? How do you access the machine ? Where do you store backups ? How do you manage DNS ? Which AMI ? It depends. The point is asking the right questions.

As a lazy developer, every choice follows one criterion: less effort, in terms of time, cost, or both. And when less effort isn’t enough to decide, the cleanest path is a minimal system: you know what’s there, you know what’s missing, no surprises.

Why Terraform and not CDK

	Terraform	CDK
Language	HCL: declarative, simple	TypeScript/Python: powerful but verbose for simple infra
State	Local file, zero dependencies	Requires CloudFormation stack, S3 bucket for assets
Bootstrap	`terraform init`	`cdk bootstrap` already creates resources in your AWS account
Learning curve	Low for simple infra	Need to know both CDK and CloudFormation .. and their quirks
Destruction	`terraform destroy`: clean, predictable	`cdk destroy`, which sometimes leaves orphaned resources

For an ephemeral workshop run by one person, Terraform with local state is the minimum effort. CDK makes sense when the infra grows, you need complex logic, or there’s a team involved.

The choices and why

Choice	Why (less effort)	The discarded alternative (more effort)
ALB + ACM	Free HTTPS certificate, auto-renewal, no certbot/nginx	Let’s Encrypt on EC2: port 80 open, cron for renewal, more moving parts
SSM instead of SSH	No keys, no port 22, audit trail on CloudTrail	SSH key pair, SG rules, bastion if private subnet
S3 for backups	Costs nothing, survives the EC2, simple CLI	EBS snapshot: tied to instance lifecycle, harder to restore
Route 53 hosted zone	DNS validation for ACM, alias record for ALB, all managed by Terraform	External DNS only: manual certificate validation or HTTP challenge
Amazon Linux 2023 minimal	Clean AMI, you install only what you need	AL2023 standard: doesn’t have Docker anyway, but has hundreds of extra packages you don’t need
`docker compose up --build`	Works with both `build` and `image`	Separate logic for build vs pull: pointless complexity
Local state	The workshop is ephemeral, one operator, no team	Remote state (S3 + DynamoDB): cost and setup for zero benefit
Conditional VPC	Three modes: use an existing VPC, find the default, or create a new one	Always new VPC: waste for a workshop running in the default VPC
Conditional S3 bucket	Pass one and it uses it. Don’t, and it creates one named after the domain	Always new bucket: waste for someone running many workshops and just managing backups

What I learned (the hard way)

The right AMI and how much disk

As a lazy developer, instead of reading the documentation, one command to see what’s out there:

aws ec2 describe-images \
  --filters "Name=name,Values=al2023-ami-*-x86_64" \
  --owners amazon \
  --query 'reverse(sort_by(Images, &CreationDate))[:10].[Name, BlockDeviceMappings[0].Ebs.VolumeSize]' \
  --output table

Three variants: minimal (2 GB), standard (8 GB), ECS-optimized (30 GB). The ECS one comes with Docker but is meant to run in an ECS cluster, not on a standalone EC2. Standard and minimal don’t have Docker: you need to install it either way.

At that point, what does the standard have that minimal doesn’t ? SSM agent and a few hundred packages you don’t need. The package comparison page confirms it: no Docker, no buildx, nothing that changes the picture.

Minimal is the cleanest choice: install Docker, SSM agent and buildx in the user data, and you know exactly what’s on the machine. One thing to watch: the 2 GB disk isn’t enough, set volume_size = 20 and move on.

ssm-user is not root

When you connect with aws ssm start-session, you’re ssm-user. You don’t have access to the Docker socket. Everything needs sudo. Commands sent with aws ssm send-command run as root though, so sudo is built in.

buildx: no buildx, no build

From Docker Compose v2.17+ the --build flag requires buildx >= 0.17.0. The minimal AMI doesn’t have it. Without buildx, docker compose up --build fails even if no service uses build: install it in the user data and forget about it.

That damn cache

After a destroy + redeploy, the new Route 53 hosted zone gets different nameservers. You update the NS records on the DNS provider, everything looks fine. But the browser says no.

dig @8.8.8.8 tells you it’s all good. But your local resolver disagrees.

What happens: your ISP’s resolver has the old SERVFAIL cached, and until it expires, that domain doesn’t exist as far as it’s concerned.

The fix: temporarily switch your local DNS to Google (8.8.8.8) and wait for your provider’s cache to expire: they say 5-10 minutes, but sometimes (way) longer.

Anything else to add ?

When it’s not a workshop of a few hours but something that lasts weeks or months, it’s worth investing extra effort to make the system hold up over time. But remember, it’s always a temporary solution !

More subdomains: more applications on the same ALB, with routing rules, separate target groups, and potentially more containers on the same EC2 or, if needed, dedicated EC2s per service
Tactical scheduling: start/stop the EC2 to save money off-hours, periodic backups with EventBridge + SSM, not just at destroy
CloudWatch alarms: basic monitoring (CPU, disk, health check) with SNS notifications
Auto-recovery: ASG with min=max=1 to replace dying instances (user data restores everything from S3)
Spot instances: for workshops that tolerate interruptions, ~70% cost reduction