Last week was about paying less for compute you actually use. This week is about not running compute you don't use at all. In a multi-account environment, idle and oversized resources are the single biggest source of recurring waste — and the easiest place to find quick wins.
The sprawl problem
In any organization with more than a handful of AWS accounts, resource sprawl is inevitable:
- An engineer spins up a
m5.4xlargefor a benchmark, never tears it down. - A dev RDS instance runs 24/7 even though the team only touches it 8 hours a day.
- An EBS volume gets detached when an instance is terminated, and stays at $0.10/GB-month forever.
- A NAT Gateway gets created in dev because someone copy-pasted the prod Terraform module.
None of these is a disaster. But across multiple accounts they compound. By the time someone notices, the bill is "just what it is".
Where to look first
Almost every multi-account environment has the same six rocks under it:
| Resource Type | Common Waste Pattern |
|---|---|
| EC2 instances | Oversized; consistent <10% CPU and low memory pressure |
| RDS instances | Dev/test databases running 24/7 in non-prod accounts |
| EBS volumes | Unattached volumes left after instance termination |
| Elastic IPs | Allocated but not associated with a running resource |
| NAT Gateways | Over-engineered into every dev/test VPC |
| Load balancers | Provisioned for retired applications |
If you do nothing else, run a single multi-account sweep against this list every month.
The tools that earn their keep
- AWS Compute Optimizer. Free, organization-aware, and surprisingly accurate. It analyzes CloudWatch metrics over a 14-day window and suggests rightsized instance types for EC2, Lambda, ECS, and EBS.
- Cost Explorer Rightsizing Recommendations. Similar analysis, surfaced directly in the billing console, easier to share with non-engineers.
- AWS Trusted Advisor. The "Cost Optimization" checks find idle load balancers, low-utilization RDS, and unassociated Elastic IPs. Full check coverage requires Business or Enterprise support.
- Cloud Custodian. Comes into play in week 5 — but it's the long-term home for "find and act on waste" policies, because it can run identically across every account.
Scheduled shutdowns: the highest-ROI lever in dev/test
Production needs to run 24/7. Dev and test environments almost never do. If your dev EC2/RDS fleet runs only during business hours (say, 7am–7pm Monday–Friday), you've cut its hours from 168/week to 60/week — about a 64% reduction in compute spend on those resources, with no architectural changes.
Two ways to implement it:
- AWS Instance Scheduler. AWS-published solution; supports EC2 and RDS, tag-driven, multi-account out of the box.
- Custom EventBridge + Lambda. A 50-line Python function plus two scheduled rules. More flexible, slightly more to maintain.
Either way, drive it from a tag like Schedule=office-hours rather than hardcoded resource lists. New resources get scheduled automatically; opt-out is explicit.
EBS cleanup automation
Unattached EBS volumes are quiet money. They're easy to find — describe-volumes with State=available — and easy to delete, but the right answer is graceful cleanup, not terraform destroy:
- Tag every unattached volume with
cleanup_afterset to today + 7 days. - If the volume is still unattached after 7 days, snapshot it (cheap insurance) and delete it.
- Keep snapshots for 30 days. The number of times someone says "wait, I needed that" is non-zero, and snapshots make the conversation cheap.
This is exactly the kind of policy Cloud Custodian was built for; we'll wire it up next week.
Rightsizing without breaking trust
Rightsizing is where FinOps programs lose their reputation if they're not careful. The failure mode is: cloud team unilaterally resizes an EC2 instance, the workload's p99 latency tanks, the owning team finds out from a customer ticket. Now nobody on that team trusts you with their infrastructure again.
Avoid that with a simple workflow:
- Generate Compute Optimizer recommendations on a monthly cadence.
- Filter to "high confidence" recommendations only.
- Send each team a short report of their recommendations, with estimated monthly savings and a link to the source data.
- The owning team makes the change. Your job is to make it easy and visible, not to make the change for them.
The change rate is lower this way — but the changes that do happen don't blow up.
Cleaning up the truly dead
The category that should be cleanable unilaterally:
- Elastic IPs not associated to any resource.
- EBS snapshots whose source volume was deleted >90 days ago.
- Load balancers with zero target groups for >30 days.
- NAT Gateways in dev VPCs with zero data processed in the last 14 days.
Wire these into automated cleanup with a 7-day mark-and-sweep cycle. Notify the account owner when something is queued for deletion. Most of the time, no one will object — and you'll reclaim a surprising chunk of monthly spend.
Rule of thumb
Run a monthly rightsizing review. Even catching one oversized instance per account per month adds up to massive savings at a multiple account scale. The win isn't any single resource, it's that nothing stays "free to forget" for long.
Next week we move from one-off cleanups to policy : keeping all of this enforced as code with Cloud Custodian.