AWS FinOps for Cloud Ops Teams — Week 4: Rightsizing & Waste Reduction

Last week was about paying less for compute you actually use. This week is about not running compute you don't use at all. In a multi-account environment, idle and oversized resources are the single biggest source of recurring waste — and the easiest place to find quick wins.

The sprawl problem

In any organization with more than a handful of AWS accounts, resource sprawl is inevitable:

An engineer spins up a m5.4xlarge for a benchmark, never tears it down.
A dev RDS instance runs 24/7 even though the team only touches it 8 hours a day.
An EBS volume gets detached when an instance is terminated, and stays at $0.10/GB-month forever.
A NAT Gateway gets created in dev because someone copy-pasted the prod Terraform module.

None of these is a disaster. But across multiple accounts they compound. By the time someone notices, the bill is "just what it is".

Where to look first

Almost every multi-account environment has the same six rocks under it:

Resource Type	Common Waste Pattern
EC2 instances	Oversized; consistent <10% CPU and low memory pressure
RDS instances	Dev/test databases running 24/7 in non-prod accounts
EBS volumes	Unattached volumes left after instance termination
Elastic IPs	Allocated but not associated with a running resource
NAT Gateways	Over-engineered into every dev/test VPC
Load balancers	Provisioned for retired applications

If you do nothing else, run a single multi-account sweep against this list every month.

The tools that earn their keep

AWS Compute Optimizer. Free, organization-aware, and surprisingly accurate. It analyzes CloudWatch metrics over a 14-day window and suggests rightsized instance types for EC2, Lambda, ECS, and EBS.
Cost Explorer Rightsizing Recommendations. Similar analysis, surfaced directly in the billing console, easier to share with non-engineers.
AWS Trusted Advisor. The "Cost Optimization" checks find idle load balancers, low-utilization RDS, and unassociated Elastic IPs. Full check coverage requires Business or Enterprise support.
Cloud Custodian. Comes into play in week 5 — but it's the long-term home for "find and act on waste" policies, because it can run identically across every account.

Scheduled shutdowns: the highest-ROI lever in dev/test

Production needs to run 24/7. Dev and test environments almost never do. If your dev EC2/RDS fleet runs only during business hours (say, 7am–7pm Monday–Friday), you've cut its hours from 168/week to 60/week — about a 64% reduction in compute spend on those resources, with no architectural changes.

Two ways to implement it:

AWS Instance Scheduler. AWS-published solution; supports EC2 and RDS, tag-driven, multi-account out of the box.
Custom EventBridge + Lambda. A 50-line Python function plus two scheduled rules. More flexible, slightly more to maintain.

Either way, drive it from a tag like Schedule=office-hours rather than hardcoded resource lists. New resources get scheduled automatically; opt-out is explicit.

EBS cleanup automation

Unattached EBS volumes are quiet money. They're easy to find — describe-volumes with State=available — and easy to delete, but the right answer is graceful cleanup, not terraform destroy:

Tag every unattached volume with cleanup_after set to today + 7 days.
If the volume is still unattached after 7 days, snapshot it (cheap insurance) and delete it.
Keep snapshots for 30 days. The number of times someone says "wait, I needed that" is non-zero, and snapshots make the conversation cheap.

This is exactly the kind of policy Cloud Custodian was built for; we'll wire it up next week.

Rightsizing without breaking trust

Rightsizing is where FinOps programs lose their reputation if they're not careful. The failure mode is: cloud team unilaterally resizes an EC2 instance, the workload's p99 latency tanks, the owning team finds out from a customer ticket. Now nobody on that team trusts you with their infrastructure again.

Avoid that with a simple workflow:

Generate Compute Optimizer recommendations on a monthly cadence.
Filter to "high confidence" recommendations only.
Send each team a short report of their recommendations, with estimated monthly savings and a link to the source data.
The owning team makes the change. Your job is to make it easy and visible, not to make the change for them.

The change rate is lower this way — but the changes that do happen don't blow up.

Cleaning up the truly dead

The category that should be cleanable unilaterally:

Elastic IPs not associated to any resource.
EBS snapshots whose source volume was deleted >90 days ago.
Load balancers with zero target groups for >30 days.
NAT Gateways in dev VPCs with zero data processed in the last 14 days.

Wire these into automated cleanup with a 7-day mark-and-sweep cycle. Notify the account owner when something is queued for deletion. Most of the time, no one will object — and you'll reclaim a surprising chunk of monthly spend.

Rule of thumb

Run a monthly rightsizing review. Even catching one oversized instance per account per month adds up to massive savings at a multiple account scale. The win isn't any single resource, it's that nothing stays "free to forget" for long.

Next week we move from one-off cleanups to policy : keeping all of this enforced as code with Cloud Custodian.