AWS FinOps for Cloud Ops Teams — Week 5: Governance with Cloud Custodian

Visibility tells you what's happening. Rightsizing tells you what to fix. Governance is the layer that keeps things from drifting back. At multiple accounts, you can't manually police tagging, idle resources, or risky configurations every week, you need policy-as-code.

That's where Cloud Custodian (a.k.a. c7n) comes in. It's an open-source rules engine for the cloud: you write YAML policies, point them at your accounts, and Custodian queries the AWS APIs and runs your filters and actions on whatever it finds. It scales beautifully across an Organization.

Why Cloud Custodian (and not just AWS Config)

AWS Config is great at detection. It tells you a resource is non-compliant. Where Config falls short for FinOps work is action and remediation: it doesn't natively delete unattached EBS volumes, it doesn't notify the right SQS queue when an EC2 instance is missing tags, and writing custom remediation in SSM Automation is heavyweight.

Custodian fills that gap. The mental model is simple:

I love Cloud Custodian, we have realized at least 20 % of savings since implemenation.

A policy targets a resource type (EC2, EBS, RDS, S3, etc.).
Filters narrow it down (missing tag, untagged, unused, older than 30 days).
Actions do something (notify, mark for cleanup, stop, delete).

Policy 1: detect resources missing required tags

policies:
  - name: ec2-missing-required-tags
    resource: ec2
    description: |
      Flag EC2 instances missing Environment, Team, or CostCenter tags.
    filters:
      - or:
          - "tag:Environment": absent
          - "tag:Team": absent
          - "tag:CostCenter": absent
    actions:
      - type: notify
        to:
          - finops@example.com
        transport:
          type: sqs
          queue: finops-notifications

This is detect-only — no resource is touched, but the team that owns those instances finds out. Pair it with a similar policy for RDS, ElastiCache, and EBS to cover the costliest services.

Policy 2: graceful cleanup of unattached EBS volumes

policies:
  - name: ebs-unattached-mark
    resource: ebs
    filters:
      - State: available
      - "tag:custodian_cleanup": absent
    actions:
      - type: mark-for-op
        op: delete
        days: 7
        tag: custodian_cleanup

  - name: ebs-unattached-sweep
    resource: ebs
    filters:
      - State: available
      - type: marked-for-op
        op: delete
        tag: custodian_cleanup
    actions:
      - delete

The first policy marks unattached volumes for deletion in 7 days. The second policy, run daily, deletes anything that's still marked and still unattached. This pattern (mark, wait, sweep) gives owning teams a graceful window to object before anything disappears.

Running policies across 40+ accounts with c7n-org

Custodian alone runs against one account at a time. Its companion tool c7n-org runs the same set of policies across many accounts in parallel. With an accounts.yml describing each account and the role to assume, you can audit your whole Organization in one command:

c7n-org run \
  -c accounts.yml \
  -s output/ \
  -u policies/tag-compliance.yml \
  --region us-east-1

Output is structured per-account, per-policy: output/<account>/<policy>/resources.json. That's gold — it's structured data you can pipe into a database, a Slack message, or a Grafana panel.

Generating CSV governance reports

For weekly executive-friendly reporting:

c7n-org report \
  -c accounts.yml \
  -s output/ \
  -u policies/tag-compliance.yml \
  --format csv > tag-compliance-report.csv

That CSV becomes the artifact you share. One row per non-compliant resource, with account ID, region, resource type, and which tag is missing. Drop it into a shared drive every Monday and your engineering managers have a real, account-level scoreboard.

The IAM model that doesn't scare your security team

The standard pattern is:

Custodian runs in a dedicated governance account.
Each member account has a role like CustodianExecutionRole that the governance account can assume.
That role has read access to the resource APIs Custodian inspects, plus narrowly scoped write access for the actions you actually use (e.g., ec2:DeleteVolume on tagged resources).

Don't grant *:*. The whole point of policy-as-code is auditability — your security team should be able to read the role and know exactly what Custodian can and cannot do.

Detect first, remediate later

Possibly the most important rule in this post: start with detect-only policies in every account for 2–4 weeks before turning on any remediation actions. You'll find:

Tagging exceptions you didn't know about.
Resources that look idle but are actually quarterly batch jobs.
Naming conventions that fight your filters in subtle ways.

Surprises in production are how FinOps programs lose trust. A 2-week dry run is cheap insurance.

What to govern beyond cost

Once you have Custodian deployed, the same rails carry security and compliance work:

S3 buckets without encryption.
Security groups with 0.0.0.0/0 on sensitive ports.
IAM users without MFA.
RDS instances without automated backups.

That's beyond the scope of this series, but it's worth noting: a Custodian platform built for FinOps becomes a general-purpose cloud guardrail engine almost for free.

Rule of thumb

If a tagging or waste rule has to be enforced manually more than once a quarter, it should be a Cloud Custodian policy. Manual enforcement at 40+ accounts is just unenforced enforcement.

Next week we'll wire Custodian's output, plus Cost Explorer data, into a Grafana dashboard the team can actually live with day-to-day.