By this point in the series, we have the data: tags, commitments, rightsizing recommendations, and Custodian compliance output. What we don't yet have is a place where any of that lives: visibly, every day, in front of the same engineers who own the workloads.
That's what week 6 is about: getting cost data out of the AWS billing console and into the same Grafana you already stare at for latency and error rates.
Why Grafana over the AWS console
The AWS Cost Explorer UI is fine. But it has three problems for a multi account cloud ops team:
- It's a separate place. Engineers won't visit it daily.
- It can't easily blend cost data with operational data (e.g., spend per request, spend per GB processed).
- It can't easily pull in your data — Custodian compliance, RI utilization, tag coverage — and show it next to spend.
If you already run Prometheus and Grafana, the marginal cost of adding cost as a metric is small, and the value is large.
The data flow
The pattern I use is:
- A scheduled job (Lambda, ECS task, Kubernetes CronJob — your choice) calls the AWS Cost Explorer API.
- It transforms the response into Prometheus metrics with sensible labels (
account,service,environment,team). - It pushes those metrics to Prometheus Pushgateway, since this is batch data, not scrape-friendly.
- Grafana queries Prometheus like any other dashboard.
The job runs once a day — Cost Explorer data isn't real-time anyway, so anything more frequent is wasted API calls.
Pulling cost data with the Cost Explorer API
import boto3, datetime as dt
ce = boto3.client("ce")
end = dt.date.today().replace(day=1)
start = (end - dt.timedelta(days=1)).replace(day=1)
resp = ce.get_cost_and_usage(
TimePeriod={"Start": str(start), "End": str(end)},
Granularity="DAILY",
Metrics=["UnblendedCost"],
GroupBy=[
{"Type": "DIMENSION", "Key": "LINKED_ACCOUNT"},
{"Type": "DIMENSION", "Key": "SERVICE"},
],
)
Two things worth knowing:
- Cost Explorer's API is billable at $0.01 per request. Cache aggressively.
- Use
UnblendedCostfor ops-engineering work.AmortizedCostis what finance wants for accounting.
Pushing metrics to Prometheus Pushgateway
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
cost = Gauge(
"aws_daily_cost_usd",
"AWS daily cost in USD",
["account", "service", "date"],
registry=registry,
)
for day in resp["ResultsByTime"]:
date = day["TimePeriod"]["Start"]
for group in day["Groups"]:
account, service = group["Keys"]
amount = float(group["Metrics"]["UnblendedCost"]["Amount"])
cost.labels(account=account, service=service, date=date).set(amount)
push_to_gateway(
"pushgateway.internal:9091",
job="aws-finops",
registry=registry,
)
One trap: if you re-push the same job repeatedly, Pushgateway keeps the latest values per label set. That's fine here — daily granularity, daily push.
Add tag-aware grouping
The above gives you spend by account and service. The real superpower is grouping by your tags. Cost Explorer's API supports it via GroupBy with Type=TAG:
GroupBy=[
{"Type": "TAG", "Key": "Team"},
{"Type": "TAG", "Key": "Environment"},
],
Now Grafana can show spend per Team, per Environment, alongside SLO data. This is the panel that makes the monthly review almost redundant — leaders walk in already knowing.
Recommended Grafana panels
- Total monthly spend with month-over-month delta and a sparkline.
- Spend by account (top 10) as a stacked bar chart.
- Spend by service (top 10) as a stacked bar chart.
- Daily spend trend with a 7-day moving average to smooth out weekend dips.
- Spend by Team from your tags — the panel engineering managers will care about.
- RI / Savings Plan utilization and coverage as gauges, with a red threshold below 90%.
- Tag compliance % per account, sourced from Cloud Custodian output.
- Waste tracker — unattached EBS GB, count of idle instances, count of unassociated EIPs.
Custodian output as Prometheus metrics
The Custodian resources.json output is structured. A small post-run script can convert it into gauges:
noncompliant = Gauge(
"finops_noncompliant_resources",
"Resources failing a Custodian policy",
["account", "policy", "resource_type"],
registry=registry,
)
# count rows in resources.json per account/policy and .set() the gauge
Now your tag compliance score is just another Grafana panel, with thresholds and alerts.
Alerts that turn the dashboard into operations
Dashboards without alerts are wallpaper. Three alerts I'd start with:
- Account on track to exceed 110% of monthly budget. Calculated as spend so far × (days in month / days elapsed) against a per-account budget. Notify the owning team in Slack.
- Savings Plan utilization < 90% for 24 hours. Notify the FinOps channel — usually means a workload was decommissioned and the commitment needs to be re-balanced.
- Tag compliance < 95% for any production account. Notify the account owner. Production accounts have no excuse for missing tags.
Make it scrollable, not impressive
The dashboard's job is to be the first thing engineering managers open on Monday morning, not to be a demo for executives. Optimize for that. Boring, scannable, link-rich. Each panel should tell you whether to act, and link to where you'd act.
Rule of thumb
If your monthly review surfaces information your team didn't already know, your dashboard isn't doing its job. The point of the dashboard is to make the meeting boring.
Next week the final post we'll close the loop on the human side: how to actually run those reviews with engineering teams without it feeling like a blame session.