AWS FinOps for Cloud Ops Teams — Week 6: Dashboards & Cost Tracking with Grafana

By this point in the series, we have the data: tags, commitments, rightsizing recommendations, and Custodian compliance output. What we don't yet have is a place where any of that lives: visibly, every day, in front of the same engineers who own the workloads.

That's what week 6 is about: getting cost data out of the AWS billing console and into the same Grafana you already stare at for latency and error rates.

Why Grafana over the AWS console

The AWS Cost Explorer UI is fine. But it has three problems for a multi account cloud ops team:

It's a separate place. Engineers won't visit it daily.
It can't easily blend cost data with operational data (e.g., spend per request, spend per GB processed).
It can't easily pull in your data — Custodian compliance, RI utilization, tag coverage — and show it next to spend.

If you already run Prometheus and Grafana, the marginal cost of adding cost as a metric is small, and the value is large.

The data flow

The pattern I use is:

A scheduled job (Lambda, ECS task, Kubernetes CronJob — your choice) calls the AWS Cost Explorer API.
It transforms the response into Prometheus metrics with sensible labels (account, service, environment, team).
It pushes those metrics to Prometheus Pushgateway, since this is batch data, not scrape-friendly.
Grafana queries Prometheus like any other dashboard.

The job runs once a day — Cost Explorer data isn't real-time anyway, so anything more frequent is wasted API calls.

Pulling cost data with the Cost Explorer API

import boto3, datetime as dt

ce = boto3.client("ce")

end = dt.date.today().replace(day=1)
start = (end - dt.timedelta(days=1)).replace(day=1)

resp = ce.get_cost_and_usage(
    TimePeriod={"Start": str(start), "End": str(end)},
    Granularity="DAILY",
    Metrics=["UnblendedCost"],
    GroupBy=[
        {"Type": "DIMENSION", "Key": "LINKED_ACCOUNT"},
        {"Type": "DIMENSION", "Key": "SERVICE"},
    ],
)

Two things worth knowing:

Cost Explorer's API is billable at $0.01 per request. Cache aggressively.
Use UnblendedCost for ops-engineering work. AmortizedCost is what finance wants for accounting.

Pushing metrics to Prometheus Pushgateway

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegistry()
cost = Gauge(
    "aws_daily_cost_usd",
    "AWS daily cost in USD",
    ["account", "service", "date"],
    registry=registry,
)

for day in resp["ResultsByTime"]:
    date = day["TimePeriod"]["Start"]
    for group in day["Groups"]:
        account, service = group["Keys"]
        amount = float(group["Metrics"]["UnblendedCost"]["Amount"])
        cost.labels(account=account, service=service, date=date).set(amount)

push_to_gateway(
    "pushgateway.internal:9091",
    job="aws-finops",
    registry=registry,
)

One trap: if you re-push the same job repeatedly, Pushgateway keeps the latest values per label set. That's fine here — daily granularity, daily push.

Add tag-aware grouping

The above gives you spend by account and service. The real superpower is grouping by your tags. Cost Explorer's API supports it via GroupBy with Type=TAG:

GroupBy=[
  {"Type": "TAG", "Key": "Team"},
  {"Type": "TAG", "Key": "Environment"},
],

Now Grafana can show spend per Team, per Environment, alongside SLO data. This is the panel that makes the monthly review almost redundant — leaders walk in already knowing.

Recommended Grafana panels

Total monthly spend with month-over-month delta and a sparkline.
Spend by account (top 10) as a stacked bar chart.
Spend by service (top 10) as a stacked bar chart.
Daily spend trend with a 7-day moving average to smooth out weekend dips.
Spend by Team from your tags — the panel engineering managers will care about.
RI / Savings Plan utilization and coverage as gauges, with a red threshold below 90%.
Tag compliance % per account, sourced from Cloud Custodian output.
Waste tracker — unattached EBS GB, count of idle instances, count of unassociated EIPs.

Custodian output as Prometheus metrics

The Custodian resources.json output is structured. A small post-run script can convert it into gauges:

noncompliant = Gauge(
    "finops_noncompliant_resources",
    "Resources failing a Custodian policy",
    ["account", "policy", "resource_type"],
    registry=registry,
)
# count rows in resources.json per account/policy and .set() the gauge

Now your tag compliance score is just another Grafana panel, with thresholds and alerts.

Alerts that turn the dashboard into operations

Dashboards without alerts are wallpaper. Three alerts I'd start with:

Account on track to exceed 110% of monthly budget. Calculated as spend so far × (days in month / days elapsed) against a per-account budget. Notify the owning team in Slack.
Savings Plan utilization < 90% for 24 hours. Notify the FinOps channel — usually means a workload was decommissioned and the commitment needs to be re-balanced.
Tag compliance < 95% for any production account. Notify the account owner. Production accounts have no excuse for missing tags.

Make it scrollable, not impressive

The dashboard's job is to be the first thing engineering managers open on Monday morning, not to be a demo for executives. Optimize for that. Boring, scannable, link-rich. Each panel should tell you whether to act, and link to where you'd act.

Rule of thumb

If your monthly review surfaces information your team didn't already know, your dashboard isn't doing its job. The point of the dashboard is to make the meeting boring.

Next week the final post we'll close the loop on the human side: how to actually run those reviews with engineering teams without it feeling like a blame session.