The Observability Tax

When you pay more to watch your systems than to run them

$7.5M/yr Real Case 82% Report Cost Shock 0 Skills Solving This

You're an SRE Lead. It's the first Tuesday of the month.
The cloud bill just arrived. Last month's total: $847,000.
You open the breakdown and your stomach drops...

The Monthly Cloud Bill

MONTHLY INFRASTRUCTURE COST
June 2025
Compute (EC2/EKS)$186,000
Storage (S3/EBS/RDS)$94,000
Networking (ALB/NAT)$67,000
Other AWS services$52,000
Subtotal: Infrastructure$399,000
Datadog (APM + Logs + Infra)$228,000
PagerDuty$18,000
Splunk (security logs)$142,000
New Relic (synthetics)$34,000
Sentry (error tracking)$26,000
Subtotal: Observability$448,000
TOTAL$847,000
Observability: $448K  >  Infrastructure: $399K

You are paying 12% more to watch your systems than to run them.

The Cost Anatomy

You pull the Datadog usage report. The results are shocking.

Datadog: $228K/month breakdown

The discovery

85% of your Datadog bill is custom metrics and log ingestion. When you trace it back:

What do you do?

Cut the logs

Remove the DEBUG logs and save $67K/month immediately

Keep everything

You can't risk missing the next incident — those logs might be important

The Observability Dilemma

If you cut

You save $67K/month. Then 3 weeks later, a payment bug happens and there are no logs to debug it. The post-mortem says "insufficient observability." You get blamed.

Monthly savings$67K
Career riskHIGH
Decision basisGut feeling

If you keep

The bill stays at $448K. Finance asks "why are we spending more on monitoring than infrastructure?" You say "we need it." They ask "prove it." You can't.

Monthly waste$67K
Career riskMEDIUM
Decision basisFear

The real problem

You have no way to know which logs, metrics, and traces actually contribute to incident resolution and which are pure waste. Nobody does. So everyone keeps everything and pays the tax.

r/devops — 847 upvotes
"Our observability costs are now higher than our AWS bill. We migrated from Datadog to self-hosted Prometheus/Grafana for metrics, kept DD for APM only. Bill dropped from $180K to $45K. But it took 3 engineers 4 months."
r/sre
"We are currently spending around $7.5 million in observability tools. We ended up building our own internal observability platform, but even that costs $2M/yr to maintain."

The $47 Billion Observability Market

$47B
Observability market 2026
82%
cite cloud cost as top concern
15-20%
of compute spend goes to monitoring
0
skills that reduce this cost

The irony of the skill marketplace

We scanned 1,995 agent skills and found 74 monitoring-related skills. Here's what they do:

The skills teach you how to set up the tools that are bankrupting you.
Not a single one helps you figure out what to cut.

What exists today

Datadog Cost EstimationShows total, not per-service
Vantage / CloudZeroInfra costs only, no observability
Prometheus + GrafanaCheaper, but DIY migration
Cost-per-log-line attributionDoes not exist
Log/metric ROI analysisDoes not exist

What the Monthly Review Should Look Like

Observability ROI Report — June 2025
Total observability spend$448,000
USED IN INCIDENT RESOLUTION (worth keeping):
  APM traces (payment-processor)$34,000 — used in 3 incidents
  Error tracking (Sentry)$26,000 — 47 alerts acted on
  Infra metrics (CPU/mem/disk)$42,000 — 12 scaling events
  Security logs (auth failures)$38,000 — 2 incident investigations
Justified spend$140,000
NEVER USED (safe to cut or downsample):
  DEBUG logs (payment-processor)$67,000 — 0 queries in 90 days
  High-cardinality custom metrics$89,000 — 3 dashboards, 0 alerts
  Duplicate traces (3 overlapping tools)$52,000 — redundant coverage
  Verbose K8s event logs$41,000 — never investigated
  Synthetic monitoring (unused routes)$18,000 — testing dead endpoints
Cuttable spend$267,000
Remaining unclassified$41,000 — needs review

Today: Fear-based decisions

Monthly spend$448K
Known wasteUnknown
Decision method"We might need it"
Confidence to cutZero

With cost attribution

Justified spend$140K
Safe to cut$267K (60%)
Decision methodUsage data
Annual savings$3.2M

Key Findings

The observability cost crisis by the numbers

60%
of observability spend is waste (based on simulation)
0
tools that attribute cost to incident value
$3.2M
annual savings potential (mid-size company)
74
monitoring skills — all setup guides, 0 cost optimizers

The missing tool

Every observability vendor tells you what's happening in your systems. None tell you what it costs to know that and whether knowing it was worth the price.

The tool that's missing would:

The pattern (again)

This is the same pattern we found across all 8 pain categories:

What skills do

Help you set up Datadog, configure dashboards, write PromQL queries, deploy Grafana

What users need

Know which dashboards nobody looks at, which logs cost $67K but save $0, which alerts just create noise

"The skill marketplace is optimized for setup, not for survival."

Methodology