Cristian Magherusan · ex-AWS engineer · cristi@leanercloud.com

The $8,000 Needle in a Haystack of CloudWatch Logs

Most cost optimization stories are about big, visible resources. Databases, compute instances, storage volumes. Things with their own line items in the bill. Things you can point at.

CloudWatch Logs are not that.

This client had a few hundred log groups. Most of them cost almost nothing. A couple of dollars a month, maybe ten. The kind of line items that don't show up in any top-N cost report. The kind of thing nobody would think to optimize.

But somewhere in those hundreds of log groups, one was different. A VPC flow log - a single log group - had accumulated 2TB of data and was costing $8,000 a year.

Finding it meant scanning hundreds of log groups one by one.

This was part of a longer project with this client. The prior week, I'd found $35,000 to $50,000 in yearly savings across RDS rightsizing, Graviton migration, RIs for about 15 databases, and converting roughly 70 EBS volumes to GP3. Two weeks before that, I'd found $50,000 yearly from moving to the right S3 storage classes across approximately 800 buckets.

The CloudWatch discovery was different. Not because the dollar amount was smaller - though it was - but because the fix was absurdly simple relative to the waste. A couple of configuration fields on a single cloud resource. Set the right storage class. Set a sensible retention policy. Maybe determine whether you need the log at all.

At a second client, the biggest offender was a Kubernetes container insights log costing about $1,000 a year. Same story: no retention limit set, using the standard (most expensive) ingestion class.

Both clients had a few hundred log groups. Both had one or two that were dramatically more expensive than the rest. And in both cases, the expensive ones were configured with defaults that nobody had ever revisited.

This is a cost leak that hides in plain sight. CloudWatch Logs aren't expensive in general. Most log groups cost almost nothing. But the pricing model has sharp edges - standard ingestion versus infrequent access, unlimited retention versus a sensible cutoff - and if you hit the wrong combination on a high-volume log source, the costs compound quietly over months.

VPC flow logs are a particularly common culprit. They generate enormous volumes of data. Most teams enable them for compliance or debugging and then forget about them. The logs accumulate. The bill grows. Nobody checks because it's just "logging" and logging is supposed to be cheap.

I now have automation that finds these expensive outliers across hundreds of log groups in seconds. The scanning is automated. The fix, for now, still requires a human decision - do you need this log? What retention makes sense? What ingestion class is appropriate? These are judgment calls that depend on the team's actual use of the data.

The total confirmed CloudWatch Logs savings for this client - VPC flow logs and WAF logs combined - came to $14,000 a year.

It's not the biggest number. But a configuration change on a single resource, two fields, $8,000 a year - that's a pretty good ratio.