Skip to main content

One post tagged with "open-source"

View All Tags

Reducing LLM Costs Is Easy — Until Production Starts

· 5 min read
Founder of VCAL Project

Originally published on Dev.to on April 13, 2026.
Read the Dev.to version

A month ago, I wrote about reducing LLM costs using caching.

The idea is simple: don’t send the same or similar request to the model twice.

It works well in demos. It even works well in early testing.

And then production starts.


Production Reality: Where LLM Systems Start Breaking

At first, everything looks under control. Requests are small, traffic is predictable, and caching delivers immediate savings. You see fewer calls to the model and faster responses. It feels like the problem is solved.

But real systems don’t stay simple for long.

Prompts begin to grow. What used to be a short question turns into a long conversation with accumulated context, system instructions, and sometimes entire documents pasted by users. Requests become heavier, slower, and more expensive in ways that caching alone cannot fix.

At the same time, failures start to blur together. A timeout, a malformed request, and an upstream provider error all look the same from the outside. Without clear separation, debugging becomes guesswork, and cost anomalies become difficult to explain.

Then there’s latency. A request times out — but what actually happened? Was the provider slow? Did the request even reach it? Should you retry it or not? Without visibility into upstream behavior, you’re operating blind.

Even semantic caching, which looks almost magical at first, becomes a tuning problem. Similarity thresholds that worked in testing suddenly feel off. Some responses are reused too aggressively, others not at all. Without insight into what the system is actually doing, you’re left adjusting numbers and hoping for the best. This is all similar to how prompts are tuned — but here, the feedback loop is missing.

Finally, the moment that exposes everything: deployment.

You restart the service during traffic, and suddenly there are dropped requests, inconsistent responses, and unpredictable behavior. What worked perfectly in isolation now reveals gaps in lifecycle handling.