Skip to main content

4 posts tagged with "machine-learning"

View All Tags

What Actually Matters When You Try to Reduce LLM Costs

· 7 min read
Founder of VCAL Project

Originally published on Medium.com on April 5, 2026.
Read the Medium.com version

After publishing the first release of AI Cost Firewall, I thought the hard part was done.

The idea was simple and it worked immediately: avoid sending duplicate or semantically similar requests to the LLM, and you reduce cost.

I described that initial approach in more detail here: How to Reduce OpenAI API Costs with Semantic Caching

And it did work.

But once I started pushing it further — adding more metrics, handling edge cases, running real traffic through it — it became clear that the initial idea was only a small part of the problem.

Reducing LLM cost is not just about caching. It’s about understanding where the cost actually comes from, what “savings” really mean, and what begins to break when a system moves from a controlled demo into something closer to production.


The First Insight Still Holds

The original observation hasn’t changed, and neither has the core architecture. The system still solves the same underlying problem.

  • Users repeat themselves.
  • Applications repeat themselves.
  • Agents repeat themselves.

Often the wording changes slightly, but the intent remains the same. From the model’s perspective, however, every variation is a brand new request. And every request has a cost.

So yes — caching works. It reduces cost immediately, often without any changes to the application itself.

But that’s only the surface. The deeper questions only appear once you try to rely on it.


The First Misconception: “Caching Is Free”

In the beginning, the results looked almost too good.

In a demo environment, exact cache hits dominated. When a request hit the cache, it meant no API call, no tokens, and almost zero latency. It felt like pure gain, as if cost reduction came with no trade-offs.

That illusion disappears the moment you introduce semantic caching properly. Because semantic caching requires embeddings.

To determine whether two requests are similar, you first need to convert them into vectors. That means calling an embedding model, storing the result, and comparing it against existing data. Only then can you decide whether to reuse a response or forward the request to the LLM.

And embeddings are not free.

At that point, the equation changes:

Net savings = avoided LLM cost − embedding cost

This is where things become more delicate.

If your similarity threshold is too low, you generate embeddings too often. If your traffic is highly unique, most of those embeddings never lead to a cache hit. If your embedding model is expensive, the optimization starts working against you.

What initially looked like a simple cost reduction mechanism becomes something that requires careful balance.

That was the moment when the project stopped being just a clever shortcut and started behaving like a system that needs tuning.

From Words to Vectors: How Semantics Traveled from Linguistics to Large Language Models

· 8 min read
Founder of VCAL Project

Originally published on Dev.to on January 17, 2026.
Read the Dev.to version

Why meaning moved from definitions to structure — and what that changed for modern AI


When engineers talk about semantic search, embeddings, or LLMs that "understand" language, it often sounds like something fundamentally new. Yet the problems modern AI systems face — meaning, reference, ambiguity, and context — were already central questions in linguistics and philosophy more than a century ago.

This article traces how the concept of semantics evolved across disciplines: from linguistics and philosophy, through symbolic AI and statistical NLP, and finally into the neural architectures that power modern large language models, and why this history matters for how we design retrieval, memory, and language systems today. The journey reveals that today's AI systems are not a break from the past, but the convergence of long-standing ideas finally made computationally feasible.


Linguistic Origins: Meaning as a System, Not a Label

Modern semantics begins not with computers, but with language itself. In the late 19th and early 20th centuries, linguists began to reject the naive idea that words simply "point" to things in the world. One of the most influential figures in this shift was Ferdinand de Saussure, who argued that language is a structured system of signs rather than a naming scheme.

Saussure proposed that each linguistic sign consists of two inseparable parts: the signifier (the sound or written form) and the signified (the concept evoked). Crucially, the relationship between the two is arbitrary. There is nothing inherently "dog-like" about the word dog. Its meaning arises because it occupies a position within a broader system of contrasts: dog is meaningful because it is not cat, not wolf, not table.

This was a radical idea at the time. Meaning, Saussure claimed, is relational. Words derive significance from how they differ from other words, not from direct correspondence with reality. This insight quietly laid the conceptual groundwork for everything from structural linguistics to modern vector-based representations.

Beyond Vector Databases: The Case for Local Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on November 6, 2025.
Read the Medium.com version

Cover

When “intelligence” wastes cycles

Most teams building LLM-powered products eventually realize that a large portion of their API costs come not from new insights, but from repeated questions.

A support bot, an internal assistant, or an analytics copilot, all encounter thousands of near-identical queries:

“How do I pass the API key to the local model gateway?”
“Why is the dev database connection timing out?”
“How can I refresh the cache without restarting the service?”

Each of those prompts gets re-tokenized, re-embedded, and re-sent to an LLM even when the model has already answered an equivalent question a minute earlier.

What do we have as a result? Burned tokens, wasted latency, and duplicated reasoning.

Vector databases solved storage, not reuse

The industry's first instinct was to throw vector databases at the problem. They excel at persistent embeddings and semantic retrieval, but they were never built for reuse. What they lack are TTL policies, eviction strategies, and atomic snapshotting of in-flight state. In other words, they store knowledge, not memory.

Traditional vector databases follow a key:value paradigm: they persist embeddings indefinitely so they can be queried later, much like records in a datastore. A semantic cache, by contrast, treats embeddings as dynamic memory — governed by similarity, expiration, and adaptive retention. Its goal is not to archive information, but to avoid redundant reasoning across millions of semantically similar requests.

With a semantic cache such as VCAL, cached answers can stay valid for days or weeks, depending on data volatility and TTL settings. This moves caching from short-term repetition avoidance to long-horizon semantic reuse where reasoning itself becomes a reusable resource rather than a recurring cost.

In essence, VCAL bridges the gap between data retrieval and cognitive efficiency, turning past computation into future acceleration.

How I Created a Semantic Cache Library for AI

· 4 min read
Founder of VCAL Project

Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Cover

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?

That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.


The “Eureka!” Moment

It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:

“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”

Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.

So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.

I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.

The first version of VCAL could:

  • Store and search vector embeddings in RAM using HNSW graph indexing
  • Handle TTL and LRU evictions automatically
  • Save snapshots to disk so it could restart fast

Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.

Screenshot: Grafana dashboard showing cache hits and cost saving