How I Created a Semantic Cache Library for AI
Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?
That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.
The “Eureka!” Moment
It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:
“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”
Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.
So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.
I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.
The first version of VCAL could:
- Store and search vector embeddings in RAM using HNSW graph indexing
- Handle TTL and LRU evictions automatically
- Save snapshots to disk so it could restart fast
Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.
