2 posts tagged with "startups"

How to Reduce OpenAI API Costs with Semantic Caching

March 21, 2026 · 6 min read

Founder of VCAL Project

Originally published on Medium.com on March 21, 2026.
Read the Medium.com version

A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage

While working on LLM-powered tools for my customer, I kept seeing something that didn’t feel right.

Users were asking the same or similar questions again and again. Support queries repeated. Internal assistants received nearly identical prompts. Even AI agents were looping through similar requests.

At first, it didn’t look like a problem. That’s just how users behave.

But then I looked at the cost.

Every repeated question meant another API call. Another batch of tokens. Another charge. Over time, it added up more than was expected.

I realized something simple:

We are paying multiple times for the same answer.

Why Existing Solutions Didn’t Quite Work

Initially I looked at the available tools.

Redis helped with exact caching, but only when the prompt was identical. The moment a user rephrased the question slightly, the cache missed. “How do I get access to Jira?” and “Cannot get access to Jira” were treated as completely different requests.

I also explored RedisVL, which brings vector search capabilities into Redis. It moves in the right direction by combining caching and similarity in one place. But in practice, it still requires setting up embedding flows, defining schemas, tuning similarity thresholds, and integrating it manually into the LLM request pipeline.

Vector databases like Milvus, Weaviate, or Qdrant seemed promising as well. They can detect semantic similarity effectively, but integrating them into the request flow means building additional pipelines, managing embeddings, and writing glue code.

All of these tools are powerful, but they aren’t simple.

More importantly, none of them are designed as a drop-in layer in front of an LLM API. There was no unified solution that combined caching, semantic matching, and cost awareness in one place.

How I Created a Semantic Cache Library for AI

October 27, 2025 · 4 min read

Sergey Lunev

Founder of VCAL Project

Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Cover

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?

That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.

The “Eureka!” Moment

It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:

“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”

Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.

So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.

I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.

The first version of VCAL could:

Store and search vector embeddings in RAM using HNSW graph indexing
Handle TTL and LRU evictions automatically
Save snapshots to disk so it could restart fast

Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.

Screenshot: Grafana dashboard showing cache hits and cost saving

A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage​

Why Existing Solutions Didn’t Quite Work​

The “Eureka!” Moment​

A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage

Why Existing Solutions Didn’t Quite Work

The “Eureka!” Moment