Skip to main content

7 posts tagged with "devops"

View All Tags

AI Cost Firewall vs. VCAL Server: What’s the Difference and How to Choose

· 10 min read
Founder of VCAL Project

Cover

Many customers ask us about the difference between AI Cost Firewall and VCAL Server, and how to choose between them.

The two products are not the same. They differ in purpose, integration style, and the level of control they give to application teams. But they are built around the same core idea: helping AI applications avoid repeated LLM work, improve visibility into AI usage, and reduce unnecessary token spend.

This article is a simplified guide to the purpose of both products, the differences between them, and the situations where each product fits best.

Intro

In many AI applications, users repeatedly ask questions that are not completely identical, but are close enough for the system to reuse a previous answer. A support chatbot may receive the same question in different wording. An internal assistant may repeatedly summarize similar documents. A RAG system may answer the same operational question many times for different users.

This is where caching and control layers become useful.

The VCAL project includes two related but different products for this problem: AI Cost Firewall and VCAL Server.

They both help reduce repeated LLM work and unnecessary token spend, but they are designed for different integration styles and different use cases.


The simple difference

The simplest way to explain the difference is this:

AI Cost Firewall is a gateway.

AI Cost Firewall sits between an application and an LLM provider. It is designed for OpenAI-compatible APIs, so an existing application can often start using it with minimal changes.

VCAL Server is a semantic cache service.

VCAL Server is different. It does not try to behave like OpenAI, Anthropic, Gemini, or any other LLM provider. Instead, it exposes its own REST API for semantic cache operations. Applications integrate with it directly when teams want more control over how cache lookup, reuse, storage, and decision logic work.

Both approaches are useful, but they solve the problem at different levels.


What AI Cost Firewall does

AI Cost Firewall is an OpenAI-compatible LLM gateway.

In a typical setup, the application sends requests to AI Cost Firewall instead of sending them directly to the LLM provider.

The flow looks like this:

Application

AI Cost Firewall

LLM provider

AI Cost Firewall can check whether a request is already known, whether an exact or semantically similar answer exists, and whether the upstream LLM call can be avoided.

For applications already using OpenAI-compatible APIs, the integration can be relatively simple. In many cases, the customer mainly changes the API base URL and API key configuration.

For example, instead of sending requests directly to an OpenAI-compatible provider, the application sends them to AI Cost Firewall first.

This makes AI Cost Firewall a good fit when the goal is to add a cost and cache control layer without deeply changing the application logic.


How VCAL Server is different

VCAL Server is a standalone semantic cache server. It is provider-agnostic. It does not depend on a specific LLM provider, chat API, or model vendor. Instead, VCAL Server works with vectors, cache entries, answer payloads, similarity search, and cache-management logic.

A typical VCAL Server flow looks like this:

User question

Application creates an embedding

Application searches VCAL Server

If a good cached answer exists, VCAL Server returns it

If not, application calls the LLM provider

VCAL Server stores the new question and answer

This means VCAL Server can be used with OpenAI, Anthropic, Gemini, Mistral, Ollama, Azure OpenAI, OpenRouter, local models, or custom internal AI systems.

The important requirement is that the application must generate or provide embeddings that match the vector dimension configured in VCAL Server.

For example:

Embedding model: 768 dimensions
VCAL Server configuration: 768 dimensions

VCAL Server does not care which LLM provider generates the final answer. It only needs the vector representation and the cache payload that the application wants to store.


Integration difference

The biggest practical difference is how customers integrate the products.

AI Cost Firewall is designed for low-friction adoption in OpenAI-compatible applications. With AI Cost Firewall, the application can often continue using a familiar OpenAI-compatible request pattern. The firewall handles caching and cost-control logic in the gateway layer.

VCAL Server requires explicit application-level integration. With VCAL Server, the customer’s application needs to add cache lookup and storage logic. The application decides when to call VCAL Server, when to reuse a cached answer, when to call the LLM, and when to store a new answer.

Below is a simplified comparison:

AreaAI Cost FirewallVCAL Server
Main roleOpenAI-compatible gatewayStandalone semantic cache service
Integration styleGateway / proxy-styleDirect application integration
Application changesUsually minimal for OpenAI-compatible appsUsually required
Provider dependencyBest suited for OpenAI-compatible APIsProvider-agnostic
Cache controlManaged by the gatewayControlled by the application
Best fitFast adoption and pilot deploymentsCustom systems and deeper control
PurposeFast integration into existing OpenAI-compatible chatbots and AI applicationsHigh-performance semantic cache for custom AI systems

This distinction is important.

AI Cost Firewall is easier when the customer wants to add caching and cost visibility quickly.

VCAL Server is better when the customer wants more direct control over semantic search, cache behavior, thresholds, storage, and integration logic.


Is VCAL Server provider-agnostic?

Yes, VCAL Server is provider-agnostic because it works below the provider API level. It does not need to know whether the answer comes from OpenAI, Anthropic, Gemini, Mistral, Ollama, or a private model.

The customer's application is responsible for three things:

  1. Creating embeddings.
  2. Calling VCAL Server for cache lookup and storage.
  3. Calling the selected LLM provider when there is no suitable cached answer.

This makes VCAL Server useful in environments where teams use different LLM providers or want to avoid being tied to one API style.


Where AI Cost Firewall can be deployed

AI Cost Firewall is useful when the main goal is to reduce repeated LLM calls with minimal integration effort.

It is best suited for applications that already use OpenAI-compatible APIs and where teams want to add caching, cost visibility, and control without deeply changing the application code.

Common deployment areas include:

AI chatbots
Customer support bots, internal assistants, and FAQ systems often receive repeated or similar questions. AI Cost Firewall can reduce unnecessary upstream LLM calls and improve response speed.

OpenAI-compatible AI applications
Many applications already use OpenAI-compatible clients or gateways. AI Cost Firewall fits naturally into this pattern because it is designed around OpenAI-compatible API behavior.

Pilot AI deployments
Teams can evaluate LLM cost savings, caching behavior, and observability without rewriting the application. This makes AI Cost Firewall a practical starting point for controlled pilots.

In simple terms, AI Cost Firewall is best suited for situations where the customer wants a practical control layer in front of LLM traffic with minimal application changes.


Where VCAL Server can be deployed

VCAL Server is useful when the customer wants a dedicated semantic cache service with direct control over integration logic.

It works especially well in systems where the application team wants to decide how cache lookup, similarity thresholds, answer reuse, metadata, and storage should work.

Common deployment areas include:

RAG systems
Retrieval-augmented generation systems often receive repeated questions over the same knowledge base. VCAL Server can help reuse previous answers or decisions when the semantic meaning is close enough.

Enterprise knowledge assistants
Internal assistants for documentation, policies, procedures, and technical support can benefit from semantic caching because many users ask similar questions in different ways.

Agent workflows
AI agents often repeat planning, classification, summarization, or decision steps. VCAL Server can cache these intermediate or final results.

Custom chatbot backends
Teams building their own chatbot logic can use VCAL Server when they need more control than a gateway-only approach. The application decides how to use similarity scores, thresholds, metadata, and stored answers.

On-prem and private infrastructure
VCAL Server can run inside the customer’s own infrastructure. This is useful when data locality, internal control, and private deployment are important.

Provider-independent AI systems
Because VCAL Server is provider-agnostic, it can support applications using different LLM providers, OpenAI-compatible APIs, Anthropic, Gemini, Ollama, local models, or custom internal systems.

In simple terms, VCAL Server is best suited for situations where the customer wants a flexible and powerful semantic cache that the application controls directly.


When to choose AI Cost Firewall (a simple guide)

AI Cost Firewall is usually the better starting point when the customer wants a faster and simpler integration path.

It is especially useful when:

  • the application already uses OpenAI-compatible APIs
  • you want minimal code changes
  • the main goal is to reduce repeated LLM calls
  • the team wants cost visibility and cache metrics quickly
  • the application architecture is suitable for a gateway layer

In simple terms, choose AI Cost Firewall when you want a practical gateway for reducing repeated LLM traffic with minimal or no application changes.


When you need VCAL Server

VCAL Server is usually the better choice when the customer wants more control and is ready to integrate caching directly into the application workflow.

It is especially useful when:

  • the application uses multiple LLM providers
  • you need provider-agnostic caching
  • the team wants direct control over similarity thresholds and cache behavior
  • the system is a custom RAG, agent, or chatbot backend
  • you want a standalone semantic cache inside your own infrastructure

In simple terms, choose VCAL Server when you want a flexible and powerful semantic cache service that your application controls directly.


Compliance-conscious deployment

For many organizations, reducing LLM cost is only part of the problem. AI systems also need to fit internal security, privacy, governance, and regulatory requirements.

AI Cost Firewall and VCAL Server are designed to support compliance-conscious deployments.

They do not force teams to send their data through an uncontrolled external layer. Both products can be deployed inside the customer’s own infrastructure, where the organization can control network access, API keys, logs, metrics, retention policies, and operational monitoring.

This is especially important for teams working with customer support data, internal documentation, or sensitive operational information.


Can both products be used together?

Yes, in some architectures they can complement each other.

AI Cost Firewall can provide a gateway layer for OpenAI-compatible traffic, while VCAL Server can provide a deeper semantic cache service for custom workflows, RAG systems, or application-controlled cache operations.

However, they do not have to be used together. Each product has its own purpose:

  • AI Cost Firewall is focused on easy gateway-style adoption
  • VCAL Server is focused on flexible, provider-agnostic semantic cache infrastructure

Final thoughts

AI Cost Firewall and VCAL Server solve related problems, but they are not the same product.

AI Cost Firewall is an OpenAI-compatible gateway that helps applications reduce repeated LLM calls with minimal changes.

VCAL Server is a provider-agnostic semantic cache service that gives applications direct control over cache lookup, reuse, storage, and integration logic.

Both products are designed for real AI workloads where repeated or similar requests create unnecessary cost, latency, and infrastructure load. The right choice depends on the customer’s architecture.

If the goal is fast adoption with an OpenAI-compatible application, AI Cost Firewall is usually the better starting point.

If the goal is deeper control, provider independence, and custom integration, VCAL Server is the better fit.

I hope this guide helps you choose the right product for your AI application.

If you need more details, you can read the VCAL documentation or contact us directly.

Practical Pilot Deployments with AI Cost Firewall v0.1.9

· 7 min read
Founder of VCAL Project

Cover

AI Cost Firewall began as a lightweight OpenAI-compatible gateway designed to reduce LLM cost and latency through exact and semantic caching. Over time, the project evolved beyond simple request reuse and gradually became a broader operational layer for AI infrastructure.

Recent releases introduced semantic cache lifecycle management, provider flexibility, improved Prometheus and Grafana observability, configuration diagnostics, and detailed cost accounting. Those features improved the technical capabilities of the system significantly, but another important question remained:

How quickly can somebody actually deploy and evaluate the system in a real environment?

That question became the main focus of v0.1.9.

Unlike earlier releases that concentrated on internal infrastructure features, v0.1.9 focuses primarily on operational polish. The goal of this release is to reduce the friction between discovering the project and successfully running it with dashboards, cache reuse, and observable semantic behavior.

In practice, this means clearer deployment patterns, better onboarding, improved startup diagnostics, more actionable provider error messages, and significantly expanded operational documentation.

Not All “AI Security” Is the Same: Application Layer vs AI Cost Firewall

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on April 22, 2026.
Read the Medium.com version

Cover

As LLM applications move from demos into production, many teams double down on one thing: prompt security. They refine system prompts, add guardrails, introduce moderation, and carefully control how users interact with the model. And yet, once real traffic arrives, something unexpected happens.

At first, everything works. The demo is smooth, responses are fast, costs are negligible.

But soon real usage begins. Costs spike, latency becomes inconsistent, errors become harder to understand, deployments start affecting live requests in subtle ways.

Nothing is obviously broken, but the system no longer feels predictable.


The Application Layer: Controlling Meaning and Behavior

The application layer is where the logic of an AI product lives. It defines how prompts are constructed, how users interact with the system, and what the model is allowed to do.

This is where most teams focus first — and for good reason. Here, you are dealing with meaning, intent, and safety.

At this layer, the focus is on controlling what the model is allowed to do. In practice, that translates into questions like:

  • Can a user manipulate the model through prompt injection?
  • Can sensitive data leak through responses?
  • Are outputs aligned with policy and expectations?

To solve this, teams build a combination of structural and defensive controls:

  • Structured prompts and system messages
  • Input validation and sanitization
  • Output filtering and moderation
  • Access control and business logic

These mechanisms are essential. Without them, the system is exposed at the semantic level.

In short, the application layer protects what the model means and does.

Reducing LLM Costs Is Easy — Until Production Starts

· 5 min read
Founder of VCAL Project

Originally published on Dev.to on April 13, 2026.
Read the Dev.to version

A month ago, I wrote about reducing LLM costs using caching.

The idea is simple: don’t send the same or similar request to the model twice.

It works well in demos. It even works well in early testing.

And then production starts.


Production Reality: Where LLM Systems Start Breaking

At first, everything looks under control. Requests are small, traffic is predictable, and caching delivers immediate savings. You see fewer calls to the model and faster responses. It feels like the problem is solved.

But real systems don’t stay simple for long.

Prompts begin to grow. What used to be a short question turns into a long conversation with accumulated context, system instructions, and sometimes entire documents pasted by users. Requests become heavier, slower, and more expensive in ways that caching alone cannot fix.

At the same time, failures start to blur together. A timeout, a malformed request, and an upstream provider error all look the same from the outside. Without clear separation, debugging becomes guesswork, and cost anomalies become difficult to explain.

Then there’s latency. A request times out — but what actually happened? Was the provider slow? Did the request even reach it? Should you retry it or not? Without visibility into upstream behavior, you’re operating blind.

Even semantic caching, which looks almost magical at first, becomes a tuning problem. Similarity thresholds that worked in testing suddenly feel off. Some responses are reused too aggressively, others not at all. Without insight into what the system is actually doing, you’re left adjusting numbers and hoping for the best. This is all similar to how prompts are tuned — but here, the feedback loop is missing.

Finally, the moment that exposes everything: deployment.

You restart the service during traffic, and suddenly there are dropped requests, inconsistent responses, and unpredictable behavior. What worked perfectly in isolation now reveals gaps in lifecycle handling.

How to Reduce OpenAI API Costs with Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on March 21, 2026.
Read the Medium.com version

A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage

While working on LLM-powered tools for my customer, I kept seeing something that didn’t feel right.

Users were asking the same or similar questions again and again. Support queries repeated. Internal assistants received nearly identical prompts. Even AI agents were looping through similar requests.

At first, it didn’t look like a problem. That’s just how users behave.

But then I looked at the cost.

Every repeated question meant another API call. Another batch of tokens. Another charge. Over time, it added up more than was expected.

I realized something simple:

We are paying multiple times for the same answer.


Why Existing Solutions Didn’t Quite Work

Initially I looked at the available tools.

Redis helped with exact caching, but only when the prompt was identical. The moment a user rephrased the question slightly, the cache missed. “How do I get access to Jira?” and “Cannot get access to Jira” were treated as completely different requests.

I also explored RedisVL, which brings vector search capabilities into Redis. It moves in the right direction by combining caching and similarity in one place. But in practice, it still requires setting up embedding flows, defining schemas, tuning similarity thresholds, and integrating it manually into the LLM request pipeline.

Vector databases like Milvus, Weaviate, or Qdrant seemed promising as well. They can detect semantic similarity effectively, but integrating them into the request flow means building additional pipelines, managing embeddings, and writing glue code.

All of these tools are powerful, but they aren’t simple.

More importantly, none of them are designed as a drop-in layer in front of an LLM API. There was no unified solution that combined caching, semantic matching, and cost awareness in one place.

AI Cost Firewall: An OpenAI-Compatible Gateway That Cuts LLM Costs by 75%

· 9 min read
Founder of VCAL Project

Originally published on Dev.to on March 16, 2026.
Read the Dev.to version

Exact + semantic caching for AI applications


In today’s era of AI adoption, there is a distinct shift from integrating AI solutions into business processes to controlling the costs, be it the costs of a cloud solution, a local LLM deployment, or the cost of tokens spent in chatbots. If your solution includes repeated questions and uses an OpenAI-compatible model, and if you are looking for a simple, free and effective way to immediately cut your company’s daily token costs, there is one infrastructural solution that does it right out of the box.

AI Cost Firewall is a free open-source API gateway that decides which requests actually need to reach the LLM and which can be answered from previous results without additional token costs.

The gateway consists of a Rust-based firewall “decider”, a Redis database, a Qdrant vector store, Prometheus for metrics scraping, and Grafana for monitoring. All the tools are deployed with a single docker compose command and are available for use in less than a minute.

Once deployed, AI Cost Firewall sits transparently between your application and the LLM provider. Your chatbot, AI assistant, or internal automation continues to send requests exactly the same way as before with the only difference that the API endpoint now points to the firewall instead of directly to the model provider. The firewall then performs an instant check before deciding whether the request should actually reach the LLM and raise your monthly bill.

How I Created a Semantic Cache Library for AI

· 4 min read
Founder of VCAL Project

Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Cover

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?

That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.


The “Eureka!” Moment

It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:

“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”

Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.

So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.

I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.

The first version of VCAL could:

  • Store and search vector embeddings in RAM using HNSW graph indexing
  • Handle TTL and LRU evictions automatically
  • Save snapshots to disk so it could restart fast

Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.

Screenshot: Grafana dashboard showing cache hits and cost saving