Skip to main content

5 posts tagged with "artificial-intelligence"

View All Tags

Practical Pilot Deployments with AI Cost Firewall v0.1.9

· 7 min read
Founder of VCAL Project

Cover

AI Cost Firewall began as a lightweight OpenAI-compatible gateway designed to reduce LLM cost and latency through exact and semantic caching. Over time, the project evolved beyond simple request reuse and gradually became a broader operational layer for AI infrastructure.

Recent releases introduced semantic cache lifecycle management, provider flexibility, improved Prometheus and Grafana observability, configuration diagnostics, and detailed cost accounting. Those features improved the technical capabilities of the system significantly, but another important question remained:

How quickly can somebody actually deploy and evaluate the system in a real environment?

That question became the main focus of v0.1.9.

Unlike earlier releases that concentrated on internal infrastructure features, v0.1.9 focuses primarily on operational polish. The goal of this release is to reduce the friction between discovering the project and successfully running it with dashboards, cache reuse, and observable semantic behavior.

In practice, this means clearer deployment patterns, better onboarding, improved startup diagnostics, more actionable provider error messages, and significantly expanded operational documentation.

Not All “AI Security” Is the Same: Application Layer vs AI Cost Firewall

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on April 22, 2026.
Read the Medium.com version

Cover

As LLM applications move from demos into production, many teams double down on one thing: prompt security. They refine system prompts, add guardrails, introduce moderation, and carefully control how users interact with the model. And yet, once real traffic arrives, something unexpected happens.

At first, everything works. The demo is smooth, responses are fast, costs are negligible.

But soon real usage begins. Costs spike, latency becomes inconsistent, errors become harder to understand, deployments start affecting live requests in subtle ways.

Nothing is obviously broken, but the system no longer feels predictable.


The Application Layer: Controlling Meaning and Behavior

The application layer is where the logic of an AI product lives. It defines how prompts are constructed, how users interact with the system, and what the model is allowed to do.

This is where most teams focus first — and for good reason. Here, you are dealing with meaning, intent, and safety.

At this layer, the focus is on controlling what the model is allowed to do. In practice, that translates into questions like:

  • Can a user manipulate the model through prompt injection?
  • Can sensitive data leak through responses?
  • Are outputs aligned with policy and expectations?

To solve this, teams build a combination of structural and defensive controls:

  • Structured prompts and system messages
  • Input validation and sanitization
  • Output filtering and moderation
  • Access control and business logic

These mechanisms are essential. Without them, the system is exposed at the semantic level.

In short, the application layer protects what the model means and does.

What Actually Matters When You Try to Reduce LLM Costs

· 7 min read
Founder of VCAL Project

Originally published on Medium.com on April 5, 2026.
Read the Medium.com version

After publishing the first release of AI Cost Firewall, I thought the hard part was done.

The idea was simple and it worked immediately: avoid sending duplicate or semantically similar requests to the LLM, and you reduce cost.

I described that initial approach in more detail here: How to Reduce OpenAI API Costs with Semantic Caching

And it did work.

But once I started pushing it further — adding more metrics, handling edge cases, running real traffic through it — it became clear that the initial idea was only a small part of the problem.

Reducing LLM cost is not just about caching. It’s about understanding where the cost actually comes from, what “savings” really mean, and what begins to break when a system moves from a controlled demo into something closer to production.


The First Insight Still Holds

The original observation hasn’t changed, and neither has the core architecture. The system still solves the same underlying problem.

  • Users repeat themselves.
  • Applications repeat themselves.
  • Agents repeat themselves.

Often the wording changes slightly, but the intent remains the same. From the model’s perspective, however, every variation is a brand new request. And every request has a cost.

So yes — caching works. It reduces cost immediately, often without any changes to the application itself.

But that’s only the surface. The deeper questions only appear once you try to rely on it.


The First Misconception: “Caching Is Free”

In the beginning, the results looked almost too good.

In a demo environment, exact cache hits dominated. When a request hit the cache, it meant no API call, no tokens, and almost zero latency. It felt like pure gain, as if cost reduction came with no trade-offs.

That illusion disappears the moment you introduce semantic caching properly. Because semantic caching requires embeddings.

To determine whether two requests are similar, you first need to convert them into vectors. That means calling an embedding model, storing the result, and comparing it against existing data. Only then can you decide whether to reuse a response or forward the request to the LLM.

And embeddings are not free.

At that point, the equation changes:

Net savings = avoided LLM cost − embedding cost

This is where things become more delicate.

If your similarity threshold is too low, you generate embeddings too often. If your traffic is highly unique, most of those embeddings never lead to a cache hit. If your embedding model is expensive, the optimization starts working against you.

What initially looked like a simple cost reduction mechanism becomes something that requires careful balance.

That was the moment when the project stopped being just a clever shortcut and started behaving like a system that needs tuning.

How to Reduce OpenAI API Costs with Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on March 21, 2026.
Read the Medium.com version

A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage

While working on LLM-powered tools for my customer, I kept seeing something that didn’t feel right.

Users were asking the same or similar questions again and again. Support queries repeated. Internal assistants received nearly identical prompts. Even AI agents were looping through similar requests.

At first, it didn’t look like a problem. That’s just how users behave.

But then I looked at the cost.

Every repeated question meant another API call. Another batch of tokens. Another charge. Over time, it added up more than was expected.

I realized something simple:

We are paying multiple times for the same answer.


Why Existing Solutions Didn’t Quite Work

Initially I looked at the available tools.

Redis helped with exact caching, but only when the prompt was identical. The moment a user rephrased the question slightly, the cache missed. “How do I get access to Jira?” and “Cannot get access to Jira” were treated as completely different requests.

I also explored RedisVL, which brings vector search capabilities into Redis. It moves in the right direction by combining caching and similarity in one place. But in practice, it still requires setting up embedding flows, defining schemas, tuning similarity thresholds, and integrating it manually into the LLM request pipeline.

Vector databases like Milvus, Weaviate, or Qdrant seemed promising as well. They can detect semantic similarity effectively, but integrating them into the request flow means building additional pipelines, managing embeddings, and writing glue code.

All of these tools are powerful, but they aren’t simple.

More importantly, none of them are designed as a drop-in layer in front of an LLM API. There was no unified solution that combined caching, semantic matching, and cost awareness in one place.

Why Edge AI Needs Lightweight Semantic Caches — and What Makes Them Hard to Build

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on November 27, 2025.
Read the Medium.com version

Cover

Today edge computing is reshaping the way AI systems are deployed. Instead of sending every request to centralized cloud infrastructure, more computation is happening on devices closer to end-users. These “edge environments” include IoT gateways, on-premise servers, mobile devices, micro-VMs, serverless functions, and browser-based applications. The appeal is clear: moving computation closer to where data is generated reduces latency, minimizes bandwidth requirements and allows organizations to satisfy strict data-privacy rules.

At the same time, WebAssembly (WASM) has emerged as a portable, sandboxed runtime for executing code in highly constrained or security-sensitive environments. Originally designed for browsers, WASM now runs in cloud edge workers, serverless platforms, and isolated environments where traditional binaries cannot be executed. These runtimes often restrict access to system calls such as networking, threading, or the local filesystem. They operate under strict memory limits, sometimes as low as tens of megabytes, and they prioritize deterministic, predictable execution.

Altogether, while offering obvious advantages, running AI components at the edge introduces its own challenges, especially when applications rely on semantic search, embeddings, or large language models (LLM).