<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>VCAL Blog</title>
        <link>https://blog.vcal-project.com</link>
        <description>Updates on semantic caching for LLMs, Rust, performance, AI infrastructure, and VCAL products.</description>
        <lastBuildDate>Wed, 03 Jun 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <item>
            <title><![CDATA[AI Cost Firewall vs. VCAL Server: What’s the Difference and How to Choose]]></title>
            <link>https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference</link>
            <guid>https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference</guid>
            <pubDate>Wed, 03 Jun 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[A practical comparison of AI Cost Firewall and VCAL Server, explaining when to use an OpenAI-compatible LLM gateway and when to use a lower-level semantic cache service.]]></description>
            <content:encoded><![CDATA[<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/ai-cost-firewall-and-vcal-server-whats-the-difference-cover-0ff8268f6c5a98320d2a89be7961de54.png" width="1536" height="1024" class="img_ev3q"></p>
<p>Many customers ask us about the difference between <strong>AI Cost Firewall</strong> and <strong>VCAL Server</strong>, and how to choose between them.</p>
<p>The two products are not the same. They differ in purpose, integration style, and the level of control they give to application teams. But they are built around the same core idea: helping AI applications avoid repeated LLM work, improve visibility into AI usage, and reduce unnecessary token spend.</p>
<p>This article is a simplified guide to the purpose of both products, the differences between them, and the situations where each product fits best.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="intro">Intro<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#intro" class="hash-link" aria-label="Direct link to Intro" title="Direct link to Intro" translate="no">​</a></h2>
<p>In many AI applications, users repeatedly ask questions that are not completely identical, but are close enough for the system to reuse a previous answer. A support chatbot may receive the same question in different wording. An internal assistant may repeatedly summarize similar documents. A RAG system may answer the same operational question many times for different users.</p>
<p>This is where caching and control layers become useful.</p>
<p>The VCAL project includes two related but different products for this problem: <strong>AI Cost Firewall</strong> and <strong>VCAL Server</strong>.</p>
<p>They both help reduce repeated LLM work and unnecessary token spend, but they are designed for different integration styles and different use cases.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-simple-difference">The simple difference<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#the-simple-difference" class="hash-link" aria-label="Direct link to The simple difference" title="Direct link to The simple difference" translate="no">​</a></h2>
<p>The simplest way to explain the difference is this:</p>
<p><strong>AI Cost Firewall is a gateway.</strong></p>
<p>AI Cost Firewall sits between an application and an LLM provider. It is designed for OpenAI-compatible APIs, so an existing application can often start using it with minimal changes.</p>
<p><strong>VCAL Server is a semantic cache service.</strong></p>
<p>VCAL Server is different. It does not try to behave like OpenAI, Anthropic, Gemini, or any other LLM provider. Instead, it exposes its own REST API for semantic cache operations. Applications integrate with it directly when teams want more control over how cache lookup, reuse, storage, and decision logic work.</p>
<p>Both approaches are useful, but they solve the problem at different levels.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-ai-cost-firewall-does">What AI Cost Firewall does<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#what-ai-cost-firewall-does" class="hash-link" aria-label="Direct link to What AI Cost Firewall does" title="Direct link to What AI Cost Firewall does" translate="no">​</a></h2>
<p>AI Cost Firewall is an OpenAI-compatible LLM gateway.</p>
<p>In a typical setup, the application sends requests to AI Cost Firewall instead of sending them directly to the LLM provider.</p>
<p>The flow looks like this:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Application</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">AI Cost Firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">LLM provider</span><br></div></code></pre></div></div>
<p>AI Cost Firewall can check whether a request is already known, whether an exact or semantically similar answer exists, and whether the upstream LLM call can be avoided.</p>
<p>For applications already using OpenAI-compatible APIs, the integration can be relatively simple. In many cases, the customer mainly changes the API base URL and API key configuration.</p>
<p>For example, instead of sending requests directly to an OpenAI-compatible provider, the application sends them to AI Cost Firewall first.</p>
<p>This makes AI Cost Firewall a good fit when the goal is to add a cost and cache control layer without deeply changing the application logic.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-vcal-server-is-different">How VCAL Server is different<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#how-vcal-server-is-different" class="hash-link" aria-label="Direct link to How VCAL Server is different" title="Direct link to How VCAL Server is different" translate="no">​</a></h2>
<p>VCAL Server is a standalone semantic cache server. It is provider-agnostic. It does not depend on a specific LLM provider, chat API, or model vendor. Instead, VCAL Server works with vectors, cache entries, answer payloads, similarity search, and cache-management logic.</p>
<p>A typical VCAL Server flow looks like this:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">User question</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Application creates an embedding</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Application searches VCAL Server</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">If a good cached answer exists, VCAL Server returns it</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">If not, application calls the LLM provider</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">VCAL Server stores the new question and answer</span><br></div></code></pre></div></div>
<p>This means VCAL Server can be used with OpenAI, Anthropic, Gemini, Mistral, Ollama, Azure OpenAI, OpenRouter, local models, or custom internal AI systems.</p>
<p>The important requirement is that the application must generate or provide embeddings that match the vector dimension configured in VCAL Server.</p>
<p>For example:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Embedding model: 768 dimensions</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">VCAL Server configuration: 768 dimensions</span><br></div></code></pre></div></div>
<p>VCAL Server does not care which LLM provider generates the final answer. It only needs the vector representation and the cache payload that the application wants to store.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="integration-difference">Integration difference<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#integration-difference" class="hash-link" aria-label="Direct link to Integration difference" title="Direct link to Integration difference" translate="no">​</a></h2>
<p>The biggest practical difference is how customers integrate the products.</p>
<p><em>AI Cost Firewall</em> is designed for low-friction adoption in OpenAI-compatible applications. With AI Cost Firewall, the application can often continue using a familiar OpenAI-compatible request pattern. The firewall handles caching and cost-control logic in the gateway layer.</p>
<p><em>VCAL Server</em> requires explicit application-level integration. With VCAL Server, the customer’s application needs to add cache lookup and storage logic. The application decides when to call VCAL Server, when to reuse a cached answer, when to call the LLM, and when to store a new answer.</p>
<p>Below is a simplified comparison:</p>
<table><thead><tr><th>Area</th><th>AI Cost Firewall</th><th>VCAL Server</th></tr></thead><tbody><tr><td>Main role</td><td>OpenAI-compatible gateway</td><td>Standalone semantic cache service</td></tr><tr><td>Integration style</td><td>Gateway / proxy-style</td><td>Direct application integration</td></tr><tr><td>Application changes</td><td>Usually minimal for OpenAI-compatible apps</td><td>Usually required</td></tr><tr><td>Provider dependency</td><td>Best suited for OpenAI-compatible APIs</td><td>Provider-agnostic</td></tr><tr><td>Cache control</td><td>Managed by the gateway</td><td>Controlled by the application</td></tr><tr><td>Best fit</td><td>Fast adoption and pilot deployments</td><td>Custom systems and deeper control</td></tr><tr><td>Purpose</td><td>Fast integration into existing OpenAI-compatible chatbots and AI applications</td><td>High-performance semantic cache for custom AI systems</td></tr></tbody></table>
<p>This distinction is important.</p>
<p>AI Cost Firewall is easier when the customer wants to add caching and cost visibility quickly.</p>
<p>VCAL Server is better when the customer wants more direct control over semantic search, cache behavior, thresholds, storage, and integration logic.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="is-vcal-server-provider-agnostic">Is VCAL Server provider-agnostic?<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#is-vcal-server-provider-agnostic" class="hash-link" aria-label="Direct link to Is VCAL Server provider-agnostic?" title="Direct link to Is VCAL Server provider-agnostic?" translate="no">​</a></h2>
<p>Yes, VCAL Server is provider-agnostic because it works below the provider API level. It does not need to know whether the answer comes from OpenAI, Anthropic, Gemini, Mistral, Ollama, or a private model.</p>
<p>The customer's application is responsible for three things:</p>
<ol>
<li class="">Creating embeddings.</li>
<li class="">Calling VCAL Server for cache lookup and storage.</li>
<li class="">Calling the selected LLM provider when there is no suitable cached answer.</li>
</ol>
<p>This makes VCAL Server useful in environments where teams use different LLM providers or want to avoid being tied to one API style.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-ai-cost-firewall-can-be-deployed">Where AI Cost Firewall can be deployed<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#where-ai-cost-firewall-can-be-deployed" class="hash-link" aria-label="Direct link to Where AI Cost Firewall can be deployed" title="Direct link to Where AI Cost Firewall can be deployed" translate="no">​</a></h2>
<p>AI Cost Firewall is useful when the main goal is to reduce repeated LLM calls with minimal integration effort.</p>
<p>It is best suited for applications that already use OpenAI-compatible APIs and where teams want to add caching, cost visibility, and control without deeply changing the application code.</p>
<p>Common deployment areas include:</p>
<p><strong>AI chatbots</strong><br>
<!-- -->Customer support bots, internal assistants, and FAQ systems often receive repeated or similar questions. AI Cost Firewall can reduce unnecessary upstream LLM calls and improve response speed.</p>
<p><strong>OpenAI-compatible AI applications</strong><br>
<!-- -->Many applications already use OpenAI-compatible clients or gateways. AI Cost Firewall fits naturally into this pattern because it is designed around OpenAI-compatible API behavior.</p>
<p><strong>Pilot AI deployments</strong><br>
<!-- -->Teams can evaluate LLM cost savings, caching behavior, and observability without rewriting the application. This makes AI Cost Firewall a practical starting point for controlled pilots.</p>
<p>In simple terms, AI Cost Firewall is best suited for situations where the customer wants a practical control layer in front of LLM traffic with minimal application changes.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-vcal-server-can-be-deployed">Where VCAL Server can be deployed<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#where-vcal-server-can-be-deployed" class="hash-link" aria-label="Direct link to Where VCAL Server can be deployed" title="Direct link to Where VCAL Server can be deployed" translate="no">​</a></h2>
<p>VCAL Server is useful when the customer wants a dedicated semantic cache service with direct control over integration logic.</p>
<p>It works especially well in systems where the application team wants to decide how cache lookup, similarity thresholds, answer reuse, metadata, and storage should work.</p>
<p>Common deployment areas include:</p>
<p><strong>RAG systems</strong><br>
<!-- -->Retrieval-augmented generation systems often receive repeated questions over the same knowledge base. VCAL Server can help reuse previous answers or decisions when the semantic meaning is close enough.</p>
<p><strong>Enterprise knowledge assistants</strong><br>
<!-- -->Internal assistants for documentation, policies, procedures, and technical support can benefit from semantic caching because many users ask similar questions in different ways.</p>
<p><strong>Agent workflows</strong><br>
<!-- -->AI agents often repeat planning, classification, summarization, or decision steps. VCAL Server can cache these intermediate or final results.</p>
<p><strong>Custom chatbot backends</strong><br>
<!-- -->Teams building their own chatbot logic can use VCAL Server when they need more control than a gateway-only approach. The application decides how to use similarity scores, thresholds, metadata, and stored answers.</p>
<p><strong>On-prem and private infrastructure</strong><br>
<!-- -->VCAL Server can run inside the customer’s own infrastructure. This is useful when data locality, internal control, and private deployment are important.</p>
<p><strong>Provider-independent AI systems</strong><br>
<!-- -->Because VCAL Server is provider-agnostic, it can support applications using different LLM providers, OpenAI-compatible APIs, Anthropic, Gemini, Ollama, local models, or custom internal systems.</p>
<p>In simple terms, VCAL Server is best suited for situations where the customer wants a flexible and powerful semantic cache that the application controls directly.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-choose-ai-cost-firewall-a-simple-guide">When to choose AI Cost Firewall (a simple guide)<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#when-to-choose-ai-cost-firewall-a-simple-guide" class="hash-link" aria-label="Direct link to When to choose AI Cost Firewall (a simple guide)" title="Direct link to When to choose AI Cost Firewall (a simple guide)" translate="no">​</a></h2>
<p>AI Cost Firewall is usually the better starting point when the customer wants a faster and simpler integration path.</p>
<p>It is especially useful when:</p>
<ul>
<li class="">the application already uses OpenAI-compatible APIs</li>
<li class="">you want minimal code changes</li>
<li class="">the main goal is to reduce repeated LLM calls</li>
<li class="">the team wants cost visibility and cache metrics quickly</li>
<li class="">the application architecture is suitable for a gateway layer</li>
</ul>
<p>In simple terms, choose AI Cost Firewall when you want a practical gateway for reducing repeated LLM traffic with minimal or no application changes.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-you-need-vcal-server">When you need VCAL Server<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#when-you-need-vcal-server" class="hash-link" aria-label="Direct link to When you need VCAL Server" title="Direct link to When you need VCAL Server" translate="no">​</a></h2>
<p>VCAL Server is usually the better choice when the customer wants more control and is ready to integrate caching directly into the application workflow.</p>
<p>It is especially useful when:</p>
<ul>
<li class="">the application uses multiple LLM providers</li>
<li class="">you need provider-agnostic caching</li>
<li class="">the team wants direct control over similarity thresholds and cache behavior</li>
<li class="">the system is a custom RAG, agent, or chatbot backend</li>
<li class="">you want a standalone semantic cache inside your own infrastructure</li>
</ul>
<p>In simple terms, choose VCAL Server when you want a flexible and powerful semantic cache service that your application controls directly.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="compliance-conscious-deployment">Compliance-conscious deployment<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#compliance-conscious-deployment" class="hash-link" aria-label="Direct link to Compliance-conscious deployment" title="Direct link to Compliance-conscious deployment" translate="no">​</a></h2>
<p>For many organizations, reducing LLM cost is only part of the problem. AI systems also need to fit internal security, privacy, governance, and regulatory requirements.</p>
<p>AI Cost Firewall and VCAL Server are designed to support compliance-conscious deployments.</p>
<p>They do not force teams to send their data through an uncontrolled external layer. Both products can be deployed inside the customer’s own infrastructure, where the organization can control network access, API keys, logs, metrics, retention policies, and operational monitoring.</p>
<p>This is especially important for teams working with customer support data, internal documentation, or sensitive operational information.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="can-both-products-be-used-together">Can both products be used together?<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#can-both-products-be-used-together" class="hash-link" aria-label="Direct link to Can both products be used together?" title="Direct link to Can both products be used together?" translate="no">​</a></h2>
<p>Yes, in some architectures they can complement each other.</p>
<p>AI Cost Firewall can provide a gateway layer for OpenAI-compatible traffic, while VCAL Server can provide a deeper semantic cache service for custom workflows, RAG systems, or application-controlled cache operations.</p>
<p>However, they do not have to be used together. Each product has its own purpose:</p>
<ul>
<li class="">AI Cost Firewall is focused on easy gateway-style adoption</li>
<li class="">VCAL Server is focused on flexible, provider-agnostic semantic cache infrastructure</li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final thoughts<a href="https://blog.vcal-project.com/ai-cost-firewall-and-vcal-server-whats-the-difference#final-thoughts" class="hash-link" aria-label="Direct link to Final thoughts" title="Direct link to Final thoughts" translate="no">​</a></h2>
<p>AI Cost Firewall and VCAL Server solve related problems, but they are not the same product.</p>
<p>AI Cost Firewall is an OpenAI-compatible gateway that helps applications reduce repeated LLM calls with minimal changes.</p>
<p>VCAL Server is a provider-agnostic semantic cache service that gives applications direct control over cache lookup, reuse, storage, and integration logic.</p>
<p>Both products are designed for real AI workloads where repeated or similar requests create unnecessary cost, latency, and infrastructure load. The right choice depends on the customer’s architecture.</p>
<p>If the goal is fast adoption with an OpenAI-compatible application, AI Cost Firewall is usually the better starting point.</p>
<p>If the goal is deeper control, provider independence, and custom integration, VCAL Server is the better fit.</p>
<p>I hope this guide helps you choose the right product for your AI application.</p>
<p>If you need more details, you can read the <a href="https://docs.vcal-project.com/" target="_blank" rel="noopener noreferrer" class="">VCAL documentation</a> or <a href="https://vcal-project.com/vcal-server/#contact" target="_blank" rel="noopener noreferrer" class="">contact us</a> directly.</p>]]></content:encoded>
            <category>ai-infrastructure</category>
            <category>llm</category>
            <category>devops</category>
            <category>systems-programming</category>
        </item>
        <item>
            <title><![CDATA[Why Edge AI Benefits from Small Rust Binaries]]></title>
            <link>https://blog.vcal-project.com/why-edge-ai-benefits-from-small-rust-binaries</link>
            <guid>https://blog.vcal-project.com/why-edge-ai-benefits-from-small-rust-binaries</guid>
            <pubDate>Fri, 29 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Why lightweight Rust binaries simplify deployment, reduce operational overhead, and improve reliability for Edge AI infrastructure]]></description>
            <content:encoded><![CDATA[<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/why-edge-computing-benefits-from-rust-binaries-cover-21564f753b2ebaf4351a3dc62365f741.png" width="1536" height="1024" class="img_ev3q"></p>
<p>When people talk about Edge AI, the conversation usually revolves around models. Larger context windows, smaller quantized variants, GPU acceleration, inference speed, and hardware optimization tend to dominate the discussion. But in practice, many real-world Edge AI deployments are constrained not by the model itself, but by the operational realities surrounding it.</p>
<p>Running AI at the edge means running software in environments that are fundamentally different from modern cloud infrastructure. These systems may operate with limited memory, modest CPUs, unreliable connectivity, restricted storage, or strict uptime requirements. They may be installed in factories, telecom cabinets, or remote locations where updates are difficult and maintenance windows are limited.</p>
<p>In these environments, the infrastructure surrounding the AI model becomes critically important.</p>
<p>Inference alone is rarely enough. Real systems require routing, telemetry, caching, authentication, observability, synchronization, and APIs to name a few. As Edge AI deployments mature, the supporting software stack increasingly determines whether the system remains practical to operate over time.</p>
<p>This is where small Rust binaries become unexpectedly valuable.</p>
<p><em>A simplified comparison illustrates how deployment characteristics differ between traditional runtime-heavy stacks and lightweight Rust-based infrastructure.</em></p>
<p><img decoding="async" loading="lazy" alt="Screenshot: Cloud vs Rust-based infrastructure." src="https://blog.vcal-project.com/assets/images/why-edge-computing-benefits-from-rust-binaries-intext-b5b0b3e08b9617b23d2b0f09585a512d.png" width="1536" height="1024" class="img_ev3q"></p>
<p>A great deal of modern AI infrastructure is assembled from heavyweight runtime environments. Python services, Node.js middleware, layered containers, package managers, dynamically linked dependencies, and multiple supporting sidecars work reasonably well in large cloud environments where storage, memory, and compute resources are abundant. Cloud-native ecosystems are optimized around this assumption.</p>
<p>Edge systems operate under very different constraints.</p>
<p>A deployment that feels trivial in Kubernetes can become fragile when moved into a constrained ARM device or an industrial Linux gateway. Every additional runtime layer increases image size, startup complexity, memory overhead, update time, and operational surface area. In disconnected or bandwidth-constrained environments, even transferring updates can become an engineering problem.</p>
<p>The difference between a small statically linked binary and a multi-hundred-megabyte container image is not merely aesthetic. It changes how systems behave operationally.</p>
<p>Rust fits this environment unusually well because it combines low-level efficiency with modern software engineering practices. Rust applications can achieve native performance while maintaining memory safety without relying on garbage collection. They can often be compiled into small standalone binaries with minimal runtime dependencies, especially when using musl-based static builds.</p>
<p>Operationally, this has several consequences.</p>
<p>Deployment becomes simpler because the application can often be distributed as a single executable without requiring external runtimes or complex dependency trees. Updating edge systems becomes faster because smaller artifacts transfer more efficiently across constrained networks. Startup times improve because the system avoids initializing large runtime environments. Memory consumption remains relatively predictable, which matters when multiple services must coexist on limited hardware.</p>
<p>Perhaps more importantly, the operational model itself becomes easier to reason about.</p>
<p>In edge environments, reliability frequently depends on reducing complexity rather than adding abstraction. A smaller deployment footprint generally means fewer moving parts, fewer compatibility problems, fewer patching requirements, and fewer failure scenarios. This becomes increasingly important in industrial or regulated environments where operational stability matters more than developer convenience.</p>
<p>Static linking is one of the more underrated aspects of this approach. A fully self-contained binary changes deployment from “install and configure an environment” into something much closer to “copy and run.” That may sound simplistic, but at scale, simplicity becomes an operational advantage. Especially in disconnected or semi-connected infrastructure, reducing environmental assumptions can significantly improve reliability.</p>
<p>Edge AI also introduces another practical challenge: repeated computation.</p>
<p>Many edge systems process highly repetitive workloads. Similar prompts, recurring semantic queries, repeated retrieval patterns, and predictable operational workflows appear constantly in production environments. Without an intermediate control layer, these systems repeatedly recompute results they have effectively already seen before.</p>
<p>This creates unnecessary pressure on hardware resources that are already constrained.</p>
<p>Semantic caching becomes particularly useful at the edge because compute resources are finite and expensive. Reducing repeated inference can improve responsiveness, lower latency, decrease hardware utilization, and reduce energy consumption. In disconnected environments, lightweight caching layers can also help systems remain responsive even when upstream connectivity is unstable or unavailable.</p>
<p>This is one reason why we believe lightweight infrastructure matters just as much as lightweight models.</p>
<p>At VCAL Labs, we have been building semantic infrastructure components in Rust not only because of performance characteristics, but because operational portability increasingly matters in AI systems. Small binaries, predictable resource usage, and minimal deployment friction align naturally with the realities of edge environments.</p>
<p>The future of Edge AI will likely depend on more than just advances in inference itself. It will also depend on whether the surrounding infrastructure can operate reliably in constrained, distributed, and operationally complex environments.</p>
<p>In many cases, the hardest problem is not running the model once.</p>
<p>The harder problem is continuously operating the systems around it.</p>
<p>Observability, routing, semantic caching, synchronization, APIs, telemetry, and traffic control all become part of the deployment surface. And as AI infrastructure moves closer to the edge, operational efficiency starts to matter just as much as raw model capability.</p>
<p>Small Rust binaries are not a universal solution to these problems. But they are remarkably well aligned with the operational realities of Edge AI infrastructure, and that alignment is becoming increasingly difficult to ignore.</p>
<p>Learn more about VCAL semantic infrastructure for AI systems:
<a href="https://vcal-project.com/" target="_blank" rel="noopener noreferrer" class="">https://vcal-project.com/</a></p>]]></content:encoded>
            <category>rust</category>
            <category>edge-ai</category>
            <category>edge-computing</category>
            <category>llm</category>
            <category>ai-infrastructure</category>
            <category>systems-programming</category>
        </item>
        <item>
            <title><![CDATA[Practical Pilot Deployments with AI Cost Firewall v0.1.9]]></title>
            <link>https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9</link>
            <guid>https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9</guid>
            <pubDate>Sun, 24 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[How AI Cost Firewall v0.1.9 makes deployment, observability, and troubleshooting easier for pilot users]]></description>
            <content:encoded><![CDATA[<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/ai-firewall-overview-019-74a1e610e367b83763bba6e35805adb5.png" width="1700" height="701" class="img_ev3q"></p>
<p>AI Cost Firewall began as a lightweight OpenAI-compatible gateway designed to reduce LLM cost and latency through exact and semantic caching. Over time, the project evolved beyond simple request reuse and gradually became a broader operational layer for AI infrastructure.</p>
<p>Recent releases introduced semantic cache lifecycle management, provider flexibility, improved Prometheus and Grafana observability, configuration diagnostics, and detailed cost accounting. Those features improved the technical capabilities of the system significantly, but another important question remained:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">How quickly can somebody actually deploy and evaluate the system in a real environment?</span><br></div></code></pre></div></div>
<p>That question became the main focus of v0.1.9.</p>
<p>Unlike earlier releases that concentrated on internal infrastructure features, v0.1.9 focuses primarily on operational polish. The goal of this release is to reduce the friction between discovering the project and successfully running it with dashboards, cache reuse, and observable semantic behavior.</p>
<p>In practice, this means clearer deployment patterns, better onboarding, improved startup diagnostics, more actionable provider error messages, and significantly expanded operational documentation.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-shift-toward-real-deployments">A Shift Toward Real Deployments<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#a-shift-toward-real-deployments" class="hash-link" aria-label="Direct link to A Shift Toward Real Deployments" title="Direct link to A Shift Toward Real Deployments" translate="no">​</a></h2>
<p>One recurring observation during development was that many evaluation problems were not caused by semantic cache logic itself. Instead, the most common issues were operational:</p>
<ul>
<li class="">wrong provider base URLs</li>
<li class="">Docker networking confusion</li>
<li class="">embedding dimension mismatches</li>
<li class="">empty dashboards</li>
<li class="">TLS and certificate problems</li>
<li class="">misunderstanding how semantic cache behaves</li>
</ul>
<p>These are normal operational problems for infrastructure software, but they become barriers when somebody is evaluating a project for the first time.</p>
<p>v0.1.9 therefore introduces a more deployment-oriented structure for the repository and documentation. The project now includes runnable deployment examples under:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">deploy/examples/</span><br></div></code></pre></div></div>
<p>The new examples are designed to demonstrate practical deployment patterns instead of only isolated configuration snippets.</p>
<p>Included examples:</p>
<table><thead><tr><th>Deployment Pattern</th><th>Purpose</th></tr></thead><tbody><tr><td><code>openai-cloud/</code></td><td>Fastest cloud evaluation</td></tr><tr><td><code>local-ollama/</code></td><td>Fully local OpenAI-compatible stack</td></tr><tr><td><code>hybrid-openai-local-embeddings/</code></td><td>OpenAI chat + local embeddings</td></tr><tr><td><code>openrouter/</code></td><td>OpenRouter upstream example</td></tr><tr><td><code>local-full-stack/</code></td><td>Full local stack with dashboards</td></tr></tbody></table>
<p>Each example includes a runnable Docker Compose stack, minimal configuration, example requests, expected behavior, and optional observability overlays where appropriate.</p>
<p>The intent is not only to make deployments easier, but also to make them more understandable.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-practical-hybrid-deployment">A Practical Hybrid Deployment<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#a-practical-hybrid-deployment" class="hash-link" aria-label="Direct link to A Practical Hybrid Deployment" title="Direct link to A Practical Hybrid Deployment" translate="no">​</a></h2>
<p>One particularly useful deployment pattern introduced in the examples is:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">hybrid-openai-local-embeddings/</span><br></div></code></pre></div></div>
<p>This deployment combines cloud chat inference with local embeddings.</p>
<p>In this setup:</p>
<ul>
<li class="">OpenAI handles chat completions</li>
<li class="">Ollama generates embeddings locally</li>
<li class="">Redis stores exact cache entries</li>
<li class="">Qdrant stores semantic cache vectors</li>
</ul>
<p>This pattern is interesting because it demonstrates a practical middle ground between fully local infrastructure and fully cloud-hosted inference.</p>
<p>Many organizations still want the quality and convenience of cloud-hosted chat models, but embedding overhead can become expensive once semantic cache traffic grows. Running embeddings locally can reduce or eliminate those costs while still preserving semantic reuse behavior.</p>
<p>The request flow becomes:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Application</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">AI Cost Firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Exact Cache (Redis)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Semantic Cache (Qdrant + local embeddings)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">OpenAI upstream</span><br></div></code></pre></div></div>
<p>This arrangement also keeps GPU requirements relatively modest compared to fully local chat inference stacks.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="faster-evaluation-flow">Faster Evaluation Flow<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#faster-evaluation-flow" class="hash-link" aria-label="Direct link to Faster Evaluation Flow" title="Direct link to Faster Evaluation Flow" translate="no">​</a></h2>
<p>The deployment examples intentionally avoid unnecessary complexity. The objective is to help operators get from zero to a working observable deployment as quickly as possible.</p>
<p>A typical startup sequence now looks like:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">docker compose up -d</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">docker compose exec ollama ollama pull nomic-embed-text</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">docker compose restart ai-firewall</span><br></div></code></pre></div></div>
<p>After startup, the deployment can immediately be validated using the health and readiness endpoints:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">curl http://localhost:8080/healthz</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">curl http://localhost:8080/readyz</span><br></div></code></pre></div></div>
<p>A simple request can then be sent through the firewall:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">curl http://localhost:8080/v1/chat/completions \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  -H "Content-Type: application/json" \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  -d '{</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    "model": "gpt-4o-mini-2024-07-18",</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    "messages": [</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      {"role": "user", "content": "Explain Redis briefly."}</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  }'</span><br></div></code></pre></div></div>
<p>Repeated requests should begin producing exact cache hits, while semantically similar prompts may eventually reuse semantic cache entries.</p>
<p>The important part here is not merely that caching exists, but that the behavior becomes visible and understandable during evaluation.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="observability-as-a-first-class-feature">Observability as a First-Class Feature<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#observability-as-a-first-class-feature" class="hash-link" aria-label="Direct link to Observability as a First-Class Feature" title="Direct link to Observability as a First-Class Feature" translate="no">​</a></h2>
<p>Semantic caching can easily become opaque if operators cannot see what the system is doing internally. One of the long-term goals of AI Cost Firewall has therefore been making semantic reuse behavior observable instead of hidden.</p>
<p>v0.1.8 introduced more advanced financial and cache metrics, while v0.1.9 improves the operational deployment and interpretation of those dashboards.</p>
<p>The project includes two Grafana dashboards with different purposes.</p>
<p>The Overview dashboard (see cover image) focuses on higher-level operational and business metrics such as:</p>
<ul>
<li class="">request traffic</li>
<li class="">exact cache hits</li>
<li class="">semantic cache hits</li>
<li class="">gross savings</li>
<li class="">embedding overhead</li>
<li class="">net savings</li>
</ul>
<p>Meanwhile, the Diagnostics dashboard focuses more heavily on semantic runtime behavior:</p>
<ul>
<li class="">semantic lookup latency</li>
<li class="">threshold pass/fail behavior</li>
<li class="">semantic candidate activity</li>
<li class="">runtime cache diagnostics</li>
</ul>
<p>This separation is intentional.</p>
<p>The Overview dashboard helps answer:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">“Is this deployment actually reducing cost and upstream traffic?”</span><br></div></code></pre></div></div>
<p>The Diagnostics dashboard helps answer:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">“Why is semantic cache behaving the way it is?”</span><br></div></code></pre></div></div>
<p>That distinction becomes increasingly important as semantic cache deployments grow larger and more complex.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="operational-problems-are-real-problems">Operational Problems Are Real Problems<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#operational-problems-are-real-problems" class="hash-link" aria-label="Direct link to Operational Problems Are Real Problems" title="Direct link to Operational Problems Are Real Problems" translate="no">​</a></h2>
<p>One of the themes of v0.1.9 is that operational clarity matters just as much as architecture.</p>
<p>Even a technically strong caching system becomes difficult to evaluate if deployment failures are confusing or poorly explained. For that reason, this release also improves startup diagnostics, runtime validation, and provider error handling.</p>
<p>Several deployment mistakes appeared repeatedly during testing and evaluation.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrong-base-urls">Wrong Base URLs<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#wrong-base-urls" class="hash-link" aria-label="Direct link to Wrong Base URLs" title="Direct link to Wrong Base URLs" translate="no">​</a></h3>
<p>A very common issue is configuring full endpoint paths instead of provider base URLs.</p>
<p>Incorrect:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://api.openai.com/v1/chat/completions</span><br></div></code></pre></div></div>
<p>Correct:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://api.openai.com</span><br></div></code></pre></div></div>
<p>AI Cost Firewall appends OpenAI-compatible routes internally. The same rule applies to embedding endpoints.</p>
<p>v0.1.9 improves diagnostics around this issue and makes related provider failures easier to interpret.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="qdrant-vector-size-mismatches">Qdrant Vector-Size Mismatches<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#qdrant-vector-size-mismatches" class="hash-link" aria-label="Direct link to Qdrant Vector-Size Mismatches" title="Direct link to Qdrant Vector-Size Mismatches" translate="no">​</a></h3>
<p>Another common operational problem involves embedding dimensions.</p>
<p>Different embedding models produce vectors of different sizes:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">nomic-embed-text → 768</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">text-embedding-3-small → 1536</span><br></div></code></pre></div></div>
<p>If the Qdrant collection vector size does not match the configured embedding model, semantic cache behavior will fail.</p>
<p>Earlier releases already validated vector sizes, but v0.1.9 improves the clarity of those startup diagnostics and explains the likely cause more explicitly.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="docker-networking-confusion">Docker Networking Confusion<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#docker-networking-confusion" class="hash-link" aria-label="Direct link to Docker Networking Confusion" title="Direct link to Docker Networking Confusion" translate="no">​</a></h3>
<p>Docker networking also causes frequent evaluation problems.</p>
<p>Inside containers:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">localhost != host machine</span><br></div></code></pre></div></div>
<p>This especially affects Ollama deployments.</p>
<p>Incorrect inside Compose networking:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">http://localhost:11434</span><br></div></code></pre></div></div>
<p>Correct:</p>
<div class="language-conf codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-conf codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">http://ollama:11434</span><br></div></code></pre></div></div>
<p>v0.1.9 expands troubleshooting documentation around these operational patterns.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="empty-dashboards">Empty Dashboards<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#empty-dashboards" class="hash-link" aria-label="Direct link to Empty Dashboards" title="Direct link to Empty Dashboards" translate="no">​</a></h3>
<p>Sometimes the infrastructure is healthy but Grafana dashboards remain empty.</p>
<p>This is usually caused by:</p>
<ul>
<li class="">no traffic being generated yet</li>
<li class="">Prometheus scrape failures</li>
<li class="">dashboard provisioning path problems</li>
<li class="">observability overlays not running</li>
</ul>
<p>Useful checks include:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">curl http://localhost:8080/metrics</span><br></div></code></pre></div></div>
<p>and:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">http://localhost:9090/targets</span><br></div></code></pre></div></div>
<p>The release documentation now explains these scenarios more clearly.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="tls-and-self-signed-certificates">TLS and Self-Signed Certificates<a href="https://blog.vcal-project.com/practical-pilot-deployments-ai-cost-firewall-v0-1-9#tls-and-self-signed-certificates" class="hash-link" aria-label="Direct link to TLS and Self-Signed Certificates" title="Direct link to TLS and Self-Signed Certificates" translate="no">​</a></h3>
<p>OpenAI-compatible providers are often deployed internally with self-signed certificates or non-public trust chains.</p>
<p>As a result, TLS problems became another recurring evaluation issue.</p>
<p>v0.1.9 improves diagnostics for:</p>
<ul>
<li class="">hostname mismatch</li>
<li class="">SAN mismatch</li>
<li class="">self-signed certificates</li>
<li class="">TLS handshake failures</li>
<li class="">provider connectivity failures</li>
</ul>
<p>The objective is to make startup and provider failures more actionable for operators instead of surfacing only generic upstream errors.</p>
<hr>
<h1>Why This Release Matters</h1>
<p>v0.1.9 is intentionally less focused on introducing major new algorithms or architectural subsystems. Instead, it focuses on operational maturity.</p>
<p>In practice, infrastructure software becomes useful only when people can deploy, observe, troubleshoot, and understand it quickly. That is especially true for semantic caching systems, where invisible behavior can otherwise become difficult to reason about.</p>
<p>This release is therefore an important transition point for AI Cost Firewall. The project is evolving from an experimental semantic cache layer into a more practical operational gateway for OpenAI-compatible AI infrastructure.</p>
<hr>
<h1>Resources</h1>
<p>GitHub:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://github.com/vcal-project/ai-firewall</span><br></div></code></pre></div></div>
<p>Documentation:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://ai-firewall.docs.vcal-project.com/</span><br></div></code></pre></div></div>]]></content:encoded>
            <category>artificial-intelligence</category>
            <category>llm</category>
            <category>ai-infrastructure</category>
            <category>devops</category>
            <category>observability</category>
        </item>
        <item>
            <title><![CDATA[Not All “AI Security” Is the Same: Application Layer vs AI Cost Firewall]]></title>
            <link>https://blog.vcal-project.com/not-all-ai-security-is-the-same</link>
            <guid>https://blog.vcal-project.com/not-all-ai-security-is-the-same</guid>
            <pubDate>Wed, 22 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[From prompt safety to system control: the missing layer in production AI]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Medium.com on <strong>April 22, 2026</strong>.<br>
<!-- -->Read the <a href="https://medium.com/@sergey.lunev_27518/not-all-ai-security-is-the-same-application-layer-vs-ai-cost-firewall-5f4c94f173d6" target="_blank" rel="noopener noreferrer" class="">Medium.com </a> version</p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/two-layers-of-llm-security-4490ebc4e595079ba3a8f64e45ec20a6.png" width="1536" height="1024" class="img_ev3q"></p>
<p><em>As LLM applications move from demos into production, many teams double down on one thing: prompt security. They refine system prompts, add guardrails, introduce moderation, and carefully control how users interact with the model. And yet, once real traffic arrives, something unexpected happens.</em></p>
<p>At first, everything works. The demo is smooth, responses are fast, costs are negligible.</p>
<p>But soon real usage begins. Costs spike, latency becomes inconsistent, errors become harder to understand, deployments start affecting live requests in subtle ways.</p>
<p>Nothing is obviously broken, but the system no longer feels predictable.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-application-layer-controlling-meaning-and-behavior">The Application Layer: Controlling Meaning and Behavior<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#the-application-layer-controlling-meaning-and-behavior" class="hash-link" aria-label="Direct link to The Application Layer: Controlling Meaning and Behavior" title="Direct link to The Application Layer: Controlling Meaning and Behavior" translate="no">​</a></h2>
<p>The application layer is where the logic of an AI product lives. It defines how prompts are constructed, how users interact with the system, and what the model is allowed to do.</p>
<p>This is where most teams focus first — and for good reason. Here, you are dealing with <em>meaning</em>, <em>intent</em>, and <em>safety</em>.</p>
<p>At this layer, the focus is on controlling what the model is allowed to do. In practice, that translates into questions like:</p>
<ul>
<li class="">Can a user manipulate the model through prompt injection?</li>
<li class="">Can sensitive data leak through responses?</li>
<li class="">Are outputs aligned with policy and expectations?</li>
</ul>
<p>To solve this, teams build a combination of structural and defensive controls:</p>
<ul>
<li class="">Structured prompts and system messages</li>
<li class="">Input validation and sanitization</li>
<li class="">Output filtering and moderation</li>
<li class="">Access control and business logic</li>
</ul>
<p>These mechanisms are essential. Without them, the system is exposed at the semantic level.</p>
<p>In short, <em>the application layer protects what the model means and does.</em></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-missing-layer-controlling-execution-and-behavior-under-load">The Missing Layer: Controlling Execution and Behavior Under Load<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#the-missing-layer-controlling-execution-and-behavior-under-load" class="hash-link" aria-label="Direct link to The Missing Layer: Controlling Execution and Behavior Under Load" title="Direct link to The Missing Layer: Controlling Execution and Behavior Under Load" translate="no">​</a></h2>
<p>However, there is another category of problems that has nothing to do with meaning. They emerge only when the system is under real conditions.</p>
<p>Here are the most usual ones:</p>
<ul>
<li class="">Prompts gradually become larger and heavier</li>
<li class="">Similar requests are repeated again and again</li>
<li class="">Upstream providers introduce latency or intermittent failures</li>
<li class="">Errors return in inconsistent formats</li>
<li class="">Deployments interrupt in-flight requests</li>
</ul>
<p>These are not <em>prompt problems</em>. They are <em>system behavior problems</em>. The application layer answers: “Is this prompt safe?” The next layer answers: “Should this request exist at all?”</p>
<p>This is where the <strong>AI Cost Firewall</strong> layer comes in.</p>
<p><img decoding="async" loading="lazy" alt="Screenshot: LLM Security: Two Layers" src="https://blog.vcal-project.com/assets/images/two-layers-of-llm-security-4490ebc4e595079ba3a8f64e45ec20a6.png" width="1536" height="1024" class="img_ev3q"></p>
<p><em>Two distinct layers in LLM systems: the application layer controls meaning and safety, while the AI Cost Firewall controls execution, cost, and reliability.</em></p>
<p>Sitting between the application and the LLM provider, it acts as a <em>control plane for LLM traffic</em>. Its role is not to understand the prompt, but to ensure that every request is handled in a controlled, predictable, and observable way.</p>
<p>At this layer, the focus shifts to:</p>
<ul>
<li class="">How large is the request?</li>
<li class="">Should this request even reach the provider?</li>
<li class="">Is this a duplicate or semantically similar request?</li>
<li class="">Did the upstream fail, timeout, or respond incorrectly?</li>
<li class="">What happens if the system is shutting down?</li>
</ul>
<p>To answer these, the AI Cost Firewall introduces the following guardrails:</p>
<ul>
<li class="">Prompt size supervision and request validation</li>
<li class="">Error classification and normalized responses</li>
<li class="">Timeout handling and upstream protection</li>
<li class="">Exact and semantic caching</li>
<li class="">Readiness checks and graceful shutdown behavior</li>
</ul>
<p><em>This layer protects how the system executes and consumes resources.</em></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-these-layers-are-not-interchangeable">Why These Layers Are Not Interchangeable<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#why-these-layers-are-not-interchangeable" class="hash-link" aria-label="Direct link to Why These Layers Are Not Interchangeable" title="Direct link to Why These Layers Are Not Interchangeable" translate="no">​</a></h2>
<p>It’s tempting to think that strong application-layer security is enough.</p>
<p>The difference becomes obvious when you look at what each layer is actually responsible for.</p>
<p><img decoding="async" loading="lazy" alt="Screenshot: Two Layers of LLM Systems" src="https://blog.vcal-project.com/assets/images/two-layers-comparison-table-03c382dfd6d6e747627a520dbfaf1885.png" width="541" height="274" class="img_ev3q"></p>
<p><em>Two layers of LLM systems: controlling meaning vs controlling behavior</em></p>
<p>A perfectly secured prompt can still result in:</p>
<ul>
<li class="">A 2MB payload that strains your system</li>
<li class="">Thousands of repeated requests driving unnecessary cost</li>
<li class="">Silent upstream timeouts with no clear diagnostics</li>
</ul>
<p>In other words, semantic safety does not guarantee operational stability.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-familiar-pattern-just-not-yet-in-ai">A Familiar Pattern, Just Not Yet in AI<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#a-familiar-pattern-just-not-yet-in-ai" class="hash-link" aria-label="Direct link to A Familiar Pattern, Just Not Yet in AI" title="Direct link to A Familiar Pattern, Just Not Yet in AI" translate="no">​</a></h2>
<p>If this separation feels unfamiliar, it’s only because LLM systems are still new.</p>
<p>In traditional web architecture, this distinction is well understood:</p>
<ul>
<li class="">the application handles authentication, authorization, and business logic</li>
<li class="">the infrastructure layer (reverse proxy, API gateway) handles request validation, rate limiting, retries, and observability</li>
</ul>
<p>In that world, tools like Nginx became essential — not because they understand your business logic, but because they control how requests flow through the system.</p>
<p>The same pattern is now emerging in AI systems.</p>
<p><em>AI Cost Firewall plays a role similar to Nginx but for LLM traffic.</em></p>
<p>It does not interpret prompts or enforce business rules. Instead, it ensures that every request is well-formed, controlled, observable, and efficient before it reaches the model.</p>
<p>And just like in web systems, skipping this layer might work in a demo, but it rarely holds up in production.</p>
<p>No one would deploy a production system that sends raw traffic directly to back-end services without passing through a control layer. And yet, this is exactly how many LLM applications operate today.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-changes-in-production">What Changes in Production<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#what-changes-in-production" class="hash-link" aria-label="Direct link to What Changes in Production" title="Direct link to What Changes in Production" translate="no">​</a></h2>
<p>In early prototypes, everything seems fine. Traffic is low, prompts are short, and errors are rare.</p>
<p>However, production changes the situation completely:</p>
<ul>
<li class="">Prompts accumulate context and silently grow</li>
<li class="">Users repeat similar queries in slightly different forms</li>
<li class="">Costs scale faster than usage</li>
<li class="">Failures become harder to classify and debug</li>
</ul>
<p>These issues don’t break the system immediately. They degrade it gradually, until behavior becomes unpredictable.</p>
<p>This is precisely the gap the AI Cost Firewall layer is designed to address.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-key-insight">The Key Insight<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#the-key-insight" class="hash-link" aria-label="Direct link to The Key Insight" title="Direct link to The Key Insight" translate="no">​</a></h2>
<p>LLM security is not just about safe prompts. It’s about safe system behavior under real conditions.</p>
<p>The application layer ensures the model behaves correctly. The AI Cost Firewall ensures the system behaves reliably.</p>
<p>Both are required to move from a working demo to a production-grade system.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thought">Final Thought<a href="https://blog.vcal-project.com/not-all-ai-security-is-the-same#final-thought" class="hash-link" aria-label="Direct link to Final Thought" title="Direct link to Final Thought" translate="no">​</a></h2>
<p>Most teams start by asking: “How do we control what the model says?”</p>
<p>But in production, a more important question emerges: “How do we control what the system does under real conditions?”</p>
<p>That’s where the second layer becomes essential.</p>]]></content:encoded>
            <category>artificial-intelligence</category>
            <category>llm</category>
            <category>ai-infrastructure</category>
            <category>cybersecurity</category>
            <category>devops</category>
        </item>
        <item>
            <title><![CDATA[Reducing LLM Costs Is Easy — Until Production Starts]]></title>
            <link>https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts</link>
            <guid>https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts</guid>
            <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Originally published on Dev.to on April 13, 2026.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Dev.to on <strong>April 13, 2026</strong>.<br>
<!-- -->Read the <a href="https://dev.to/vcalproject/reducing-llm-costs-is-easy-until-production-starts-1da9" target="_blank" rel="noopener noreferrer" class="">Dev.to </a> version</p>
</blockquote>
<p>A month ago, I wrote about reducing LLM costs using caching.</p>
<p>The idea is simple: don’t send the same or similar request to the model twice.</p>
<p>It works well in demos. It even works well in early testing.</p>
<p>And then production starts.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="production-reality-where-llm-systems-start-breaking">Production Reality: Where LLM Systems Start Breaking<a href="https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts#production-reality-where-llm-systems-start-breaking" class="hash-link" aria-label="Direct link to Production Reality: Where LLM Systems Start Breaking" title="Direct link to Production Reality: Where LLM Systems Start Breaking" translate="no">​</a></h2>
<p>At first, everything looks under control. Requests are small, traffic is predictable, and caching delivers immediate savings. You see fewer calls to the model and faster responses. It feels like the problem is solved.</p>
<p>But real systems don’t stay simple for long.</p>
<p>Prompts begin to grow. What used to be a short question turns into a long conversation with accumulated context, system instructions, and sometimes entire documents pasted by users. Requests become heavier, slower, and more expensive in ways that caching alone cannot fix.</p>
<p>At the same time, failures start to blur together. A timeout, a malformed request, and an upstream provider error all look the same from the outside. Without clear separation, debugging becomes guesswork, and cost anomalies become difficult to explain.</p>
<p>Then there’s latency. A request times out — but what actually happened? Was the provider slow? Did the request even reach it? Should you retry it or not? Without visibility into upstream behavior, you’re operating blind.</p>
<p>Even semantic caching, which looks almost magical at first, becomes a tuning problem. Similarity thresholds that worked in testing suddenly feel off. Some responses are reused too aggressively, others not at all. Without insight into what the system is actually doing, you’re left adjusting numbers and hoping for the best. This is all similar to how prompts are tuned — but here, the feedback loop is missing.</p>
<p>Finally, the moment that exposes everything: deployment.</p>
<p>You restart the service during traffic, and suddenly there are dropped requests, inconsistent responses, and unpredictable behavior. What worked perfectly in isolation now reveals gaps in lifecycle handling.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-missing-layer-llm-systems-have-no-traffic-control">The Missing Layer: LLM Systems Have No Traffic Control<a href="https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts#the-missing-layer-llm-systems-have-no-traffic-control" class="hash-link" aria-label="Direct link to The Missing Layer: LLM Systems Have No Traffic Control" title="Direct link to The Missing Layer: LLM Systems Have No Traffic Control" translate="no">​</a></h2>
<p>What all of this points to is a deeper issue.</p>
<p>LLM applications don’t just need optimization. They need a control layer.</p>
<p>In traditional systems, we never send traffic directly to application logic. There is always a layer in front — something that validates, routes, filters, and observes. Tools like <em>Nginx</em> became essential not because they were convenient, but because they made systems predictable.</p>
<p>LLM systems are now facing the same reality.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="from-calling-models-to-controlling-requests">From Calling Models to Controlling Requests<a href="https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts#from-calling-models-to-controlling-requests" class="hash-link" aria-label="Direct link to From Calling Models to Controlling Requests" title="Direct link to From Calling Models to Controlling Requests" translate="no">​</a></h2>
<p>When you introduce a control layer in front of LLMs, the perspective changes.</p>
<p>The question is no longer just “how do I call the model?” but “should this request reach the model at all?”</p>
<p>Is it valid?
Has it already been answered?
What happens if it fails?</p>
<p>Cost optimization becomes a side effect of something bigger: managing traffic properly.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="from-caching-to-control-what-changed-in-real-deployments">From Caching to Control: What Changed in Real Deployments<a href="https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts#from-caching-to-control-what-changed-in-real-deployments" class="hash-link" aria-label="Direct link to From Caching to Control: What Changed in Real Deployments" title="Direct link to From Caching to Control: What Changed in Real Deployments" translate="no">​</a></h2>
<p>This is where the AI Cost Firewall evolved.</p>
<p>It started as a caching layer — combining exact matches in Redis with semantic search in Qdrant. That alone reduced a significant portion of redundant requests.</p>
<p>But real deployments made it clear that caching is only the beginning. The system needed to behave predictably under load, during failures, and across deployments. So the focus shifted.</p>
<p>Readiness and liveness became explicit, separating a healthy process from one that is actually ready to handle traffic. Shutdown behavior was redesigned to drain in-flight requests instead of dropping them. Restarts became controlled events rather than risky moments.</p>
<p>Errors were no longer just errors. They were classified: validation issues, upstream timeouts, provider failures, internal faults — each telling a different story about what went wrong.</p>
<p>Upstream behavior stopped being a black box. Latency became measurable, timeouts became visible, and slow responses could finally be distinguished from real failures.</p>
<p>Semantic caching also became observable. Instead of guessing whether it works, you can now see how many candidates are evaluated, how often thresholds pass or fail, and how long lookups take. What used to feel like a heuristic now becomes something you can tune with confidence.</p>
<p>And perhaps most importantly, the system itself became visible while it runs. You can tell whether it is ready, whether it is shutting down, and how it behaves under real traffic — not just in theory.</p>
<p>At this point, semantic caching stops being a black box.</p>
<p>This is what diagnostics visibility looks like in practice:</p>
<p><img decoding="async" loading="lazy" alt="Screenshot: AI Cost Firewall Diagnostics Dashboard" src="https://blog.vcal-project.com/assets/images/ai-firewall-diagnostics-d8436b34e2c880be3c69cf9039811975.png" width="1709" height="730" class="img_ev3q"></p>
<p>Instead of guessing thresholds, you now have feedback. Instead of assumptions, you have data.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-predictability-matters-more-than-features">Why Predictability Matters More Than Features<a href="https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts#why-predictability-matters-more-than-features" class="hash-link" aria-label="Direct link to Why Predictability Matters More Than Features" title="Direct link to Why Predictability Matters More Than Features" translate="no">​</a></h2>
<p>None of these changes are flashy.</p>
<p>They don’t improve model quality or add new capabilities.</p>
<p>But they solve something more fundamental: they make the system predictable.</p>
<p>And without predictability, cost optimization doesn’t hold.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="caching-starts-it-control-makes-it-work">Caching Starts It, Control Makes It Work<a href="https://blog.vcal-project.com/reducing-llm-costs-is-easy-until-production-starts#caching-starts-it-control-makes-it-work" class="hash-link" aria-label="Direct link to Caching Starts It, Control Makes It Work" title="Direct link to Caching Starts It, Control Makes It Work" translate="no">​</a></h2>
<p>Reducing LLM costs is easy when everything is controlled and small.</p>
<p>It becomes difficult when requests grow, failures mix together, and systems need to operate continuously under real conditions.</p>
<p>At that point, the problem is no longer about saving tokens. It’s about understanding and controlling the flow of requests before they ever reach the model.</p>
<p>Caching is where it starts. Control is what makes it work in production.</p>
<hr>
<p>If you want to experiment with the tool, the AI Cost Firewall project is open-source and designed to run as a drop-in OpenAI-compatible gateway in front of existing applications:</p>
<p><a href="https://github.com/vcal-project/ai-firewall" target="_blank" rel="noopener noreferrer" class="">https://github.com/vcal-project/ai-firewall</a></p>
<blockquote>
<p><em>Built and maintained by the VCAL Project team — feedback and real-world use cases are very welcome.</em></p>
</blockquote>]]></content:encoded>
            <category>ai</category>
            <category>llm</category>
            <category>devops</category>
            <category>open-source</category>
        </item>
        <item>
            <title><![CDATA[What Actually Matters When You Try to Reduce LLM Costs]]></title>
            <link>https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs</link>
            <guid>https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs</guid>
            <pubDate>Sun, 05 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Lessons from turning a simple idea into a production-ready OpenAI-compatible gateway]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Medium.com on <strong>April 5, 2026</strong>.<br>
<!-- -->Read the <a href="https://medium.com/@sergey.lunev_27518/what-actually-matters-when-you-try-to-reduce-llm-costs-461ce55facd5" target="_blank" rel="noopener noreferrer" class="">Medium.com </a> version</p>
</blockquote>
<p>After publishing the first release of AI Cost Firewall, I thought the hard part was done.</p>
<p>The idea was simple and it worked immediately: avoid sending duplicate or semantically similar requests to the LLM, and you reduce cost.</p>
<blockquote>
<p>I described that initial approach in more detail here: <a href="https://medium.com/@sergey.lunev_27518/how-to-reduce-openai-api-costs-with-semantic-caching-27a1b934c234?postPublishedType=repub" target="_blank" rel="noopener noreferrer" class="">How to Reduce OpenAI API Costs with Semantic Caching</a></p>
</blockquote>
<p>And it did work.</p>
<p>But once I started pushing it further — adding more metrics, handling edge cases, running real traffic through it — it became clear that the initial idea was only a small part of the problem.</p>
<p>Reducing LLM cost is not just about caching. It’s about understanding where the cost actually comes from, what “savings” really mean, and what begins to break when a system moves from a controlled demo into something closer to production.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-first-insight-still-holds">The First Insight Still Holds<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#the-first-insight-still-holds" class="hash-link" aria-label="Direct link to The First Insight Still Holds" title="Direct link to The First Insight Still Holds" translate="no">​</a></h2>
<p>The original observation hasn’t changed, and neither has the core architecture. The system still solves the same underlying problem.</p>
<ul>
<li class="">Users repeat themselves.</li>
<li class="">Applications repeat themselves.</li>
<li class="">Agents repeat themselves.</li>
</ul>
<p>Often the wording changes slightly, but the intent remains the same. From the model’s perspective, however, every variation is a brand new request. And every request has a cost.</p>
<p>So yes — caching works. It reduces cost immediately, often without any changes to the application itself.</p>
<p>But that’s only the surface. The deeper questions only appear once you try to rely on it.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-first-misconception-caching-is-free">The First Misconception: “Caching Is Free”<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#the-first-misconception-caching-is-free" class="hash-link" aria-label="Direct link to The First Misconception: “Caching Is Free”" title="Direct link to The First Misconception: “Caching Is Free”" translate="no">​</a></h2>
<p>In the beginning, the results looked almost too good.</p>
<p>In a demo environment, exact cache hits dominated. When a request hit the cache, it meant no API call, no tokens, and almost zero latency. It felt like pure gain, as if cost reduction came with no trade-offs.</p>
<p>That illusion disappears the moment you introduce semantic caching properly. Because semantic caching requires embeddings.</p>
<p>To determine whether two requests are similar, you first need to convert them into vectors. That means calling an embedding model, storing the result, and comparing it against existing data. Only then can you decide whether to reuse a response or forward the request to the LLM.</p>
<p>And embeddings are not free.</p>
<p>At that point, the equation changes:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Net savings = avoided LLM cost − embedding cost</span><br></div></code></pre></div></div>
<p>This is where things become more delicate.</p>
<p>If your similarity threshold is too low, you generate embeddings too often.
If your traffic is highly unique, most of those embeddings never lead to a cache hit.
If your embedding model is expensive, the optimization starts working against you.</p>
<p>What initially looked like a simple cost reduction mechanism becomes something that requires careful balance.</p>
<p>That was the moment when the project stopped being just a clever shortcut and started behaving like a system that needs tuning.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-the-savings-actually-come-from">Where the Savings Actually Come From<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#where-the-savings-actually-come-from" class="hash-link" aria-label="Direct link to Where the Savings Actually Come From" title="Direct link to Where the Savings Actually Come From" translate="no">​</a></h2>
<p>Looking at real metrics changed another assumption.</p>
<p>Intuitively, semantic caching feels like the main feature. It’s the “intelligent” part of the system. But in practice, most of the savings come from something much simpler — exact matches.</p>
<p>A surprisingly large portion of traffic is not just similar — it is identical. The same prompt appears again and again, sometimes minutes apart, sometimes hours later. Once you see it in real data, it’s hard to ignore.</p>
<p>Semantic caching still matters, but its role is different. It extends the coverage rather than forming the base.</p>
<p>Without exact caching, the system loses most of its immediate impact. Without semantic caching, you miss additional opportunities. But they are not equal contributors, and treating them as such leads to wrong expectations.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-breaks-in-real-usage">What Breaks in Real Usage<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#what-breaks-in-real-usage" class="hash-link" aria-label="Direct link to What Breaks in Real Usage" title="Direct link to What Breaks in Real Usage" translate="no">​</a></h2>
<p>As soon as real traffic enters the system, subtle issues begin to surface.</p>
<p>One of the first is <strong>payload size</strong>.</p>
<p>LLM requests tend to grow over time. Prompts accumulate context, system messages expand, conversation history becomes longer. In some cases, payloads become unexpectedly large — either naturally or intentionally.</p>
<p>Without limits, a single request can consume disproportionate resources. What seemed like a minor edge case quickly turns into something that needs explicit control.</p>
<p>Another issue is <strong>validation</strong>.</p>
<p>If you accept any model name, any payload structure, and any input format, your system remains flexible but your metrics lose meaning. Cost calculations become inconsistent, comparisons stop being reliable, and “savings” become difficult to interpret.</p>
<p>Adding strict validation changes that. It makes the system more predictable, but also more opinionated.</p>
<p>At that point, it is no longer just a transparent proxy. It becomes a controlled gateway. And that shift is intentional.</p>
<blockquote>
<p>It’s worth noting that this layer is not a security firewall in the traditional sense. It does not attempt to detect malicious prompts, prevent prompt injection, or enforce content policies. Those concerns belong to the application layer, where context, user intent, and business logic are better understood. The goal here is different: to control cost, reduce unnecessary requests, and make LLM usage more predictable.</p>
</blockquote>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="architecture-matters-more-than-it-seems">Architecture Matters More Than It Seems<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#architecture-matters-more-than-it-seems" class="hash-link" aria-label="Direct link to Architecture Matters More Than It Seems" title="Direct link to Architecture Matters More Than It Seems" translate="no">​</a></h2>
<p>From the outside, the architecture still looks simple. One layer in front of the LLM, minimal components, no intrusive changes to the application.</p>
<p>But over time, the reasoning behind each choice becomes more important.</p>
<ul>
<li class="">Redis handles exact matches because it is fast and predictable.</li>
<li class="">Qdrant supports semantic search efficiently without adding unnecessary complexity.</li>
<li class="">Rust ensures that this layer can sit in the request path without introducing latency or instability.</li>
</ul>
<p>Individually, these are straightforward decisions.</p>
<p>Together, they define whether the system can operate reliably under load. Because once this layer becomes part of the critical path, it is no longer optional. If it slows down, everything slows down. If it fails, everything fails.</p>
<p>At that point, optimization is no longer the only goal. Stability becomes equally important.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-difference-between-demo-traffic-and-reality">The Difference Between Demo Traffic and Reality<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#the-difference-between-demo-traffic-and-reality" class="hash-link" aria-label="Direct link to The Difference Between Demo Traffic and Reality" title="Direct link to The Difference Between Demo Traffic and Reality" translate="no">​</a></h2>
<p>It is easy to produce impressive results in a controlled environment.</p>
<ul>
<li class="">high cache hit rates</li>
<li class="">clear cost savings</li>
<li class="">clean, predictable behavior</li>
</ul>
<p>But those results are shaped by the input.</p>
<p>Real systems behave differently. There are always new queries, unexpected variations, and edge cases that were not part of the initial design. You never reach 100% cache hits — and that’s not a failure.</p>
<p>A healthy system still generates misses. It still calls the model. It still adapts to new inputs.</p>
<p>The goal is not to eliminate LLM usage entirely. The goal is to eliminate unnecessary usage. That distinction becomes much more important once you move beyond a demo.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-this-is-going">Where This Is Going<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#where-this-is-going" class="hash-link" aria-label="Direct link to Where This Is Going" title="Direct link to Where This Is Going" translate="no">​</a></h2>
<p>What started as a simple way to avoid duplicate requests is gradually evolving into something broader.</p>
<p>It is no longer just about caching. It becomes a control layer for how LLMs are used:</p>
<ul>
<li class="">cost visibility</li>
<li class="">request validation</li>
<li class="">provider abstraction</li>
<li class="">traffic control</li>
</ul>
<p>Caching is still at the center, but it is no longer the whole story. It is the entry point into a larger set of concerns that appear once LLM usage grows.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final Thoughts<a href="https://blog.vcal-project.com/what-actually-matters-when-you-try-to-reduce-llm-costs#final-thoughts" class="hash-link" aria-label="Direct link to Final Thoughts" title="Direct link to Final Thoughts" translate="no">​</a></h2>
<p>The original idea was simple: Don’t pay twice for the same answer.</p>
<p>That still holds. But applying that idea in a real system reveals a different challenge.</p>
<p>The difficult part is not avoiding the call. It is understanding when avoiding it actually makes sense. Because cost optimization, in practice, is not just about reducing usage. It is about understanding your system well enough to reduce costs in the right way.</p>
<p>You can explore the project here:</p>
<p><a href="https://github.com/vcal-project/ai-firewall" target="_blank" rel="noopener noreferrer" class="">https://github.com/vcal-project/ai-firewall</a></p>]]></content:encoded>
            <category>artificial-intelligence</category>
            <category>machine-learning</category>
            <category>openai</category>
            <category>software-architecture</category>
            <category>cost-optimization</category>
        </item>
        <item>
            <title><![CDATA[How to Reduce OpenAI API Costs with Semantic Caching]]></title>
            <link>https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching</link>
            <guid>https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching</guid>
            <pubDate>Sat, 21 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Medium.com on <strong>March 21, 2026</strong>.<br>
<!-- -->Read the <a href="https://medium.com/@sergey.lunev_27518/how-to-reduce-openai-api-costs-with-semantic-caching-27a1b934c234" target="_blank" rel="noopener noreferrer" class="">Medium.com </a> version</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-simple-openai-compatible-gateway-that-eliminates-duplicate-requests-and-cuts-token-usage">A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#a-simple-openai-compatible-gateway-that-eliminates-duplicate-requests-and-cuts-token-usage" class="hash-link" aria-label="Direct link to A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage" title="Direct link to A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage" translate="no">​</a></h3>
<p>While working on LLM-powered tools for my customer, I kept seeing something that didn’t feel right.</p>
<p>Users were asking the same or similar questions again and again. Support queries repeated. Internal assistants received nearly identical prompts. Even AI agents were looping through similar requests.</p>
<p>At first, it didn’t look like a problem. That’s just how users behave.</p>
<p>But then I looked at the cost.</p>
<p>Every repeated question meant another API call. Another batch of tokens. Another charge. Over time, it added up more than was expected.</p>
<p>I realized something simple:</p>
<p><strong>We are paying multiple times for the same answer.</strong></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-existing-solutions-didnt-quite-work">Why Existing Solutions Didn’t Quite Work<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#why-existing-solutions-didnt-quite-work" class="hash-link" aria-label="Direct link to Why Existing Solutions Didn’t Quite Work" title="Direct link to Why Existing Solutions Didn’t Quite Work" translate="no">​</a></h2>
<p>Initially I looked at the available tools.</p>
<p>Redis helped with exact caching, but only when the prompt was identical. The moment a user rephrased the question slightly, the cache missed. “How do I get access to Jira?” and “Cannot get access to Jira” were treated as completely different requests.</p>
<p>I also explored RedisVL, which brings vector search capabilities into Redis. It moves in the right direction by combining caching and similarity in one place. But in practice, it still requires setting up embedding flows, defining schemas, tuning similarity thresholds, and integrating it manually into the LLM request pipeline.</p>
<p>Vector databases like Milvus, Weaviate, or Qdrant seemed promising as well. They can detect semantic similarity effectively, but integrating them into the request flow means building additional pipelines, managing embeddings, and writing glue code.</p>
<p>All of these tools are powerful, but they aren’t simple.</p>
<p>More importantly, none of them are designed as a drop-in layer in front of an LLM API. There was no unified solution that combined caching, semantic matching, and cost awareness in one place.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-i-built-instead">What I Built Instead<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#what-i-built-instead" class="hash-link" aria-label="Direct link to What I Built Instead" title="Direct link to What I Built Instead" translate="no">​</a></h2>
<p>At some point, I decided to take a step back and ask a simple question:</p>
<p>What if we just put one smart layer in front of the LLM?</p>
<p>That’s how the AI Cost Firewall started.</p>
<p>Instead of modifying applications or adding complex pipelines, AI Firewall intercepts requests before they reach the model. If it’s already seen a similar request, it returns the cached response. If not, it forwards it, stores the result, and moves on.</p>
<p>From the application’s perspective, nothing changes. It still talks to an OpenAI-compatible API.</p>
<p>But behind the scenes, unnecessary calls disappear.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-it-works-without-the-complexity">How It Works (Without the Complexity)<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#how-it-works-without-the-complexity" class="hash-link" aria-label="Direct link to How It Works (Without the Complexity)" title="Direct link to How It Works (Without the Complexity)" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="Screenshot: AI Cost Firewall Architecture" src="https://blog.vcal-project.com/assets/images/ai-cost-firewall-diagram-917e92c7a9df51a5f4da379b4cfcaaa7.png" width="602" height="502" class="img_ev3q"></p>
<p>I intentionally kept the architecture minimal.</p>
<p>At the core, there’s a Rust-based API gateway that speaks the same language as the OpenAI API. For caching, I use Redis for exact matches and Qdrant for semantic similarity. Prometheus and Grafana provide visibility into what’s happening.</p>
<p>A request comes in, we check the cache, and only if needed do we call the LLM.</p>
<p>That’s it.</p>
<p>No SDK rewrites. No major architectural changes. Just one additional layer.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-i-chose-rust">Why I Chose Rust<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#why-i-chose-rust" class="hash-link" aria-label="Direct link to Why I Chose Rust" title="Direct link to Why I Chose Rust" translate="no">​</a></h2>
<p>Since this component sits directly in the request path, performance matters.</p>
<p>I chose Rust because it provides low latency and predictable performance without garbage collection pauses. It handles concurrency well and keeps the memory footprint small, which makes it ideal for containerized deployments.</p>
<p>Most importantly, we can trust it not to become the bottleneck.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-i-open-sourced-it">Why I Open-Sourced It<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#why-i-open-sourced-it" class="hash-link" aria-label="Direct link to Why I Open-Sourced It" title="Direct link to Why I Open-Sourced It" translate="no">​</a></h2>
<p>This layer sits between the application and the AI provider. That’s a sensitive place.</p>
<p>I felt it had to be transparent and auditable. Open source makes it easier to trust, easier to adopt, and easier to extend.</p>
<p>It also keeps the core idea simple: reducing costs shouldn’t introduce new risks or lock you into a vendor.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="getting-started-in-minutes">Getting Started in Minutes<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#getting-started-in-minutes" class="hash-link" aria-label="Direct link to Getting Started in Minutes" title="Direct link to Getting Started in Minutes" translate="no">​</a></h2>
<p>I wanted the setup to be as simple as possible.</p>
<p>Clone the repository, start Docker, and point your application to a new endpoint.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">git clone https://github.com/vcal-project/ai-firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">cd ai-firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">cp configs/ai-firewall.conf.example configs/ai-firewall.conf</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">nano configs/ai-firewall.conf # Replace the placeholders with your API keys</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">docker compose up -d</span><br></div></code></pre></div></div>
<p>After that, you just replace your API base URL with:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">http://localhost:8080/v1/chat/completions</span><br></div></code></pre></div></div>
<p>That’s the entire integration.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-changed-for-me">What Changed for Me<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#what-changed-for-me" class="hash-link" aria-label="Direct link to What Changed for Me" title="Direct link to What Changed for Me" translate="no">​</a></h2>
<p>Once I started using this approach, two things became obvious.</p>
<p>First, a surprisingly large portion of requests was served directly from cache. The reason? All of them were already answered before.</p>
<p>Second, response times improved whenever the cache was hit.</p>
<p>I didn’t need to optimize prompts or switch models to see an effect. Just avoiding redundant calls made a noticeable difference.</p>
<p>To make this visible, I add a simple Grafana dashboard.</p>
<p>It shows how many requests are served from cache vs forwarded to the LLM, along with the estimated cost savings in real time.</p>
<p><img decoding="async" loading="lazy" alt="Screenshot: Grafana dashboard showing cache hits and cost saving" src="https://blog.vcal-project.com/assets/images/ai-firewall-dashboard-60f307e0961674d36dd9bcd9aa243ee1.png" width="1097" height="630" class="img_ev3q"></p>
<p>The key metrics are:</p>
<ul>
<li class="">cache hit ratio (how many requests never reach the LLM)</li>
<li class="">total tokens saved</li>
<li class="">estimated cost savings</li>
</ul>
<p>What surprised me most was how quickly the savings accumulated even with relatively small traffic.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-comes-next">What Comes Next<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#what-comes-next" class="hash-link" aria-label="Direct link to What Comes Next" title="Direct link to What Comes Next" translate="no">​</a></h2>
<p>I see this as a starting point rather than a finished product.</p>
<p>Next, I’m focusing on adding support for other LLM providers beyond OpenAI. Expanding analytics is another priority, along with exploring multi-model setups and smarter routing.</p>
<p>There’s still a lot to build — and that’s exactly the point.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final Thoughts<a href="https://blog.vcal-project.com/how-to-reduce-openai-api-costs-with-semantic-caching#final-thoughts" class="hash-link" aria-label="Direct link to Final Thoughts" title="Direct link to Final Thoughts" translate="no">​</a></h2>
<p>AI costs don’t spike all at once. They grow quietly, request by request.</p>
<p>And in many cases, a large part of that cost is unnecessary.</p>
<p>We didn’t need a more complex system to reduce it. We just needed to stop sending the same request twice.</p>
<p>Sometimes the most effective optimization is the simplest one:</p>
<p>Not calling the model at all.</p>
<hr>
<p>If you’re running LLM-powered tools and want to reduce costs without changing your application architecture, you can try it here:</p>
<p><a href="https://github.com/vcal-project/ai-firewall" target="_blank" rel="noopener noreferrer" class="">https://github.com/vcal-project/ai-firewall</a></p>]]></content:encoded>
            <category>artificial-intelligence</category>
            <category>openai</category>
            <category>devops</category>
            <category>software-engineering</category>
            <category>startups</category>
        </item>
        <item>
            <title><![CDATA[AI Cost Firewall: An OpenAI-Compatible Gateway That Cuts LLM Costs by 75% ]]></title>
            <link>https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway</link>
            <guid>https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway</guid>
            <pubDate>Mon, 16 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Exact + semantic caching for AI applications ]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Dev.to on <strong>March 16, 2026</strong>.<br>
<!-- -->Read the <a href="https://dev.to/vcalproject/ai-cost-firewall-an-openai-compatible-gateway-that-cuts-llm-costs-by-75-3lh1" target="_blank" rel="noopener noreferrer" class="">Dev.to</a> version</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="exact--semantic-caching-for-ai-applications">Exact + semantic caching for AI applications<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#exact--semantic-caching-for-ai-applications" class="hash-link" aria-label="Direct link to Exact + semantic caching for AI applications" title="Direct link to Exact + semantic caching for AI applications" translate="no">​</a></h3>
<hr>
<p>In today’s era of AI adoption, there is a distinct shift from integrating AI solutions into business processes to controlling the costs, be it the costs of a cloud solution, a local LLM deployment, or the cost of tokens spent in chatbots. If your solution includes repeated questions and uses an OpenAI-compatible model, and if you are looking for a simple, free and effective way to immediately cut your company’s daily token costs, there is one infrastructural solution that does it right out of the box.</p>
<p><em>AI Cost Firewall</em> is a free open-source API gateway that decides which requests actually need to reach the LLM and which can be answered from previous results without additional token costs.</p>
<p>The gateway consists of a Rust-based firewall “decider”, a Redis database, a Qdrant vector store, Prometheus for metrics scraping, and Grafana for monitoring. All the tools are deployed with a single <code>docker compose</code> command and are available for use in less than a minute.</p>
<p>Once deployed, AI Cost Firewall sits transparently between your application and the LLM provider. Your chatbot, AI assistant, or internal automation continues to send requests exactly the same way as before with the only difference that the API endpoint now points to the firewall instead of directly to the model provider. The firewall then performs an instant check before deciding whether the request should actually reach the LLM and raise your monthly bill.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-the-firewall-reduces-token-costs">How the firewall reduces token costs<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#how-the-firewall-reduces-token-costs" class="hash-link" aria-label="Direct link to How the firewall reduces token costs" title="Direct link to How the firewall reduces token costs" translate="no">​</a></h2>
<p>AI Cost Firewall eliminates unnecessary token spends using two layers of caching.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="exact-match-cache-redis--valkey">Exact match cache (Redis / Valkey)<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#exact-match-cache-redis--valkey" class="hash-link" aria-label="Direct link to Exact match cache (Redis / Valkey)" title="Direct link to Exact match cache (Redis / Valkey)" translate="no">​</a></h3>
<p>The first step is an extremely fast exact request match check. Each incoming request is normalized and hashed. If an identical request was previously processed, the firewall immediately returns the stored response from Redis. This lookup takes microseconds and costs zero tokens. For workloads with frequent identical prompts such as customer support or internal documentation assistants this alone can already reduce a significant portion of LLM traffic.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="semantic-cache-qdrant">Semantic cache (Qdrant)<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#semantic-cache-qdrant" class="hash-link" aria-label="Direct link to Semantic cache (Qdrant)" title="Direct link to Semantic cache (Qdrant)" translate="no">​</a></h3>
<p>The second layer addresses the case of semantic similarity: questions are similar but not identical.</p>
<p>For example:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">User A: Provide a one-sentence explanation of what Kubernetes is.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">User B: What is Kubernetes? Give me a one-sentence explanation.</span><br></div></code></pre></div></div>
<p>Even though the wording differs, the semantic value and thus meaning of these questions is essentially the same (if you are interested in what semantics in AI is, have a look at my article <a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models" target="_blank" rel="noopener noreferrer" class="">From words to vectors: how semantics traveled from linguistics to Large Language Models</a>).</p>
<p>To detect these situations, AI Cost Firewall uses a semantic vector search. Each request is embedded using a lightweight embedding model, and the resulting vector is compared against previously stored queries using Qdrant, a high-performance vector database designed specifically for this. If the similarity score exceeds a certain threshold, the firewall returns the previously generated answer instead of sending the request to the LLM again. In this way, a single LLM response can be reused dozens or even hundreds of times without extra tokens expense.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="forwarding-only-when-necessary">Forwarding only when necessary<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#forwarding-only-when-necessary" class="hash-link" aria-label="Direct link to Forwarding only when necessary" title="Direct link to Forwarding only when necessary" translate="no">​</a></h3>
<p>If neither the exact cache nor the semantic cache contains a suitable answer, the firewall forwards the request to your upstream model provider. Besides being provided to the user, the returned response is then stored in both Redis and Qdrant for future reuse.
The workflow therefore becomes (simplified):</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Client → AI Cost Firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      Redis check</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     Qdrant semantic check</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   (only if needed)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        LLM API</span><br></div></code></pre></div></div>
<p>The LLM is only called when a genuinely new question appears.</p>
<p>With this approach, the AI Cost Firewall does not only save the costs but also rockets the response time improving the users’ satisfaction (Customer Satisfaction Score, CSAT).</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="openai-compatible-by-design">OpenAI compatible by design<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#openai-compatible-by-design" class="hash-link" aria-label="Direct link to OpenAI compatible by design" title="Direct link to OpenAI compatible by design" translate="no">​</a></h2>
<p>One of the most practical aspects of AI Cost Firewall is that you do not have to touch your application to integrate it. What you do is you simply switch the base URL to the firewall’s endpoint:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">client = OpenAI(</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    base_url="http://localhost:8080/v1"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">)</span><br></div></code></pre></div></div>
<p>From the application’s perspective, nothing changes. The same requests and responses flow through the system. However, now the firewall intelligently “decides” whether the model actually needs to be called and the money has to be spent.</p>
<p>This tool is compatible with:</p>
<ul>
<li class="">OpenAI models</li>
<li class="">Azure OpenAI</li>
<li class="">local OpenAI-compatible servers</li>
<li class="">many hosted inference platforms</li>
</ul>
<p>In other words, any system that already works with the OpenAI API can immediately benefit from cost reduction. And more than that, other models are going to be added soon by the project developers.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="observability-built-in">Observability built in<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#observability-built-in" class="hash-link" aria-label="Direct link to Observability built in" title="Direct link to Observability built in" translate="no">​</a></h2>
<p>One of the integrated features of the AI Cost Firewall is its built-in monitoring. It consists of Prometheus for scraping the metrics and integrated Grafana Dashboard. Both services are launched automatically by <code>docker compose</code> using preconfigured Prometheus YAML and a prebuilt Grafana dashboard JSON, so the monitoring stack is ready immediately without any manual configuration.</p>
<p>Prometheus metrics allow you to track:</p>
<ul>
<li class="">number of cache hits</li>
<li class="">semantic matches</li>
<li class="">forwarded requests</li>
<li class="">estimated cost savings</li>
<li class="">active requests</li>
</ul>
<p>You can immediately visualize these metrics with the Grafana dashboard to see exactly how much the firewall is saving in real time (with a 5-second delay to be honest).</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-it-works-well">Why it works well<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#why-it-works-well" class="hash-link" aria-label="Direct link to Why it works well" title="Direct link to Why it works well" translate="no">​</a></h2>
<p>AI Cost Firewall works because it targets a structural characteristic present in almost every LLM application:</p>
<ul>
<li class="">repeated user questions</li>
<li class="">overlapping knowledge queries</li>
<li class="">duplicated agent prompts</li>
</ul>
<p>By caching responses and using semantic similarity search, the system converts repeated LLM calls into near-zero-cost lookups.</p>
<p>Why near-zero and not fully zero? Because semantic matching still requires generating embeddings for incoming queries. However, embedding costs are typically orders of magnitude lower than generating full LLM responses.</p>
<p>Another advantage of the firewall is its intentionally minimal architecture:</p>
<ul>
<li class="">Rust firewall gateway</li>
<li class="">Redis for exact caching</li>
<li class="">Qdrant for semantic caching</li>
<li class="">Prometheus + Grafana for monitoring</li>
</ul>
<p>This simplicity makes it easy to deploy, maintain, and scale.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-ai-cost-firewall-is-most-effective">When AI Cost Firewall is most effective<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#when-ai-cost-firewall-is-most-effective" class="hash-link" aria-label="Direct link to When AI Cost Firewall is most effective" title="Direct link to When AI Cost Firewall is most effective" translate="no">​</a></h2>
<p>The biggest savings with AI Firewall occur in systems where similar questions appear frequently. You will immediately benefit from the AI Cost Firewall integration if your system includes any or several of the following components:</p>
<ul>
<li class="">customer support chatbots</li>
<li class="">internal company knowledge assistants</li>
<li class="">documentation Q&amp;A systems</li>
<li class="">developer copilots</li>
<li class="">AI help desks</li>
<li class="">AI Agents performing any of the above tasks</li>
</ul>
<p>In these environments, the same core questions appear repeatedly across many users. Even when questions are phrased differently, the semantic cache can reuse the same answer multiple times.</p>
<p>Advanced users may also appreciate the integrated TTL (Time-to-Live) feature which allows you to set up the duration of the response kept in Redis’s memory before replaced with a newly generated one. The same feature for Qdrant is currently under development and will be introduced soon.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="try-it-in-60-seconds">Try it in 60 seconds<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#try-it-in-60-seconds" class="hash-link" aria-label="Direct link to Try it in 60 seconds" title="Direct link to Try it in 60 seconds" translate="no">​</a></h2>
<p>If you want to see how AI Cost Firewall works, you can deploy the whole stack locally or on a small server in less than a minute.</p>
<p>Example:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">git clone https://github.com/vcal-project/ai-firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">cd ai-firewall</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">cp configs/ai-firewall.conf.example configs/ai-firewall.conf</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">nano configs/ai-firewall.conf # Replace the placeholders with your API keys</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">docker compose up -d</span><br></div></code></pre></div></div>
<p>This launches:</p>
<ul>
<li class="">AI Cost Firewall</li>
<li class="">Redis (exact cache)</li>
<li class="">Qdrant (semantic cache)</li>
<li class="">Prometheus (metrics scraping)</li>
<li class="">Grafana (monitoring dashboard)</li>
</ul>
<p>Once the containers start, simply point your OpenAI client to:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">http://localhost:8080/v1</span><br></div></code></pre></div></div>
<p>Within seconds, the gateway is ready to accept OpenAI-compatible requests. Similar to Nginx and other infrastructure gateways, you only need to add your API keys to the configuration file. From that point on, every request automatically passes through the cost-saving pipeline while previous responses are stored in Redis and Qdrant for future reuse.</p>
<p>Because the gateway itself is stateless, multiple firewall instances can be deployed behind a load balancer, allowing the system to scale horizontally with growing traffic.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">                         ┌───────────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │        Clients        │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │  Chatbots / Agents /  │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │  Internal AI Apps     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         └───────────┬───────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                     ▼</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         ┌───────────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │     Load Balancer     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │   Nginx / HAProxy     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         └───────────┬───────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                ┌────────────────────┼────────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                │                    │                    │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                ▼                    ▼                    ▼</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     │ AI Cost Firewall │  │ AI Cost Firewall │  │ AI Cost Firewall │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     │    instance 1    │  │    instance 2    │  │    instance 3    │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     └─────────┬────────┘  └─────────┬────────┘  └─────────┬────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">               │                     │                     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">               └─────────────┬───────┴───────────┬─────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                             │                   │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                             ▼                   ▼</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                 ┌──────────────────┐   ┌──────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                 │   Redis / Valkey │   │      Qdrant      │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                 │  Exact cache     │   │  Semantic cache  │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                 └─────────┬────────┘   └─────────┬────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                           │                      │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                           └──────────┬───────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                      │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                      ▼</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         ┌───────────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │  Upstream LLM API     │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         │ OpenAI / Azure / vLLM │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                         └───────────────────────┘</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     ┌─────────────────────── Observability Stack ─────────────────┐</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     │                                                             │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     │  AI Cost Firewall metrics ─────► Prometheus ─────► Grafana  │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     │                                                             │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">     └─────────────────────────────────────────────────────────────┘</span><br></div></code></pre></div></div>
<p>Architecture of a horizontally scalable AI Cost Firewall deployment.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://blog.vcal-project.com/ai-cost-firewall-an-openai-compatible-gateway#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>With the growing adoption of AI, it is sometimes painful to watch the steady increase of company expenses related to LLM tokens. Large Language Models are extremely powerful, but they can also become quite expensive when used at scale. It feels even more unfair when you realize that a significant portion of LLM traffic consists of repeated or semantically similar questions.</p>
<p>AI Cost Firewall addresses this inefficiency with a simple idea: do not send the same or similar question to the model again and again. Instead, reuse answers that were already generated for identical or semantically similar queries.</p>
<p>By combining exact caching with semantic similarity search, the firewall allows previously generated answers to be reused safely and efficiently. The result is lower token consumption, faster responses, and reduced infrastructure costs.</p>
<p>Because the gateway is OpenAI-compatible, integration requires only a small configuration change. No application refactoring is needed.
If your system includes chatbots, knowledge assistants, developer copilots, or AI agents that answer recurring questions, AI Cost Firewall can reduce token usage by 30–75% immediately after deployment.</p>
<p>And since the entire stack runs with a single docker compose command, you can try it in minutes.</p>
<p>Sometimes the most effective optimization is not a new model, a larger GPU cluster, or a complex architecture.</p>
<p>Sometimes it is simply not paying twice for the same answer.</p>
<hr>
<p>If you find this open-source project useful, a GitHub star will help the project grow.</p>
<p>⭐ <a href="https://github.com/vcal-project/ai-firewall" target="_blank" rel="noopener noreferrer" class="">https://github.com/vcal-project/ai-firewall</a></p>]]></content:encoded>
            <category>ai-infrastructure</category>
            <category>llm</category>
            <category>opensource</category>
            <category>devops</category>
            <category>rust</category>
            <category>machinelearning</category>
        </item>
        <item>
            <title><![CDATA[From Words to Vectors: How Semantics Traveled from Linguistics to Large Language Models]]></title>
            <link>https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models</link>
            <guid>https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models</guid>
            <pubDate>Sat, 17 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Why meaning moved from definitions to structure — and what that changed for modern AI]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Dev.to on <strong>January 17, 2026</strong>.<br>
<!-- -->Read the <a href="https://dev.to/vcalproject/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models-31f1" target="_blank" rel="noopener noreferrer" class="">Dev.to</a> version</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-meaning-moved-from-definitions-to-structure--and-what-that-changed-for-modern-ai">Why meaning moved from definitions to structure — and what that changed for modern AI<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#why-meaning-moved-from-definitions-to-structure--and-what-that-changed-for-modern-ai" class="hash-link" aria-label="Direct link to Why meaning moved from definitions to structure — and what that changed for modern AI" title="Direct link to Why meaning moved from definitions to structure — and what that changed for modern AI" translate="no">​</a></h3>
<hr>
<p>When engineers talk about semantic search, embeddings, or LLMs that "understand" language, it often sounds like something fundamentally new. Yet the problems modern AI systems face — meaning, reference, ambiguity, and context — were already central questions in linguistics and philosophy more than a century ago.</p>
<p>This article traces how the concept of semantics evolved across disciplines: from linguistics and philosophy, through symbolic AI and statistical NLP, and finally into the neural architectures that power modern large language models, and why this history matters for how we design retrieval, memory, and language systems today. The journey reveals that today's AI systems are not a break from the past, but the convergence of long-standing ideas finally made computationally feasible.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="linguistic-origins-meaning-as-a-system-not-a-label">Linguistic Origins: Meaning as a System, Not a Label<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#linguistic-origins-meaning-as-a-system-not-a-label" class="hash-link" aria-label="Direct link to Linguistic Origins: Meaning as a System, Not a Label" title="Direct link to Linguistic Origins: Meaning as a System, Not a Label" translate="no">​</a></h2>
<p>Modern semantics begins not with computers, but with language itself. In the late 19th and early 20th centuries, linguists began to reject the naive idea that words simply "point" to things in the world. One of the most influential figures in this shift was <strong>Ferdinand de Saussure</strong>, who argued that language is a structured system of signs rather than a naming scheme.</p>
<p>Saussure proposed that each linguistic sign consists of two inseparable parts: the signifier (the sound or written form) and the signified (the concept evoked). Crucially, the relationship between the two is arbitrary. There is nothing inherently "dog-like" about the word dog. Its meaning arises because it occupies a position within a broader system of contrasts: dog is meaningful because it is not cat, not wolf, not table.</p>
<p>This was a radical idea at the time. Meaning, Saussure claimed, is relational. Words derive significance from how they differ from other words, not from direct correspondence with reality. This insight quietly laid the conceptual groundwork for everything from structural linguistics to modern vector-based representations.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="philosophy-of-language-meaning-logic-and-composition">Philosophy of Language: Meaning, Logic, and Composition<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#philosophy-of-language-meaning-logic-and-composition" class="hash-link" aria-label="Direct link to Philosophy of Language: Meaning, Logic, and Composition" title="Direct link to Philosophy of Language: Meaning, Logic, and Composition" translate="no">​</a></h2>
<p>While linguists focused on structure, philosophers sought precision. In particular, <strong>Gottlob Frege</strong> transformed semantics by embedding it into formal logic. Frege introduced a critical distinction between <em>sense</em> — the mode of presentation of an idea, and <em>reference</em> — the actual object being referred to.</p>
<p>This distinction explained how two expressions could refer to the same thing while conveying different information. "The morning star" and "the evening star" both refer to Venus, yet they are not interchangeable in all contexts. Meaning, therefore, could not be reduced to reference alone.</p>
<p>More importantly, Frege formalized the idea of compositionality: the meaning of a sentence is determined by the meanings of its parts and the rules used to combine them. This principle became foundational not only in philosophy, but later in programming languages, logic systems, and early AI models.</p>
<p>In retrospect, compositionality is what allowed meaning to be treated as something computable, at least in theory.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="early-artificial-intelligence-when-meaning-was-symbolic">Early Artificial Intelligence: When Meaning Was Symbolic<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#early-artificial-intelligence-when-meaning-was-symbolic" class="hash-link" aria-label="Direct link to Early Artificial Intelligence: When Meaning Was Symbolic" title="Direct link to Early Artificial Intelligence: When Meaning Was Symbolic" translate="no">​</a></h2>
<p>When I studied linguistics at university many years ago, everything up to this point was already part of the curriculum. Structural linguistics, philosophy of language, and formal semantics provided a solid theoretical foundation. What none of us could have anticipated at the time was how directly these ideas would later intersect with computer science in what would come to be called <em>artificial intelligence</em>.</p>
<p>When AI emerged as a field in the mid-20th century, it inherited philosophy's confidence in symbols and logic. Early systems assumed that meaning could be explicitly represented through formal structures: symbols, predicates, rules, and ontologies. To "understand" language was to transform symbols according to carefully designed rules.</p>
<p>For a while, this worked. Expert systems, knowledge graphs, and first-order logic engines achieved impressive results in narrowly defined domains such as medical diagnosis, chemical analysis, and configuration problems. Within carefully bounded worlds, symbolic semantics appeared tractable.</p>
<p>Natural language, however, quickly exposed the limits of this approach. Human language is ambiguous, context-dependent, and constantly evolving. Encoding all possible meanings and interpretations proved not merely difficult, but fundamentally unscalable. Symbolic systems were brittle: they failed not gradually, but catastrophically, when faced with inputs that deviated even slightly from their assumptions.</p>
<p>Semantics, it turned out, was far messier than logic had allowed, and far more resistant to being fully written down.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-statistical-shift-meaning-emerges-from-usage">The Statistical Shift: Meaning Emerges from Usage<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#the-statistical-shift-meaning-emerges-from-usage" class="hash-link" aria-label="Direct link to The Statistical Shift: Meaning Emerges from Usage" title="Direct link to The Statistical Shift: Meaning Emerges from Usage" translate="no">​</a></h2>
<p>A quiet revolution began when linguists and computer scientists started to look not at rules, but at usage patterns. The idea that meaning could be inferred from how words are used rather than how they are defined gained traction in the mid-20th century.</p>
<p>The core insight was simple but profound: words that appear in similar contexts tend to have similar meanings. Instead of encoding semantics explicitly, one could measure it statistically by analyzing large corpora of text.</p>
<p>This approach, known as distributional semantics, reframed meaning as something empirical rather than prescriptive. Words became vectors of co-occurrence statistics. Similarity was no longer binary or rule-based, but graded and approximate.</p>
<p>This was a decisive break from symbolic AI and a return, in spirit, to Saussure's relational view of meaning.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="word-embeddings-geometry-becomes-semantics">Word Embeddings: Geometry Becomes Semantics<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#word-embeddings-geometry-becomes-semantics" class="hash-link" aria-label="Direct link to Word Embeddings: Geometry Becomes Semantics" title="Direct link to Word Embeddings: Geometry Becomes Semantics" translate="no">​</a></h2>
<p>Distributional ideas matured dramatically with the introduction of neural word embeddings, particularly models like <strong>Word2Vec</strong>. Instead of relying on sparse frequency counts, these models learned dense, low-dimensional vector representations optimized to predict linguistic context.</p>
<p>What emerged surprised even their creators. Semantic relationships appeared as geometric regularities in vector space. Differences between vectors encoded analogies, hierarchies, and semantic proximity. Meaning became something you could measure with cosine similarity.</p>
<p>This was not symbolic understanding, but it was not random either. It was structure: learned rather than designed.</p>
<p>For the first time, machines exhibited behavior that looked like semantic intuition, despite having no explicit definitions or rules.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="contextual-semantics-meaning-is-not-fixed">Contextual Semantics: Meaning Is Not Fixed<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#contextual-semantics-meaning-is-not-fixed" class="hash-link" aria-label="Direct link to Contextual Semantics: Meaning Is Not Fixed" title="Direct link to Contextual Semantics: Meaning Is Not Fixed" translate="no">​</a></h2>
<p>Static embeddings had a fundamental limitation: each word had exactly one vector, regardless of context. But human language does not work that way. The meaning of a word shifts depending on surrounding words, speaker intent, situation, and even emotion.</p>
<p>Transformer-based models, particularly <strong>BERT</strong>, addressed this by making representations contextual. Instead of asking "What does this word mean?", the model learned to ask "What does this word mean here?"</p>
<p>Through attention mechanisms, transformers model relationships between tokens dynamically. Meaning is no longer stored in a single vector per word, but distributed across layers and activations that respond to context.</p>
<p>This marked a crucial step toward pragmatic semantics: language as it is actually used, not as it is abstractly defined.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="large-language-models-semantics-as-emergent-structure">Large Language Models: Semantics as Emergent Structure<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#large-language-models-semantics-as-emergent-structure" class="hash-link" aria-label="Direct link to Large Language Models: Semantics as Emergent Structure" title="Direct link to Large Language Models: Semantics as Emergent Structure" translate="no">​</a></h2>
<p>Large language models such as <strong>GPT</strong> do not contain explicit semantic representations in the traditional sense. They are trained to predict the next token in a sequence. And yet, at scale, they display behaviors that look strikingly semantic: summarization, reasoning, translation, abstraction.</p>
<p>The key idea is emergence. As models compress vast amounts of linguistic data, they internalize regularities about the world, language, and human communication. Semantics arises not as a module, but as a side effect of learning efficient representations.</p>
<p>These models do not "know" meaning in a philosophical sense. But they operate in a space where syntax, semantics, and pragmatics are inseparable, and where relational structure dominates.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-meaning-becomes-operational">When Meaning Becomes Operational<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#when-meaning-becomes-operational" class="hash-link" aria-label="Direct link to When Meaning Becomes Operational" title="Direct link to When Meaning Becomes Operational" translate="no">​</a></h2>
<p>For practitioners building semantic search systems, RAG pipelines, or LLM-adjacent infrastructure, this history is not academic background — it is an explanation of why certain designs consistently work while others fail. Exact matching breaks down because natural language rarely repeats itself verbatim. Embeddings succeed not because they are clever, but because they mirror how meaning behaves in practice: approximately, relationally, and with tolerance for variation.</p>
<p>Once this is understood, several architectural consequences follow naturally. Retrieval quality depends less on perfect recall and more on selecting representations that preserve semantic neighborhoods. Caching strategies become viable only when equivalence is defined by similarity rather than identity. Evaluation metrics must account for graded relevance instead of binary correctness. Even system boundaries shift: components no longer exchange "facts", but approximations of meaning that remain useful within context.</p>
<p>Semantic systems are effective precisely because they do not attempt to eliminate ambiguity. They absorb it. Whether you are designing a vector store, placing a semantic cache in front of an LLM, or building a long-term memory layer for conversational systems, you are implicitly making choices about how much approximation your system tolerates and where that tolerance is enforced.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="closing-thought-semantics-as-shared-infrastructure">Closing Thought: Semantics as Shared Infrastructure<a href="https://blog.vcal-project.com/from-words-to-vectors-how-semantics-traveled-from-linguistics-to-large-language-models#closing-thought-semantics-as-shared-infrastructure" class="hash-link" aria-label="Direct link to Closing Thought: Semantics as Shared Infrastructure" title="Direct link to Closing Thought: Semantics as Shared Infrastructure" translate="no">​</a></h2>
<p>What began as a linguistic insight, that words gain meaning through their relations to other words, has quietly become an organizing principle for entire computational systems. Meaning no longer lives in dictionaries, rules, or symbols, but in patterns: in how expressions cluster, diverge, and reappear across vast landscapes of language. Semantics is no longer something a system contains; it is something a system moves through.</p>
<p>This shift took more than a century to unfold. It required philosophers to separate sense from reference, linguists to abandon naming theories, and engineers to accept approximation over certainty. Only when data became abundant and computation relatively cheap did this long trajectory converge into something operational. Semantics, once debated in lecture halls and footnotes, has become infrastructure — implicit, distributed, and shared.</p>
<p>That idea, radical when first proposed, has been waiting over a hundred years for enough data and compute to become practical.</p>
<p>And now, finally, it has.</p>]]></content:encoded>
            <category>semantics</category>
            <category>ai-infrastructure</category>
            <category>machine-learning</category>
            <category>llm</category>
            <category>nlp</category>
        </item>
        <item>
            <title><![CDATA[Why Edge AI Needs Lightweight Semantic Caches — and What Makes Them Hard to Build]]></title>
            <link>https://blog.vcal-project.com/why-edge-ai-needs-lightweight-semantic-caches—and-what-makes-them-hard-to-build</link>
            <guid>https://blog.vcal-project.com/why-edge-ai-needs-lightweight-semantic-caches—and-what-makes-them-hard-to-build</guid>
            <pubDate>Thu, 27 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How on-prem semantic caching bridges the gap between LLM reasoning and data efficiency]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Medium.com on <strong>November 27, 2025</strong>.<br>
<!-- -->Read the <a href="https://medium.com/@sergey.lunev_27518/why-edge-ai-needs-lightweight-semantic-caches-and-what-makes-them-hard-to-build-75124519a746" target="_blank" rel="noopener noreferrer" class="">Medium.com </a> version</p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/iot-vcal-2b2075c6bdc5bda7ec02970c0ef13d70.png" width="1020" height="790" class="img_ev3q"></p>
<p>Today edge computing is reshaping the way AI systems are deployed. Instead of sending every request to centralized cloud infrastructure, more computation is happening on devices closer to end-users. These “edge environments” include IoT gateways, on-premise servers, mobile devices, micro-VMs, serverless functions, and browser-based applications. The appeal is clear: moving computation closer to where data is generated reduces latency, minimizes bandwidth requirements and allows organizations to satisfy strict data-privacy rules.</p>
<p>At the same time, WebAssembly (WASM) has emerged as a portable, sandboxed runtime for executing code in highly constrained or security-sensitive environments. Originally designed for browsers, WASM now runs in cloud edge workers, serverless platforms, and isolated environments where traditional binaries cannot be executed. These runtimes often restrict access to system calls such as networking, threading, or the local filesystem. They operate under strict memory limits, sometimes as low as tens of megabytes, and they prioritize deterministic, predictable execution.</p>
<p>Altogether, while offering obvious advantages, running AI components at the edge introduces its own challenges, especially when applications rely on semantic search, embeddings, or large language models (LLM).</p>
<hr>
<p>A major issue arises when AI applications repeatedly generate similar responses to similar prompts. In a cloud setting this inefficiency is tolerable, but at the edge it becomes costly. Edge nodes often have hard limits on CPU time and memory allocation, meaning that even small local language models may struggle to meet real-time latency budgets. A semantic cache — a system that stores answers together with an embedding vector and returns a cached answer when the incoming request is semantically similar — is a natural solution. However, building such a system for constrained environments is significantly more difficult than building one for the cloud.</p>
<p>The first challenge is memory. Classical vector databases and similarity search engines rely on complex indexing structures such as HNSW graphs, which are fast but memory-intensive. Standard configurations easily grow to hundreds of megabytes and often assume the availability of multi-threading, background maintenance processes, and dynamic memory growth. Edge workers and WASM isolates cannot accommodate this. In many cases, the runtime enforces strict caps on linear memory and disallows growing beyond a fixed boundary. This immediately rules out most existing semantic search libraries, even before considering cold-start overhead or storage.</p>
<p>The second constraint is the execution environment itself. WASM runtimes typically do not expose POSIX-like APIs (Portable Operating System Interface, a family of standards developed by IEEE that define consistent application programming interfaces). Features such as mmap, file descriptors, or native sockets are unavailable unless the host explicitly provides them through WASI (WebAssembly System Interface), and even then, support varies. This makes it almost impossible to run vector databases “as-is,” because they depend heavily on operating system functionality and persistent background services. In edge environments developers have only a few milliseconds to initialize modules, produce a response, and return control to the runtime. A semantic cache that takes hundreds of milliseconds to load an index simply cannot be deployed in these contexts.</p>
<p>Cold-start behavior is another architectural concern. Unlike long-running cloud servers, edge workers may be rapidly created and destroyed. A new isolate might handle only one or two requests before being recycled. For AI applications, this means that any semantic cache must load extremely quickly — ideally in a few milliseconds — and must not rely on heavy initialization or dynamic graph reconstruction. Snapshotting becomes essential: developers need the ability to store the cache state in a compact format that loads deterministically and quickly into memory.</p>
<p>Then, there is also the question of energy and cost efficiency. Edge nodes operate on limited power budgets, especially in IoT scenarios. Recomputing the same embedding or calling an external LLM repeatedly wastes both energy and bandwidth. Reducing redundant inference calls requires a semantic memory layer that can match incoming queries to existing knowledge without exceeding stringent resource constraints.</p>
<p>Privacy regulations add an additional layer of complexity. One of the motivations for moving AI workloads to the edge is to keep sensitive data local. But to do that effectively, the system must avoid unnecessarily sending repeated questions or logs to a central model. A semantic cache therefore becomes not just a performance optimization but a privacy mechanism: if the system can answer from its local memory, no data transmission to external LLMs is required. Unfortunately, building such a cache in environments with restricted storage, no access to background processes, and strict runtime quotas is a non-trivial task.</p>
<p>These are the conditions under which traditional semantic search infrastructure begins to struggle. Large vector databases simply assume too much: too much RAM, too much access to the operating system, too much startup time, and too much persistence. Even lightweight semantic caches designed for server applications often rely on threading, shared memory, file-based checkpointing, or dynamically growing allocations. Most embedding-based caches were never designed with WASM runtimes, edge workers, or IoT gateways in mind.</p>
<hr>
<p>This is precisely the gap that newer designs aim to address. Solutions like <a href="https://vcal-project.com/" target="_blank" rel="noopener noreferrer" class=""><strong>VCAL</strong></a> approach semantic caching not as a distributed system or standalone service but as a small in-process library that can run with minimal memory and without heavy OS dependencies. Instead of behaving like a database, it behaves more like a CPU-level cache for AI reasoning, storing question-answer pairs and their embeddings in an optimized structure that can fit within the constraints of edge and WASM environments. By avoiding reliance on network calls, background threads, or large indices, such systems become suitable for serverless workers, browser WASM modules, or embedded devices with limited RAM.</p>
<p>In this sense, the semantic cache becomes a missing piece of infrastructure for edge AI. As more organizations push inference closer to the user, the need for a lightweight, deterministic, low-memory semantic lookup system grows. The limitations of edge platforms — from strict memory caps to rapid cold starts — make this a difficult problem, and the lack of suitable solutions has slowed the adoption of AI features outside centralized cloud environments. As WASM matures and edge utilities evolve, semantic caching may become a standard part of the AI pipeline, enabling faster, cheaper, and more privacy-preserving deployments across a wide range of devices.</p>]]></content:encoded>
            <category>edge-computing</category>
            <category>webassembly</category>
            <category>artificial-intelligence</category>
            <category>vector-database</category>
            <category>internet-of-things</category>
        </item>
        <item>
            <title><![CDATA[Beyond Vector Databases: The Case for Local Semantic Caching]]></title>
            <link>https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching</link>
            <guid>https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching</guid>
            <pubDate>Thu, 06 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How on-prem semantic caching bridges the gap between LLM reasoning and data efficiency]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Medium.com on <strong>November 6, 2025</strong>.<br>
<!-- -->Read the <a href="https://medium.com/@sergey.lunev_27518/beyond-vector-databases-the-case-for-local-semantic-caching-a7224b75a6f2" target="_blank" rel="noopener noreferrer" class="">Medium.com </a> version</p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/articles-image-dc371944641b3414e30457ccbd5cd154.png" width="1077" height="453" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-intelligence-wastes-cycles">When “intelligence” wastes cycles<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#when-intelligence-wastes-cycles" class="hash-link" aria-label="Direct link to When “intelligence” wastes cycles" title="Direct link to When “intelligence” wastes cycles" translate="no">​</a></h3>
<p>Most teams building LLM-powered products eventually realize that a large portion of their API costs come not from new insights, but from repeated questions.</p>
<p>A support bot, an internal assistant, or an analytics copilot, all encounter thousands of near-identical queries:</p>
<blockquote>
<p>“How do I pass the API key to the local model gateway?”<br>
<!-- -->“Why is the dev database connection timing out?”<br>
<!-- -->“How can I refresh the cache without restarting the service?”</p>
</blockquote>
<p>Each of those prompts gets re-tokenized, re-embedded, and re-sent to an LLM even when the model has already answered an equivalent question a minute earlier.</p>
<p>What do we have as a result? Burned tokens, wasted latency, and duplicated reasoning.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="vector-databases-solved-storage-not-reuse">Vector databases solved <em>storage</em>, not <em>reuse</em><a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#vector-databases-solved-storage-not-reuse" class="hash-link" aria-label="Direct link to vector-databases-solved-storage-not-reuse" title="Direct link to vector-databases-solved-storage-not-reuse" translate="no">​</a></h3>
<p>The industry's first instinct was to throw vector databases at the problem.
They excel at persistent embeddings and semantic retrieval, but they were never built for reuse.
What they lack are TTL policies, eviction strategies, and atomic snapshotting of in-flight state.
In other words, they store knowledge, not memory.</p>
<p>Traditional vector databases follow a <code>key:value</code> paradigm: they persist embeddings indefinitely so they can be queried later, much like records in a datastore.
A semantic cache, by contrast, treats embeddings as dynamic memory — governed by similarity, expiration, and adaptive retention.
Its goal is not to archive information, but to avoid redundant reasoning across millions of semantically similar requests.</p>
<p>With a semantic cache such as VCAL, cached answers can stay valid for days or weeks, depending on data volatility and TTL settings.
This moves caching from short-term repetition avoidance to long-horizon semantic reuse where reasoning itself becomes a reusable resource rather than a recurring cost.</p>
<p>In essence, VCAL bridges the gap between data retrieval and cognitive efficiency, turning past computation into future acceleration.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="from-data-stores-to-memory-layers">From data stores to memory layers<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#from-data-stores-to-memory-layers" class="hash-link" aria-label="Direct link to From data stores to memory layers" title="Direct link to From data stores to memory layers" translate="no">​</a></h3>
<p>In <a href="https://dev.to/vcalproject/how-i-created-a-semantic-cache-library-for-ai-863" target="_blank" rel="noopener noreferrer" class="">my previous Dev.to article</a>, I explained how we built <em>VCAL</em>, a Rust-based semantic cache that sits between your app and the LLM. Instead of persisting every vector, it memorizes embeddings for a short time, indexed by <em>semantic similarity</em> and metadata.</p>
<p>When a new query arrives, VCAL compares it to cached vectors. If it is close enough — a <em>cache hit</em> — the LLM call is skipped, and the stored answer is returned in milliseconds. Otherwise, the request proceeds normally, and the response is stored for future matches.</p>
<p>The design combines concepts from vector search and traditional caching systems, enhanced with features for resilience and monitoring:</p>
<ul>
<li class=""><a href="https://arxiv.org/abs/1603.09320" target="_blank" rel="noopener noreferrer" class="">HNSW</a> index for ultra-fast approximate similarity search.</li>
<li class="">TTL and LRU eviction for automatic cache turnover.</li>
<li class="">Snapshotting for persistence between restarts.</li>
<li class="">Prometheus metrics for observability.</li>
</ul>
<p>All of it runs on-prem, next to your model or gateway with no remote dependencies.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-local-caching-changes-the-economics">Why local caching changes the economics<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#why-local-caching-changes-the-economics" class="hash-link" aria-label="Direct link to Why local caching changes the economics" title="Direct link to Why local caching changes the economics" translate="no">​</a></h3>
<p>Unlike vector databases, a local semantic cache has one simple purpose: avoid redundant reasoning.
Each avoided LLM call translates directly into saved tokens, lower API bills, and shorter response times.</p>
<p>In real deployments we’ve seen:</p>
<ul>
<li class="">30–60 % reduction in LLM calls</li>
<li class="">Millisecond-level latency on repeated queries enabling near-real-time responsiveness</li>
<li class="">Predictable resource usage: no external round-trips, no cloud egress costs, and no multi-tenant contention</li>
</ul>
<p>At scale, the more your users interact, the greater the savings become. Instead of paying per token for every repetition, you amortize prior reasoning across sessions and teams.</p>
<p>And because VCAL runs inside your private environment, all caching and embeddings stay under your control ensuring data privacy, compliance, and deterministic performance even in regulated industries.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-new-layer-in-the-ai-stack">A new layer in the AI stack<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#a-new-layer-in-the-ai-stack" class="hash-link" aria-label="Direct link to A new layer in the AI stack" title="Direct link to A new layer in the AI stack" translate="no">​</a></h3>
<p>If you visualize the modern LLM stack, the simplified design looks like this:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">User → Application → LLM Gateway → Model</span><br></div></code></pre></div></div>
<p>or, if a RAG (Retrieval-Augmented Generation) framework is involved:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">User → Application → Retriever → Vector DB → (context) → LLM Gateway → Model</span><br></div></code></pre></div></div>
<p>Adding a semantic cache such as VCAL introduces this new dimension:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">User → Application → Retriever → Vector DB → (context) → Semantic Cache → LLM Gateway → Model</span><br></div></code></pre></div></div>
<p>Here, the cache checks whether a semantically equivalent query was already answered.
If found, the response is returned instantly — skipping tokenization, embedding, and inference altogether.
If not, the request continues as usual, and the new answer is stored for future reuse.</p>
<p>Vector databases still matter but they belong to the <em>knowledge layer</em>, not the <em>inference path</em>. What has been missing so far is a memory layer that prevents repeated reasoning altogether. Semantic caching fills the missing “memory” slot in between. It is not a replacement for RAG, it is a complement. While RAG injects <em>context</em>, caching avoids <em>duplication.</em></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="engineering-for-low-latency">Engineering for low latency<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#engineering-for-low-latency" class="hash-link" aria-label="Direct link to Engineering for low latency" title="Direct link to Engineering for low latency" translate="no">​</a></h3>
<p>Achieving millisecond response times in semantic caching requires more than just a fast similarity search algorithm. It’s the result of careful coordination between data structures, memory layout, and concurrency control.</p>
<p>The cache can be implemented efficiently in systems programming languages such as Rust, using an HNSW-based index for approximate nearest-neighbor search. HNSW provides logarithmic-scale query complexity while maintaining accuracy for large collections of embeddings, making it suitable for workloads that reach millions of cached entries.</p>
<p>Low latency also depends on predictable memory management and lock-free or fine-grained synchronization between threads. Instead of allocating and freeing vectors dynamically, embeddings are often stored in preallocated arenas or memory-mapped regions to minimize fragmentation and system calls. Parallel workers can update the index or evaluate similarity thresholds concurrently, so that retrieval scales with the number of available cores.</p>
<p>In practice, a semantic cache can be deployed as a lightweight service beside an inference gateway, communicating over local HTTP or gRPC.
It can also be embedded directly into an application process when minimal overhead is required, for example, within an agent runtime or API handler.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-bigger-picture">The bigger picture<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#the-bigger-picture" class="hash-link" aria-label="Direct link to The bigger picture" title="Direct link to The bigger picture" translate="no">​</a></h3>
<p>Caching has always been an invisible driver of performance — from CPU registers that reuse instructions, to CDNs that reuse content, to databases that reuse queries.</p>
<p>Each generation of systems extends the notion of what can be reused. As language models enter production, we are witnessing a shift toward semantic reuse: reusing meaning rather than data. This enables systems to recall previous reasoning instead of repeating it — a step toward more efficient and sustainable AI infrastructure.</p>
<p>In this new layer of the AI stack, semantic caching becomes a form of reasoning memory: it stores the results of understanding, not just storage operations. Instead of recomputing the same insight across thousands of near-identical prompts, we can recall it instantly — with full control over latency, privacy, and cost.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="further-reading">Further reading<a href="https://blog.vcal-project.com/beyond-vector-databases-the-case-for-local-semantic-caching#further-reading" class="hash-link" aria-label="Direct link to Further reading" title="Direct link to Further reading" translate="no">​</a></h3>
<p>For readers interested in implementation details and open-source examples:</p>
<ul>
<li class=""><a href="https://dev.to/vcalproject/how-i-created-a-semantic-cache-library-for-ai-863" target="_blank" rel="noopener noreferrer" class="">How I Created a Semantic Cache Library for AI</a> (Dev.to)</li>
<li class=""><a href="https://vcal-project.com/" target="_blank" rel="noopener noreferrer" class="">vcal-project.com</a> for general information, documentation, and architectural notes</li>
</ul>
<hr>
<p><strong>Thank you for reading!</strong> Semantic caching is still an emerging concept — every real-world use case helps shape how we think about efficient reasoning. Share yours in the comments if you’d like to join the conversation.</p>]]></content:encoded>
            <category>ai-infrastructure</category>
            <category>llm</category>
            <category>caching</category>
            <category>rust</category>
            <category>machine-learning</category>
        </item>
        <item>
            <title><![CDATA[How I Created a Semantic Cache Library for AI]]></title>
            <link>https://blog.vcal-project.com/how-i-created-a-semantic-cache-library</link>
            <guid>https://blog.vcal-project.com/how-i-created-a-semantic-cache-library</guid>
            <pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[The story behind building VCAL — an on-prem semantic cache that helps AI apps remember what they’ve already answered.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Originally published on Dev.to on <strong>October 27, 2025</strong>.<br>
<!-- -->Read the <a href="https://dev.to/vcalproject/how-i-created-a-semantic-cache-library-for-ai-863" target="_blank" rel="noopener noreferrer" class="">Dev.to </a> version</p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="Cover" src="https://blog.vcal-project.com/assets/images/articles-image-dc371944641b3414e30457ccbd5cd154.png" width="1077" height="453" class="img_ev3q"></p>
<p>Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: <em>why are we constantly asking the model the same thing?</em></p>
<p>That question led me down the rabbit hole of <strong>semantic caching</strong>, and eventually to building <strong>VCAL (Vector Cache-as-a-Library)</strong>, an open-source project that helps AI apps remember what they’ve already answered.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-eureka-moment">The “Eureka!” Moment<a href="https://blog.vcal-project.com/how-i-created-a-semantic-cache-library#the-eureka-moment" class="hash-link" aria-label="Direct link to The “Eureka!” Moment" title="Direct link to The “Eureka!” Moment" translate="no">​</a></h2>
<p>It started while optimizing an internal support chatbot that ran on top of a local LLM.
Logs showed hundreds of near-identical queries:</p>
<p><em>“How do I request access to the analytics dashboard?”</em><br>
<em>“Who approves dashboard access for my team?”</em><br>
<em>“My access to analytics was revoked — how do I get it back?”</em></p>
<p>Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.</p>
<p>So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this <em>before</em> asking the model.</p>
<p>I wrote a prototype in Rust — for performance and reliability — and designed it as a small <a href="https://github.com/vcal-project/vcal-core" target="_blank" rel="noopener noreferrer" class="">vcal-core</a> open-source library that any app could embed.</p>
<p>The first version of <strong>VCAL</strong> could:</p>
<ul>
<li class="">Store and search vector embeddings in RAM using <a href="https://arxiv.org/abs/1603.09320" target="_blank" rel="noopener noreferrer" class="">HNSW</a> graph indexing</li>
<li class="">Handle TTL and LRU evictions automatically</li>
<li class="">Save snapshots to disk so it could restart fast</li>
</ul>
<p>Later came <strong>VCAL Server</strong>, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.</p>
<p><img decoding="async" loading="lazy" alt="Screenshot: Grafana dashboard showing cache hits and cost saving" src="https://blog.vcal-project.com/assets/images/grafana-6-hours-dark-36c232e131d1142f93065445353a2ad8.png" width="1614" height="783" class="img_ev3q"></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-it-feels-like-to-use">What It Feels Like to Use<a href="https://blog.vcal-project.com/how-i-created-a-semantic-cache-library#what-it-feels-like-to-use" class="hash-link" aria-label="Direct link to What It Feels Like to Use" title="Direct link to What It Feels Like to Use" translate="no">​</a></h2>
<p>Unlike a full vector database, VCAL isn’t designed for long-term storage or analytics. I didn’t want to build another vector database.<br>
<!-- -->VCAL is intentionally lightweight. It is a fast, in-memory semantic cache optimized for repeated LLM queries.</p>
<p>Integrating VCAL takes minutes.<br>
<!-- -->Instead of calling the model directly, you send your query to VCAL first.<br>
<!-- -->If a similar question has been asked before — and the similarity threshold can be tuned — VCAL returns the answer from its cache in milliseconds.
If it’s a new question, VCAL asks the LLM, stores the result, and returns it.<br>
<!-- -->Next time, if a semantically similar question comes in, VCAL answers instantly.</p>
<p>It’s like adding a memory layer between your app and the model — lightweight, explainable, and under your full control.</p>
<p><img decoding="async" loading="lazy" alt="Flow diagram: user → VCAL → LLM" src="https://blog.vcal-project.com/assets/images/vcal-server-integration-0ca5014cc1b40c4bdecb8e4393368dc9.png" width="667" height="330" class="img_ev3q"></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="lessons-learned">Lessons Learned<a href="https://blog.vcal-project.com/how-i-created-a-semantic-cache-library#lessons-learned" class="hash-link" aria-label="Direct link to Lessons Learned" title="Direct link to Lessons Learned" translate="no">​</a></h2>
<ul>
<li class=""><strong>LLMs love redundancy.</strong> Once you start caching semantically, you realize how often people repeat the same question with different words.</li>
<li class=""><strong>Caching semantics ≠ caching text.</strong> Cosine similarity and vector distances matter more than exact matches.</li>
<li class=""><strong>Performance scales beautifully.</strong> A well-tuned cache can handle thousands of lookups per second, even on modest hardware.</li>
<li class=""><strong>It scales big.</strong> A single VCAL Server instance can comfortably store and serve up to 10 million cached answers in memory, depending on embedding dimensions and hardware.</li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="whats-next">What’s Next<a href="https://blog.vcal-project.com/how-i-created-a-semantic-cache-library#whats-next" class="hash-link" aria-label="Direct link to What’s Next" title="Direct link to What’s Next" translate="no">​</a></h2>
<p>We’re now working on a licensing server, enterprise snapshot formats, and RAG-style extensions, so teams can use VCAL not just for Q&amp;A caching, but as the foundation for <em>private semantic memory</em>.</p>
<p>If you’re building AI agents, support desks, or knowledge assistants, you’ll likely benefit from giving your system a brain that <em>remembers</em>.</p>
<p>You can explore more at <a href="https://vcal-project.com/" target="_blank" rel="noopener noreferrer" class=""><strong>vcal-project.com</strong></a> - try the free 30-day Trial Evaluation of VCAL Server or jump into the open-source <a href="https://github.com/vcal-project/vcal-core" target="_blank" rel="noopener noreferrer" class=""><strong>vcal-core</strong></a> version on GitHub.</p>
<hr>
<p><em>Thanks for reading!</em><br>
<!-- -->If this resonates with you, please drop a comment. I’d love to hear how you’re approaching caching and optimization for AI apps.</p>
<hr>]]></content:encoded>
            <category>vcal</category>
            <category>ai-infrastructure</category>
            <category>caching</category>
            <category>rust</category>
            <category>devops</category>
            <category>machine-learning</category>
            <category>startups</category>
        </item>
    </channel>
</rss>