LMCache Lab’s cover photo
LMCache Lab

LMCache Lab

Software Development

Chicago, IL 2,911 followers

Open-Source for 10X better LLM inference w. vLLM Production Stack + LMCache

About us

Open-source large-scale LLM serving solutions to democratize LLM Inference.

Website
https://github.com/LMCache/LMCache
Industry
Software Development
Company size
201-500 employees
Headquarters
Chicago, IL
Type
Nonprofit

Locations

Employees at LMCache Lab

Updates

  • LMCache Lab reposted this

    Tensormesh: 𝗙𝗿𝗼𝗺 𝗔𝗰𝗮𝗱𝗲𝗺𝗶𝗮 𝘁𝗼 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 In this clip, our 𝗖𝗘𝗢 𝗮𝗻𝗱 𝗰𝗼-𝗳𝗼𝘂𝗻𝗱𝗲𝗿, Junchen Jiang, explains what it really takes to 𝗯𝘂𝗶𝗹𝗱 𝗮 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 at the intersection of 𝗮𝗰𝗮𝗱𝗲𝗺𝗶𝗮, 𝗼𝗽𝗲𝗻 𝘀𝗼𝘂𝗿𝗰𝗲, 𝗮𝗻𝗱 𝗶𝗻𝗱𝘂𝘀𝘁𝗿𝘆. Building an open-source company isn’t just about code prowess. It requires a deep, long-term technical vision, and a team willing to commit to that vision long enough to turn research into real systems. At Tensormesh, that foundation was already in place. 🎥 Watch the full interview: 👉 http://y2u.be/zHW4Zzd7pjI #OpenSource #AIInfrastructure #LLMInference #KVCache #TensorMesh

  • LMCache Lab reposted this

    LMCache: reuse KV cache to speed up AI inference Long-context and multi-turn LLM apps often recompute the same tokens again and again. LMCache reduces that wasted work. Why this matters: - Lower time-to-first-token for long prompts and RAG - Better GPU utilization under repeated or similar requests How LMCache works: - Reuses KV cache for repeated text, not just prefixes - Stores KV cache on GPU, CPU, or disk - Allows cache reuse across different serving instances - Integrates with vLLM and SGLang for KV cache offloading - Works well for multi-turn chat and RAG workloads - It’s a practical way to treat KV cache as shared infrastructure instead of a per-request artifact. ♻️ Share it with anyone who’s running LLMs in production :) I share tutorials on how to build + improve AI apps and agents, on my newsletter 𝑨𝑰 𝑨𝒈𝒆𝒏𝒕 𝑬𝒏𝒈𝒊𝒏𝒆𝒆𝒓𝒊𝒏𝒈: https://lnkd.in/gaJTcZBR Link to repo: https://lnkd.in/ekp-uBjy #LLMs #RAG #GenAI

    • No alternative text description for this image
  • 🔥 Interesting research conducted on top of LMCache on supporting agentic workloads!

    🤖 Agents are amazing. But they are also 𝗯𝘂𝗿𝗻𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗯𝘂𝗱𝗴𝗲𝘁 🔥 Agents like Claude Code can solve SWE-Bench tasks at >70% accuracy…but it costs ~$1 per issue, with 90% of tokens spent rereading the same repo. Why agents are 𝗲𝘅𝗽𝗲𝗻𝘀𝗶𝘃𝗲? They: 1️⃣ Repeatedly re-send massive contexts 2️⃣ Spawn parallel sub-agents 3️⃣ Wait on tools while their KV cache gets evicted 4️⃣ May need to redo the lengthy prefill Even with prefix caching, LRU eviction is terrible for agent traces. It still introduces queueing delay for every agentic program. To mitigate the issue, we built Continuum — a TTL-based KV cache scheduler for agents. It introduces a time-to-live mechanism that sets a timer for a KV cache before it becomes evictable from the inference engine. This is far away from the final agent serving solution, but we present our initial findings and propose directions for future improvement in the full blog: https://lnkd.in/gimYHWHc #LLM #Agent #Inference #KVCache #LMCache #vLLM

  • LMCache Lab reposted this

    LMCache: smart caching for LLM inference. LMCache - a project that offers a solution for storing KV caches on the CPU, disk, or even in specialized NIXL memory. Essentially, it's a tool that turns one-time computations into reusable blocks, saving time and resources. Imagine that in a chatbot, users often refer to the same system prompt or dialogue history. Usually, the model processes this data anew, but LMCache simply loads the ready-made cache. Unloading KV caches frees up the GPU for new tasks, reducing TTFT (time to first token) by up to 10 times. 🟡LMCache is flexible. Caches can not only be unloaded but also shared between different LLM instances. Simply put, if two users simultaneously access different copies of the model with the same request, the system won't duplicate the work: the results of one prefill will be available to everyone. This even works for incomplete prefixes, for example, when there's a partial match in the input data. 🟡LMCache supports separate preprocessing. Prefill and decode, which are usually performed on a single GPU, can now be separated: the first stage is processed on powerful nodes, and the second on ones optimized for generation. For distributed systems, this technique will increase throughput. The project's developers' tests show that in real tasks, the delay is reduced by 3–10 times, and GPU cycles are saved on repeated calculations. The project is closely integrated with vLLM, and the repository includes a large set of examples, documentation, and tips on installation and configuration. ⚠️ KV cache calculator with a choice of model, its data type, and number of tokens, which will help estimate how much VRAM can be saved. 📌Licensing: Apache 2.0 License. 🖥 Github: https://lnkd.in/g6Kvxu-M #AI #ML #LLM #LMCache #KVCache #Github

    • No alternative text description for this image
    • No alternative text description for this image
  • Live from LMCache Lab office hours 🚀 We’re diving into the science of KV caching and how LMCache enables large-scale offloading to external storage — straight from the maintainers.

  • LMCache Lab reposted this

    You don't need expensive GPU clusters to handle long-context LLMs anymore. The LMCache team just open-sourced a KV cache management system that stores and reuses caches across GPU, CPU, and disk: 👉 3-10× faster response times in multi-round QA and RAG 👉 Reduces GPU cycles through intelligent cache reuse 👉 Supports vLLM v1, SGLang, with CPU/Disk/P2P storage How it works: • Distributes cache across GPU, CPU DRAM, and local disk • Reuses KV caches across serving instances • Handles non-prefix text (not just prompt prefixes) • Install: pip install lmcache Relevant if you're running multi-round QA, RAG pipelines, or anything with reusable context.

    • No alternative text description for this image
  • We ran a tiny one-shot experiment from a one-shot SWE-bench task with Claude Code to study the Context Engineering & Reuse Pattern: • 92 LLM calls invoked • ~2M input tokens • 13 minutes runtime • 92% prefix reuse rate With prefix caching, this single task drops from $6.00 → $1.15 in input cost (≈ 81% savings) and dramatically reduces TTFT. This trace shows Claude Code is essentially a prefix-reuse machine with warm-up calls to prime the cache, parallel multi-agent system, and a ReAct-style execution loop — all optimized for KV cache reuse. Blog post: https://lnkd.in/g3bzMqcj Raw trace: https://lnkd.in/gBR4kjUH Trace visualizer: https://lnkd.in/g2RPSXpk If you care about context engineering, agent architecture, and KV-cache economics, this is a concrete, end-to-end look under the hood. You may paste the raw trace to the trace visualizer to peek more.

    • No alternative text description for this image
  • LMCache Lab reposted this

    ⚡ In cybersecurity, slow AI is as bad as no AI. When you’re building AI-first SOC copilots, agentic pentesting, identity risk engines, or AI triage bots, latency and cost quickly become the real attackers. That’s why ideas like LMCache – a KV cache layer / “Knowledge Delivery Network” for LLMs – are super interesting. --- 🔍 What LMCache does (in simple words) LMCache sits between your model and your infra and: Reuses KV caches (the internal attention states) across requests Avoids recomputing long, shared prefixes Speeds up generation and reduces GPU/CPU cost Plays nicely with inference engines like vLLM Think of it as Cloudflare, but for LLM context. --- 🛡️ Why this matters for Cybersecurity AI 💡 1. Faster SOC copilots & IR assistants Security workflows are long and multi-turn: > “Show me alerts on this asset → correlate with previous incidents → explain root cause → suggest actions.” LMCache lets you reuse all that shared context instead of recomputing it every time. Result: faster investigations, lower infra burn. 💡 2. Agentic pentesting that doesn’t feel laggy Your red-team / pentest agents often: Reuse the same recon info Iterate on similar payloads Call the model multiple times per target Caching the shared parts of prompts means your agent feels real-time, not “thinking…” for ages. 💡 3. More context for less cost Good cyber reasoning needs: Log snippets TI reports Past incident notes Asset relationships With LMCache, you can keep rich context in prompts and still make it affordable to run at scale across many analysts / agents. 💡 4. Scale to thousands of alerts & users As soon as you move from “demo” to “product”, the questions hit: Can we handle 10x more alerts? Can every analyst get an AI copilot? Can we run this on-prem with fixed GPUs? A caching layer gives you the throughput and cost profile needed for real enterprise deployments. 💡 5. Perfect fit for SLM + Agentic architectures For security, we increasingly prefer smaller, specialized models (SLMs) + agents over giant LLMs. LMCache makes those SLMs: Snappier Cheaper to run Better at handling repeated, structured security workflows --- 🧠 Bottom Line If you’re building Autonomous SOC, agentic pentesting, AI-IR, or identity risk engines, you don’t just need good models – you need fast, cached, cost-efficient inference. A layer like LMCache helps you: ✅ Cut latency in complex security workflows ✅ Reuse shared security context across calls ✅ Serve more analysts/agents on the same hardware ✅ Move from “cool PoC” to production-grade AI security platform Bigger models are optional. Fast, efficient, cached models are not.

  • LMCache Lab reposted this

    😦 What if 99% of your requests could subsidize themselves with just 1% cache reuse? ✅ That's not a hypothetical. It's the actual break-even threshold for KV caching in LLM inference. Here's what most teams miss are missing: 📉 You're recomputing the same tokens over and over. Processing one NVIDIA annual report costs $3 in LLM inference. Now multiply that by thousands of users per second. We built the LMCache ROI Calculator to answer one critical question: "When does caching actually save money?" 🌟 The answer: at just 1.06% cache hit rate, you break even. That means if just 1 out of every 94 requests reuses cached data, your storage investment pays for itself. 🔗 Read the full ROI breakdown: https://lnkd.in/gGKU88XZ 🧮 Calculate your savings: https://lnkd.in/gkCwAmJy 🚀 Get $100 free GPU time on Tensormesh (no credit card required): https://lnkd.in/gmKa8chj #AI #LLM #MachineLearning #InfrastructureOptimization #CostOptimization

Similar pages