KV Cache Offloading for Large-Language-Model Inference

Same GPUs.
Four times the AI capacity.

Awide ADP adds a shared NVMe-backed KV-cache layer to LLM inference servers. Conversation context lives outside GPU memory, so expensive GPUs spend their time computing instead of storing. The result is more concurrent users and fewer servers.

more concurrent users under the same latency SLA
$0K
server-equivalent CapEx avoided per workload
0%
lower power and cooling for the same work
The Bottleneck

Modern LLM serving is limited by memory, not compute

During inference a transformer keeps a KV cache - the attention keys and values for every token already seen - so it never has to recompute the conversation from scratch on each new token. This cache grows linearly with context length and with the number of concurrent users, and it lives in scarce, expensive GPU memory (VRAM) alongside the model weights.

In real assistant and chatbot traffic, where dialogs are long and history accumulates, the KV cache for many simultaneous users no longer fits in VRAM. This pressure is even more acute in agentic coding harnesses - opencode, Claude Code, Codex and the like - where each session continuously accumulates repository code, file diffs, tool outputs, and multi-step reasoning, with context that only grows turn by turn. The serving engine is then forced to evict cached context. When that context is needed again on the next turn, the GPU must recompute the entire history - a "recompute storm" that spikes latency and collapses the service level. The only conventional remedy is to buy more GPU servers purely to hold memory, not to do more useful computation.

The memory wall

Long-context KV cache exceeds VRAM. The GPU is saturated on memory while still having compute headroom - capacity is wasted.

Recompute storms

Evicted context must be re-processed from token zero. Prefill cost explodes, time-to-first-token spikes, and the SLA breaks under load.

Expensive scaling

Adding memory means adding a whole GPU server: +rack units, +kilowatts, +capital - all to store context, not to serve more tokens.

How It Works

A shared KV-cache tier on NVMe - and why it is faster

ADP places a high-throughput NVMe pool beneath the GPU and uses it as a second tier for the KV cache. Cache blocks that do not fit in VRAM are offloaded to the pool and fetched back on demand, working together with paged/block-level attention and prefix reuse. The scaling limit moves from GPU memory capacity to GPU compute - exactly where you want it, because compute is the thing you are paying the GPU for.

Step 1
Long dialogs arrive
context history grows past what VRAM can hold
Step 2
Awide ADP offloads the KV cache
overflow blocks are held on NVMe, VRAM stays free for compute
Step 3
Responses stay fast at scale
context is fetched from NVMe, never recomputed

The mechanism is a deliberate trade: instead of spending GPU floating-point operations to rebuild history, the engine spends cheap NVMe bandwidth to read it back. For a context of thousands of tokens, fetching the stored KV blocks is far cheaper than re-running attention over the whole prompt.

GPU-only, on cache eviction

1VRAM fills with weights + KV cache of active users
2Engine evicts older KV blocks to make room
3Next turn needs that history → full prefill recompute over the entire context
4GPU FLOPs burned on recomputation; first-token latency spikes
Cost paid in the scarcest resource: GPU compute.

GPU + ADP, on the same turn

1Overflowing KV blocks are offloaded to the NVMe pool
2VRAM stays free for active sequences and computation
3Next turn needs that history → relevant KV blocks fetched from NVMe
4No recompute; first-token latency stays bounded under load
Cost paid in the cheapest resource: NVMe bandwidth.
Measured Results

One 8×H200 server, with ADP and without

A single 8×GPU inference server was driven with a realistic conversational workload while concurrency was doubled step by step from 1 to 128 clients. The defining metric is TPOT (time per output token); the operating limit is the last step before mean TPOT crosses the 50 ms service-level target. Identical hardware, identical model, identical load - the only difference is whether ADP is enabled.

64 vs 16
sustainable clients under the 50 ms SLA - 4× more
12.27 vs 3.47
sustainable throughput, requests / second
~82 vs ~246
ms to first token at the operating point
+206%
throughput at peak (128 clients)
Throughput vs concurrent clientsreq/s
051015 1248163264128
GPU + ADPGPU-only

Higher is better. GPU-only flattens as memory pressure rises; ADP keeps scaling.

TPOT latency vs SLA thresholdms
050100150200 SLA 50 ms 1248163264128
GPU + ADPGPU-only

Lower is better. GPU-only crosses the SLA at 32 clients; ADP holds to 128.

Raw measurements

ClientsGPU req/sADP req/sGPU TTFTADP TTFTGPU TPOTADP TPOTADP gain
10.740.801896610.310.5+8%
82.744.222217223.916.1+54%
163.476.142468238.822.3+77%
324.088.823039466.031.0+116%
644.7412.27433129114.544.5+159%
1285.3316.31752189204.166.0+206%

TTFT and TPOT in milliseconds. At 128 clients GPU-only first-token latency degrades to 752 ms; ADP holds 189 ms.

Why The Numbers Look This Way

Each result maps directly to the mechanism

Nothing here is a black box. The curves above are the predictable signature of moving KV cache off the GPU.

Time-to-first-token (TTFT) → prefill cost

TTFT is dominated by prefill - processing the prompt and history before the first token. GPU-only TTFT climbs from 189 ms to 752 ms as load grows, because evicted context is recomputed. ADP fetches that context from NVMe instead, so TTFT stays in the 66-189 ms band. This gap is the recompute being avoided.

Time-per-output-token (TPOT) → memory pressure

TPOT reflects decode efficiency, which degrades as VRAM contention forces smaller effective batches. Freeing VRAM lets the scheduler keep more sequences resident, so ADP's TPOT rises slowly and only reaches the 50 ms SLA at 128 clients, versus 32 for GPU-only - a 4× shift in the sustainable operating point.

Throughput → freed capacity

With context held outside VRAM, more concurrent sequences fit and the GPU stays compute-utilized. Throughput therefore keeps climbing where GPU-only plateaus, reaching 16.31 vs 5.33 req/s at peak. The curve doesn't bend up by magic - it bends because the memory ceiling was removed.

The offload actually happened

During the run the cache tier was genuinely exercised: the pool grew by +92.2 GB and 703,576 context objects were written and served back. These are real I/O counters, not a modeling assumption - direct evidence that gains come from offloaded-and-reused context.

Honest boundary - where ADP does not help

For short, stateless requests with no accumulating history, GPU-only can show higher peak throughput, because there is nothing to cache and the offload path adds slight overhead. This is exactly what the mechanism predicts: no reusable context means no benefit. ADP's advantage is specific to long, multi-turn, context-heavy workloads - the real shape of assistant and chatbot traffic. We state this plainly so reviewers can see the results are bounded and not cherry-picked.

The same signature appears in a long-dialog test on a separate model: with six turns per conversation and ~5,500 tokens of context, the advantage grows with dialog length. By the sixth turn at 72 concurrent dialogs, end-to-end response time is 40.0 s with ADP versus 68.4 s GPU-only - divergence that widens exactly as accumulated context grows, which is what KV reuse should produce.

Time to first response, long dialogssec
020s40s 10.16.927.516.946.027.0 244872
GPU-onlyGPU + ADP

Concurrent dialogs on X. Lower is better - ADP responds ~41% faster at 72 dialogs.

Throughput, long dialogsreq/s
01.02.0 1.231.761.221.941.202.01 244872
GPU-onlyGPU + ADP

Higher is better - +68% requests/second at 72 dialogs, zero errors.

A Genuine Win - Win

The same workload, on half the hardware

To hold 64 clients under the latency SLA, bare GPUs need twice the servers. ADP serves the same workload from one server - and both sides win: engineering gets stable latency, finance gets lower cost.

GPU-only

Scale by brute force

AI servers2
GPUs16 × H200
Rack space14U
Power under load20-24 kW
Infrastructure cost$1,000,000
VS
GPU + Awide ADP

Consolidate the infrastructure

AI servers1
GPUs8 × H200
Rack space7U
Power under load10-12 kW
Infrastructure cost$500,000

The cost of bare GPU scaling

  • !Adding capacity means buying another GPU server
  • !GPU memory is the primary bottleneck
  • !Rack footprint, power and cooling all double
  • !Latency degrades sharply under heavy load

What ADP delivers to the buyer

  • 4× more concurrent users on the same GPU cluster
  • Predictable SLA under heavy inference load
  • Deferred GPU purchases and less exposure to GPU supply constraints
  • Better performance-per-dollar across the fleet

Datacenter savings per AI cluster

ResourceGPU-onlyGPU + ADPSaving
AI servers2150%
GPUs1688 GPUs
Rack space14U7U7U
Power20-24 kW10-12 kW≈50%
Cooling100%50%≈50%
Infrastructure cost$1.0M$0.5M$500K
AI platform team
Pain: long context destabilizes latency. ADP: stable TPOT and TTFT under load.
Infrastructure
Pain: GPU memory drives fleet growth. ADP: KV-cache capacity grows outside VRAM.
Datacenter
Pain: rack, power and cooling limits. ADP: 2× density, up to 50% less energy.
Finance / procurement
Pain: GPU CapEx and scarcity. ADP: one 8-GPU server avoided per workload.
Methodology

Methodology, definitions and the test stack

This section gives a technical specialist the exact conditions, metric definitions and instrumentation needed to validate the claims independently.

Hardware & software stack

GPU
8 × NVIDIA H200 141GB
Total GPU memory
1,128 GiB
Cache tier
8 × 15 TB NVMe (ADP RAID)
KV-cache pool required
966.86 GiB
OS / CUDA
Ubuntu 22.04 · CUDA 12.8
Serving engine
vLLM 0.12.0
Models exercised
DeepSeek-V3 · Qwen2.5-32B

Test method

Apples-to-apples: identical hardware, model and request load in both modes; the only variable is ADP on/off.
Load ramp: concurrency doubled 1 → 128. The sustainable point is the last step before mean TPOT exceeds the 50 ms SLA.
Conversational profile: multi-turn dialogs (6 turns, ~5,500 tokens of context) to reflect real assistant traffic, not single-shot prompts.
Offload instrumentation: cache growth (+92.2 GB) and object count (703,576) captured live, confirming real reuse.

Metric definitions

TTFT - time to first token. Latency until the first output token; dominated by prefill over the prompt and history. Sensitive to recompute on cache eviction.
TPOT - time per output token. Mean inter-token latency during decode; sensitive to VRAM contention and effective batch size. The SLA metric, threshold 50 ms.
req/s - completed requests per second; the productive throughput before SLA violation.

How to read the trade-off

What is spent: NVMe bandwidth and PCIe transfers, in place of GPU FLOPs spent on recomputation.
When it wins: long, context-heavy, multi-turn workloads where KV blocks are reused across turns and requests.
When it is neutral: short, stateless requests with no reusable context - a small overhead, stated openly above.
What it does not change: model weights, outputs or accuracy - ADP is a caching tier, not a model modification.

ADP is not an accelerator.
It is a capacity multiplier.

The same workload runs on fewer GPU servers, at higher rack density, lower power draw and predictable latency at scale - a result that holds up to technical scrutiny because it follows directly from where the KV cache lives.