KV Cache Offloading for Large-Language-Model Inference

Same GPUs. Four times the AI capacity.

Awide ADP adds a shared NVMe-backed KV-cache layer to LLM inference servers. Conversation context lives outside GPU memory, so expensive GPUs spend their time computing instead of storing. The result is more concurrent users and fewer servers.

Download as HTML

4×

more concurrent users under the same latency SLA

$500K

server-equivalent CapEx avoided per workload

50%

lower power and cooling for the same work

The Bottleneck

Modern LLM serving is limited by memory, not compute

During inference a transformer keeps a KV cache - the attention keys and values for every token already seen - so it never has to recompute the conversation from scratch on each new token. This cache grows linearly with context length and with the number of concurrent users, and it lives in scarce, expensive GPU memory (VRAM) alongside the model weights.

In real assistant and chatbot traffic, where dialogs are long and history accumulates, the KV cache for many simultaneous users no longer fits in VRAM. This pressure is even more acute in agentic coding harnesses - opencode, Claude Code, Codex and the like - where each session continuously accumulates repository code, file diffs, tool outputs, and multi-step reasoning, with context that only grows turn by turn. The serving engine is then forced to evict cached context. When that context is needed again on the next turn, the GPU must recompute the entire history - a "recompute storm" that spikes latency and collapses the service level. The only conventional remedy is to buy more GPU servers purely to hold memory, not to do more useful computation.

The memory wall

Long-context KV cache exceeds VRAM. The GPU is saturated on memory while still having compute headroom - capacity is wasted.

Recompute storms

Evicted context must be re-processed from token zero. Prefill cost explodes, time-to-first-token spikes, and the SLA breaks under load.

Expensive scaling

Adding memory means adding a whole GPU server: +rack units, +kilowatts, +capital - all to store context, not to serve more tokens.

How It Works

A shared KV-cache tier on NVMe - and why it is faster

ADP places a high-throughput NVMe pool beneath the GPU and uses it as a second tier for the KV cache. Cache blocks that do not fit in VRAM are offloaded to the pool and fetched back on demand, working together with paged/block-level attention and prefix reuse. The scaling limit moves from GPU memory capacity to GPU compute - exactly where you want it, because compute is the thing you are paying the GPU for.

Step 1

Long dialogs arrive

context history grows past what VRAM can hold

→

Step 2

Awide ADP offloads the KV cache

overflow blocks are held on NVMe, VRAM stays free for compute

→

Step 3

Responses stay fast at scale

context is fetched from NVMe, never recomputed

The mechanism is a deliberate trade: instead of spending GPU floating-point operations to rebuild history, the engine spends cheap NVMe bandwidth to read it back. For a context of thousands of tokens, fetching the stored KV blocks is far cheaper than re-running attention over the whole prompt.

GPU-only, on cache eviction

1 VRAM fills with weights + KV cache of active users
2 Engine evicts older KV blocks to make room
3 Next turn needs that history → full prefill recompute over the entire context
4 GPU FLOPs burned on recomputation; first-token latency spikes

Cost paid in the scarcest resource: GPU compute.

GPU + ADP, on the same turn

1 Overflowing KV blocks are offloaded to the NVMe pool
2 VRAM stays free for active sequences and computation
3 Next turn needs that history → relevant KV blocks fetched from NVMe
4 No recompute; first-token latency stays bounded under load

Cost paid in the cheapest resource: NVMe bandwidth.

Measured Results

One 8×H200 server, with ADP and without

A single 8×GPU inference server was driven with a realistic conversational workload while concurrency was doubled step by step from 1 to 128 clients. The defining metric is TPOT (time per output token); the operating limit is the last step before mean TPOT crosses the 50 ms service-level target. Identical hardware, identical model, identical load - the only difference is whether ADP is enabled.

64 vs 16

sustainable clients under the 50 ms SLA - 4× more

12.27 vs 3.47

sustainable throughput, requests / second

~82 vs ~246

ms to first token at the operating point

+206%

throughput at peak (128 clients)

Throughput vs concurrent clients

req/s

GPU + ADP

GPU-only

Higher is better. GPU-only flattens as memory pressure rises; ADP keeps scaling.

TPOT latency vs SLA threshold

GPU + ADP

GPU-only

Lower is better. GPU-only crosses the SLA at 32 clients; ADP holds to 128.

Raw measurements

Clients	GPU req/s	ADP req/s	GPU TTFT	ADP TTFT	GPU TPOT	ADP TPOT	ADP gain
1	0.74	0.80	189	66	10.3	10.5	+8%
8	2.74	4.22	221	72	23.9	16.1	+54%
16	3.47	6.14	246	82	38.8	22.3	+77%
32	4.08	8.82	303	94	66.0	31.0	+116%
64	4.74	12.27	433	129	114.5	44.5	+159%
128	5.33	16.31	752	189	204.1	66.0	+206%

TTFT and TPOT in milliseconds. At 128 clients GPU-only first-token latency degrades to 752 ms; ADP holds 189 ms.

Why The Numbers Look This Way

Each result maps directly to the mechanism

Nothing here is a black box. The curves above are the predictable signature of moving KV cache off the GPU.

Time-to-first-token (TTFT) → prefill cost

TTFT is dominated by prefill - processing the prompt and history before the first token. GPU-only TTFT climbs from 189 ms to 752 ms as load grows, because evicted context is recomputed. ADP fetches that context from NVMe instead, so TTFT stays in the 66-189 ms band. This gap is the recompute being avoided.

Time-per-output-token (TPOT) → memory pressure

TPOT reflects decode efficiency, which degrades as VRAM contention forces smaller effective batches. Freeing VRAM lets the scheduler keep more sequences resident, so ADP's TPOT rises slowly and only reaches the 50 ms SLA at 128 clients, versus 32 for GPU-only - a 4× shift in the sustainable operating point.

Throughput → freed capacity

With context held outside VRAM, more concurrent sequences fit and the GPU stays compute-utilized. Throughput therefore keeps climbing where GPU-only plateaus, reaching 16.31 vs 5.33 req/s at peak. The curve doesn't bend up by magic - it bends because the memory ceiling was removed.

The offload actually happened

During the run the cache tier was genuinely exercised: the pool grew by +92.2 GB and 703,576 context objects were written and served back. These are real I/O counters, not a modeling assumption - direct evidence that gains come from offloaded-and-reused context.

Honest boundary - where ADP does not help

For short, stateless requests with no accumulating history, GPU-only can show higher peak throughput, because there is nothing to cache and the offload path adds slight overhead. This is exactly what the mechanism predicts: no reusable context means no benefit. ADP's advantage is specific to long, multi-turn, context-heavy workloads - the real shape of assistant and chatbot traffic. We state this plainly so reviewers can see the results are bounded and not cherry-picked.

The same signature appears in a long-dialog test on a separate model: with six turns per conversation and ~5,500 tokens of context, the advantage grows with dialog length. By the sixth turn at 72 concurrent dialogs, end-to-end response time is 40.0 s with ADP versus 68.4 s GPU-only - divergence that widens exactly as accumulated context grows, which is what KV reuse should produce.

Time to first response, long dialogs

sec

10.1

6.9

27.5

16.9

GPU-only

GPU + ADP

Concurrent dialogs on X. Lower is better - ADP responds ~41% faster at 72 dialogs.

Throughput, long dialogs

req/s

1.23

1.76

1.22

1.94

1.2

2.01

GPU-only

GPU + ADP

Higher is better - +68% requests/second at 72 dialogs, zero errors.

A Genuine Win-Win

The same workload, on half the hardware

To hold 64 clients under the latency SLA, bare GPUs need twice the servers. ADP serves the same workload from one server - and both sides win: engineering gets stable latency, finance gets lower cost.

GPU-only

Scale by brute force

AI servers2

GPUs16 × H200

Rack space14U

Power under load20-24 kW

Infrastructure cost$1,000,000

GPU + Awide ADP

Consolidate the infrastructure

AI servers1

GPUs8 × H200

Rack space7U

Power under load10-12 kW

Infrastructure cost$500,000

The cost of bare GPU scaling

!Adding capacity means buying another GPU server
!GPU memory is the primary bottleneck
!Rack footprint, power and cooling all double
!Latency degrades sharply under heavy load

What ADP delivers to the buyer

✓4× more concurrent users on the same GPU cluster
✓Predictable SLA under heavy inference load
✓Deferred GPU purchases and less exposure to GPU supply constraints
✓Better performance-per-dollar across the fleet

Datacenter savings per AI cluster

Resource	GPU-only	GPU + ADP	Saving
AI servers	2	1	50%
GPUs	16	8	8 GPUs
Rack space	14U	7U	7U
Power	20-24 kW	10-12 kW	≈50%
Cooling	100%	50%	≈50%
Infrastructure cost	$1.0M	$0.5M	$500K

AI platform team

Pain: long context destabilizes latency. ADP: stable TPOT and TTFT under load.

Infrastructure

Pain: GPU memory drives fleet growth. ADP: KV-cache capacity grows outside VRAM.

Datacenter

Pain: rack, power and cooling limits. ADP: 2× density, up to 50% less energy.

Finance / procurement

Pain: GPU CapEx and scarcity. ADP: one 8-GPU server avoided per workload.

Methodology

Methodology, definitions and the test stack

This section gives a technical specialist the exact conditions, metric definitions and instrumentation needed to validate the claims independently.

Hardware & software stack

GPU: 8 × NVIDIA H200 141GB
Total GPU memory: 1,128 GiB
Cache tier: 8 × 15 TB NVMe (ADP RAID)
KV-cache pool required: 966.86 GiB
OS / CUDA: Ubuntu 22.04 · CUDA 12.8
Serving engine: vLLM 0.12.0
Models exercised: DeepSeek-V3 · Qwen2.5-32B

Test method

Apples-to-apples: identical hardware, model and request load in both modes; the only variable is ADP on/off.

Load ramp: concurrency doubled 1 → 128. The sustainable point is the last step before mean TPOT exceeds the 50 ms SLA.

Conversational profile: multi-turn dialogs (6 turns, ~5,500 tokens of context) to reflect real assistant traffic, not single-shot prompts.

Offload instrumentation: cache growth (+92.2 GB) and object count (703,576) captured live, confirming real reuse.

Metric definitions

TTFT - time to first token. Latency until the first output token; dominated by prefill over the prompt and history. Sensitive to recompute on cache eviction.

TPOT - time per output token. Mean inter-token latency during decode; sensitive to VRAM contention and effective batch size. The SLA metric, threshold 50 ms.

req/s - completed requests per second; the productive throughput before SLA violation.

How to read the trade-off

What is spent: NVMe bandwidth and PCIe transfers, in place of GPU FLOPs spent on recomputation.

When it wins: long, context-heavy, multi-turn workloads where KV blocks are reused across turns and requests.

When it is neutral: short, stateless requests with no reusable context - a small overhead, stated openly above.

What it does not change: model weights, outputs or accuracy - ADP is a caching tier, not a model modification.

ADP is not an accelerator.
It is a capacity multiplier.

The same workload runs on fewer GPU servers, at higher rack density, lower power draw and predictable latency at scale - a result that holds up to technical scrutiny because it follows directly from where the KV cache lives.

Download as HTML

Same GPUs. Four times the AI capacity.

Modern LLM serving is limited by memory, not compute

The memory wall

Recompute storms

Expensive scaling

A shared KV-cache tier on NVMe - and why it is faster

GPU-only, on cache eviction

GPU + ADP, on the same turn

One 8×H200 server, with ADP and without

Throughput vs concurrent clients

TPOT latency vs SLA threshold

Raw measurements

Each result maps directly to the mechanism

Time-to-first-token (TTFT) → prefill cost

Time-per-output-token (TPOT) → memory pressure

Throughput → freed capacity

The offload actually happened

Time to first response, long dialogs

Throughput, long dialogs

The same workload, on half the hardware

Scale by brute force

Consolidate the infrastructure

The cost of bare GPU scaling

What ADP delivers to the buyer

Datacenter savings per AI cluster

Methodology, definitions and the test stack

Hardware & software stack

Test method

Metric definitions

How to read the trade-off

ADP is not an accelerator.It is a capacity multiplier.

ADP is not an accelerator.
It is a capacity multiplier.