Awide ADP adds a shared NVMe-backed KV-cache layer to LLM inference servers. Conversation context lives outside GPU memory, so expensive GPUs spend their time computing instead of storing. The result is more concurrent users and fewer servers.
During inference a transformer keeps a KV cache - the attention keys and values for every token already seen - so it never has to recompute the conversation from scratch on each new token. This cache grows linearly with context length and with the number of concurrent users, and it lives in scarce, expensive GPU memory (VRAM) alongside the model weights.
In real assistant and chatbot traffic, where dialogs are long and history accumulates, the KV cache for many simultaneous users no longer fits in VRAM. This pressure is even more acute in agentic coding harnesses - opencode, Claude Code, Codex and the like - where each session continuously accumulates repository code, file diffs, tool outputs, and multi-step reasoning, with context that only grows turn by turn. The serving engine is then forced to evict cached context. When that context is needed again on the next turn, the GPU must recompute the entire history - a "recompute storm" that spikes latency and collapses the service level. The only conventional remedy is to buy more GPU servers purely to hold memory, not to do more useful computation.
Long-context KV cache exceeds VRAM. The GPU is saturated on memory while still having compute headroom - capacity is wasted.
Evicted context must be re-processed from token zero. Prefill cost explodes, time-to-first-token spikes, and the SLA breaks under load.
Adding memory means adding a whole GPU server: +rack units, +kilowatts, +capital - all to store context, not to serve more tokens.
ADP places a high-throughput NVMe pool beneath the GPU and uses it as a second tier for the KV cache. Cache blocks that do not fit in VRAM are offloaded to the pool and fetched back on demand, working together with paged/block-level attention and prefix reuse. The scaling limit moves from GPU memory capacity to GPU compute - exactly where you want it, because compute is the thing you are paying the GPU for.
The mechanism is a deliberate trade: instead of spending GPU floating-point operations to rebuild history, the engine spends cheap NVMe bandwidth to read it back. For a context of thousands of tokens, fetching the stored KV blocks is far cheaper than re-running attention over the whole prompt.
A single 8×GPU inference server was driven with a realistic conversational workload while concurrency was doubled step by step from 1 to 128 clients. The defining metric is TPOT (time per output token); the operating limit is the last step before mean TPOT crosses the 50 ms service-level target. Identical hardware, identical model, identical load - the only difference is whether ADP is enabled.
Higher is better. GPU-only flattens as memory pressure rises; ADP keeps scaling.
Lower is better. GPU-only crosses the SLA at 32 clients; ADP holds to 128.
| Clients | GPU req/s | ADP req/s | GPU TTFT | ADP TTFT | GPU TPOT | ADP TPOT | ADP gain |
|---|---|---|---|---|---|---|---|
| 1 | 0.74 | 0.80 | 189 | 66 | 10.3 | 10.5 | +8% |
| 8 | 2.74 | 4.22 | 221 | 72 | 23.9 | 16.1 | +54% |
| 16 | 3.47 | 6.14 | 246 | 82 | 38.8 | 22.3 | +77% |
| 32 | 4.08 | 8.82 | 303 | 94 | 66.0 | 31.0 | +116% |
| 64 | 4.74 | 12.27 | 433 | 129 | 114.5 | 44.5 | +159% |
| 128 | 5.33 | 16.31 | 752 | 189 | 204.1 | 66.0 | +206% |
TTFT and TPOT in milliseconds. At 128 clients GPU-only first-token latency degrades to 752 ms; ADP holds 189 ms.
Nothing here is a black box. The curves above are the predictable signature of moving KV cache off the GPU.
TTFT is dominated by prefill - processing the prompt and history before the first token. GPU-only TTFT climbs from 189 ms to 752 ms as load grows, because evicted context is recomputed. ADP fetches that context from NVMe instead, so TTFT stays in the 66-189 ms band. This gap is the recompute being avoided.
TPOT reflects decode efficiency, which degrades as VRAM contention forces smaller effective batches. Freeing VRAM lets the scheduler keep more sequences resident, so ADP's TPOT rises slowly and only reaches the 50 ms SLA at 128 clients, versus 32 for GPU-only - a 4× shift in the sustainable operating point.
With context held outside VRAM, more concurrent sequences fit and the GPU stays compute-utilized. Throughput therefore keeps climbing where GPU-only plateaus, reaching 16.31 vs 5.33 req/s at peak. The curve doesn't bend up by magic - it bends because the memory ceiling was removed.
During the run the cache tier was genuinely exercised: the pool grew by +92.2 GB and 703,576 context objects were written and served back. These are real I/O counters, not a modeling assumption - direct evidence that gains come from offloaded-and-reused context.
For short, stateless requests with no accumulating history, GPU-only can show higher peak throughput, because there is nothing to cache and the offload path adds slight overhead. This is exactly what the mechanism predicts: no reusable context means no benefit. ADP's advantage is specific to long, multi-turn, context-heavy workloads - the real shape of assistant and chatbot traffic. We state this plainly so reviewers can see the results are bounded and not cherry-picked.
The same signature appears in a long-dialog test on a separate model: with six turns per conversation and ~5,500 tokens of context, the advantage grows with dialog length. By the sixth turn at 72 concurrent dialogs, end-to-end response time is 40.0 s with ADP versus 68.4 s GPU-only - divergence that widens exactly as accumulated context grows, which is what KV reuse should produce.
Concurrent dialogs on X. Lower is better - ADP responds ~41% faster at 72 dialogs.
Higher is better - +68% requests/second at 72 dialogs, zero errors.
To hold 64 clients under the latency SLA, bare GPUs need twice the servers. ADP serves the same workload from one server - and both sides win: engineering gets stable latency, finance gets lower cost.
| Resource | GPU-only | GPU + ADP | Saving |
|---|---|---|---|
| AI servers | 2 | 1 | 50% |
| GPUs | 16 | 8 | 8 GPUs |
| Rack space | 14U | 7U | 7U |
| Power | 20-24 kW | 10-12 kW | ≈50% |
| Cooling | 100% | 50% | ≈50% |
| Infrastructure cost | $1.0M | $0.5M | $500K |
This section gives a technical specialist the exact conditions, metric definitions and instrumentation needed to validate the claims independently.
The same workload runs on fewer GPU servers, at higher rack density, lower power draw and predictable latency at scale - a result that holds up to technical scrutiny because it follows directly from where the KV cache lives.