A single H100 benchmark says SGLang serves Llama 3.1 8B at about 16,200 tokens/sec versus vLLM at about 12,500 tokens/sec, yet that number can still lead you to the wrong engine.
The short answer: vLLM vs SGLang should be decided by workload shape because prefix reuse, structured generation, hardware coverage, and operational maturity matter more than headline throughput, as of June 20, 2026.
TL;DR: vLLM is the default open-source inference server for teams that need the broadest production path across models, hardware, observability, and Kubernetes. SGLang is the stronger first pick when the workload repeats long prefixes, especially coding agents, RAG, long tool schemas, and multi-step structured generation. TensorRT-LLM belongs in the shortlist for NVIDIA-only fleets with a platform team that can absorb a heavier build and serving stack.
vLLM vs SGLang starts with traffic shape
Most teams ask the wrong first question. They ask which LLM inference engine is fastest.
The better question is: which parts of your workload are expensive?
If you serve short chat turns with moderate context, the bottleneck is usually batching, streaming, GPU memory management, autoscaling, and operational reliability. That points toward vLLM, especially the current stable v0.22.1 line released on June 4, 2026, according to the vLLM release mirror and v0.22.1 release notes.
If you serve coding agents, RAG, or structured workflows with repeated system prompts and shared templates, prefill dominates. That points toward SGLang, whose RadixAttention and HiCache machinery are built around prefix reuse. SGLang v0.5.13.post1 shipped on June 15, 2026, according to the SGLang release mirror.
Key takeaways
- vLLM is the conservative default for a general-purpose open-source inference server.
- SGLang is the specialist for prefix-heavy workloads: RAG, coding agents, multi-turn programs, and structured generation.
- A single LLM serving benchmark is weak evidence unless it matches your prompt length, cache hit rate, concurrency, model, GPU, and latency target.
- TensorRT-LLM can win on NVIDIA throughput, but the operational tax is higher.
- llama.cpp is excellent for local and edge inference, but it is the wrong primitive for high-concurrency hosted APIs.
- The first metric to instrument is usually TTFT under realistic prefill pressure, then inter-token latency, cache hit rate, and tokens/GPU/hour.
What changed in 2026
VLLM and SGLang both matured fast enough that stale comparisons from early 2025 are actively misleading.
VLLM v0.22.0 shipped on May 29, 2026 with DeepSeek V4 support, an experimental Rust frontend, multi-tier KV cache work, and a Cutlass FP8 path reported to improve end-to-end latency by 28.9% on batch-invariant inference, according to a vLLM v0.22 summary. The v0.22.1 patch followed on June 4 with fixes for multi-node Ray data-parallel serving and DeepSeek-V4 initialization.
SGLang v0.5.13, released June 11, 2026, made Speculative Decoding V2 the default and added new model support including Nemotron 3 Ultra, Step-3.7-Flash, and Command A+, according to the v0.5.13 release notes. The post1 patch followed four days later.
The release cadence matters. VLLM averaged about 6 days and 23 hours between releases, while SGLang averaged about 5 days and 16 hours, per their release mirrors. Pin versions, test upgrade paths, and assume inference-serving behavior will change monthly.
The numbers
The cleanest public comparison in the research set puts SGLang ahead on raw throughput for one H100 workload: about 16,200 tokens/sec for SGLang versus about 12,500 tokens/sec for vLLM on Llama 3.1 8B, according to TECHSY's vLLM vs SGLang benchmark.
That result is useful, but only inside its envelope. It doesn't settle your production choice unless your prompts, cache reuse, concurrency, GPU type, quantization, model architecture, and latency SLO look similar.
MLPerf makes the same point at a bigger scale. Lambda reported 130.0k server tokens/sec and 160.4k offline tokens/sec for Llama 3.1 8B on an 8-GPU HGX B200 system in MLPerf Inference v6.0, with up to a 9% improvement over the same hardware in v5.1, according to Lambda's MLPerf v6.0 results.
That is a hardware and stack benchmark as much as an engine lesson.
The actionable benchmark is your own replay trace. Capture real prompts, output lengths, tenant mix, cacheable prefixes, and burst shape. Then compare p50 and p99 TTFT, inter-token latency, tokens/GPU/hour, cache hit rate, and error rate under rolling deploys.
Decision table: choose the engine by workload
| Workload shape | Best first choice | Why it fits | Watch out for |
|---|---|---|---|
| General chat API | vLLM | Mature OpenAI-compatible server, broad hardware support, production stack | Tune batching and KV memory before chasing microbenchmarks |
| Coding agents | SGLang | Long shared prompts and tool schemas benefit from RadixAttention | Validate operational maturity and security posture |
| RAG with repeated templates | SGLang | Prefix reuse is the core performance lever | Namespace cache by tenant and avoid private-context leakage |
| Batch extraction | vLLM | Continuous batching plus guided decoding fits offline throughput | Disable speculative decoding if completions are short and overhead wins |
| NVIDIA-only max throughput | TensorRT-LLM | Hardware-aware NVIDIA stack with Triton path | Budget for engine builds and specialist operations |
| Local, CPU, edge, Apple Silicon | llama.cpp | Simple local server and broad quantized model support | Avoid it for multi-tenant hosted APIs |
Best choice if...
Choose vLLM if your team needs the fewest surprises in a production LLM inference engine. It has an OpenAI-compatible API, Prometheus and Grafana observability, broad hardware plugins, and the vLLM Production Stack for Kubernetes-oriented deployments.
Choose SGLang if prefill cost is the center of your bill. RAG systems, coding agents, and structured generation workflows often reuse long prefixes across many sessions, and SGLang's RadixAttention documentation describes the radix-tree approach that makes this efficient.
Choose TensorRT-LLM if you run a serious NVIDIA-only fleet and can support a heavier serving stack. NVIDIA documents tensor, pipeline, expert, and context parallelism in its TensorRT-LLM parallel strategy guide, and its precision documentation covers BF16, FP16, FP8, FP4, INT8, and INT4 paths.
Why prefix caching changes the answer
Prefix caching is the difference between serving one prompt and serving a program.
A plain chatbot request might contain a short system prompt and a user message. A coding agent request can carry tool specs, repository context, policy text, examples, file snippets, and previous actions. A RAG request often repeats the same instruction template while swapping a smaller retrieved block.
SGLang's advantage appears when those prefixes overlap. RadixAttention organizes shared prompt prefixes so overlapping KV blocks can be reused across requests instead of recomputed.
VLLM has its own prefix-caching story, including automatic prefix caching and LMCache-style external KV sharing. For Kubernetes-heavy teams, that can be the better trade because the surrounding production tooling is stronger.
The practical test is simple: if cache hit rate is low, SGLang's theoretical edge may disappear. If cache hit rate is high and TTFT is painful, SGLang deserves the first serious benchmark slot.
Structured generation is a tie until workflow matters
Both engines support structured outputs.
VLLM's structured output docs list guided decoding support through backends such as xgrammar, outlines, and guidance, with JSON schema, regex, grammar, choice, and structural tag modes in the vLLM structured outputs documentation. That makes vLLM a clean fit for API-first extraction and tool-call services.
SGLang also supports structured generation patterns, and its programming model is attractive when generation is one step inside a larger language-model program. The stronger reason to choose it is usually the combination of structured generation and prefix reuse, rather than structured outputs alone.
For extraction jobs, still validate at the application layer. Guided decoding reduces malformed output. It doesn't remove the need for schema checks, retries, and poison-message handling.
Hardware and operations may decide before benchmarks do
VLLM has the broadest hardware story. The project describes support for NVIDIA GPUs, AMD GPUs, x86, ARM, PowerPC CPUs, Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more on the vLLM GitHub repository.
SGLang also supports serious hardware breadth, including NVIDIA, AMD ROCm, TPU, Intel CPU, and Huawei Ascend paths, but teams should validate the exact model and kernel path. That advice is especially important for MoE models, FP4 or FP8 quantization, and long-context attention.
TensorRT-LLM is the opposite profile. It is NVIDIA-first by design, and that focus can be a strength if your fleet is homogeneous.
Security also belongs in the engine choice. Cloud Security Alliance reported critical SGLang CVEs in 2026, including CVE-2026-5760 and CVE-2026-7304 with CVSS 9.8 severity, in its SGLang GGUF RCE research note. That doesn't disqualify SGLang, but it does make patch discipline and model-ingest controls mandatory.
How to run your own LLM serving benchmark
Use production traces. Synthetic prompts are fine for smoke tests, then they start lying.
A useful benchmark matrix has four axes: prompt length, output length, concurrency, and prefix overlap. Run at least one short-chat case, one long-prefill case, one structured-output case, and one burst case.
Track these metrics:
| Metric | Why it matters |
|---|---|
| TTFT p50 and p99 | Captures prefill pain and user-perceived responsiveness |
| Inter-token latency | Captures decode speed after streaming begins |
| Tokens/GPU/hour | Captures cost efficiency |
| Cache hit rate | Explains whether prefix caching is actually helping |
| GPU memory pressure | Predicts OOMs and batching collapse |
| Error and retry rate | Exposes guided decoding, timeout, and scheduler problems |
Keep the first pass boring. Use the same model weights, quantization, GPU type, context window, request trace, and SLO for each engine.
Then tune. For vLLM, explore batching, KV memory, prefix caching, and production-stack routing. For SGLang, test RadixAttention hit rates, HiCache behavior, speculative decoding, and disaggregated prefill/decode for long prompts.
What this means for you
If you're building a general hosted API, start with vLLM v0.22.1 as of June 20, 2026. It is the most defensible default because the surrounding ecosystem reduces execution risk.
If you're building agent infrastructure or RAG with long repeated prompts, benchmark SGLang v0.5.13.post1 first. The more your workload looks like repeated language-model programs, the more SGLang's architecture matters.
If your CFO only sees tokens/sec charts, add cache hit rate and p99 TTFT to the dashboard. Those two numbers usually explain why the fastest engine in a blog post loses in production.
JSON-LD schema
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "vLLM vs SGLang: Pick by Workload Shape",
"description": "vLLM vs SGLang comes down to workload shape: broad production serving favors vLLM, while prefix-heavy agents and RAG favor SGLang.",
"datePublished": "2026-06-20",
"dateModified": "2026-06-20",
"author": {
"@type": "Organization",
"name": "GenAlphAI"
},
"publisher": {
"@type": "Organization",
"name": "GenAlphAI"
}
}
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{"@type": "ListItem", "position": 1, "name": "AI Engineering"},
{"@type": "ListItem", "position": 2, "name": "Inference"},
{"@type": "ListItem", "position": 3, "name": "vLLM vs SGLang"}
]
}
FAQ: Is vLLM or SGLang faster?
SGLang can be faster when prefixes repeat heavily, and one H100 benchmark reported about 29.6% higher throughput than vLLM on Llama 3.1 8B. VLLM can still be the better production choice when operational maturity, hardware breadth, and Kubernetes tooling matter more than that benchmark's exact workload.
FAQ: Which engine should I use for RAG?
Use SGLang when many RAG requests share long templates, policy text, or repeated retrieved context. Use vLLM with LMCache when the platform requirement is production Kubernetes, cross-GPU KV sharing, and broader operational familiarity.
FAQ: Which engine is safer for production?
VLLM is the safer default for most teams because of its ecosystem, production-stack work, observability, and broad community. SGLang is production-grade for the right team, but 2026 CVE history means security updates and model-ingest controls need real ownership.
Bottom line
The vLLM vs SGLang decision is a workload-shape decision. Pick vLLM for the broad production path, pick SGLang when prefix reuse drives latency and cost, and keep TensorRT-LLM on the table for NVIDIA-only GPU inference optimization.
The number to monitor next is cache-adjusted p99 TTFT, because that is where architecture shows up in the user experience.
Sources
- vLLM releases
- vLLM v0.22.1 release notes
- vLLM GitHub repository
- vLLM structured outputs documentation
- vLLM Production Stack
- vLLM v0.22 DeepSeek, Rust frontend, and KV offload summary
- SGLang releases
- SGLang v0.5.13 release notes
- SGLang RadixAttention documentation
- TECHSY vLLM vs SGLang benchmark
- Lambda MLPerf Inference v6.0 results
- TensorRT-LLM parallel strategy documentation
- TensorRT-LLM precision documentation
- Cloud Security Alliance SGLang CVE research note
