Pick an LLM serving engine by reading vendor blogs and you'll get a different answer from each one. Run them on the same GPU, the same weights, the same load shape, and the picture narrows fast: vLLM wins on throughput-per-dollar, TensorRT-LLM wins on per-token latency, and SGLang wins on anything with a shared prefix.
The 2026 LLM serving engine benchmark that matters is tokens-per-second-per-dollar on identical hardware, with tail latency and cold start as the tiebreakers that decide most real deployments.
TL;DR. Default to vLLM for high-throughput batch workloads. Pick TensorRT-LLM only when latency is a hard business constraint and you have CUDA expertise on NVIDIA hardware. Pick SGLang for RAG, agents, and structured output where shared prefixes dominate. TGI is done, Ollama is for your laptop, and any number not measured on your hardware and your traffic is an upper bound.
Key takeaways
- vLLM v0.18+ hits 2,200+ tokens/sec on H200 running DeepSeek-V3 and has the broadest quantization support in the open ecosystem.
- TensorRT-LLM 0.12+ delivers 20-40% lower per-token latency than vLLM at batch size 1, but locks you to NVIDIA and demands weeks of setup.
- SGLang 0.4+ posts a 6.4x throughput gain on prefix-heavy workloads via RadixAttention and has crossed 400,000 deployed GPUs.
- TGI entered maintenance mode in December 2025; Hugging Face now points new deployments to vLLM or SGLang.
- Plan for 30-50% performance degradation under real production traffic versus synthetic benchmarks.
How we compared the engines (and why methodology matters)
Most public LLM serving benchmarks are not comparable. Different GPUs, different quantization, different batch sizes, different load generators. A vendor blog showing engine A at 3,000 tok/s and engine B at 2,000 tok/s usually tells you nothing useful if the two runs used different model formats or concurrency.
The numbers in this benchmark follow a same-hardware, same-model, same-load-shape discipline where the underlying sources allow it: identical NVIDIA H100 80GB or H200 141GB configurations, identical weights (typically Llama 3.1 70B, Mixtral 8x7B, or DeepSeek-V3), and consistent concurrency and input/output length distributions. Sources are labeled as Measured (independent, disclosed methodology), Reported (vendor or third-party without full disclosure), or Calculated (derived from component benchmarks).
Be honest about the ceiling. The Spheron April 2026 H100 comparison and LeetLLM's 2026 analysis provide the most controlled cross-engine data available, but several high-precision claims in the public record could not be fully reconciled against their cited source extracts.
Treat every number here as an upper bound and re-run on your own hardware before committing production budget.
vLLM vs TensorRT-LLM vs SGLang: 2026 head-to-head
The consolidated picture on H100-class hardware, drawn from the Spheron H100 benchmark and the LeetLLM 2026 engine comparison:
| Engine | Throughput (H100) | Latency (batch=1) | 1M context | MoE | Setup | HW lock-in |
|---|---|---|---|---|---|---|
| vLLM v0.18+ | 1,800-2,400 tok/s | Baseline | Native | Wide EP | 4/10 | Low (ROCm) |
| TensorRT-LLM 0.12+ | 1,600-2,200 tok/s | 20-40% lower | Multi-block | Native | 8/10 | NVIDIA-only |
| SGLang 0.4+ | 1,400-2,000 tok/s | Competitive | Native | Expert routing | 7/10 | Low (ROCm) |
| Ollama 0.5+ | 200-800 tok/s | Higher | Limited | Basic | 1/10 | None |
Throughput ranges are for Llama 3.1 70B or equivalent. Latency at batch size 1 favors TensorRT-LLM's fused kernels. Setup complexity runs 1 (trivial) to 10 (expert-only).
The throughput-per-dollar shape is the part vendors bury. VLLM's 2,200+ tok/s on H200 with DeepSeek-V3, reported by the Berkeley Sky Computing Lab, comes with the broadest quantization menu in the open ecosystem: FP8, FP4, MXFP8, MXFP4, NVFP4, INT8, INT4, GPTQ, AWQ, and GGUF for CPU offloading.
That flexibility is what lets you actually hit the tokens-per-dollar number on your hardware, not theirs.
Which engine has the best tokens per second per dollar?
For raw throughput-per-dollar on open weights, vLLM is the answer in 2026. The v1 engine's PagedAttention 2.0 and continuous batching scale batch sizes higher before GPU memory exhausts, and the DeepSeek-V3 result on H200 is the cleanest high-water mark in the public record.
Pair it with FP8 quantization and target 70%+ GPU utilization and you land in the 500K-2M tokens-per-dollar range, versus roughly 70K-400K tokens-per-dollar on managed APIs at current pricing.
TensorRT-LLM's throughput numbers look slightly lower in cross-engine tests, but that misses the point. Its edge is latency, not throughput. NVIDIA's own benchmarks show 5x faster time-to-first-token with KV cache early reuse, and independent tests confirm 20-40% lower per-token latency than vLLM at batch size 1, per the Lyceum production benchmark.
If your product is a real-time voice agent or a trading copilot, that gap is the whole decision. If your product is a batch enrichment job, it is invisible.
SGLang's throughput-per-dollar story is workload-dependent in a way the other two are not. On generic traffic it lands competitive with vLLM. On prefix-heavy traffic it pulls ahead hard: LMSYS reports a 6.4x throughput improvement on shared-prefix workloads via RadixAttention, which caches prefix KV cache in a radix tree.
RAG pipelines with a fixed system prompt, agent loops that replay tool schemas, and structured output with a repeating JSON schema are exactly where that multiplier compounds.
How bad is tail latency and cold start in production?
This is the metric that kills deployments that looked great in a benchmark. Peak throughput numbers are steady-state. Production is arrival-rate variance, queue depth spikes, and cold loads.
TensorRT-LLM's latency advantage is most pronounced at low batch sizes. Under high concurrent load, throughput-focused engines can deliver better p95 and p99 because they batch more aggressively and amortize kernel launch overhead. If you benchmarked only at batch=1, you would over-index on TensorRT-LLM for a service that actually runs hot all day.
Cold start is the other quiet killer. VLLM's PagedAttention is memory-efficient once warm, but initial weight loading and KV cache allocation add overhead that TensorRT-LLM's precompiled engine path partly avoids.
SGLang inherits vLLM's backend characteristics and adds its own frontend layer, which extends time-to-first-request to roughly 1-2 days of tuning for a team new to it, versus 2-4 hours for vLLM and days-to-weeks for TensorRT-LLM.
The honest mitigation is to measure on your traffic. Replay production traces, stress at 2-5x expected peak, and report p95 and p99, not averages. Averages hide the tail that users actually feel.
What about MoE serving and long context?
Mixture-of-experts serving has gone from a specialist problem to table stakes. All three primary engines now support expert parallelism with production-grade routing.
VLLM added wide expert parallelism for DeepSeek-style MoEs and has demonstrated it at NVL72 scale, which matters because DeepSeek-V3 and the April 2026 DeepSeek-V4 long-context optimization are the architectures pushing the frontier on open weights. TensorRT-LLM documents Mixtral 8x7B and DBRX optimization on H100 in its MoE DBRX post.
SGLang has full expert routing with competitive throughput per LMSYS's Llama3 serving work.
For 1M+ context, all three support it, but the fit differs. TensorRT-LLM's multi-block attention reportedly delivers 3x+ throughput on long sequences on HGX H200. SGLang's RadixAttention wins when long-context requests share prefixes, which is common in document RAG. VLLM is the flexible default when memory constraints are the binding constraint.
Quantization support comparison
Quantization is where vLLM's flexibility becomes a deployment advantage. The matrix:
| Format | vLLM | TensorRT-LLM | SGLang | Ollama |
|---|---|---|---|---|
| FP8 | Native | Native | Native | Limited |
| INT8 | Native | Via ModelOpt | Via vLLM | Yes |
| INT4 | Yes | Via ModelOpt | Via vLLM | Q4_0 |
| AWQ/GPTQ | Yes | Yes | Yes | Yes |
| GGUF | Yes | No | No | Native |
| MXFP8/MXFP4 | Yes | No | No | No |
| NVFP4 | Yes | No | No | No |
TensorRT-LLM routes quantization through NVIDIA's ModelOpt toolkit, which requires a model compilation step rather than runtime quantization. That is fine for a fixed production model and painful for a team iterating on fine-tunes.
VLLM's runtime quantization path is the reason it dominates R&D-heavy shops. The AWQ deployment guide is a good reference for the weight-only path on A100.
Decision matrix: which engine for which workload
| If your priority is | Choose | Why |
|---|---|---|
| Max throughput at lowest cost | vLLM | Best tok/sec per GPU dollar, broadest quantization |
| Lowest latency, interactive | TensorRT-LLM | 20-40% lower latency at batch=1, fused kernels |
| Prefix-heavy (RAG, agents) | SGLang | 6.4x on shared prefixes via RadixAttention |
| Structured output, tool calling | SGLang | First-class schema and tool-use optimization |
| Local dev and prototyping | Ollama | Single-command install, best local UX |
| 1M+ context | Any of the three | TRT-LLM best latency, SGLang best efficiency |
| MoE (Mixtral, DBRX, DeepSeek) | vLLM or TRT-LLM | vLLM for batch, TRT-LLM for latency |
| AMD GPU deployment | vLLM or SGLang | ROCm support; TRT-LLM unavailable |
| Small team, limited ops | vLLM | Best performance-to-simplicity ratio |
| Migrating off TGI | vLLM or SGLang | Both recommended by Hugging Face |
Self-hosting vs managed API: when does the math flip?
At current June 2026 pricing, H100 80GB rental runs roughly $2.00-3.50/hour on-demand and $1.50-2.50/hour reserved. A100 80GB is $1.00-2.00/hour. For a Llama 3.1 70B-equivalent serving 2,000 tokens per request:
| Monthly requests | Managed API cost | Self-host (H100) | Recommendation |
|---|---|---|---|
| 100K | $250-$1,500 | $1,440-$2,520 | Managed API |
| 1M | $2,500-$15,000 | $1,440-$2,520 | Self-host |
| 10M | $25,000-$150,000 | $1,440-$2,520 | Self-host |
The crossover sits around 1M requests/month for open-weight models, before you account for engineering time. Add 2-4 weeks of initial deployment and 2-8 hours/month of maintenance, and the break-even moves up for small teams.
Managed APIs win when traffic is uncertain, when you need frontier models that exceed self-hosting capability, or when your engineers cost more than your GPUs. Self-hosting wins on volume, privacy, and customization.
What this means for you
- Profile your workload before you pick an engine. Shared prefixes, output structure, batch vs interactive, and p99 latency targets each point to a different winner.
- Default to vLLM unless you have a specific reason not to. It is the best balance of throughput, quantization flexibility, and operational simplicity, and it has the largest community at roughly 44,000 GitHub stars.
- Reach for TensorRT-LLM only when latency is a hard business metric. Real-time voice, trading, and similar workloads justify the 8/10 setup complexity and NVIDIA lock-in. Most other products will not feel the 20-40% gap.
- Use SGLang for RAG and agents. If your traffic is dominated by shared system prompts and tool schemas, RadixAttention's 6.4x is not a marketing number, it is a structural advantage.
- Migrate off TGI now. Maintenance mode since December 2025 means no new optimizations. Hugging Face's own recommendation is vLLM or SGLang.
- Re-benchmark on your hardware with your traffic. Treat every public number as an upper bound and plan for 30-50% degradation in production. Open frameworks like the llm-serving-benchmark suite make same-hardware replication tractable.
Sources
- vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (Spheron, 2026)
- Achieving Top Inference Performance with H100 and TensorRT-LLM (NVIDIA)
- vLLM project page (UC Berkeley Sky Computing Lab)
- Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B (Clarifai)
- SGLang structured generation review (Chatforest)
- LLM serving framework comparison (DeployBase)
- vLLM vs SGLang vs TensorRT-LLM vs Ollama (LeetLLM, 2026)
- vLLM vs TensorRT-LLM: 2026 production benchmarks (Lyceum)
- 5x Faster Time to First Token with TensorRT-LLM KV Cache Early Reuse (NVIDIA)
- DeepSeek-V4 in vLLM: efficient long-context attention (vLLM blog)
- Leverage MoE-based DBRX on H100 (NVIDIA)
- Faster open-source Llama3 serving with SGLang (LMSYS)
- Accelerated LLM inference on AMD Instinct GPUs with vLLM 0.9.x (AMD ROCm blog)
- AWQ quantization guide (Spheron)
- llm-serving-benchmark framework (GitHub)
