Evaluating Ai Models And Agents

VLLM vs TensorRT-LLM vs SGLang: The 2026 Same-Hardware Serving Benchmark

Tokens-per-second-per-dollar on identical GPUs decides more deployments than peak throughput, and tail latency plus cold start decide the rest.

By June 26, 202611 min read
vLLM vs TensorRT-LLM vs SGLang 2026LLM serving engine benchmarktokens per second per dollar LLM
VLLM vs TensorRT-LLM vs SGLang: The 2026 Same-Hardware Serving Benchmark

Pick an LLM serving engine by reading vendor blogs and you'll get a different answer from each one. Run them on the same GPU, the same weights, the same load shape, and the picture narrows fast: vLLM wins on throughput-per-dollar, TensorRT-LLM wins on per-token latency, and SGLang wins on anything with a shared prefix.

The 2026 LLM serving engine benchmark that matters is tokens-per-second-per-dollar on identical hardware, with tail latency and cold start as the tiebreakers that decide most real deployments.

TL;DR. Default to vLLM for high-throughput batch workloads. Pick TensorRT-LLM only when latency is a hard business constraint and you have CUDA expertise on NVIDIA hardware. Pick SGLang for RAG, agents, and structured output where shared prefixes dominate. TGI is done, Ollama is for your laptop, and any number not measured on your hardware and your traffic is an upper bound.

Key takeaways

  • vLLM v0.18+ hits 2,200+ tokens/sec on H200 running DeepSeek-V3 and has the broadest quantization support in the open ecosystem.
  • TensorRT-LLM 0.12+ delivers 20-40% lower per-token latency than vLLM at batch size 1, but locks you to NVIDIA and demands weeks of setup.
  • SGLang 0.4+ posts a 6.4x throughput gain on prefix-heavy workloads via RadixAttention and has crossed 400,000 deployed GPUs.
  • TGI entered maintenance mode in December 2025; Hugging Face now points new deployments to vLLM or SGLang.
  • Plan for 30-50% performance degradation under real production traffic versus synthetic benchmarks.

How we compared the engines (and why methodology matters)

Most public LLM serving benchmarks are not comparable. Different GPUs, different quantization, different batch sizes, different load generators. A vendor blog showing engine A at 3,000 tok/s and engine B at 2,000 tok/s usually tells you nothing useful if the two runs used different model formats or concurrency.

The numbers in this benchmark follow a same-hardware, same-model, same-load-shape discipline where the underlying sources allow it: identical NVIDIA H100 80GB or H200 141GB configurations, identical weights (typically Llama 3.1 70B, Mixtral 8x7B, or DeepSeek-V3), and consistent concurrency and input/output length distributions. Sources are labeled as Measured (independent, disclosed methodology), Reported (vendor or third-party without full disclosure), or Calculated (derived from component benchmarks).

Be honest about the ceiling. The Spheron April 2026 H100 comparison and LeetLLM's 2026 analysis provide the most controlled cross-engine data available, but several high-precision claims in the public record could not be fully reconciled against their cited source extracts.

Treat every number here as an upper bound and re-run on your own hardware before committing production budget.

vLLM vs TensorRT-LLM vs SGLang: 2026 head-to-head

The consolidated picture on H100-class hardware, drawn from the Spheron H100 benchmark and the LeetLLM 2026 engine comparison:

Engine Throughput (H100) Latency (batch=1) 1M context MoE Setup HW lock-in
vLLM v0.18+ 1,800-2,400 tok/s Baseline Native Wide EP 4/10 Low (ROCm)
TensorRT-LLM 0.12+ 1,600-2,200 tok/s 20-40% lower Multi-block Native 8/10 NVIDIA-only
SGLang 0.4+ 1,400-2,000 tok/s Competitive Native Expert routing 7/10 Low (ROCm)
Ollama 0.5+ 200-800 tok/s Higher Limited Basic 1/10 None

Throughput ranges are for Llama 3.1 70B or equivalent. Latency at batch size 1 favors TensorRT-LLM's fused kernels. Setup complexity runs 1 (trivial) to 10 (expert-only).

The throughput-per-dollar shape is the part vendors bury. VLLM's 2,200+ tok/s on H200 with DeepSeek-V3, reported by the Berkeley Sky Computing Lab, comes with the broadest quantization menu in the open ecosystem: FP8, FP4, MXFP8, MXFP4, NVFP4, INT8, INT4, GPTQ, AWQ, and GGUF for CPU offloading.

That flexibility is what lets you actually hit the tokens-per-dollar number on your hardware, not theirs.

Reported throughput on H100/H200 (tok/s, best available)vLLM (H200, DeepSeek-V3)2200tok/svLLM (H100, Llama 70B)2100tok/sTensorRT-LLM (H100)1900tok/sSGLang (H100)1700tok/sOllama (H100)500tok/s
Reported throughput on H100/H200 (tok/s, best available)

Which engine has the best tokens per second per dollar?

For raw throughput-per-dollar on open weights, vLLM is the answer in 2026. The v1 engine's PagedAttention 2.0 and continuous batching scale batch sizes higher before GPU memory exhausts, and the DeepSeek-V3 result on H200 is the cleanest high-water mark in the public record.

Pair it with FP8 quantization and target 70%+ GPU utilization and you land in the 500K-2M tokens-per-dollar range, versus roughly 70K-400K tokens-per-dollar on managed APIs at current pricing.

TensorRT-LLM's throughput numbers look slightly lower in cross-engine tests, but that misses the point. Its edge is latency, not throughput. NVIDIA's own benchmarks show 5x faster time-to-first-token with KV cache early reuse, and independent tests confirm 20-40% lower per-token latency than vLLM at batch size 1, per the Lyceum production benchmark.

If your product is a real-time voice agent or a trading copilot, that gap is the whole decision. If your product is a batch enrichment job, it is invisible.

SGLang's throughput-per-dollar story is workload-dependent in a way the other two are not. On generic traffic it lands competitive with vLLM. On prefix-heavy traffic it pulls ahead hard: LMSYS reports a 6.4x throughput improvement on shared-prefix workloads via RadixAttention, which caches prefix KV cache in a radix tree.

RAG pipelines with a fixed system prompt, agent loops that replay tool schemas, and structured output with a repeating JSON schema are exactly where that multiplier compounds.

How bad is tail latency and cold start in production?

This is the metric that kills deployments that looked great in a benchmark. Peak throughput numbers are steady-state. Production is arrival-rate variance, queue depth spikes, and cold loads.

TensorRT-LLM's latency advantage is most pronounced at low batch sizes. Under high concurrent load, throughput-focused engines can deliver better p95 and p99 because they batch more aggressively and amortize kernel launch overhead. If you benchmarked only at batch=1, you would over-index on TensorRT-LLM for a service that actually runs hot all day.

Cold start is the other quiet killer. VLLM's PagedAttention is memory-efficient once warm, but initial weight loading and KV cache allocation add overhead that TensorRT-LLM's precompiled engine path partly avoids.

SGLang inherits vLLM's backend characteristics and adds its own frontend layer, which extends time-to-first-request to roughly 1-2 days of tuning for a team new to it, versus 2-4 hours for vLLM and days-to-weeks for TensorRT-LLM.

The honest mitigation is to measure on your traffic. Replay production traces, stress at 2-5x expected peak, and report p95 and p99, not averages. Averages hide the tail that users actually feel.

What about MoE serving and long context?

Mixture-of-experts serving has gone from a specialist problem to table stakes. All three primary engines now support expert parallelism with production-grade routing.

VLLM added wide expert parallelism for DeepSeek-style MoEs and has demonstrated it at NVL72 scale, which matters because DeepSeek-V3 and the April 2026 DeepSeek-V4 long-context optimization are the architectures pushing the frontier on open weights. TensorRT-LLM documents Mixtral 8x7B and DBRX optimization on H100 in its MoE DBRX post.

SGLang has full expert routing with competitive throughput per LMSYS's Llama3 serving work.

For 1M+ context, all three support it, but the fit differs. TensorRT-LLM's multi-block attention reportedly delivers 3x+ throughput on long sequences on HGX H200. SGLang's RadixAttention wins when long-context requests share prefixes, which is common in document RAG. VLLM is the flexible default when memory constraints are the binding constraint.

Quantization support comparison

Quantization is where vLLM's flexibility becomes a deployment advantage. The matrix:

Format vLLM TensorRT-LLM SGLang Ollama
FP8 Native Native Native Limited
INT8 Native Via ModelOpt Via vLLM Yes
INT4 Yes Via ModelOpt Via vLLM Q4_0
AWQ/GPTQ Yes Yes Yes Yes
GGUF Yes No No Native
MXFP8/MXFP4 Yes No No No
NVFP4 Yes No No No

TensorRT-LLM routes quantization through NVIDIA's ModelOpt toolkit, which requires a model compilation step rather than runtime quantization. That is fine for a fixed production model and painful for a team iterating on fine-tunes.

VLLM's runtime quantization path is the reason it dominates R&D-heavy shops. The AWQ deployment guide is a good reference for the weight-only path on A100.

Decision matrix: which engine for which workload

If your priority is Choose Why
Max throughput at lowest cost vLLM Best tok/sec per GPU dollar, broadest quantization
Lowest latency, interactive TensorRT-LLM 20-40% lower latency at batch=1, fused kernels
Prefix-heavy (RAG, agents) SGLang 6.4x on shared prefixes via RadixAttention
Structured output, tool calling SGLang First-class schema and tool-use optimization
Local dev and prototyping Ollama Single-command install, best local UX
1M+ context Any of the three TRT-LLM best latency, SGLang best efficiency
MoE (Mixtral, DBRX, DeepSeek) vLLM or TRT-LLM vLLM for batch, TRT-LLM for latency
AMD GPU deployment vLLM or SGLang ROCm support; TRT-LLM unavailable
Small team, limited ops vLLM Best performance-to-simplicity ratio
Migrating off TGI vLLM or SGLang Both recommended by Hugging Face

Self-hosting vs managed API: when does the math flip?

At current June 2026 pricing, H100 80GB rental runs roughly $2.00-3.50/hour on-demand and $1.50-2.50/hour reserved. A100 80GB is $1.00-2.00/hour. For a Llama 3.1 70B-equivalent serving 2,000 tokens per request:

Monthly requests Managed API cost Self-host (H100) Recommendation
100K $250-$1,500 $1,440-$2,520 Managed API
1M $2,500-$15,000 $1,440-$2,520 Self-host
10M $25,000-$150,000 $1,440-$2,520 Self-host

The crossover sits around 1M requests/month for open-weight models, before you account for engineering time. Add 2-4 weeks of initial deployment and 2-8 hours/month of maintenance, and the break-even moves up for small teams.

Managed APIs win when traffic is uncertain, when you need frontier models that exceed self-hosting capability, or when your engineers cost more than your GPUs. Self-hosting wins on volume, privacy, and customization.

What this means for you

  1. Profile your workload before you pick an engine. Shared prefixes, output structure, batch vs interactive, and p99 latency targets each point to a different winner.
  2. Default to vLLM unless you have a specific reason not to. It is the best balance of throughput, quantization flexibility, and operational simplicity, and it has the largest community at roughly 44,000 GitHub stars.
  3. Reach for TensorRT-LLM only when latency is a hard business metric. Real-time voice, trading, and similar workloads justify the 8/10 setup complexity and NVIDIA lock-in. Most other products will not feel the 20-40% gap.
  4. Use SGLang for RAG and agents. If your traffic is dominated by shared system prompts and tool schemas, RadixAttention's 6.4x is not a marketing number, it is a structural advantage.
  5. Migrate off TGI now. Maintenance mode since December 2025 means no new optimizations. Hugging Face's own recommendation is vLLM or SGLang.
  6. Re-benchmark on your hardware with your traffic. Treat every public number as an upper bound and plan for 30-50% degradation in production. Open frameworks like the llm-serving-benchmark suite make same-hardware replication tractable.

Sources

Frequently asked questions

Which LLM serving engine has the best tokens per second per dollar in 2026?

VLLM offers the best throughput-per-dollar for high-throughput batch workloads, with the v1 engine reaching 2,200+ tokens/sec on H200-class hardware and the broadest quantization format support. TensorRT-LLM wins on per-token latency, not throughput-per-dollar, and locks you to NVIDIA hardware.

Is TensorRT-LLM faster than vLLM?

Yes, for latency. Independent tests put TensorRT-LLM at 20-40% lower per-token latency than vLLM at batch size 1, and NVIDIA reports 5x faster time-to-first-token with KV cache early reuse. The trade-off is significantly higher setup complexity and NVIDIA-only hardware lock-in.

When should you choose SGLang over vLLM?

Choose SGLang for prefix-heavy workloads like RAG and agents, and for structured output (JSON schema, tool calling). Its RadixAttention caches shared prefixes and delivers up to 6.4x throughput improvement on those workloads. For generic high-throughput batch serving, vLLM remains the default.

Is Hugging Face TGI still supported for production LLM serving?

No. TGI entered maintenance mode in December 2025, and Hugging Face now recommends vLLM or SGLang for new production deployments. Existing TGI deployments should plan a migration.

How do you benchmark LLM serving engines fairly?

Use same-hardware, same-model, same-load-shape methodology: identical GPUs, identical weights, and consistent concurrency and input/output length distributions. Measure TTFT, inter-token latency, and end-to-end latency at p50/p95/p99, then divide tokens/sec by on-demand GPU price to get tokens-per-second-per-dollar.