What is the binding constraint for on-device LLM inference in 2026?

Memory bandwidth, not raw NPU TOPS. Even 80-TOPS chips stall when streaming model weights from RAM, and real throughput often falls 30-50% below theoretical peak once bandwidth saturates.

How much accuracy do you lose with INT4 quantization on local AI models?

Q4_K_M quantization typically adds 1.5-1.9% perplexity over FP16, imperceptible for summarization and chat but risky for code generation and precise numerical reasoning, where INT8 is safer.

Which on-device AI runtime should you use for cross-platform deployment?

ONNX Runtime offers the broadest hardware coverage, while llama.cpp and Ollama suit developer workstations. Apple MLX beats llama.cpp by 20-30% on Apple Silicon but locks you to that platform.

When does on-device AI beat cloud inference on cost?

For low-volume apps under roughly 1 million tokens per user per month, on-device inference usually wins on TCO after 6-18 months, assuming the AI-capable hardware is already in users' hands.

How do hybrid AI architectures route requests between device and cloud?

They classify requests by sensitivity, complexity, latency budget, and connectivity. PII and PHI stay local; complex reasoning and fresh-knowledge queries go to cloud, with transparent fallback when the network drops.

On-Device AI's Real Bottleneck Isn't the Chip. It's the Memory

On-Device AI Infrastructure: Why Memory Bandwidth, Not TOPS, Decides What Ships

Qualcomm's Snapdragon X2 Elite hit 80 TOPS in September 2025, nearly double the prior generation. Apple's A18 Pro Neural Engine delivers 35 TOPS, AMD's Ryzen AI 300 reaches 50, and Intel's Panther Lake is targeting 50 for H2 2026. By the headline metric, on-device AI infrastructure has arrived.

It hasn't, at least not the way the spec sheets imply. The binding constraint on local LLM inference in 2026 is memory bandwidth, and real-world throughput routinely lands 30 to 50 percent below theoretical peak TOPS once you saturate the path between RAM and the NPU.

Battery, thermals, privacy routing, and developer ergonomics compound the gap. If you're planning an on-device AI deployment, the chip is the easy part.

On-device AI infrastructure is the full stack of hardware (NPU, memory subsystem, power envelope), runtimes (Core ML, LiteRT, Windows ML, llama.cpp), quantized model formats, and routing logic that lets a model run locally on a phone, PC, or edge device with acceptable latency, accuracy, and energy cost. Picking a chip by TOPS alone is the most common and most expensive mistake teams make here.

TL;DR

Memory bandwidth, not peak NPU throughput, governs real on-device LLM performance in 2026. INT4 quantization is now production-viable at roughly 1.5 to 1.9 percent perplexity cost. Hybrid routing, sensitive data local, complex reasoning to cloud, is the dominant practical architecture.

Open-source runtimes (llama.cpp, MLX, Ollama) have closed most of the gap to vendor stacks, but platform lock-in still shapes your model choices. Plan against bandwidth, battery, and thermals, not spec-sheet TOPS.

Key takeaways

Memory bandwidth is the ceiling: Apple's M5 Max tops out at 153 GB/s, Qualcomm's X2 Elite at 228 GB/s, and LPDDR6 won't be widespread in flagships until late 2026.
INT4 (Q4_K_M) is the production sweet spot, with 2 to 3x throughput over FP16 and acceptable quality loss for most tasks.
Thermal throttling is brutal on phones: the iPhone 16 Pro loses about 50 percent of peak inference speed under sustained load.
Hybrid routing beats pure local or pure cloud for most apps, and it's where the enterprise money is going.
ONNX is becoming the neutral interchange format, and OpenAI-compatible APIs are the de facto interface across Ollama, Foundry, and AI Edge.

Why Raw TOPS Mislead You About Edge AI Hardware

Chip vendors quote peak TOPS under ideal conditions: small batch, simple ops, sustained load, no thermal pressure. Real inference of a 7B parameter model looks nothing like that.

A 7B model at FP16 needs roughly 14 GB just to hold weights, and generation requires streaming those weights through memory on every token. If your memory subsystem can't feed the NPU fast enough, the NPU starves and your throughput collapses regardless of its peak rating.

The data bears this out. Measured throughput for 7B models on a Snapdragon X1E lands at 10 to 15 tokens per second at FP16, climbing to 35 to 45 at INT4.

An NVIDIA RTX 4090, with over 1,000 GB/s of memory bandwidth, hits 120 to 180 tokens per second at INT4. The TOPS gap between those two parts is large, but the bandwidth gap is larger, and the bandwidth gap is what shows up in your app.

Measured 7B INT4 inference throughput (tokens/sec)

Note the Apple M5 Max figure: 18 to 25 tokens per second for a 70B Q4 model. That's a much bigger model running slower, and it's only possible because of 128 GB of unified memory. Unified memory is Apple's structural advantage here, but even 153 GB/s is a hard ceiling for larger work.

What's the Real Memory Bandwidth Constraint for Local AI Models?

The math is unforgiving. To generate one token from a 7B INT4 model, you stream roughly 3.5 GB of weights through memory. At 50 GB/s of usable bandwidth, that's a theoretical floor of about 14 tokens per second before any compute overhead.

Add activations, KV cache, and OS contention, and you're below the NPU's peak before you've warmed up.

LPDDR6 (JEDEC JESD209-6, finalized July 2025) pushes effective bandwidth to 28.5 to 38.4 GB/s per subchannel at 10,667 to 14,400 MT/s, and early chips draw 20 to 21 percent less power than LPDDR5X. But as of June 2026, LPDDR6 is barely shipping in consumer devices.

LPDDR5X at 136 to 228 GB/s remains the dominant standard, which means most of the bandwidth gains you'd want for on-device LLMs are still a product cycle away.

The practical takeaway: when you evaluate a platform, look at memory bandwidth first, NPU TOPS second. A 50-TOPS NPU fed by 136 GB/s will underperform a 45-TOPS NPU fed by 228 GB/s on any model that doesn't fit in on-chip cache.

Model size limits by device class

Device class	Typical RAM	FP16 max	INT8 max	INT4 max
Smartphone (8 GB)	8 GB	~4B	~7B	~13B
Smartphone (12 GB)	12 GB	~7B	~13B	~30B
Copilot+ PC (16 GB)	16 GB	~9B	~16B	~35B
Copilot+ PC (32 GB)	32 GB	~18B	~35B	~70B
Apple M5 Mac (36 GB)	36 GB	~20B	~40B	~80B
Apple M5 Max (128 GB)	128 GB	~70B	~140B	~300B

These assume 30 to 50 percent overhead for activations and runtime buffers. Real achievable sizes depend on OS memory pressure and what else the device is running.

How Do You Quantize Local AI Models Without Killing Accuracy?

Quantization has matured fast. Q4_K_M, the production standard from the llama.cpp K-quant family, typically costs 1.5 to 1.9 percent perplexity over FP16. That's imperceptible for summarization, chat, and most general language tasks. It's risky for code generation and precise numerical reasoning, where a single bad token breaks the output.

Quantization	Perplexity degradation	Acceptable for
FP16	0% (baseline)	Maximum accuracy
INT8 (per-channel)	+0.5 to 1.0%	Most production apps
INT8 (per-tensor)	+1.0 to 2.0%	Many apps
INT4 (Q4_K_M)	+1.5 to 1.9%	Production standard
INT4 (Q4_K_S)	+2.0 to 3.0%	Non-critical apps
INT4 (I-quant)	+1.8 to 2.5%	Variable by model

Method selection should follow your target hardware. AWQ (activation-aware weight quantization) beats GPTQ by 0.5 to 1.0 percent at equivalent bit-widths and is the preferred choice for GPU edge deployment.

GGUF with K-quant is the right call for CPU inference on x86 or ARM. Apple Silicon should use MLX's native quantization or GGUF via llama.cpp. Qualcomm Hexagon wants INT4 compiled through AI Hub, and Microsoft's DirectML path expects INT4 ONNX, which is how Phi-4-multimodal ships.

The durable rule: pick the quantization method that matches your runtime and hardware, then benchmark on your actual task before you commit. Don't trust vendor perplexity averages for a code-generation workload.

Battery, Thermals, and the Mobile AI Inference Wall

Phones throttle hard. The iPhone 16 Pro loses about 50 percent of its peak inference speed under sustained thermal load because it has to keep skin temperature tolerable.

A flagship Android part like the Snapdragon 8 Elite behaves similarly. Sustained NPU power on a phone sits below 2 watts, and that's the budget you actually plan against, not the peak.

Energy per token is the metric that matters for mobile. Measured estimates for 7B INT4 generation:

Device	Wh per 1,000 tokens
iPhone 16 Pro	0.05 to 0.15
Android flagship (Snapdragon 8 Elite)	0.08 to 0.20
Copilot+ PC (battery)	0.3 to 0.8
Apple MacBook Air (battery)	0.15 to 0.40

A 4,000 mAh phone battery holds roughly 15 Wh. Generating 1,000 tokens costs 0.1 to 0.2 Wh, about 1 percent of capacity. That's tolerable for occasional use and brutal for a always-on assistant that streams tokens continuously.

If your feature runs inference in the background on every notification, you will drain batteries and you will get uninstalled.

Thermals and battery together push most serious mobile AI toward short, bursty inference: classify, extract, summarize, then stop. Long-form generation belongs in the cloud unless the user explicitly asked for offline.

Privacy Routing and the Case for Hybrid AI Architectures

Pure local and pure cloud are the easy extremes. The architecture most production teams actually ship is hybrid: route by sensitivity, complexity, and connectivity.

Apple's Private Cloud Compute is the canonical example. When the on-device Foundation Model (roughly 3B parameters, ANE-only) can't handle a request, it routes to PCC with cryptographic verification that no logs are retained. Sensitive data stays local; hard reasoning goes to cloud with auditable guarantees.

Google's Android AICore takes a different shape. Gemini Nano (1B and 3B variants, 4-bit by default) runs in the Android security sandbox, encrypted at rest, decrypted only inside the secure runtime. The Google AI Edge SDK exposes capability checks and scoped permissions that enterprise MDM can audit.

Microsoft's Foundry Local targets the air-gapped enterprise case: workloads can be constrained to never reach a network endpoint, enabling fully offline RAG on Windows hardware. Phi-4-mini-flash-reasoning (3.8B, March 2025) uses a SambaY/GMU architecture with speculative decoding to deliver roughly 10x throughput and 2 to 3x lower latency than the base model, which is what makes offline RAG feel responsive.

Routing criteria for hybrid architectures

Sensitivity: PII, PHI, attorney-client data, biometrics stay local.
Complexity: Simple classification and extraction local; multi-step reasoning to cloud.
Connectivity: Local when offline, cloud when available, transparent fallback between.
Latency budget: Sub-100ms responses need local; cloud when the user can wait.

Hybrid is the most complex to build because you need both stacks, routing logic, and a consistent API surface regardless of where inference ran. It's also the architecture that survives the most real-world conditions.

Which On-Device AI Runtime Should You Pick?

The open-source ecosystem has closed the gap to vendor stacks, and for cross-platform work it's often the better choice.

llama.cpp (about 118,000 GitHub stars as of June 2026, latest stable b9721) remains the most widely used inference engine for GGUF-quantized models, with backends for Metal, CUDA, Vulkan, OpenCL, and AVX512. On an M1 Max it hits 25 to 35 tokens per second for a 7B Q4 model.

It's the safe default for heterogeneous hardware.

Apple MLX (about 27,200 stars) is Apple's official framework and beats llama.cpp by 20 to 30 percent on equivalent Apple Silicon through tighter Neural Engine integration. The trade-off is platform exclusivity. If you're Mac-only, MLX is the right call; if you need to ship to Windows and Android too, it isn't.

Ollama (about 175,000 stars, v0.30.10 in June 2026) sits above the inference engine and exposes an OpenAI-compatible REST API across macOS, Windows, and Linux. It's ideal for developer workstations and local servers, less so for mobile where startup latency and footprint matter.

For mobile, Meta's ExecuTorch is the PyTorch-native path to iOS and Android with Qualcomm and MediaTek NPU backends. For Windows, ONNX Runtime with the QNN execution provider is the cross-vendor route, compiling ONNX to Hexagon-optimized executables via Qualcomm's AI Hub.

Runtime	Best for	Limitation
llama.cpp	Cross-platform CPU/GPU	Manual optimization
MLX	Apple Silicon throughput	Apple-only
Ollama	Dev workstations, prototyping	Not mobile-friendly
ExecuTorch	Mobile, PyTorch models	Smaller ecosystem
ONNX Runtime	Windows, cross-vendor	NPU path needs QNN
Core ML	iOS/macOS production	Apple-only

The convergence to watch: ONNX as the neutral interchange format and OpenAI-compatible chat completion APIs as the de facto interface. Both reduce lock-in, and both are accelerating.

Cloud Still Wins on Quality and Iteration Speed

On-device models in the 3 to 7B range are good enough for classification, summarization, and short generation. They are not competitive with frontier cloud models on complex reasoning, multi-step problem solving, or tasks needing current knowledge. That gap is real and won't close this product cycle.

Cloud also wins on operability. Request logging, A/B testing, prompt iteration, and quality monitoring all happen server-side in minutes. On-device updates require an app or OS update, with adoption curves measured in weeks to months. If your product depends on fast iteration or detailed observability, cloud-first is still the right default.

Cost flips the script at scale. Cloud inference at 2026 pricing runs $0.075 to $0.60 per million tokens for cheap tiers (Gemini 2.0 Flash, GPT-4o-mini) up to several dollars for premium output.

On-device inference has a one-time hardware premium of $100 to $500 per user plus engineering cost, then near-zero marginal cost. For apps under roughly 1 million tokens per user per month, on-device usually wins on TCO after 6 to 18 months, assuming the AI-capable hardware is already in users' hands.

What This Means for You

Plan against bandwidth, not TOPS. When evaluating a platform, pull memory bandwidth first. A 50-TOPS NPU on 136 GB/s will lose to a 45-TOPS NPU on 228 GB/s for any model that spills out of on-chip cache.
Default to INT4 with a benchmark gate. Ship Q4_K_M for general tasks, but run your actual evaluation set before committing. Switch to INT8 for code generation, math, or anything where one bad token fails the output.
Design hybrid routing from day one. Classify requests by sensitivity, complexity, and connectivity. Sensitive and simple stay local; complex and fresh-knowledge go to cloud. Build the routing layer before you build the model integrations.
Budget for thermals on mobile. Sustained phone inference throttles to roughly half of peak. Keep mobile workloads short and bursty, and push long generation to cloud or PC.
Pick the runtime that matches your platform matrix. MLX for Apple-only, llama.cpp or Ollama for cross-platform dev, ExecuTorch for mobile, ONNX Runtime with QNN for Windows NPU. Don't fight the platform; use its native path.
Track LPDDR6 and the X2 Elite wave. Bandwidth gains from LPDDR6 and 80-TOPS-class NPUs arriving through 2026 will roughly double what's practical on-device. Design your model registry so you can swap in larger models as the hardware catches up.

The teams that win on on-device AI in 2026 are the ones who stopped reading spec sheets and started measuring memory bandwidth, battery drain, and thermal curves. The chip is table stakes. The system around it is the moat.

On-Device AI Infrastructure: Why Memory Bandwidth, Not TOPS, Decides What Ships

On-Device AI Infrastructure: Why Memory Bandwidth, Not TOPS, Decides What Ships

TL;DR

Key takeaways

Why Raw TOPS Mislead You About Edge AI Hardware

What's the Real Memory Bandwidth Constraint for Local AI Models?

Model size limits by device class

How Do You Quantize Local AI Models Without Killing Accuracy?

Battery, Thermals, and the Mobile AI Inference Wall

Privacy Routing and the Case for Hybrid AI Architectures

Routing criteria for hybrid architectures

Which On-Device AI Runtime Should You Pick?

Cloud Still Wins on Quality and Iteration Speed

What This Means for You

Sources

Frequently asked questions