How much VRAM do I need to fine-tune a 7B model with QLoRA?

Roughly 6, 9 GB for standard QLoRA and about 5, 6 GB with Unsloth QLoRA, per the QLoRA paper and independent benchmarks. Standard LoRA in FP16 needs about 14 GB for the same 7B model. A 24 GB RTX 4090 handles all three comfortably.

Is Unsloth actually faster than standard LoRA?

Yes. Independent benchmarks measured 1.87, 2.74x faster training versus a HuggingFace TRL baseline, with 22, 66% less VRAM, and Unsloth's own figure is 2x. The gains are largest on 7B, 33B models on a single consumer GPU.

What is the difference between LoRA and QLoRA?

LoRA keeps base weights in full precision (FP16/BF16); QLoRA quantizes them to 4-bit NormalFloat so the frozen base fits in far less memory. QLoRA enables bigger models on smaller cards but runs about 39% slower than LoRA and can cost 1, 3% on benchmarks.

Can I fine-tune a 70B model on a single RTX 4090?

Yes, with QLoRA or Unsloth QLoRA on a 24 GB RTX 4090, using batch_size=1 and heavy gradient accumulation. Expect 10, 15 hours per epoch on 10B tokens. Unsloth QLoRA needs roughly 40 GB for 70B, so on a 24 GB card you rely on aggressive offloading and paged optimizers.

Should I fine-tune or use RAG?

Fine-tuning is for behavior, format, and style; RAG is for knowledge that changes or needs citations. RAG with a vector DB runs about $5, 20/month versus $50, 200 to fine-tune a 7B model on 10B tokens, so for factual, fast-changing data RAG is usually 5, 10x cheaper.

Unsloth Trains 2.7x Faster: QLoRA vs LoRA on One GPU

Fine-tuning a 65-billion-parameter model used to mean renting a cluster. The 2023 QLoRA paper by Dettmers et al. collapsed that requirement to a single 48 GB GPU by freezing the base model in 4-bit and training small adapters on top.

Three years later the question isn't whether you can fine-tune open models on one GPU. It's which path costs you the least time, VRAM, and money.

So we lined up the three that matter for single-GPU work: LoRA, QLoRA, and Unsloth. On identical hardware, the fastest path trains 1.87, 2.74x quicker than a stock HuggingFace baseline while using up to 66% less VRAM, according to independent benchmarks. That gap decides your cloud bill.

TL;DR

For single-GPU fine-tuning as of July 2026, Unsloth wins on speed and VRAM (1.87, 2.74x faster, up to 66% less memory than stock HuggingFace TRL, per AIFoss benchmarks), and it runs on top of either LoRA or QLoRA. Use plain LoRA when VRAM is plentiful and you want maximum quality; use QLoRA when you need to fit a 33B, 70B model on a 24, 48 GB card.

Unsloth accelerates both, so the real choice is LoRA vs QLoRA for your model size, then Unsloth to make it fast.

What's the difference between LoRA, QLoRA, and Unsloth?

Short version: LoRA and QLoRA are methods; Unsloth is an implementation that speeds both up.

LoRA (Hu et al., 2021) freezes the pretrained weights and trains two small low-rank matrices per target layer. For GPT-3 175B it cut trainable parameters by 10,000x and GPU memory by 3x, while keeping the base weights in FP16/BF16.

QLoRA goes further. It quantizes the frozen base model to 4-bit NormalFloat (NF4), adds double quantization to shrink the quantization constants, and uses paged optimizers to survive memory spikes. The adapters still train in full precision; only the frozen base drops to 4-bit. That's how you fit a 65B model in 48 GB instead of the ~780 GB full 16-bit training would need.

Unsloth (version 2026.6.9, released June 22, 2026, per its GitHub) is a drop-in replacement for HuggingFace's Trainer. It fuses CUDA kernels, uses a leaner gradient-checkpointing path, and integrates Flash Attention. It doesn't change the math of LoRA or QLoRA. It changes how fast that math runs.

Which uses the least VRAM per model size?

This is the table most single-GPU practitioners actually need. Standard LoRA in FP16 is the memory hog; QLoRA and Unsloth QLoRA are what put big models on small cards.

Model size	Standard LoRA (FP16)	QLoRA (NF4)	Unsloth LoRA	Unsloth QLoRA
7B	~14 GB	~6, 9 GB	~8 GB	~5, 6 GB
13B	~26 GB	~13, 15 GB	~16 GB	~10 GB
33B	~66 GB	~24 GB	~32 GB	~18 GB
65B	not feasible	~48 GB	~50 GB	~36 GB
70B	not feasible	~52 GB	~55 GB	~40 GB

Sources: QLoRA paper; InsiderLLM and Markaicode benchmarks.

The pattern is clean. A 24 GB RTX 4090 runs any 7B model on all four paths, handles 13B, 33B comfortably on QLoRA, and can reach 70B only through Unsloth QLoRA with careful batch settings.

VRAM to fine-tune a 7B model, by method

Which trains fastest on one GPU?

Speed is where Unsloth earns its place. Against a HuggingFace TRL baseline of 1.0x, the measured multipliers hold up across models.

Training speed vs HuggingFace TRL baseline (1.0x)

Two facts to keep straight. First, QLoRA is slower than LoRA, about 39% slower in the original paper, because 4-bit dequantization runs on every forward pass. You trade speed for memory.

Second, Unsloth's 30x claim is a Pro/multi-GPU number. On a free single GPU the honest figure is roughly 2x, which the independent 1.87, 2.74x range confirms. For Mixture-of-Experts models Unsloth reports up to 12x, per its MoE documentation, but that's an architecture-specific gain, not the dense-model default.

Does QLoRA cost quality?

Yes, a little, and honesty here matters more than the headline. QLoRA-fine-tuned models typically score 1, 3% lower on standard benchmarks than full-precision counterparts, with 1, 5% relative perplexity increases in community tests. Tasks needing precise numerical reasoning degrade more.

The QLoRA paper argues it reaches "full 16-bit finetuning task performance," and its Guanaco 33B hit 99.3% of ChatGPT on the Vicuna benchmark with an MMLU of 55.47. But that near-parity holds most cleanly for larger models (33B+), where the adapter carries proportionally more of the effective signal.

On a 7B model the quantization tax is more visible. If you're chasing the last point of accuracy and have the VRAM, plain LoRA is still the quality ceiling.

What does a fine-tuning run actually cost?

Cloud pricing decides most single-GPU decisions. Here's the July 2026 hourly landscape.

GPU	VRAM	AWS $/hr	Lambda $/hr	Vast.ai $/hr
RTX 4090	24 GB	consumer only	$0.50, 0.69	$0.35, 0.50
A100 40GB	40 GB	$3.67	$1.09	$0.79, 0.99
H100 80GB	80 GB	$35.69	$3.79	$2.89, 3.50

Sources: AWS pricing; Lambda Labs; Vast.ai.

Now fold in speed. On an A100 40GB training 10B tokens, standard LoRA runs ~~8 hours (~~$8.72 on AWS), QLoRA ~~11 hours (~~$12.01), and Unsloth LoRA ~~3, 4 hours (~~$4.00).

For a team training ten models a week, switching from stock HuggingFace to Unsloth saves roughly $200, 400/week in compute. That's the entire argument for Unsloth in one line.

LoRA vs full fine-tuning: is PEFT ever the wrong call?

Full fine-tuning updates every weight and needs an order of magnitude more memory and money. Parameter-efficient methods win on cost, and there's a quality nuance in their favor: research from Biderman et al. Found PEFT methods "learn less but also forget less," which reduces catastrophic forgetting on narrow datasets.

But fine-tuning of any kind can be the wrong tool. It teaches behavior, format, and style, not fresh facts. When knowledge changes daily or you need cited, verifiable answers, RAG is the better path.

A vector-DB RAG setup runs about $5, 20/month versus $50, 200 to fine-tune a 7B model on 10B tokens. For fast-moving factual data, RAG is 5, 10x cheaper.

Decide behavior-vs-knowledge before you decide LoRA-vs-QLoRA.

Key takeaways

Unsloth is the default accelerator for single-GPU work: 1.87, 2.74x faster, up to 66% less VRAM, on top of LoRA or QLoRA.
LoRA for quality, QLoRA for size. QLoRA runs ~39% slower and costs 1, 3% on benchmarks, but fits 65B on a 48 GB card.
VRAM math is the real constraint. A 24 GB RTX 4090 reaches 70B only via Unsloth QLoRA with batch_size=1.
Cost follows speed. Unsloth LoRA can cut an A100 run from ~$12 to ~$4 per 10B tokens.
Fine-tune for behavior, RAG for knowledge. RAG is 5, 10x cheaper for fast-changing facts.

Version and compatibility notes for July 2026

Reproducibility breaks on version drift, and this stack ships roughly every 60 days. Pin these.

Library	Recommended	Why
PEFT	≥0.19.1 (Apr 16, 2026)	Current stable; supports bitsandbytes 0.45.0+
bitsandbytes	0.49.2 (Feb 16, 2026)	0.50+ has a planned breaking change to blockwise quantization
Unsloth	≥2026.6.9 (Jun 22, 2026)	Fixes CUDA 13.x breakage from the May release
Transformers	≥4.46	Needed for Gemma 4 and Llama 4
PyTorch	2.4+	Flash Attention 3 support

Two traps guides miss. Unsloth's May 2026 release broke CUDA 13.x; the June 22 build fixed it, so upgrade if you're on CUDA 13.x. And Unsloth self-pins torch, transformers, TRL, and PEFT, which fights org-wide dependency management. Teams report needing --no-deps on install to avoid version conflicts.

What this means for you

Start with your card. On a 24 GB RTX 4090, run 7B models on Unsloth LoRA for speed, and 13B, 70B on Unsloth QLoRA with batch_size=1 and gradient accumulation around 128. On an A100 40GB, use LoRA (or Unsloth LoRA) up to 13B for best quality, and QLoRA for 33B, 70B.

Validate the pipeline on a 7B model before scaling, keep the original base weights (adapters are reversible, merged models are not), and reserve 1, 5% of data for holdout eval. One more caution worth pricing in: LoRA can be used to strip safety guardrails, demonstrated on a $200 budget, so vet any community adapter you didn't train yourself.

The workflow above outlives the version numbers. Pick the method for your VRAM, add Unsloth for throughput, and re-check the library table each quarter.

QLoRA vs LoRA vs Unsloth: The Single-GPU Benchmark