Fine-tuning a 65-billion-parameter model used to mean renting a cluster. The 2023 QLoRA paper by Dettmers et al. collapsed that requirement to a single 48 GB GPU by freezing the base model in 4-bit and training small adapters on top.
Three years later the question isn't whether you can fine-tune open models on one GPU. It's which path costs you the least time, VRAM, and money.
So we lined up the three that matter for single-GPU work: LoRA, QLoRA, and Unsloth. On identical hardware, the fastest path trains 1.87, 2.74x quicker than a stock HuggingFace baseline while using up to 66% less VRAM, according to independent benchmarks. That gap decides your cloud bill.
TL;DR
For single-GPU fine-tuning as of July 2026, Unsloth wins on speed and VRAM (1.87, 2.74x faster, up to 66% less memory than stock HuggingFace TRL, per AIFoss benchmarks), and it runs on top of either LoRA or QLoRA. Use plain LoRA when VRAM is plentiful and you want maximum quality; use QLoRA when you need to fit a 33B, 70B model on a 24, 48 GB card.
Unsloth accelerates both, so the real choice is LoRA vs QLoRA for your model size, then Unsloth to make it fast.
What's the difference between LoRA, QLoRA, and Unsloth?
Short version: LoRA and QLoRA are methods; Unsloth is an implementation that speeds both up.
LoRA (Hu et al., 2021) freezes the pretrained weights and trains two small low-rank matrices per target layer. For GPT-3 175B it cut trainable parameters by 10,000x and GPU memory by 3x, while keeping the base weights in FP16/BF16.
QLoRA goes further. It quantizes the frozen base model to 4-bit NormalFloat (NF4), adds double quantization to shrink the quantization constants, and uses paged optimizers to survive memory spikes. The adapters still train in full precision; only the frozen base drops to 4-bit. That's how you fit a 65B model in 48 GB instead of the ~780 GB full 16-bit training would need.
Unsloth (version 2026.6.9, released June 22, 2026, per its GitHub) is a drop-in replacement for HuggingFace's Trainer. It fuses CUDA kernels, uses a leaner gradient-checkpointing path, and integrates Flash Attention. It doesn't change the math of LoRA or QLoRA. It changes how fast that math runs.
Which uses the least VRAM per model size?
This is the table most single-GPU practitioners actually need. Standard LoRA in FP16 is the memory hog; QLoRA and Unsloth QLoRA are what put big models on small cards.
| Model size | Standard LoRA (FP16) | QLoRA (NF4) | Unsloth LoRA | Unsloth QLoRA |
|---|---|---|---|---|
| 7B | ~14 GB | ~6, 9 GB | ~8 GB | ~5, 6 GB |
| 13B | ~26 GB | ~13, 15 GB | ~16 GB | ~10 GB |
| 33B | ~66 GB | ~24 GB | ~32 GB | ~18 GB |
| 65B | not feasible | ~48 GB | ~50 GB | ~36 GB |
| 70B | not feasible | ~52 GB | ~55 GB | ~40 GB |
Sources: QLoRA paper; InsiderLLM and Markaicode benchmarks.
The pattern is clean. A 24 GB RTX 4090 runs any 7B model on all four paths, handles 13B, 33B comfortably on QLoRA, and can reach 70B only through Unsloth QLoRA with careful batch settings.
Which trains fastest on one GPU?
Speed is where Unsloth earns its place. Against a HuggingFace TRL baseline of 1.0x, the measured multipliers hold up across models.
Two facts to keep straight. First, QLoRA is slower than LoRA, about 39% slower in the original paper, because 4-bit dequantization runs on every forward pass. You trade speed for memory.
Second, Unsloth's 30x claim is a Pro/multi-GPU number. On a free single GPU the honest figure is roughly 2x, which the independent 1.87, 2.74x range confirms. For Mixture-of-Experts models Unsloth reports up to 12x, per its MoE documentation, but that's an architecture-specific gain, not the dense-model default.
Does QLoRA cost quality?
Yes, a little, and honesty here matters more than the headline. QLoRA-fine-tuned models typically score 1, 3% lower on standard benchmarks than full-precision counterparts, with 1, 5% relative perplexity increases in community tests. Tasks needing precise numerical reasoning degrade more.
The QLoRA paper argues it reaches "full 16-bit finetuning task performance," and its Guanaco 33B hit 99.3% of ChatGPT on the Vicuna benchmark with an MMLU of 55.47. But that near-parity holds most cleanly for larger models (33B+), where the adapter carries proportionally more of the effective signal.
On a 7B model the quantization tax is more visible. If you're chasing the last point of accuracy and have the VRAM, plain LoRA is still the quality ceiling.
What does a fine-tuning run actually cost?
Cloud pricing decides most single-GPU decisions. Here's the July 2026 hourly landscape.
| GPU | VRAM | AWS $/hr | Lambda $/hr | Vast.ai $/hr |
|---|---|---|---|---|
| RTX 4090 | 24 GB | consumer only | $0.50, 0.69 | $0.35, 0.50 |
| A100 40GB | 40 GB | $3.67 | $1.09 | $0.79, 0.99 |
| H100 80GB | 80 GB | $35.69 | $3.79 | $2.89, 3.50 |
Sources: AWS pricing; Lambda Labs; Vast.ai.
Now fold in speed. On an A100 40GB training 10B tokens, standard LoRA runs 8 hours ($8.72 on AWS), QLoRA 11 hours ($12.01), and Unsloth LoRA 3, 4 hours ($4.00).
For a team training ten models a week, switching from stock HuggingFace to Unsloth saves roughly $200, 400/week in compute. That's the entire argument for Unsloth in one line.
LoRA vs full fine-tuning: is PEFT ever the wrong call?
Full fine-tuning updates every weight and needs an order of magnitude more memory and money. Parameter-efficient methods win on cost, and there's a quality nuance in their favor: research from Biderman et al. Found PEFT methods "learn less but also forget less," which reduces catastrophic forgetting on narrow datasets.
But fine-tuning of any kind can be the wrong tool. It teaches behavior, format, and style, not fresh facts. When knowledge changes daily or you need cited, verifiable answers, RAG is the better path.
A vector-DB RAG setup runs about $5, 20/month versus $50, 200 to fine-tune a 7B model on 10B tokens. For fast-moving factual data, RAG is 5, 10x cheaper.
Decide behavior-vs-knowledge before you decide LoRA-vs-QLoRA.
Key takeaways
- Unsloth is the default accelerator for single-GPU work: 1.87, 2.74x faster, up to 66% less VRAM, on top of LoRA or QLoRA.
- LoRA for quality, QLoRA for size. QLoRA runs ~39% slower and costs 1, 3% on benchmarks, but fits 65B on a 48 GB card.
- VRAM math is the real constraint. A 24 GB RTX 4090 reaches 70B only via Unsloth QLoRA with batch_size=1.
- Cost follows speed. Unsloth LoRA can cut an A100 run from ~$12 to ~$4 per 10B tokens.
- Fine-tune for behavior, RAG for knowledge. RAG is 5, 10x cheaper for fast-changing facts.
Version and compatibility notes for July 2026
Reproducibility breaks on version drift, and this stack ships roughly every 60 days. Pin these.
| Library | Recommended | Why |
|---|---|---|
| PEFT | ≥0.19.1 (Apr 16, 2026) | Current stable; supports bitsandbytes 0.45.0+ |
| bitsandbytes | 0.49.2 (Feb 16, 2026) | 0.50+ has a planned breaking change to blockwise quantization |
| Unsloth | ≥2026.6.9 (Jun 22, 2026) | Fixes CUDA 13.x breakage from the May release |
| Transformers | ≥4.46 | Needed for Gemma 4 and Llama 4 |
| PyTorch | 2.4+ | Flash Attention 3 support |
Two traps guides miss. Unsloth's May 2026 release broke CUDA 13.x; the June 22 build fixed it, so upgrade if you're on CUDA 13.x. And Unsloth self-pins torch, transformers, TRL, and PEFT, which fights org-wide dependency management. Teams report needing --no-deps on install to avoid version conflicts.
What this means for you
Start with your card. On a 24 GB RTX 4090, run 7B models on Unsloth LoRA for speed, and 13B, 70B on Unsloth QLoRA with batch_size=1 and gradient accumulation around 128. On an A100 40GB, use LoRA (or Unsloth LoRA) up to 13B for best quality, and QLoRA for 33B, 70B.
Validate the pipeline on a 7B model before scaling, keep the original base weights (adapters are reversible, merged models are not), and reserve 1, 5% of data for holdout eval. One more caution worth pricing in: LoRA can be used to strip safety guardrails, demonstrated on a $200 budget, so vet any community adapter you didn't train yourself.
The workflow above outlives the version numbers. Pick the method for your VRAM, add Unsloth for throughput, and re-check the library table each quarter.
Sources
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- Unsloth on GitHub
- Unsloth vs Axolotl 2026 benchmarks (AIFoss)
- Fine-Tune Llama with Unsloth (Markaicode)
- Fine-tune MoE Models 12x Faster (Unsloth docs)
- Why RAG and When Not To (EngineersOfAI)
- HuggingFace PEFT releases
- bitsandbytes on PyPI
- Vast.ai GPU pricing
