Fine-Tuning LLMs with Torchtune: A Practical Guide (Qwen & Llama2)
Fine-Tuning LLMs with Torchtune: A Practical Guide (Qwen & Llama2)
This post walks through fine-tuning large language models using torchtune, covering common errors and memory constraints encountered in practice — especially when running inside a memory-limited Kubernetes container.
Environment Setup
pip install torch torchtune torchao --index-url https://download.pytorch.org/whl/cu128
Version Compatibility: torchtune + torchao
After installation, you may hit this import error when running any tune command:
ImportError: cannot import name 'int4_weight_only' from 'torchao.quantization'
This happens because torchtune and torchao have a version mismatch. Newer versions of torchao renamed int4_weight_only to Int4WeightOnlyConfig. Fix it by upgrading both packages together, or pinning a compatible torchao version:
# Option 1: Upgrade both
pip install torchtune torchao --upgrade
# Option 2: Pin compatible torchao
pip install "torchao<0.8"
Downloading Models
# List all available built-in configs
tune ls
# Download Qwen 2.5 models
tune download Qwen/Qwen2.5-3B-Instruct --output-dir /tmp/Qwen2.5-3B-Instruct --hf-token $HF_TOKEN
tune download Qwen/Qwen2.5-7B-Instruct --output-dir /tmp/Qwen2.5-7B-Instruct --hf-token $HF_TOKEN
# Download Llama 2 7B
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token $HF_TOKEN
If tune download fails (e.g., due to the torchao import issue above), you can bypass it entirely with huggingface-cli:
huggingface-cli download meta-llama/Llama-2-7b-hf --local-dir /tmp/Llama-2-7b-hf --token $HF_TOKEN
Fine-Tuning with LoRA
Basic Command
tune run lora_finetune_single_device --config qwen2_5/7B_lora_single_device epochs=1
Error: bf16 Not Supported
RuntimeError: bf16 precision was requested but not available on this hardware.
bf16 (bfloat16) requires NVIDIA Ampere or newer GPUs (A100, RTX 30xx/40xx). If your hardware doesn’t support it, switch to fp32 or fp16:
# Use fp32 (safe for all hardware, but 2x memory and slower)
tune run lora_finetune_single_device --config qwen2_5/7B_lora_single_device epochs=1 dtype=fp32
# Use fp16 (most GPUs support this, half the memory of fp32)
tune run lora_finetune_single_device --config qwen2_5/7B_lora_single_device epochs=1 dtype=fp16
Handling OOM (Out of Memory) Kills
Symptom
Training starts and immediately gets killed:
0%| | 0/3235 [00:00<?, ?it/s]Killed
No error traceback — the Linux OOM Killer terminates the process silently.
Identifying the Real Memory Limit in Containers
If you’re running inside a Kubernetes Pod / Docker container, the top or free commands show the host machine’s memory, not your container’s limit. The actual limit is controlled by cgroups:
# Container memory limit (cgroup v2)
cat /sys/fs/cgroup/memory.max
# Current usage
cat /sys/fs/cgroup/memory.current
# For cgroup v1
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
In my case, the container limit was 32GB while top showed 257GB (host memory):
| Metric | Value |
|---|---|
| Container memory limit | 34,359,738,368 bytes (32 GiB) |
| Usage before kill | 34,337,239,040 bytes (~32 GiB) |
| Headroom | ~21 MB — almost zero |
Memory Estimation for Model Weights
Model weights alone (fp32, 4 bytes per parameter):
- 3B model: $(3 \times 10^9 \times 4) \div 1024^3 \approx 11.2\text{ GB}$
- 7B model: $(7 \times 10^9 \times 4) \div 1024^3 \approx 26.1\text{ GB}$
On top of weights, you also need memory for optimizer states, gradients, activations, and data loading — easily adding 50–100% overhead. A 7B fp32 model realistically needs 40+ GB.
Solutions to Reduce Memory Usage
1. Use fp16 with smaller batch size:
tune run lora_finetune_single_device --config qwen2_5/7B_lora_single_device \
epochs=1 dtype=fp16 batch_size=1 gradient_accumulation_steps=16 \
enable_activation_checkpointing=true tokenizer.max_seq_len=512
fp16 cuts model weight memory in half (~14 GB for 7B), leaving room for optimizer states and activations within 32 GB.
2. Use QLoRA (4-bit quantization) for maximum savings:
# If a built-in qlora config exists:
tune run lora_finetune_single_device --config qwen2_5/7B_qlora_single_device epochs=1
# Or manually specify the quantizer:
tune run lora_finetune_single_device --config qwen2_5/7B_lora_single_device \
epochs=1 dtype=fp32 \
quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQuantizer \
quantizer.groupsize=128
4-bit quantization reduces the 7B model to ~3.5 GB, making it very comfortable within a 32 GB container.
3. Use a smaller model:
tune download Qwen/Qwen2.5-3B-Instruct --output-dir /tmp/Qwen2.5-3B-Instruct --hf-token $HF_TOKEN
tune run lora_finetune_single_device --config qwen2_5/3B_lora_single_device epochs=1 dtype=fp32
Qwen2.5 available sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B (there is no 4B variant).
Summary: Memory Strategy Cheat Sheet
| Strategy | 7B Weight Memory | Total Estimated | Fits 32GB? |
|---|---|---|---|
| fp32 | ~26 GB | ~40+ GB | No |
| fp16 | ~13 GB | ~20-24 GB | Yes |
| QLoRA (4-bit) | ~3.5 GB | ~8-12 GB | Yes |
| Use 3B fp32 | ~11 GB | ~18-22 GB | Yes |