The Core Constraint: VRAM
Ollama runs model inference on GPU (or CPU fallback). VRAM is the binding constraint. A model that does not fit in VRAM either runs on CPU (10-50x slower) or crashes. The rule: the model's weights in your chosen quantization must fit in your available VRAM with at least 1-2 GB headroom for the KV cache.
Quantization: What Q4, Q5, Q8 Mean
Full-precision (FP16) models store each weight as 16 bits. Quantized models compress weights to fewer bits, reducing VRAM at some quality cost.
| Quantization | Bits per weight | VRAM vs FP16 | Quality loss | When to use |
|---|---|---|---|---|
| Q2_K | ~2.6 bits | ~84% smaller | Significant | Extreme VRAM constraints only |
| Q4_K_M | ~4.8 bits | ~70% smaller | Minor | Best default for most use cases |
| Q5_K_M | ~5.7 bits | ~64% smaller | Minimal | Better quality, still fits smaller GPUs |
| Q8_0 | ~8 bits | ~50% smaller | Negligible | When VRAM allows; near-FP16 quality |
| FP16 | 16 bits | Baseline | None | Research/fine-tuning; needs high VRAM |
The community consensus: Q4_K_M is the sweet spot for most production use cases. Quality is close to Q8 at a significant VRAM saving. Q8_0 is worth it only if your GPU can comfortably fit it.
VRAM Requirements by Model Size
| Model size | Q4_K_M VRAM | Q8_0 VRAM | GPU that fits Q4_K_M |
|---|---|---|---|
| 3B params | ~2.2 GB | ~3.5 GB | Any dedicated GPU (even 4 GB) |
| 7B params | ~4.8 GB | ~7.5 GB | RTX 3060 (12 GB), M1/M2 Pro |
| 13B params | ~8.5 GB | ~14 GB | RTX 3080 (10 GB) tight, RTX 4070 (12 GB) |
| 30B params | ~19 GB | ~32 GB | RTX 4090 (24 GB) for Q4; A100 40 GB for Q8 |
| 70B params | ~42 GB | ~70 GB | 2x RTX 4090, A100 80 GB |
| 405B params | ~245 GB | ~405 GB | Multi-GPU server or cloud only |
Apple Silicon (M1/M2/M3/M4) shares RAM between CPU and GPU. An M2 Pro with 32 GB unified memory can run 30B Q4_K_M models with acceptable speed. This is why Macs have become popular local LLM hardware.Model Recommendations by Use Case
Coding and Agent Tool Calling
| Model | Size | Notes |
|---|---|---|
| qwen2.5-coder:14b | 14B | Best local coding model; strong function calling |
| qwen2.5-coder:7b | 7B | Good balance of quality and speed; fits most GPUs |
| deepseek-coder-v2:16b | 16B | Strong coding quality; fits RTX 4080/4090 |
| codellama:13b | 13B | Older but reliable; good for Python/JS |
General Chat and Reasoning
| Model | Size | Notes |
|---|---|---|
| llama3.1:8b | 8B | Best general-purpose 8B model; fast |
| llama3.1:70b | 70B | Near GPT-4 quality locally; requires high-end hardware |
| mistral:7b | 7B | Fast, solid general performance |
| gemma3:12b | 12B | Google's Gemma 3; strong reasoning per parameter |
Instruction Following and RAG
| Model | Size | Notes |
|---|---|---|
| llama3.2:3b | 3B | Extremely fast; good for simple RAG retrieval tasks |
| phi4:14b | 14B | Microsoft Phi-4; punches above its weight for instruction following |
| qwen2.5:7b | 7B | Strong multilingual support; good for non-English RAG |
Running Ollama: The Basics
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull qwen2.5-coder:7b
# Run interactively
ollama run qwen2.5-coder:7b
# List downloaded models
ollama list
# Delete a model
ollama rm codellama:13b
CPU Fallback: What to Expect
If your model exceeds VRAM, Ollama offloads layers to CPU. Performance degrades significantly:
- Full GPU: 20-80 tokens/second (depending on model size and GPU)
- Partial offload (some layers on CPU): 5-15 tokens/second
- Full CPU: 2-8 tokens/second on a modern 8-core CPU
CPU inference is usable for batch offline tasks where latency does not matter. It is not usable for interactive chat or real-time agent tool calls.