The Core Constraint: VRAM

Ollama runs model inference on GPU (or CPU fallback). VRAM is the binding constraint. A model that does not fit in VRAM either runs on CPU (10-50x slower) or crashes. The rule: the model's weights in your chosen quantization must fit in your available VRAM with at least 1-2 GB headroom for the KV cache.

Quantization: What Q4, Q5, Q8 Mean

Full-precision (FP16) models store each weight as 16 bits. Quantized models compress weights to fewer bits, reducing VRAM at some quality cost.

Quantization Bits per weight VRAM vs FP16 Quality loss When to use
Q2_K ~2.6 bits ~84% smaller Significant Extreme VRAM constraints only
Q4_K_M ~4.8 bits ~70% smaller Minor Best default for most use cases
Q5_K_M ~5.7 bits ~64% smaller Minimal Better quality, still fits smaller GPUs
Q8_0 ~8 bits ~50% smaller Negligible When VRAM allows; near-FP16 quality
FP16 16 bits Baseline None Research/fine-tuning; needs high VRAM

The community consensus: Q4_K_M is the sweet spot for most production use cases. Quality is close to Q8 at a significant VRAM saving. Q8_0 is worth it only if your GPU can comfortably fit it.

VRAM Requirements by Model Size

Model size Q4_K_M VRAM Q8_0 VRAM GPU that fits Q4_K_M
3B params ~2.2 GB ~3.5 GB Any dedicated GPU (even 4 GB)
7B params ~4.8 GB ~7.5 GB RTX 3060 (12 GB), M1/M2 Pro
13B params ~8.5 GB ~14 GB RTX 3080 (10 GB) tight, RTX 4070 (12 GB)
30B params ~19 GB ~32 GB RTX 4090 (24 GB) for Q4; A100 40 GB for Q8
70B params ~42 GB ~70 GB 2x RTX 4090, A100 80 GB
405B params ~245 GB ~405 GB Multi-GPU server or cloud only
Apple Silicon (M1/M2/M3/M4) shares RAM between CPU and GPU. An M2 Pro with 32 GB unified memory can run 30B Q4_K_M models with acceptable speed. This is why Macs have become popular local LLM hardware.

Model Recommendations by Use Case

Coding and Agent Tool Calling

Model Size Notes
qwen2.5-coder:14b 14B Best local coding model; strong function calling
qwen2.5-coder:7b 7B Good balance of quality and speed; fits most GPUs
deepseek-coder-v2:16b 16B Strong coding quality; fits RTX 4080/4090
codellama:13b 13B Older but reliable; good for Python/JS

General Chat and Reasoning

Model Size Notes
llama3.1:8b 8B Best general-purpose 8B model; fast
llama3.1:70b 70B Near GPT-4 quality locally; requires high-end hardware
mistral:7b 7B Fast, solid general performance
gemma3:12b 12B Google's Gemma 3; strong reasoning per parameter

Instruction Following and RAG

Model Size Notes
llama3.2:3b 3B Extremely fast; good for simple RAG retrieval tasks
phi4:14b 14B Microsoft Phi-4; punches above its weight for instruction following
qwen2.5:7b 7B Strong multilingual support; good for non-English RAG

Running Ollama: The Basics

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull a model
ollama pull qwen2.5-coder:7b
 
# Run interactively
ollama run qwen2.5-coder:7b
 
# List downloaded models
ollama list
 
# Delete a model
ollama rm codellama:13b
 

CPU Fallback: What to Expect

If your model exceeds VRAM, Ollama offloads layers to CPU. Performance degrades significantly:

  • Full GPU: 20-80 tokens/second (depending on model size and GPU)
  • Partial offload (some layers on CPU): 5-15 tokens/second
  • Full CPU: 2-8 tokens/second on a modern 8-core CPU

CPU inference is usable for batch offline tasks where latency does not matter. It is not usable for interactive chat or real-time agent tool calls.