Ollama Model Selection in 2026: What Runs on Your Hardware and What Doesn't

The Core Constraint: VRAM

Ollama runs model inference on GPU (or CPU fallback). VRAM is the binding constraint. A model that does not fit in VRAM either runs on CPU (10-50x slower) or crashes. The rule: the model's weights in your chosen quantization must fit in your available VRAM with at least 1-2 GB headroom for the KV cache.

Quantization: What Q4, Q5, Q8 Mean

Full-precision (FP16) models store each weight as 16 bits. Quantized models compress weights to fewer bits, reducing VRAM at some quality cost.

Quantization	Bits per weight	VRAM vs FP16	Quality loss	When to use
Q2_K	~2.6 bits	~84% smaller	Significant	Extreme VRAM constraints only
Q4_K_M	~4.8 bits	~70% smaller	Minor	Best default for most use cases
Q5_K_M	~5.7 bits	~64% smaller	Minimal	Better quality, still fits smaller GPUs
Q8_0	~8 bits	~50% smaller	Negligible	When VRAM allows; near-FP16 quality
FP16	16 bits	Baseline	None	Research/fine-tuning; needs high VRAM

The community consensus: Q4_K_M is the sweet spot for most production use cases. Quality is close to Q8 at a significant VRAM saving. Q8_0 is worth it only if your GPU can comfortably fit it.

VRAM Requirements by Model Size

Model size	Q4_K_M VRAM	Q8_0 VRAM	GPU that fits Q4_K_M
3B params	~2.2 GB	~3.5 GB	Any dedicated GPU (even 4 GB)
7B params	~4.8 GB	~7.5 GB	RTX 3060 (12 GB), M1/M2 Pro
13B params	~8.5 GB	~14 GB	RTX 3080 (10 GB) tight, RTX 4070 (12 GB)
30B params	~19 GB	~32 GB	RTX 4090 (24 GB) for Q4; A100 40 GB for Q8
70B params	~42 GB	~70 GB	2x RTX 4090, A100 80 GB
405B params	~245 GB	~405 GB	Multi-GPU server or cloud only

Apple Silicon (M1/M2/M3/M4) shares RAM between CPU and GPU. An M2 Pro with 32 GB unified memory can run 30B Q4_K_M models with acceptable speed. This is why Macs have become popular local LLM hardware.

Model Recommendations by Use Case

Coding and Agent Tool Calling

Model	Size	Notes
qwen2.5-coder:14b	14B	Best local coding model; strong function calling
qwen2.5-coder:7b	7B	Good balance of quality and speed; fits most GPUs
deepseek-coder-v2:16b	16B	Strong coding quality; fits RTX 4080/4090
codellama:13b	13B	Older but reliable; good for Python/JS

General Chat and Reasoning

Model	Size	Notes
llama3.1:8b	8B	Best general-purpose 8B model; fast
llama3.1:70b	70B	Near GPT-4 quality locally; requires high-end hardware
mistral:7b	7B	Fast, solid general performance
gemma3:12b	12B	Google's Gemma 3; strong reasoning per parameter

Instruction Following and RAG

Model	Size	Notes
llama3.2:3b	3B	Extremely fast; good for simple RAG retrieval tasks
phi4:14b	14B	Microsoft Phi-4; punches above its weight for instruction following
qwen2.5:7b	7B	Strong multilingual support; good for non-English RAG

Running Ollama: The Basics

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull a model
ollama pull qwen2.5-coder:7b
 
# Run interactively
ollama run qwen2.5-coder:7b
 
# List downloaded models
ollama list
 
# Delete a model
ollama rm codellama:13b

CPU Fallback: What to Expect

If your model exceeds VRAM, Ollama offloads layers to CPU. Performance degrades significantly:

Full GPU: 20-80 tokens/second (depending on model size and GPU)
Partial offload (some layers on CPU): 5-15 tokens/second
Full CPU: 2-8 tokens/second on a modern 8-core CPU

CPU inference is usable for batch offline tasks where latency does not matter. It is not usable for interactive chat or real-time agent tool calls.