Why Local Models Matter for Smolagents
Smolagents was built by HuggingFace with open-source and local models as a first-class concern. Unlike frameworks that assume GPT-4 or Claude, Smolagents is designed to run on models you control — on your hardware, with no API costs, and no data leaving your network.
The tradeoff is quality. Smaller local models follow instructions less reliably, write worse code, and tool-call less accurately. This article is honest about where those gaps show up.
Option 1: Ollama Integration
Ollama runs LLMs locally with an OpenAI-compatible API. Smolagents connects via the LiteLLMModel wrapper.
# Install: pip install smolagents[litellm]
# Requires Ollama running locally: ollama serve
from smolagents import CodeAgent, LiteLLMModel
model = LiteLLMModel(
model_id='ollama/qwen2.5-coder:14b',
api_base='http://localhost:11434',
)
agent = CodeAgent(
tools=[], # start with no tools for testing
model=model,
max_steps=5, # keep low when testing local models
)
result = agent.run('What is 17 * 38?')
print(result) # Should return 646
Recommended Ollama models for Smolagents CodeAgent:
| Model | Size | Code quality | Speed (CPU) | Notes |
|---|---|---|---|---|
| qwen2.5-coder:14b | 14B | Good | Slow | Best code quality for local use |
| qwen2.5-coder:7b | 7B | Adequate | Moderate | Reasonable balance |
| llama3.1:8b | 8B | Moderate | Fast | Tool calling OK, code mediocre |
| llama3.2:3b | 3B | Poor | Very fast | Not recommended for CodeAgent |
Option 2: HuggingFace Hub API
HfApiModel calls the HuggingFace Inference API. You get access to large models without running them locally, with your HuggingFace token.
import os
from smolagents import CodeAgent, HfApiModel
# Set HF_TOKEN env variable or pass directly
model = HfApiModel(
model_id='meta-llama/Meta-Llama-3.1-70B-Instruct',
token=os.environ['HF_TOKEN'],
)
agent = CodeAgent(tools=[], model=model)
HF Inference API is rate-limited on free tier. For production use, subscribe to HF Pro or deploy your model to a dedicated HF Endpoint. Dedicated Endpoints give you consistent latency and no rate limits.Option 3: TransformersModel (Fully Local)
Run the model directly in your Python process with no external API. Requires a GPU with enough VRAM. Slowest to start but fastest for batch offline workloads.
from smolagents import CodeAgent, TransformersModel
# Loads the model weights into GPU memory on first call
model = TransformersModel(
model_id='Qwen/Qwen2.5-Coder-7B-Instruct',
device_map='auto', # auto-selects GPU/CPU
torch_dtype='auto',
)
agent = CodeAgent(tools=[], model=model, max_steps=3)
Where Local Models Fail as Agent Backends
These are common failure modes when running smaller local models with Smolagents:
| Failure mode | Symptom | Mitigation |
|---|---|---|
| Code syntax errors | Agent generates Python with syntax errors, execution fails, agent loops trying to fix | Use a code-specialized model (Qwen Coder); set max_steps low |
| Tool calling format errors | ToolCallingAgent produces malformed JSON, tool never executes | Switch to CodeAgent for local models — code format is more forgiving than JSON schemas |
| Ignoring final_answer() | Agent keeps calling tools without terminating | Add explicit instructions in system prompt: 'Always end with final_answer(result)' |
| Context overflow | Long tasks fill context window, model starts repeating or hallucinating | Set max_steps=5 and use 7B+ models with 32k+ context |
Practical Recommendation
- For prototyping with local models: Ollama + qwen2.5-coder:14b + CodeAgent
- For production with local models: HF Dedicated Endpoint + Llama 3.1 70B + ToolCallingAgent
- For best agent quality at any budget: LiteLLMModel pointing to Claude Sonnet or GPT-4o
- Keep max_steps low (5-8) for local models to cap runaway loops and costs