Why Local Models Matter for Smolagents

Smolagents was built by HuggingFace with open-source and local models as a first-class concern. Unlike frameworks that assume GPT-4 or Claude, Smolagents is designed to run on models you control — on your hardware, with no API costs, and no data leaving your network.

The tradeoff is quality. Smaller local models follow instructions less reliably, write worse code, and tool-call less accurately. This article is honest about where those gaps show up.

Option 1: Ollama Integration

Ollama runs LLMs locally with an OpenAI-compatible API. Smolagents connects via the LiteLLMModel wrapper.

# Install: pip install smolagents[litellm]
# Requires Ollama running locally: ollama serve
 
from smolagents import CodeAgent, LiteLLMModel
 
model = LiteLLMModel(
    model_id='ollama/qwen2.5-coder:14b',
    api_base='http://localhost:11434',
)
 
agent = CodeAgent(
    tools=[],  # start with no tools for testing
    model=model,
    max_steps=5,  # keep low when testing local models
)
 
result = agent.run('What is 17 * 38?')
print(result)  # Should return 646
 

Recommended Ollama models for Smolagents CodeAgent:

Model Size Code quality Speed (CPU) Notes
qwen2.5-coder:14b 14B Good Slow Best code quality for local use
qwen2.5-coder:7b 7B Adequate Moderate Reasonable balance
llama3.1:8b 8B Moderate Fast Tool calling OK, code mediocre
llama3.2:3b 3B Poor Very fast Not recommended for CodeAgent

Option 2: HuggingFace Hub API

HfApiModel calls the HuggingFace Inference API. You get access to large models without running them locally, with your HuggingFace token.

import os
from smolagents import CodeAgent, HfApiModel
 
# Set HF_TOKEN env variable or pass directly
model = HfApiModel(
    model_id='meta-llama/Meta-Llama-3.1-70B-Instruct',
    token=os.environ['HF_TOKEN'],
)
 
agent = CodeAgent(tools=[], model=model)
 
HF Inference API is rate-limited on free tier. For production use, subscribe to HF Pro or deploy your model to a dedicated HF Endpoint. Dedicated Endpoints give you consistent latency and no rate limits.

Option 3: TransformersModel (Fully Local)

Run the model directly in your Python process with no external API. Requires a GPU with enough VRAM. Slowest to start but fastest for batch offline workloads.

from smolagents import CodeAgent, TransformersModel
 
# Loads the model weights into GPU memory on first call
model = TransformersModel(
    model_id='Qwen/Qwen2.5-Coder-7B-Instruct',
    device_map='auto',  # auto-selects GPU/CPU
    torch_dtype='auto',
)
 
agent = CodeAgent(tools=[], model=model, max_steps=3)
 

Where Local Models Fail as Agent Backends

These are common failure modes when running smaller local models with Smolagents:

Failure mode Symptom Mitigation
Code syntax errors Agent generates Python with syntax errors, execution fails, agent loops trying to fix Use a code-specialized model (Qwen Coder); set max_steps low
Tool calling format errors ToolCallingAgent produces malformed JSON, tool never executes Switch to CodeAgent for local models — code format is more forgiving than JSON schemas
Ignoring final_answer() Agent keeps calling tools without terminating Add explicit instructions in system prompt: 'Always end with final_answer(result)'
Context overflow Long tasks fill context window, model starts repeating or hallucinating Set max_steps=5 and use 7B+ models with 32k+ context

Practical Recommendation

  • For prototyping with local models: Ollama + qwen2.5-coder:14b + CodeAgent
  • For production with local models: HF Dedicated Endpoint + Llama 3.1 70B + ToolCallingAgent
  • For best agent quality at any budget: LiteLLMModel pointing to Claude Sonnet or GPT-4o
  • Keep max_steps low (5-8) for local models to cap runaway loops and costs