Why Use Ollama as Your Model Layer
Ollama exposes an OpenAI-compatible REST API on localhost:11434. Most AI frameworks have an OpenAI client built in — pointing it at Ollama instead of api.openai.com routes all inference to your local GPU. Zero API costs, no data leaving your network, and no rate limits.
The tradeoff is quality and speed. Local models are good but not GPT-4 level. Inference is slower without high-end hardware. For many enterprise use cases — internal knowledge bases, document processing, code review — local quality is sufficient and the privacy benefit is decisive.
LangChain Integration
from langchain_ollama import ChatOllama, OllamaEmbeddings
# Chat model
llm = ChatOllama(
model='llama3.1:8b',
base_url='http://localhost:11434', # default; can omit
temperature=0,
)
response = llm.invoke('Explain what a vector database is in one paragraph.')
print(response.content)
# Embeddings (for RAG)
embeddings = OllamaEmbeddings(model='nomic-embed-text')
vector = embeddings.embed_query('What is a transformer model?')
print(f'Embedding dimension: {len(vector)}') # e.g. 768
For RAG pipelines, use nomic-embed-text for embeddings (768 dimensions, fast, good quality) or mxbai-embed-large for higher quality (1024 dimensions).
# Full RAG chain with Ollama
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load and chunk documents
loader = DirectoryLoader('./docs/', glob='**/*.md')
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Embed and store locally
embeddings = OllamaEmbeddings(model='nomic-embed-text')
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory='./chroma_db')
# RAG chain
llm = ChatOllama(model='llama3.1:8b', temperature=0)
chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={'k': 4}),
)
answer = chain.invoke('What is the refund policy?')
print(answer['result'])
LlamaIndex Integration
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
# Configure global settings
Settings.llm = Ollama(model='llama3.1:8b', request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name='nomic-embed-text')
# Load documents and build index
documents = SimpleDirectoryReader('./docs').load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query('Summarise the key product features.')
print(response)
Set request_timeout=120.0 or higher for Ollama with LlamaIndex. Larger local models can take 30-60 seconds to respond on the first call (cold start) or for long documents. The default timeout is too short and causes confusing errors.n8n Integration
n8n's AI Agent node supports Ollama as a chat model. In self-hosted n8n running in Docker, Ollama runs on the host — use host.docker.internal:11434 as the base URL.
- In your n8n workflow, add an AI Agent node
- Under 'Chat Model', select 'Ollama Chat Model'
- Set Base URL to http://host.docker.internal:11434 (Docker) or http://localhost:11434 (native n8n)
- Set Model to your installed model name (e.g. llama3.1:8b)
- Connect tools and configure your agent as normal
For the Ollama embeddings node (used in n8n's vector store nodes):
{
"baseUrl": "http://host.docker.internal:11434",
"model": "nomic-embed-text"
}
If n8n and Ollama are both in Docker Compose (not recommended — Ollama should run directly on the GPU host), use the service name as the host: http://ollama:11434. The host.docker.internal pattern only works when Ollama runs on the Docker host machine.OpenAI-Compatible API: Any Framework
Ollama's OpenAI-compatible endpoint lets you use the standard openai Python client with any local model. This works with any framework that supports a custom OpenAI base_url.
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required by the client; value ignored by Ollama
)
response = client.chat.completions.create(
model='qwen2.5-coder:7b',
messages=[{'role': 'user', 'content': 'Write a Python function to reverse a string'}],
)
print(response.choices[0].message.content)
When Local Is Not Enough
Ollama is not always the right answer. Know when to fall back to an API provider:
| Situation | Use Ollama? | Use API instead? |
|---|---|---|
| Processing confidential internal documents | Yes | |
| Multi-step agent requiring complex reasoning | Only with 30B+ models | GPT-4o or Claude for best quality |
| Serving 100+ concurrent users | Requires dedicated GPU server | API scales automatically |
| Vision/image understanding tasks | llava or llama3.2-vision | GPT-4o vision for production quality |
| Production SLA required (99.9% uptime) | API provider handles this | |
| Cost-sensitive batch processing at night | Yes — zero API cost |