Ollama as the Model Layer in Your Agent Stack: LangChain, LlamaIndex, and n8n Integration

Why Use Ollama as Your Model Layer

Ollama exposes an OpenAI-compatible REST API on localhost:11434. Most AI frameworks have an OpenAI client built in — pointing it at Ollama instead of api.openai.com routes all inference to your local GPU. Zero API costs, no data leaving your network, and no rate limits.

The tradeoff is quality and speed. Local models are good but not GPT-4 level. Inference is slower without high-end hardware. For many enterprise use cases — internal knowledge bases, document processing, code review — local quality is sufficient and the privacy benefit is decisive.

LangChain Integration

from langchain_ollama import ChatOllama, OllamaEmbeddings
 
# Chat model
llm = ChatOllama(
    model='llama3.1:8b',
    base_url='http://localhost:11434',  # default; can omit
    temperature=0,
)
 
response = llm.invoke('Explain what a vector database is in one paragraph.')
print(response.content)
 
# Embeddings (for RAG)
embeddings = OllamaEmbeddings(model='nomic-embed-text')
vector = embeddings.embed_query('What is a transformer model?')
print(f'Embedding dimension: {len(vector)}')  # e.g. 768

For RAG pipelines, use nomic-embed-text for embeddings (768 dimensions, fast, good quality) or mxbai-embed-large for higher quality (1024 dimensions).

# Full RAG chain with Ollama
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
# Load and chunk documents
loader = DirectoryLoader('./docs/', glob='**/*.md')
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
 
# Embed and store locally
embeddings = OllamaEmbeddings(model='nomic-embed-text')
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory='./chroma_db')
 
# RAG chain
llm = ChatOllama(model='llama3.1:8b', temperature=0)
chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={'k': 4}),
)
 
answer = chain.invoke('What is the refund policy?')
print(answer['result'])

LlamaIndex Integration

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
 
# Configure global settings
Settings.llm = Ollama(model='llama3.1:8b', request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name='nomic-embed-text')
 
# Load documents and build index
documents = SimpleDirectoryReader('./docs').load_data()
index = VectorStoreIndex.from_documents(documents)
 
# Query
query_engine = index.as_query_engine()
response = query_engine.query('Summarise the key product features.')
print(response)

Set request_timeout=120.0 or higher for Ollama with LlamaIndex. Larger local models can take 30-60 seconds to respond on the first call (cold start) or for long documents. The default timeout is too short and causes confusing errors.

n8n Integration

n8n's AI Agent node supports Ollama as a chat model. In self-hosted n8n running in Docker, Ollama runs on the host — use host.docker.internal:11434 as the base URL.

In your n8n workflow, add an AI Agent node
Under 'Chat Model', select 'Ollama Chat Model'
Set Base URL to http://host.docker.internal:11434 (Docker) or http://localhost:11434 (native n8n)
Set Model to your installed model name (e.g. llama3.1:8b)
Connect tools and configure your agent as normal

For the Ollama embeddings node (used in n8n's vector store nodes):

{
  "baseUrl": "http://host.docker.internal:11434",
  "model": "nomic-embed-text"
}

If n8n and Ollama are both in Docker Compose (not recommended — Ollama should run directly on the GPU host), use the service name as the host: http://ollama:11434. The host.docker.internal pattern only works when Ollama runs on the Docker host machine.

OpenAI-Compatible API: Any Framework

Ollama's OpenAI-compatible endpoint lets you use the standard openai Python client with any local model. This works with any framework that supports a custom OpenAI base_url.

from openai import OpenAI
 
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required by the client; value ignored by Ollama
)
 
response = client.chat.completions.create(
    model='qwen2.5-coder:7b',
    messages=[{'role': 'user', 'content': 'Write a Python function to reverse a string'}],
)
print(response.choices[0].message.content)

When Local Is Not Enough

Ollama is not always the right answer. Know when to fall back to an API provider:

Situation	Use Ollama?	Use API instead?
Processing confidential internal documents	Yes
Multi-step agent requiring complex reasoning	Only with 30B+ models	GPT-4o or Claude for best quality
Serving 100+ concurrent users	Requires dedicated GPU server	API scales automatically
Vision/image understanding tasks	llava or llama3.2-vision	GPT-4o vision for production quality
Production SLA required (99.9% uptime)		API provider handles this
Cost-sensitive batch processing at night	Yes — zero API cost