How to build retrieval pipelines that understand charts, diagrams, and images alongside text
Why text-only RAG fails on real documents
Most enterprise documents — annual reports, research papers, technical manuals — contain charts, diagrams, and tables that carry information not present in any paragraph. A text-only RAG pipeline silently ignores all of it.
LlamaIndex's multi-modal support lets you index images alongside text, retrieve both based on a query, and feed them to a vision-capable LLM for reasoning. This guide shows you how.
Prerequisites
pip install llama-index llama-index-multi-modal-llms-openai
pip install llama-index-vector-stores-qdrant
# For PDF image extraction
pip install pdf2image pillow
How multi-modal indexing works
LlamaIndex stores images as ImageNode objects alongside TextNode objects in the same (or separate) vector store. Images are embedded using a CLIP-style model or a multi-modal embedding, while text uses standard embeddings. At query time, both types can be retrieved and passed to the LLM.
| Component | Purpose |
|---|---|
| ImageNode | Stores image bytes or path + metadata + embedding |
| TextNode | Standard text chunk with embedding |
| MultiModalVectorStoreIndex | Index that handles both node types |
| MultiModalSimpleDirectoryReader | Loads images and text from a folder |
| GPT-4V / Claude claude-opus-4-6 | Vision-capable LLM for final answer synthesis |
Step 1: Load mixed documents
from llama_index.core import SimpleDirectoryReader
from pathlib import Path
# Place your PDFs and images in a folder
# For PDFs, use pdf2image to extract pages as images first
import pdf2image
import os
def extract_pdf_images(pdf_path: str, output_dir: str):
Path(output_dir).mkdir(exist_ok=True)
images = pdf2image.convert_from_path(pdf_path, dpi=150)
for i, img in enumerate(images):
img.save(f'{output_dir}/page_{i+1}.jpg', 'JPEG')
extract_pdf_images('annual_report.pdf', 'data/report_images')
# Load all images from the folder
image_docs = SimpleDirectoryReader('data/report_images').load_data()
print(f'Loaded {len(image_docs)} image documents')
Step 2: Build the multi-modal index
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
# Use separate vector stores for text and images
client = qdrant_client.QdrantClient(path='qdrant_local')
text_store = QdrantVectorStore(
client=client, collection_name='text_collection'
)
image_store = QdrantVectorStore(
client=client, collection_name='image_collection'
)
# Build the combined index
index = MultiModalVectorStoreIndex.from_documents(
image_docs,
vector_store=text_store,
image_vector_store=image_store,
)
Qdrant is the most mature vector store for multi-modal LlamaIndex use. Chroma and SimpleVectorStore also work but have fewer multi-modal-specific optimisations.Step 3: Query with a vision LLM
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import ModalityType
# Set up the vision LLM
llm = OpenAIMultiModal(
model='gpt-4o',
max_new_tokens=1024,
)
# Create a retriever that returns both text and image nodes
retriever = index.as_retriever(
similarity_top_k=3, # top 3 text chunks
image_similarity_top_k=3, # top 3 images
)
query = 'What does the revenue trend chart show for Q3 and Q4?'
nodes = retriever.retrieve(query)
# Separate text and image nodes
from llama_index.core.schema import ImageNode
image_nodes = [n for n in nodes if isinstance(n.node, ImageNode)]
text_nodes = [n for n in nodes if not isinstance(n.node, ImageNode)]
print(f'Retrieved {len(text_nodes)} text chunks, {len(image_nodes)} images')
# Build prompt with text context + images
context = '\n'.join(n.get_content() for n in text_nodes)
image_urls = [n.node.image_url for n in image_nodes if n.node.image_url]
response = llm.complete(
prompt=f'Context:\n{context}\n\nQuestion: {query}',
image_documents=[n.node for n in image_nodes],
)
print(response)
Using the SimpleMultiModalQueryEngine
For a more integrated experience, use the built-in query engine which handles retrieval and synthesis automatically.
from llama_index.core.query_engine import SimpleMultiModalQueryEngine
query_engine = index.as_query_engine(
multi_modal_llm=llm,
similarity_top_k=3,
image_similarity_top_k=3,
)
response = query_engine.query(
'Describe the key trends in the revenue chart from Q1 to Q4.'
)
print(response)
# Access retrieved source nodes
for node in response.source_nodes:
if isinstance(node.node, ImageNode):
print(f' Image: {node.node.image_url} (score: {node.score:.3f})')
else:
print(f' Text: {node.get_content()[:100]}...')
Common pitfalls
| Problem | Cause | Fix |
|---|---|---|
| Images not retrieved | Images stored but not embedded | Ensure image_vector_store is set and populated |
| Poor image relevance | Generic CLIP embeddings on domain-specific charts | Add rich metadata (caption, page number, section) to ImageNode |
| Vision LLM ignores images | image_documents not passed to complete() | Always pass image_documents explicitly |
| Slow indexing | Converting whole PDF at high DPI | Use dpi=100-150; skip non-visual pages |
| Missing chart data | Chart is a vector SVG, not a raster image | Export PDFs as high-res JPEG before processing |
Adding metadata to images
The single biggest quality improvement for multi-modal RAG is adding descriptive metadata to ImageNode objects. This improves retrieval accuracy significantly.
from llama_index.core.schema import ImageDocument
# Load with rich metadata
image_doc = ImageDocument(
image_path='data/report_images/page_12.jpg',
metadata={
'page': 12,
'section': 'Financial Results',
'caption': 'Figure 3: Quarterly Revenue by Region, 2023-2024',
'chart_type': 'bar chart',
}
)
If you have no captions, use a cheap vision LLM call to auto-generate a description for each image at index time. This one-time cost pays back in retrieval quality.