How to build retrieval pipelines that understand charts, diagrams, and images alongside text

Why text-only RAG fails on real documents

Most enterprise documents — annual reports, research papers, technical manuals — contain charts, diagrams, and tables that carry information not present in any paragraph. A text-only RAG pipeline silently ignores all of it.

LlamaIndex's multi-modal support lets you index images alongside text, retrieve both based on a query, and feed them to a vision-capable LLM for reasoning. This guide shows you how.

Prerequisites

pip install llama-index llama-index-multi-modal-llms-openai
pip install llama-index-vector-stores-qdrant
# For PDF image extraction
pip install pdf2image pillow
 

How multi-modal indexing works

LlamaIndex stores images as ImageNode objects alongside TextNode objects in the same (or separate) vector store. Images are embedded using a CLIP-style model or a multi-modal embedding, while text uses standard embeddings. At query time, both types can be retrieved and passed to the LLM.

Component Purpose
ImageNode Stores image bytes or path + metadata + embedding
TextNode Standard text chunk with embedding
MultiModalVectorStoreIndex Index that handles both node types
MultiModalSimpleDirectoryReader Loads images and text from a folder
GPT-4V / Claude claude-opus-4-6 Vision-capable LLM for final answer synthesis

Step 1: Load mixed documents

from llama_index.core import SimpleDirectoryReader
from pathlib import Path
 
# Place your PDFs and images in a folder
# For PDFs, use pdf2image to extract pages as images first
import pdf2image
import os
 
def extract_pdf_images(pdf_path: str, output_dir: str):
    Path(output_dir).mkdir(exist_ok=True)
    images = pdf2image.convert_from_path(pdf_path, dpi=150)
    for i, img in enumerate(images):
        img.save(f'{output_dir}/page_{i+1}.jpg', 'JPEG')
 
extract_pdf_images('annual_report.pdf', 'data/report_images')
 
# Load all images from the folder
image_docs = SimpleDirectoryReader('data/report_images').load_data()
print(f'Loaded {len(image_docs)} image documents')
 

Step 2: Build the multi-modal index

from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
 
# Use separate vector stores for text and images
client = qdrant_client.QdrantClient(path='qdrant_local')
 
text_store = QdrantVectorStore(
    client=client, collection_name='text_collection'
)
image_store = QdrantVectorStore(
    client=client, collection_name='image_collection'
)
 
# Build the combined index
index = MultiModalVectorStoreIndex.from_documents(
    image_docs,
    vector_store=text_store,
    image_vector_store=image_store,
)
 
Qdrant is the most mature vector store for multi-modal LlamaIndex use. Chroma and SimpleVectorStore also work but have fewer multi-modal-specific optimisations.

Step 3: Query with a vision LLM

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import ModalityType
 
# Set up the vision LLM
llm = OpenAIMultiModal(
    model='gpt-4o',
    max_new_tokens=1024,
)
 
# Create a retriever that returns both text and image nodes
retriever = index.as_retriever(
    similarity_top_k=3,             # top 3 text chunks
    image_similarity_top_k=3,        # top 3 images
)
 
query = 'What does the revenue trend chart show for Q3 and Q4?'
nodes = retriever.retrieve(query)
 
# Separate text and image nodes
from llama_index.core.schema import ImageNode
image_nodes = [n for n in nodes if isinstance(n.node, ImageNode)]
text_nodes  = [n for n in nodes if not isinstance(n.node, ImageNode)]
 
print(f'Retrieved {len(text_nodes)} text chunks, {len(image_nodes)} images')
 
# Build prompt with text context + images
context = '\n'.join(n.get_content() for n in text_nodes)
image_urls = [n.node.image_url for n in image_nodes if n.node.image_url]
 
response = llm.complete(
    prompt=f'Context:\n{context}\n\nQuestion: {query}',
    image_documents=[n.node for n in image_nodes],
)
print(response)
 

Using the SimpleMultiModalQueryEngine

For a more integrated experience, use the built-in query engine which handles retrieval and synthesis automatically.

from llama_index.core.query_engine import SimpleMultiModalQueryEngine
 
query_engine = index.as_query_engine(
    multi_modal_llm=llm,
    similarity_top_k=3,
    image_similarity_top_k=3,
)
 
response = query_engine.query(
    'Describe the key trends in the revenue chart from Q1 to Q4.'
)
print(response)
 
# Access retrieved source nodes
for node in response.source_nodes:
    if isinstance(node.node, ImageNode):
        print(f'  Image: {node.node.image_url} (score: {node.score:.3f})')
    else:
        print(f'  Text: {node.get_content()[:100]}...')
 

Common pitfalls

Problem Cause Fix
Images not retrieved Images stored but not embedded Ensure image_vector_store is set and populated
Poor image relevance Generic CLIP embeddings on domain-specific charts Add rich metadata (caption, page number, section) to ImageNode
Vision LLM ignores images image_documents not passed to complete() Always pass image_documents explicitly
Slow indexing Converting whole PDF at high DPI Use dpi=100-150; skip non-visual pages
Missing chart data Chart is a vector SVG, not a raster image Export PDFs as high-res JPEG before processing

Adding metadata to images

The single biggest quality improvement for multi-modal RAG is adding descriptive metadata to ImageNode objects. This improves retrieval accuracy significantly.

from llama_index.core.schema import ImageDocument
 
# Load with rich metadata
image_doc = ImageDocument(
    image_path='data/report_images/page_12.jpg',
    metadata={
        'page': 12,
        'section': 'Financial Results',
        'caption': 'Figure 3: Quarterly Revenue by Region, 2023-2024',
        'chart_type': 'bar chart',
    }
)
 
If you have no captions, use a cheap vision LLM call to auto-generate a description for each image at index time. This one-time cost pays back in retrieval quality.