DSPy Optimizers: How Compilation Actually Works (And When to Use It)

What Compilation Actually Does

DSPy compilation (optimization) finds the best prompts, instructions, and few-shot examples for your program by running it repeatedly against a training dataset and scoring the outputs with your metric. It is automated prompt engineering.

After compilation, your program contains learned prompts and examples baked in. The same Python code now produces better results because the internal prompt strings have been optimised for your specific task and data.

The Two Most Important Optimizers

Optimizer	What it optimizes	LLM calls needed	Best for
BootstrapFewShot	Finds good few-shot examples from your training set	Low (runs program on ~10-50 examples)	Quick improvement with limited data; good starting point
MIPRO (v2)	Optimizes both instructions AND few-shot examples together	High (hundreds to thousands of calls)	Maximum quality when you have 100+ training examples and a clear metric
BootstrapFewShotWithRandomSearch	BootstrapFewShot + random search over candidates	Medium	Good balance between BootstrapFewShot and MIPRO
Ensemble	Runs multiple optimized versions and picks the best per input	Very high	When you need the absolute highest quality and cost is secondary

Running BootstrapFewShot

Start here. It requires few training examples and uses modest compute. If it gives you a meaningful quality lift, move to MIPRO. If not, check your metric and dataset first.

import dspy
from dspy.teleprompt import BootstrapFewShot
 
# Your metric function — must return a float or bool
def answer_exact_match(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()
 
# Your training data — list of dspy.Example objects
trainset = [
    dspy.Example(question='What is 2+2?', answer='4').with_inputs('question'),
    dspy.Example(question='Capital of France?', answer='Paris').with_inputs('question'),
    # ... at least 20 examples for BootstrapFewShot
]
 
# Your program
program = dspy.ChainOfThought('question -> answer')
 
# Compile
optimizer = BootstrapFewShot(
    metric=answer_exact_match,
    max_bootstrapped_demos=4,   # max few-shot examples to add
    max_labeled_demos=16,        # how many training examples to consider
)
 
compiled_program = optimizer.compile(program, trainset=trainset)
 
# Save for reuse
compiled_program.save('compiled_qa.json')

Running MIPRO

MIPRO optimises both the task instructions and the few-shot examples in a joint search. Use it when BootstrapFewShot has reached its ceiling and you have 50+ examples.

from dspy.teleprompt import MIPROv2
 
optimizer = MIPROv2(
    metric=answer_exact_match,
    auto='medium',  # 'light' / 'medium' / 'heavy' — controls LLM call budget
    num_threads=8,
)
 
compiled_program = optimizer.compile(
    program,
    trainset=trainset,
    requires_permission_to_run=False,  # disable interactive confirmation
)

MIPRO with auto='heavy' can make thousands of LLM calls. Always estimate cost before running: with claude-sonnet-4-6, expect $5-20 for a medium compilation on 100 examples. Set auto='light' first to validate your setup.

Dataset Requirements

Your training data must use dspy.Example objects with .with_inputs() to tell DSPy which fields are inputs and which are expected outputs.

# Correct: mark which fields are inputs
example = dspy.Example(
    document='The French Revolution began in 1789...',
    question='When did the French Revolution begin?',
    answer='1789',
).with_inputs('document', 'question')  # 'answer' becomes the expected output
 
# Minimum dataset sizes:
# BootstrapFewShot: 20+ training, 50+ validation
# MIPRO: 100+ training, 50+ validation

Loading and Using a Compiled Program

# Load a previously compiled program
program = dspy.ChainOfThought('question -> answer')
program.load('compiled_qa.json')
 
# Use it identically to the uncompiled version
result = program(question='What year did World War II end?')
print(result.answer)

Deciding Whether to Compile

Compilation is worthwhile when:

Your unoptimised baseline is below 80% on your metric (room to improve)
You have at least 50 labelled examples (20+ for BootstrapFewShot)
You plan to run the program at scale (cost of compilation amortises over many runs)
You have a reliable metric function (garbage metric = garbage optimisation)

Skip compilation when:

You have fewer than 20 labelled examples
The task is too open-ended to define a meaningful metric
You are exploring or prototyping (compile once you know the task is stable)
Unoptimised quality is already above 90%