The Problem DSPy Solves
Hand-written prompts are brittle. Change the model version, change the task slightly, or add a new requirement, and your carefully tuned prompt produces worse results. You tweak it. You test it. You tweak it again. This is not engineering — it is guessing.
DSPy (Declarative Self-improving Python) replaces prompt strings with Python programs. You define what you want (inputs, outputs, reasoning steps) as code. An optimizer finds the best prompts, instructions, and few-shot examples automatically — by running your program against a dataset and measuring a metric you define.
The Core Abstraction: Signatures
A Signature declares what a language model should do: its inputs and outputs. DSPy uses this declaration to generate the prompt automatically. You never write the prompt string.
import dspy
# Configure the LM
lm = dspy.LM('anthropic/claude-sonnet-4-6', api_key='YOUR_KEY')
dspy.configure(lm=lm)
# A simple signature: question -> answer
class BasicQA(dspy.Signature):
'''Answer questions with short, factual responses.'''
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc='Short answer, 1-2 sentences')
# A more complex signature: document + question -> answer + citations
class DocumentQA(dspy.Signature):
'''Answer a question based on a provided document. Cite your sources.'''
document: str = dspy.InputField(desc='The source document')
question: str = dspy.InputField()
answer: str = dspy.OutputField()
citations: list[str] = dspy.OutputField(desc='List of quoted passages supporting the answer')
The docstring of your Signature class becomes part of the task description in the generated prompt. Write it as a clear instruction to the model.Modules: The Building Blocks
Modules wrap Signatures with reasoning strategies. The most important built-in modules:
| Module | What it does | When to use |
|---|---|---|
| dspy.Predict | Direct input → output, no explicit reasoning | Simple classification, extraction, short answers |
| dspy.ChainOfThought | Generates a reasoning trace before the final answer | Multi-step reasoning, math, complex questions |
| dspy.ReAct | Alternates reasoning and tool use in a loop | Tasks requiring tool calls and iterative reasoning |
| dspy.Retrieve | Retrieves passages from a vector store before answering | RAG pipelines |
| dspy.ProgramOfThought | Generates and executes Python code to solve the problem | Math, data manipulation tasks |
# Predict: straightforward
predict = dspy.Predict(BasicQA)
result = predict(question='What is the capital of France?')
print(result.answer) # 'Paris'
# ChainOfThought: adds reasoning before the answer
cot = dspy.ChainOfThought(BasicQA)
result = cot(question='If a train travels 120 km in 2 hours, what is its speed?')
print(result.reasoning) # 'Speed = distance / time = 120 / 2 = 60...'
print(result.answer) # '60 km/h'
Composing Modules into Programs
Real programs chain multiple modules. A program is just a Python class that inherits from dspy.Module and uses other modules in its forward() method.
class ResearchAndSummarise(dspy.Module):
def __init__(self):
self.search = dspy.Retrieve(k=3) # retrieve top 3 passages
self.summarise = dspy.ChainOfThought('context, question -> summary')
def forward(self, question: str) -> str:
# Step 1: retrieve relevant passages
passages = self.search(question).passages
context = '\n\n'.join(passages)
# Step 2: summarise with reasoning
result = self.summarise(context=context, question=question)
return result.summary
program = ResearchAndSummarise()
answer = program('What are the main causes of inflation?')
Running Without Compilation First
You do not have to compile (optimise) a DSPy program to use it. An unoptimised program still works — it uses default prompting strategies. Compilation makes it better. Start without it, measure quality, then compile if needed.
# Unoptimised: works immediately
program = ResearchAndSummarise()
result = program('How does compound interest work?')
# Evaluate quality
from dspy.evaluate import Evaluate
def answer_quality(example, prediction, trace=None):
# Your metric: returns True/False or a float 0-1
return example.expected_keyword in prediction.lower()
evaluator = Evaluate(devset=your_dataset, metric=answer_quality, num_threads=8)
score = evaluator(program)
print(f'Unoptimised score: {score:.2%}') # e.g. 62%
When DSPy Is the Right Choice
- You have a well-defined task with measurable quality (you can write a metric function)
- You plan to run the pipeline at scale and need consistent, reliable output
- You want to switch LLM providers without rewriting prompts
- You are building a pipeline that will evolve over time (new requirements = recompile, not rewrite prompts)
DSPy adds overhead: you need a dataset, a metric, and compilation time. For one-off tasks or simple single-turn queries, a hand-written prompt is still the right tool.