The Problem DSPy Solves

Hand-written prompts are brittle. Change the model version, change the task slightly, or add a new requirement, and your carefully tuned prompt produces worse results. You tweak it. You test it. You tweak it again. This is not engineering — it is guessing.

DSPy (Declarative Self-improving Python) replaces prompt strings with Python programs. You define what you want (inputs, outputs, reasoning steps) as code. An optimizer finds the best prompts, instructions, and few-shot examples automatically — by running your program against a dataset and measuring a metric you define.

The Core Abstraction: Signatures

A Signature declares what a language model should do: its inputs and outputs. DSPy uses this declaration to generate the prompt automatically. You never write the prompt string.

import dspy
 
# Configure the LM
lm = dspy.LM('anthropic/claude-sonnet-4-6', api_key='YOUR_KEY')
dspy.configure(lm=lm)
 
# A simple signature: question -> answer
class BasicQA(dspy.Signature):
    '''Answer questions with short, factual responses.'''
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc='Short answer, 1-2 sentences')
 
# A more complex signature: document + question -> answer + citations
class DocumentQA(dspy.Signature):
    '''Answer a question based on a provided document. Cite your sources.'''
    document: str = dspy.InputField(desc='The source document')
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()
    citations: list[str] = dspy.OutputField(desc='List of quoted passages supporting the answer')
 
The docstring of your Signature class becomes part of the task description in the generated prompt. Write it as a clear instruction to the model.

Modules: The Building Blocks

Modules wrap Signatures with reasoning strategies. The most important built-in modules:

Module What it does When to use
dspy.Predict Direct input → output, no explicit reasoning Simple classification, extraction, short answers
dspy.ChainOfThought Generates a reasoning trace before the final answer Multi-step reasoning, math, complex questions
dspy.ReAct Alternates reasoning and tool use in a loop Tasks requiring tool calls and iterative reasoning
dspy.Retrieve Retrieves passages from a vector store before answering RAG pipelines
dspy.ProgramOfThought Generates and executes Python code to solve the problem Math, data manipulation tasks
# Predict: straightforward
predict = dspy.Predict(BasicQA)
result = predict(question='What is the capital of France?')
print(result.answer)  # 'Paris'
 
# ChainOfThought: adds reasoning before the answer
cot = dspy.ChainOfThought(BasicQA)
result = cot(question='If a train travels 120 km in 2 hours, what is its speed?')
print(result.reasoning)  # 'Speed = distance / time = 120 / 2 = 60...'
print(result.answer)     # '60 km/h'
 

Composing Modules into Programs

Real programs chain multiple modules. A program is just a Python class that inherits from dspy.Module and uses other modules in its forward() method.

class ResearchAndSummarise(dspy.Module):
    def __init__(self):
        self.search = dspy.Retrieve(k=3)  # retrieve top 3 passages
        self.summarise = dspy.ChainOfThought('context, question -> summary')
 
    def forward(self, question: str) -> str:
        # Step 1: retrieve relevant passages
        passages = self.search(question).passages
        context = '\n\n'.join(passages)
 
        # Step 2: summarise with reasoning
        result = self.summarise(context=context, question=question)
        return result.summary
 
program = ResearchAndSummarise()
answer = program('What are the main causes of inflation?')
 

Running Without Compilation First

You do not have to compile (optimise) a DSPy program to use it. An unoptimised program still works — it uses default prompting strategies. Compilation makes it better. Start without it, measure quality, then compile if needed.

# Unoptimised: works immediately
program = ResearchAndSummarise()
result = program('How does compound interest work?')
 
# Evaluate quality
from dspy.evaluate import Evaluate
 
def answer_quality(example, prediction, trace=None):
    # Your metric: returns True/False or a float 0-1
    return example.expected_keyword in prediction.lower()
 
evaluator = Evaluate(devset=your_dataset, metric=answer_quality, num_threads=8)
score = evaluator(program)
print(f'Unoptimised score: {score:.2%}')  # e.g. 62%
 

When DSPy Is the Right Choice

  • You have a well-defined task with measurable quality (you can write a metric function)
  • You plan to run the pipeline at scale and need consistent, reliable output
  • You want to switch LLM providers without rewriting prompts
  • You are building a pipeline that will evolve over time (new requirements = recompile, not rewrite prompts)

DSPy adds overhead: you need a dataset, a metric, and compilation time. For one-off tasks or simple single-turn queries, a hand-written prompt is still the right tool.