Instructor: The Simplest Way to Get Structured Data Out of Any LLM

The Structured Output Problem

Getting an LLM to return structured data — a JSON object with specific fields and types — is one of the most common tasks in AI applications. The naive approach is to ask nicely in the prompt and then parse the response. This works until it doesn't.

LLMs occasionally return malformed JSON, add narrative text before or after the JSON block, use different field names than you asked for, or include fields that fail your validation logic. The result: JSON parse errors and a pipeline that fails unpredictably.

Instructor is a Python library that wraps LLM clients and uses Pydantic models to define your expected output. If the LLM returns something that fails validation, Instructor automatically retries — sending the validation error back to the LLM and asking it to fix the output.

Installation

pip install instructor
# Instructor supports: openai, anthropic, google-generativeai, litellm, cohere, and more

First Example: Extracting a Typed Object

import instructor
from anthropic import Anthropic
from pydantic import BaseModel
 
# 1. Define your output schema with Pydantic
class UserInfo(BaseModel):
    name: str
    age: int
    email: str
 
# 2. Patch the Anthropic client with Instructor
client = instructor.from_anthropic(Anthropic())
 
# 3. Call with response_model instead of getting raw text
user = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=256,
    messages=[{
        "role": "user",
        "content": "Extract the user info: Alice is 28 years old and her email is alice@example.com"
    }],
    response_model=UserInfo
)
 
print(user.name)   # Alice
print(user.age)    # 28
print(user.email)  # alice@example.com
print(type(user))  # <class '__main__.UserInfo'>

user is a real Python object, not a dictionary. You get type safety, IDE autocomplete, and automatic validation — age will always be an int, not '28'.

OpenAI and Gemini Support

import instructor
from openai import OpenAI
import google.generativeai as genai
from pydantic import BaseModel
 
class ProductDetails(BaseModel):
    name: str
    price: float
    in_stock: bool
 
# OpenAI
openai_client = instructor.from_openai(OpenAI())
product = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "The Blue Widget costs $19.99 and is available."}],
    response_model=ProductDetails
)
 
# Gemini
gemini_client = instructor.from_gemini(
    client=genai.GenerativeModel(model_name="gemini-2.0-flash")
)
product = gemini_client.chat.completions.create(
    messages=[{"role": "user", "content": "The Blue Widget costs $19.99 and is available."}],
    response_model=ProductDetails
)

The calling pattern is identical across providers — only the client initialisation changes. Switching models requires changing one line.

Validation with Pydantic

Instructor's real power comes from Pydantic validators. If a field fails validation, the error message is automatically sent back to the LLM for correction:

import instructor
from anthropic import Anthropic
from pydantic import BaseModel, field_validator, EmailStr
from typing import Literal
 
class SupportTicket(BaseModel):
    category: Literal["billing", "technical", "account", "other"]
    priority: Literal["low", "medium", "high", "urgent"]
    summary: str
    customer_email: EmailStr  # validates email format automatically
 
    @field_validator("summary")
    @classmethod
    def summary_must_be_actionable(cls, v: str) -> str:
        if len(v) < 10:
            raise ValueError("Summary must be at least 10 characters")
        return v
 
client = instructor.from_anthropic(Anthropic())
 
ticket = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": "Customer alice@example.com says they can't log in. Seems urgent."
    }],
    response_model=SupportTicket,
    max_retries=3  # retry up to 3 times on validation failure
)
 
print(ticket.category)   # technical
print(ticket.priority)   # urgent

Use Literal types for categorical fields. The LLM knows exactly which values are acceptable, and Pydantic will reject anything that is not in the allowed set — triggering a retry with the error message.

Nested Models

Complex structures work naturally because Instructor uses standard Pydantic nesting:

from pydantic import BaseModel
from typing import List
import instructor
from anthropic import Anthropic
 
class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
 
class Invoice(BaseModel):
    vendor: str
    invoice_number: str
    line_items: List[LineItem]
    total: float
 
client = instructor.from_anthropic(Anthropic())
 
invoice_text = """
Invoice #INV-2026-042 from Acme Supplies:
- 5x Widget A at $10.00 each
- 2x Widget B at $25.00 each
Total: $100.00
"""
 
invoice = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"Extract invoice data:\n{invoice_text}"}],
    response_model=Invoice
)
 
for item in invoice.line_items:
    print(f"{item.quantity}x {item.description}: ${item.unit_price * item.quantity:.2f}")

Optional Fields and Partial Extraction

Not all fields will be present in every input. Use Optional to handle missing data cleanly:

from pydantic import BaseModel
from typing import Optional, List
 
class PersonProfile(BaseModel):
    name: str
    age: Optional[int] = None       # might not be mentioned
    email: Optional[str] = None     # might not be mentioned
    skills: List[str] = []          # defaults to empty list
    bio: Optional[str] = None

When the LLM cannot extract a field (because the information is not in the input), it returns None for Optional fields rather than hallucinating a value.

When Instructor is NOT the Right Choice

When you need streaming responses — Instructor's retry loop requires a complete response before validating. Use streaming only for the final call after validation logic is complete.
When the output is long-form prose — Instructor is optimised for structured extraction, not generating articles or code. The overhead of wrapping prose in a Pydantic object adds no value.
When you need maximum cost efficiency on very high volumes — each retry is an additional LLM call. Design your schemas to minimise validation failures rather than relying on retries.
When model-native function calling is sufficient — OpenAI's structured output mode and Anthropic's tool use can produce JSON without the Instructor wrapper. If you are locked to one provider and do not need cross-provider compatibility, native function calling may be cleaner.

Summary

Instructor solves the structured output problem cleanly: define a Pydantic model, pass it as response_model, get a typed Python object back. The automatic retry loop handles the cases where the LLM produces invalid output — you write validation logic once in Pydantic, and Instructor handles the feedback loop. Start with simple models and validators, then progressively add complexity as your extraction requirements grow.