Tracing, evals, prompt management, and cost dashboards in one place

The observability gap in LLM apps

You can inspect a database query. You can trace a web request. But most LLM apps are black boxes — you send a prompt, something happens inside the model, you get a response, and if it is wrong you have no idea why.

LangSmith is LangChain's observability and evaluation platform. It captures every LLM call, tool call, retrieval step, and chain execution as a structured trace tree. You can inspect what went wrong, evaluate outputs systematically, and track costs — all without changing your application logic.

Setup

pip install langsmith
# Optional but useful:
pip install langchain langchain-openai
 
# Set environment variables (add to .env)
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
export LANGCHAIN_PROJECT=my-project  # traces group by project
 
That is all it takes for LangChain/LangGraph apps. Once those env vars are set, every LangChain call is automatically traced. No code changes needed.

Tracing non-LangChain code with @traceable

For apps using the OpenAI SDK, Anthropic SDK, or any other LLM library, use the @traceable decorator to add tracing manually.

from langsmith import traceable
from openai import OpenAI
 
client = OpenAI()
 
@traceable(run_type='llm', name='summarise-call')
def summarise(text: str) -> str:
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{'role': 'user', 'content': f'Summarise in one sentence: {text}'}]
    )
    return response.choices[0].message.content
 
@traceable(run_type='chain', name='process-document')
def process_document(doc: str) -> dict:
    # This creates a parent span; summarise() creates a child span
    summary = summarise(doc)
    keywords = extract_keywords(doc)  # another @traceable function
    return {'summary': summary, 'keywords': keywords}
 
@traceable(run_type='tool', name='extract-keywords')
def extract_keywords(text: str) -> list[str]:
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{'role': 'user', 'content': f'List 5 keywords: {text}'}]
    )
    return response.choices[0].message.content.split(', ')
 

Tagging traces for filtering

Add metadata to traces so you can filter by user, version, or environment in the LangSmith UI.

from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree
 
@traceable(run_type='chain')
def my_agent(query: str, user_id: str) -> str:
    # Tag the current trace with metadata
    run_tree = get_current_run_tree()
    if run_tree:
        run_tree.tags = ['production', 'v2.1']
        run_tree.metadata = {
            'user_id': user_id,
            'app_version': '2.1.0',
            'env': 'production',
        }
    # ... agent logic
    return 'response'
 

Datasets and evaluations

Build a dataset of input/output pairs, then run evaluators to measure quality. LangSmith can use LLM-as-judge evaluators or custom Python functions.

from langsmith import Client
 
client = Client()
 
# 1. Create a dataset
dataset = client.create_dataset(
    dataset_name='qa-test-set',
    description='Questions for testing the Q&A agent'
)
 
# 2. Add examples
examples = [
    {'inputs': {'question': 'What is the capital of France?'}, 'outputs': {'answer': 'Paris'}},
    {'inputs': {'question': 'Who wrote Hamlet?'}, 'outputs': {'answer': 'Shakespeare'}},
]
client.create_examples(inputs=[e['inputs'] for e in examples],
                       outputs=[e['outputs'] for e in examples],
                       dataset_id=dataset.id)
 
# 3. Define the function to evaluate
def my_qa_agent(inputs: dict) -> dict:
    answer = summarise(inputs['question'])  # your LLM function
    return {'answer': answer}
 
# 4. Run evaluation with LLM-as-judge
from langsmith.evaluation import evaluate, LangChainStringEvaluator
 
results = evaluate(
    my_qa_agent,
    data='qa-test-set',
    evaluators=[
        LangChainStringEvaluator('qa', config={'llm': None}),  # uses default LLM
    ],
    experiment_prefix='baseline-v1',
)
print(results)
 

Prompt management

Store and version prompts in LangSmith Hub. Pull the latest version at runtime — no code deploys needed for prompt updates.

from langchain import hub
 
# Push a prompt to Hub
from langchain_core.prompts import ChatPromptTemplate
 
prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are a helpful assistant. Be concise.'),
    ('human', '{question}'),
])
hub.push('your-org/qa-prompt', prompt)  # push to Hub
 
# Pull in production (always gets latest version)
production_prompt = hub.pull('your-org/qa-prompt')
 
# Pin to a specific commit for stable deployments
stable_prompt = hub.pull('your-org/qa-prompt:abc123def')
 

Cost tracking

LangSmith automatically tracks token usage and estimated costs for every traced call. In the UI, filter by project, date range, or model to see where your budget is going.

# Access cost data programmatically
from langsmith import Client
from datetime import datetime, timedelta
 
client = Client()
 
# Get runs from the last 7 days
runs = list(client.list_runs(
    project_name='my-project',
    start_time=datetime.now() - timedelta(days=7),
    run_type='llm',
))
 
total_tokens = sum(r.total_tokens or 0 for r in runs)
total_cost = sum(r.total_cost or 0 for r in runs)
print(f'Last 7 days: {total_tokens:,} tokens, ${total_cost:.2f}')
 
# Cost by model
by_model = {}
for r in runs:
    model = r.extra.get('invocation_params', {}).get('model_name', 'unknown')
    by_model[model] = by_model.get(model, 0) + (r.total_cost or 0)
for model, cost in sorted(by_model.items(), key=lambda x: -x[1]):
    print(f'  {model}: ${cost:.4f}')
 

The LangSmith workflow

  1. Deploy your app with LANGCHAIN_TRACING_V2=true
  2. Reproduce a user-reported failure by filtering traces by user_id or session
  3. Inspect the full trace tree — find the exact node where the output went wrong
  4. Add the failing input to a dataset
  5. Fix your prompt or code, run the dataset through evaluation, compare scores
  6. Deploy the fix, monitor the dashboard for improvement