Tracing, evals, prompt management, and cost dashboards in one place
The observability gap in LLM apps
You can inspect a database query. You can trace a web request. But most LLM apps are black boxes — you send a prompt, something happens inside the model, you get a response, and if it is wrong you have no idea why.
LangSmith is LangChain's observability and evaluation platform. It captures every LLM call, tool call, retrieval step, and chain execution as a structured trace tree. You can inspect what went wrong, evaluate outputs systematically, and track costs — all without changing your application logic.
Setup
pip install langsmith
# Optional but useful:
pip install langchain langchain-openai
# Set environment variables (add to .env)
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
export LANGCHAIN_PROJECT=my-project # traces group by project
That is all it takes for LangChain/LangGraph apps. Once those env vars are set, every LangChain call is automatically traced. No code changes needed.Tracing non-LangChain code with @traceable
For apps using the OpenAI SDK, Anthropic SDK, or any other LLM library, use the @traceable decorator to add tracing manually.
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
@traceable(run_type='llm', name='summarise-call')
def summarise(text: str) -> str:
response = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': f'Summarise in one sentence: {text}'}]
)
return response.choices[0].message.content
@traceable(run_type='chain', name='process-document')
def process_document(doc: str) -> dict:
# This creates a parent span; summarise() creates a child span
summary = summarise(doc)
keywords = extract_keywords(doc) # another @traceable function
return {'summary': summary, 'keywords': keywords}
@traceable(run_type='tool', name='extract-keywords')
def extract_keywords(text: str) -> list[str]:
response = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': f'List 5 keywords: {text}'}]
)
return response.choices[0].message.content.split(', ')
Tagging traces for filtering
Add metadata to traces so you can filter by user, version, or environment in the LangSmith UI.
from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree
@traceable(run_type='chain')
def my_agent(query: str, user_id: str) -> str:
# Tag the current trace with metadata
run_tree = get_current_run_tree()
if run_tree:
run_tree.tags = ['production', 'v2.1']
run_tree.metadata = {
'user_id': user_id,
'app_version': '2.1.0',
'env': 'production',
}
# ... agent logic
return 'response'
Datasets and evaluations
Build a dataset of input/output pairs, then run evaluators to measure quality. LangSmith can use LLM-as-judge evaluators or custom Python functions.
from langsmith import Client
client = Client()
# 1. Create a dataset
dataset = client.create_dataset(
dataset_name='qa-test-set',
description='Questions for testing the Q&A agent'
)
# 2. Add examples
examples = [
{'inputs': {'question': 'What is the capital of France?'}, 'outputs': {'answer': 'Paris'}},
{'inputs': {'question': 'Who wrote Hamlet?'}, 'outputs': {'answer': 'Shakespeare'}},
]
client.create_examples(inputs=[e['inputs'] for e in examples],
outputs=[e['outputs'] for e in examples],
dataset_id=dataset.id)
# 3. Define the function to evaluate
def my_qa_agent(inputs: dict) -> dict:
answer = summarise(inputs['question']) # your LLM function
return {'answer': answer}
# 4. Run evaluation with LLM-as-judge
from langsmith.evaluation import evaluate, LangChainStringEvaluator
results = evaluate(
my_qa_agent,
data='qa-test-set',
evaluators=[
LangChainStringEvaluator('qa', config={'llm': None}), # uses default LLM
],
experiment_prefix='baseline-v1',
)
print(results)
Prompt management
Store and version prompts in LangSmith Hub. Pull the latest version at runtime — no code deploys needed for prompt updates.
from langchain import hub
# Push a prompt to Hub
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
('system', 'You are a helpful assistant. Be concise.'),
('human', '{question}'),
])
hub.push('your-org/qa-prompt', prompt) # push to Hub
# Pull in production (always gets latest version)
production_prompt = hub.pull('your-org/qa-prompt')
# Pin to a specific commit for stable deployments
stable_prompt = hub.pull('your-org/qa-prompt:abc123def')
Cost tracking
LangSmith automatically tracks token usage and estimated costs for every traced call. In the UI, filter by project, date range, or model to see where your budget is going.
# Access cost data programmatically
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
# Get runs from the last 7 days
runs = list(client.list_runs(
project_name='my-project',
start_time=datetime.now() - timedelta(days=7),
run_type='llm',
))
total_tokens = sum(r.total_tokens or 0 for r in runs)
total_cost = sum(r.total_cost or 0 for r in runs)
print(f'Last 7 days: {total_tokens:,} tokens, ${total_cost:.2f}')
# Cost by model
by_model = {}
for r in runs:
model = r.extra.get('invocation_params', {}).get('model_name', 'unknown')
by_model[model] = by_model.get(model, 0) + (r.total_cost or 0)
for model, cost in sorted(by_model.items(), key=lambda x: -x[1]):
print(f' {model}: ${cost:.4f}')
The LangSmith workflow
- Deploy your app with LANGCHAIN_TRACING_V2=true
- Reproduce a user-reported failure by filtering traces by user_id or session
- Inspect the full trace tree — find the exact node where the output went wrong
- Add the failing input to a dataset
- Fix your prompt or code, run the dataset through evaluation, compare scores
- Deploy the fix, monitor the dashboard for improvement