LangFuse in Production: Evaluations, Prompt AB Testing, and Alerts

Beyond Tracing: The Production LangFuse Stack

Tracing tells you what your LLM app did. Evaluations tell you how well it did it. Prompt management tells you which version of your instructions produced which results. Datasets let you run regression tests before shipping prompt changes.

Most teams add LangFuse for tracing and stop there. This article covers the features that turn LangFuse from a debugging tool into a continuous improvement system.

Scoring Traces: Manual and Automated

A score is a numeric or categorical rating attached to a trace. Manual scores come from humans (support team marking a response as 'helpful' or 'wrong'). Automated scores come from evaluators you run after each trace.

from langfuse import Langfuse
 
langfuse = Langfuse()
 
# Manually score a trace (e.g. from a user feedback button)
langfuse.score(
    trace_id="trace-abc123",
    name="user-feedback",
    value=1,  # 1 = thumbs up, 0 = thumbs down
    comment="Answer was clear and accurate"
)
 
# Score with a float (e.g. 0.0 to 1.0)
langfuse.score(
    trace_id="trace-abc123",
    name="relevance",
    value=0.85
)

Scores are queryable in the dashboard. You can filter traces by score range, see which prompt versions have higher scores, and export score data for analysis.

LLM-as-Judge Evaluations

For automated quality measurement, use an LLM as an evaluator. LangFuse provides built-in eval templates, or you can write your own.

from langfuse.decorators import observe, langfuse_context
from langfuse import Langfuse
from anthropic import Anthropic
 
langfuse = Langfuse()
client = Anthropic()
 
def evaluate_response(question: str, response: str, trace_id: str):
    """Run an LLM-as-judge evaluation and attach the score to the trace."""
    eval_prompt = (
        f"Question: {question}\n"
        f"Response: {response}\n\n"
        "Rate the response on factual accuracy from 0 to 10. "
        "Reply with only a number."
    )
    eval_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    score_str = eval_response.content[0].text.strip()
    score = float(score_str) / 10  # normalise to 0-1
 
    langfuse.score(
        trace_id=trace_id,
        name="factual-accuracy",
        value=score
    )
 
@observe()
def answer_and_eval(question: str) -> str:
    trace_id = langfuse_context.get_current_trace_id()
 
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": question}]
    ).content[0].text
 
    # Run eval in background so it doesn't slow the user response
    import threading
    threading.Thread(
        target=evaluate_response,
        args=(question, response, trace_id)
    ).start()
 
    return response

Run LLM-as-judge evaluations asynchronously so they do not add latency to the user-facing response. The trace_id is stable — you can attach scores to a trace seconds or minutes after it was created.

Building a Dataset from Production Traces

A dataset is a collection of input/output pairs used for regression testing. The best datasets come from production traces — real user queries with known-good responses.

from langfuse import Langfuse
 
langfuse = Langfuse()
 
# Create a dataset
langfuse.create_dataset(name="qa-regression-v1")
 
# Add items from production traces
langfuse.create_dataset_item(
    dataset_name="qa-regression-v1",
    input={"question": "What is the capital of France?"},
    expected_output="Paris"
)
 
# Run your pipeline against the dataset and log as experiment
dataset = langfuse.get_dataset("qa-regression-v1")
 
for item in dataset.items:
    with item.observe(run_name="claude-haiku-baseline") as trace_id:
        output = answer_question(item.input["question"])
        # Score against expected output
        score = 1.0 if item.expected_output.lower() in output.lower() else 0.0
        langfuse.score(trace_id=trace_id, name="exact-match", value=score)

After running a dataset experiment, LangFuse shows an experiment comparison view: score distributions, cost, and latency across different runs. This is how you validate that a prompt change improves quality before shipping it.

Prompt A/B Testing

LangFuse prompt management supports labels — you can have multiple versions of a prompt and label one 'production' and another 'staging'. Route a percentage of traffic to each to run an A/B test.

from langfuse import Langfuse
import random
 
langfuse = Langfuse()
 
def get_prompt_for_user(user_id: str):
    """Route 80% of users to 'production' prompt, 20% to 'candidate' prompt."""
    # Use hash of user_id for stable assignment (same user always gets same variant)
    bucket = hash(user_id) % 100
    label = "candidate" if bucket < 20 else "production"
 
    prompt = langfuse.get_prompt("answer-template", label=label)
    return prompt, label
 
from langfuse.decorators import observe, langfuse_context
 
@observe()
def answer_with_ab_test(question: str, user_id: str) -> str:
    prompt_obj, variant = get_prompt_for_user(user_id)
 
    langfuse_context.update_current_trace(
        metadata={"prompt_variant": variant}
    )
 
    compiled = prompt_obj.compile(question=question)
    # ... call LLM with compiled prompt ...
    return "response"

With scores attached to traces and the variant recorded in metadata, you can compare the score distributions of the two variants in LangFuse's dashboard and make a data-driven promotion decision.

Cost Alerts

LangFuse does not currently support native alerting (as of April 2026), but you can build cost alerts by querying the API:

import requests
from datetime import datetime, timedelta
 
LANGFUSE_SECRET_KEY = "sk-lf-..."
LANGFUSE_PUBLIC_KEY = "pk-lf-..."
ALERT_THRESHOLD_USD = 10.0  # alert if daily spend exceeds $10
 
def check_daily_cost():
    today = datetime.utcnow().date().isoformat()
    response = requests.get(
        "https://cloud.langfuse.com/api/public/metrics/daily",
        auth=(LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY),
        params={"fromTimestamp": today}
    )
    data = response.json()
    total_cost = sum(day.get("totalCost", 0) for day in data.get("data", []))
 
    if total_cost > ALERT_THRESHOLD_USD:
        send_slack_alert(f"LLM spend today: ${total_cost:.2f} — threshold exceeded!")
 
# Run this as a cron job every hour

The LangFuse metrics API also exposes usage by model, user, and tag — useful for generating weekly cost-by-team or cost-by-feature reports.

Trace Sampling for High-Volume Applications

At high request volumes, tracing every call can itself become expensive (storage and LangFuse API limits). Use sampling to trace a representative fraction:

from langfuse.decorators import observe
import random
 
TRACE_SAMPLE_RATE = 0.1  # trace 10% of requests
 
@observe()
def handle_request(user_input: str) -> str:
    # ... your logic ...
    return "response"
 
def handle_request_sampled(user_input: str) -> str:
    if random.random() < TRACE_SAMPLE_RATE:
        return handle_request(user_input)  # traced
    else:
        # Not traced — call underlying logic directly
        return underlying_logic(user_input)

Always trace 100% of error cases regardless of your sample rate. Add explicit error tracing in your exception handlers so failures are never silently dropped from observability.

Summary

The LangFuse feature stack builds progressively: tracing is the foundation, evaluations add quality signals, datasets enable regression testing, and prompt management connects all three into a loop where you ship changes only when data supports them. Most teams implement these features incrementally over weeks rather than all at once — start with tracing, add manual scoring from support feedback, then automate evaluations once the trace data gives you enough signal to design good eval prompts.