Sentry for AI Applications: Tracing LLM Calls and Catching Silent Failures

AI applications fail in ways that traditional error monitoring doesn't catch. Your API returns 200, but the LLM hallucinated. Your embedding pipeline processed all documents, but 10% produced zero-length vectors. Your RAG search returns results, but they're semantically irrelevant. Sentry's new AI monitoring features — combined with its distributed tracing — give you visibility into LLM call latency, token usage, and semantic quality signals that generic APM tools miss.

This guide covers setting up Sentry in a Next.js AI app, tracing OpenAI calls, and setting up alerts for AI-specific failure modes.

Installation

# Next.js npx @sentry/wizard@latest -i nextjs # Or manual install: npm install @sentry/nextjs

The wizard creates `sentry.client.config.ts`, `sentry.server.config.ts`, and `sentry.edge.config.ts`. Accept the defaults — they set up error capturing, session replays, and performance monitoring.

Basic Configuration

// sentry.server.config.ts import * as Sentry from "@sentry/nextjs"; Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV, // Performance monitoring tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0, // Profile transactions for CPU flame graphs profilesSampleRate: 0.1, // Attach useful context beforeSend(event) { // Scrub sensitive data before sending to Sentry if (event.request?.data) { delete event.request.data.apiKey; delete event.request.data.messages; // don't send user messages to Sentry } return event; }, });

Tracing OpenAI API Calls

Wrap your OpenAI calls in Sentry spans to measure latency, token usage, and failures for every LLM call in your app:

// lib/traced-openai.ts import * as Sentry from "@sentry/nextjs"; import OpenAI from "openai"; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); export async function tracedChatCompletion( messages: OpenAI.ChatCompletionMessageParam[], options: { model?: string; temperature?: number } = {} ) { const { model = "gpt-4o-mini", temperature = 0.7 } = options; return Sentry.startActiveSpan( { name: "openai.chat.completions", op: "ai.run", attributes: { "ai.model": model, "ai.temperature": temperature, "ai.message_count": messages.length, }, }, async (span) => { try { const response = await openai.chat.completions.create({ model, temperature, messages, }); // Record token usage in the span const usage = response.usage; if (usage) { span.setAttributes({ "ai.prompt_tokens": usage.prompt_tokens, "ai.completion_tokens": usage.completion_tokens, "ai.total_tokens": usage.total_tokens, }); } // Alert if response seems truncated (common silent failure) const finishReason = response.choices[0].finish_reason; if (finishReason === "length") { Sentry.captureMessage("LLM response truncated — hit max tokens", { level: "warning", extra: { model, promptTokens: usage?.prompt_tokens }, }); } return response; } catch (error) { Sentry.captureException(error, { extra: { model, messageCount: messages.length }, }); throw error; } } ); }

Capturing AI-Specific Errors

Some AI failures aren't exceptions — they're bad outputs. Use `Sentry.captureMessage()` to track these:

// Detect and report semantic failures async function generateWithQualityCheck(prompt: string) { const response = await tracedChatCompletion([ { role: "user", content: prompt } ]); const content = response.choices[0].message.content ?? ""; // Empty or near-empty responses if (content.trim().length < 50) { Sentry.captureMessage("Suspiciously short AI response", { level: "warning", extra: { prompt: prompt.slice(0, 200), contentLength: content.length }, tags: { failure_type: "short_response" }, }); } // Detect refusal patterns const refusalPhrases = ["I cannot", "I'm unable to", "I don't have access"]; if (refusalPhrases.some((phrase) => content.includes(phrase))) { Sentry.captureMessage("AI refused to answer", { level: "info", extra: { prompt: prompt.slice(0, 200) }, tags: { failure_type: "refusal" }, }); } return content; }

Distributed Tracing Across the RAG Pipeline

// app/api/ask/route.ts export async function POST(req: Request) { return Sentry.startActiveSpan( { name: "rag-pipeline", op: "ai.pipeline" }, async () => { const { question } = await req.json(); // Embedding — traced span const embedding = await Sentry.startActiveSpan( { name: "embed-question", op: "ai.embed" }, async () => { const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: question, }); return response.data[0].embedding; } ); // Vector search — traced span const results = await Sentry.startActiveSpan( { name: "vector-search", op: "db.query" }, async () => { return await vectorStore.query(embedding, { topK: 5 }); } ); // LLM call — traced span const answer = await Sentry.startActiveSpan( { name: "llm-generate", op: "ai.run" }, async () => { return await tracedChatCompletion(buildRAGMessages(question, results)); } ); return NextResponse.json({ answer }); } ); }

In Sentry's Performance view, you'll see a flame chart of the full RAG pipeline — embed → search → generate — with exact durations for each step. This makes it immediately obvious if your vector search is slow or your embedding model is the bottleneck.

Setting Up Alerts

In the Sentry dashboard → Alerts → Create Alert:

Alert	Condition	Action
High LLM latency	p95 of `openai.chat.completions` > 10s	Slack #alerts
Frequent refusals	`ai.refusal` event rate > 5%	PagerDuty
Truncated responses	`short_response` tag rate > 1%	Email
OpenAI API errors	Exception rate on `ai.run` > 0.5%	Slack #critical

Source Maps for Edge Functions

# Add to your CI/CD pipeline to upload source maps SENTRY_AUTH_TOKEN=your-token npx @sentry/cli sourcemaps inject .next && npx @sentry/cli sourcemaps upload .next --org your-org --project your-project