AI applications fail in ways that traditional error monitoring doesn't catch. Your API returns 200, but the LLM hallucinated. Your embedding pipeline processed all documents, but 10% produced zero-length vectors. Your RAG search returns results, but they're semantically irrelevant. Sentry's new AI monitoring features — combined with its distributed tracing — give you visibility into LLM call latency, token usage, and semantic quality signals that generic APM tools miss.
This guide covers setting up Sentry in a Next.js AI app, tracing OpenAI calls, and setting up alerts for AI-specific failure modes.
Installation
# Next.js npx @sentry/wizard@latest -i nextjs # Or manual install: npm install @sentry/nextjs
The wizard creates `sentry.client.config.ts`, `sentry.server.config.ts`, and `sentry.edge.config.ts`. Accept the defaults — they set up error capturing, session replays, and performance monitoring.
Basic Configuration
// sentry.server.config.ts import * as Sentry from "@sentry/nextjs"; Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV, // Performance monitoring tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0, // Profile transactions for CPU flame graphs profilesSampleRate: 0.1, // Attach useful context beforeSend(event) { // Scrub sensitive data before sending to Sentry if (event.request?.data) { delete event.request.data.apiKey; delete event.request.data.messages; // don't send user messages to Sentry } return event; }, });
Tracing OpenAI API Calls
Wrap your OpenAI calls in Sentry spans to measure latency, token usage, and failures for every LLM call in your app:
// lib/traced-openai.ts import * as Sentry from "@sentry/nextjs"; import OpenAI from "openai"; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); export async function tracedChatCompletion( messages: OpenAI.ChatCompletionMessageParam[], options: { model?: string; temperature?: number } = {} ) { const { model = "gpt-4o-mini", temperature = 0.7 } = options; return Sentry.startActiveSpan( { name: "openai.chat.completions", op: "ai.run", attributes: { "ai.model": model, "ai.temperature": temperature, "ai.message_count": messages.length, }, }, async (span) => { try { const response = await openai.chat.completions.create({ model, temperature, messages, }); // Record token usage in the span const usage = response.usage; if (usage) { span.setAttributes({ "ai.prompt_tokens": usage.prompt_tokens, "ai.completion_tokens": usage.completion_tokens, "ai.total_tokens": usage.total_tokens, }); } // Alert if response seems truncated (common silent failure) const finishReason = response.choices[0].finish_reason; if (finishReason === "length") { Sentry.captureMessage("LLM response truncated — hit max tokens", { level: "warning", extra: { model, promptTokens: usage?.prompt_tokens }, }); } return response; } catch (error) { Sentry.captureException(error, { extra: { model, messageCount: messages.length }, }); throw error; } } ); }
Capturing AI-Specific Errors
Some AI failures aren't exceptions — they're bad outputs. Use `Sentry.captureMessage()` to track these:
// Detect and report semantic failures async function generateWithQualityCheck(prompt: string) { const response = await tracedChatCompletion([ { role: "user", content: prompt } ]); const content = response.choices[0].message.content ?? ""; // Empty or near-empty responses if (content.trim().length < 50) { Sentry.captureMessage("Suspiciously short AI response", { level: "warning", extra: { prompt: prompt.slice(0, 200), contentLength: content.length }, tags: { failure_type: "short_response" }, }); } // Detect refusal patterns const refusalPhrases = ["I cannot", "I'm unable to", "I don't have access"]; if (refusalPhrases.some((phrase) => content.includes(phrase))) { Sentry.captureMessage("AI refused to answer", { level: "info", extra: { prompt: prompt.slice(0, 200) }, tags: { failure_type: "refusal" }, }); } return content; }
Distributed Tracing Across the RAG Pipeline
// app/api/ask/route.ts export async function POST(req: Request) { return Sentry.startActiveSpan( { name: "rag-pipeline", op: "ai.pipeline" }, async () => { const { question } = await req.json(); // Embedding — traced span const embedding = await Sentry.startActiveSpan( { name: "embed-question", op: "ai.embed" }, async () => { const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: question, }); return response.data[0].embedding; } ); // Vector search — traced span const results = await Sentry.startActiveSpan( { name: "vector-search", op: "db.query" }, async () => { return await vectorStore.query(embedding, { topK: 5 }); } ); // LLM call — traced span const answer = await Sentry.startActiveSpan( { name: "llm-generate", op: "ai.run" }, async () => { return await tracedChatCompletion(buildRAGMessages(question, results)); } ); return NextResponse.json({ answer }); } ); }
In Sentry's Performance view, you'll see a flame chart of the full RAG pipeline — embed → search → generate — with exact durations for each step. This makes it immediately obvious if your vector search is slow or your embedding model is the bottleneck.Setting Up Alerts
In the Sentry dashboard → Alerts → Create Alert:
| Alert | Condition | Action |
|---|---|---|
| High LLM latency | p95 of `openai.chat.completions` > 10s | Slack #alerts |
| Frequent refusals | `ai.refusal` event rate > 5% | PagerDuty |
| Truncated responses | `short_response` tag rate > 1% | |
| OpenAI API errors | Exception rate on `ai.run` > 0.5% | Slack #critical |
Source Maps for Edge Functions
# Add to your CI/CD pipeline to upload source maps SENTRY_AUTH_TOKEN=your-token npx @sentry/cli sourcemaps inject .next && npx @sentry/cli sourcemaps upload .next --org your-org --project your-project