Knowing that your AI endpoint is slow is not the same as knowing why it's slow. Is it the OpenAI API latency? The vector search? The database query for user context? The JSON serialization of a large response? Sentry's performance monitoring — specifically distributed tracing and profiling — answers these questions by breaking down every request into individual spans with precise timing.
Understanding Sentry's Performance Tools
| Tool | What it Shows | When to Use |
|---|---|---|
| Distributed tracing | Span-level breakdown of a single request | Diagnosing slow individual requests |
| Performance dashboard | Aggregate metrics (p50, p95, p99) across all requests | Identifying systemic bottlenecks |
| Profiling | CPU flame charts — which functions use the most CPU | Finding expensive computation in your code |
| Web Vitals | LCP, FID, CLS for frontend pages | Frontend performance optimization |
Instrumenting a Complete AI Pipeline
The key to useful traces is meaningful span names and attributes. Here's a fully instrumented document analysis pipeline:
// lib/analysis-pipeline.ts import * as Sentry from "@sentry/nextjs"; export async function analyzeDocument(documentId: string, userId: string) { return Sentry.startActiveSpan( { name: `analyze-document:${documentId}`, op: "ai.pipeline", attributes: { "document.id": documentId, "user.id": userId }, }, async (rootSpan) => { // Step 1: Fetch document const document = await Sentry.startActiveSpan( { name: "fetch-document", op: "db.query" }, async () => { const result = await db.query( "SELECT content, metadata FROM documents WHERE id = $1", [documentId] ); return result.rows[0]; } ); // Step 2: Chunk text const chunks = await Sentry.startActiveSpan( { name: "chunk-text", op: "process" }, async (span) => { const result = chunkText(document.content, { size: 512 }); span.setAttribute("chunk.count", result.length); return result; } ); // Step 3: Batch embed (often the bottleneck) const embeddings = await Sentry.startActiveSpan( { name: "batch-embed", op: "ai.embed" }, async (span) => { span.setAttribute("ai.model", "text-embedding-3-small"); span.setAttribute("ai.input_count", chunks.length); const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: chunks, }); span.setAttribute("ai.total_tokens", response.usage.total_tokens); return response.data.map((d) => d.embedding); } ); // Step 4: Store vectors await Sentry.startActiveSpan( { name: "store-vectors", op: "db.query" }, async () => { await supabase.from("chunks").upsert( chunks.map((chunk, i) => ({ document_id: documentId, content: chunk, embedding: embeddings[i], })) ); } ); // Step 5: Generate summary const summary = await Sentry.startActiveSpan( { name: "generate-summary", op: "ai.run" }, async (span) => { span.setAttribute("ai.model", "gpt-4o-mini"); const response = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: "Summarize this document in 3 bullet points." }, { role: "user", content: document.content.slice(0, 4000) }, ], }); const usage = response.usage; if (usage) { span.setAttributes({ "ai.prompt_tokens": usage.prompt_tokens, "ai.completion_tokens": usage.completion_tokens, }); } return response.choices[0].message.content; } ); rootSpan.setAttribute("pipeline.chunks", chunks.length); return { summary, chunkCount: chunks.length }; } ); }
Reading Trace Data in Sentry
In Sentry → Performance → Explore, filter by `op:ai.pipeline`. Click any trace to see the waterfall view — a Gantt chart showing all spans in chronological order. Look for:
- **Wide bars**: spans taking most of the total time — prime optimization targets
- **Sequential spans that could be parallel**: chunking and fetching user metadata don't depend on each other
- **Gaps between spans**: time not attributed to any span — often serialization, middleware overhead, or connection pooling
- **Outlier traces**: sort by Duration to find the slowest 1% and investigate what was different
CPU Profiling for Expensive Computation
// sentry.server.config.ts — enable profiling Sentry.init({ dsn: process.env.SENTRY_DSN, tracesSampleRate: 0.1, profilesSampleRate: 0.1, // Profile 10% of traced transactions integrations: [ Sentry.nodeProfilingIntegration(), ], });
Profiling shows you a flame chart of CPU usage during a transaction. Common findings in AI apps:
- `JSON.parse()` of large LLM responses taking 20-50ms
- Token counting libraries (`tiktoken`) adding 10-30ms per call
- Markdown-to-HTML rendering of LLM output blocking the response
- Cosine similarity computed in JavaScript instead of SQL/pgvector
Performance Alerts and Regression Detection
# Sentry automatically detects performance regressions # when you deploy a new version. Configure thresholds in: # Project Settings → Performance → Thresholds # Example thresholds for an AI chat endpoint: # p95 duration: 8000ms (8 seconds) # p50 duration: 3000ms (3 seconds) # Error rate: 1% # Sentry will alert you if a new deployment exceeds these thresholds
Use Sentry's 'Compare to baseline' feature when investigating a slow release. It overlays the p50/p95 latency of the new version against the previous version, making regressions immediately visible.Connecting Frontend and Backend Traces
// Client component — inject trace context into API calls import * as Sentry from "@sentry/nextjs"; async function askQuestion(question: string) { // Sentry automatically propagates trace headers // when using its fetch instrumentation const response = await fetch("/api/ask", { method: "POST", body: JSON.stringify({ question }), headers: { "Content-Type": "application/json" }, }); return response.json(); } // The resulting trace connects the browser click event → API route → // vector search → LLM call in a single end-to-end trace
With trace propagation enabled, Sentry links the browser's frontend performance (button click → response rendered) to your backend spans. You see the full user experience — including browser rendering time — not just API latency.