Sentry Performance Monitoring: Profiling AI Pipelines and Finding Bottlenecks

Knowing that your AI endpoint is slow is not the same as knowing why it's slow. Is it the OpenAI API latency? The vector search? The database query for user context? The JSON serialization of a large response? Sentry's performance monitoring — specifically distributed tracing and profiling — answers these questions by breaking down every request into individual spans with precise timing.

Understanding Sentry's Performance Tools

Tool	What it Shows	When to Use
Distributed tracing	Span-level breakdown of a single request	Diagnosing slow individual requests
Performance dashboard	Aggregate metrics (p50, p95, p99) across all requests	Identifying systemic bottlenecks
Profiling	CPU flame charts — which functions use the most CPU	Finding expensive computation in your code
Web Vitals	LCP, FID, CLS for frontend pages	Frontend performance optimization

Instrumenting a Complete AI Pipeline

The key to useful traces is meaningful span names and attributes. Here's a fully instrumented document analysis pipeline:

// lib/analysis-pipeline.ts import * as Sentry from "@sentry/nextjs"; export async function analyzeDocument(documentId: string, userId: string) { return Sentry.startActiveSpan( { name: `analyze-document:${documentId}`, op: "ai.pipeline", attributes: { "document.id": documentId, "user.id": userId }, }, async (rootSpan) => { // Step 1: Fetch document const document = await Sentry.startActiveSpan( { name: "fetch-document", op: "db.query" }, async () => { const result = await db.query( "SELECT content, metadata FROM documents WHERE id = $1", [documentId] ); return result.rows[0]; } ); // Step 2: Chunk text const chunks = await Sentry.startActiveSpan( { name: "chunk-text", op: "process" }, async (span) => { const result = chunkText(document.content, { size: 512 }); span.setAttribute("chunk.count", result.length); return result; } ); // Step 3: Batch embed (often the bottleneck) const embeddings = await Sentry.startActiveSpan( { name: "batch-embed", op: "ai.embed" }, async (span) => { span.setAttribute("ai.model", "text-embedding-3-small"); span.setAttribute("ai.input_count", chunks.length); const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: chunks, }); span.setAttribute("ai.total_tokens", response.usage.total_tokens); return response.data.map((d) => d.embedding); } ); // Step 4: Store vectors await Sentry.startActiveSpan( { name: "store-vectors", op: "db.query" }, async () => { await supabase.from("chunks").upsert( chunks.map((chunk, i) => ({ document_id: documentId, content: chunk, embedding: embeddings[i], })) ); } ); // Step 5: Generate summary const summary = await Sentry.startActiveSpan( { name: "generate-summary", op: "ai.run" }, async (span) => { span.setAttribute("ai.model", "gpt-4o-mini"); const response = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: "Summarize this document in 3 bullet points." }, { role: "user", content: document.content.slice(0, 4000) }, ], }); const usage = response.usage; if (usage) { span.setAttributes({ "ai.prompt_tokens": usage.prompt_tokens, "ai.completion_tokens": usage.completion_tokens, }); } return response.choices[0].message.content; } ); rootSpan.setAttribute("pipeline.chunks", chunks.length); return { summary, chunkCount: chunks.length }; } ); }

Reading Trace Data in Sentry

In Sentry → Performance → Explore, filter by `op:ai.pipeline`. Click any trace to see the waterfall view — a Gantt chart showing all spans in chronological order. Look for:

**Wide bars**: spans taking most of the total time — prime optimization targets
**Sequential spans that could be parallel**: chunking and fetching user metadata don't depend on each other
**Gaps between spans**: time not attributed to any span — often serialization, middleware overhead, or connection pooling
**Outlier traces**: sort by Duration to find the slowest 1% and investigate what was different

CPU Profiling for Expensive Computation

// sentry.server.config.ts — enable profiling Sentry.init({ dsn: process.env.SENTRY_DSN, tracesSampleRate: 0.1, profilesSampleRate: 0.1, // Profile 10% of traced transactions integrations: [ Sentry.nodeProfilingIntegration(), ], });

Profiling shows you a flame chart of CPU usage during a transaction. Common findings in AI apps:

`JSON.parse()` of large LLM responses taking 20-50ms
Token counting libraries (`tiktoken`) adding 10-30ms per call
Markdown-to-HTML rendering of LLM output blocking the response
Cosine similarity computed in JavaScript instead of SQL/pgvector

Performance Alerts and Regression Detection

# Sentry automatically detects performance regressions # when you deploy a new version. Configure thresholds in: # Project Settings → Performance → Thresholds # Example thresholds for an AI chat endpoint: # p95 duration: 8000ms (8 seconds) # p50 duration: 3000ms (3 seconds) # Error rate: 1% # Sentry will alert you if a new deployment exceeds these thresholds

Use Sentry's 'Compare to baseline' feature when investigating a slow release. It overlays the p50/p95 latency of the new version against the previous version, making regressions immediately visible.

Connecting Frontend and Backend Traces

// Client component — inject trace context into API calls import * as Sentry from "@sentry/nextjs"; async function askQuestion(question: string) { // Sentry automatically propagates trace headers // when using its fetch instrumentation const response = await fetch("/api/ask", { method: "POST", body: JSON.stringify({ question }), headers: { "Content-Type": "application/json" }, }); return response.json(); } // The resulting trace connects the browser click event → API route → // vector search → LLM call in a single end-to-end trace

With trace propagation enabled, Sentry links the browser's frontend performance (button click → response rendered) to your backend spans. You see the full user experience — including browser rendering time — not just API latency.