Mastra Workflow Resumption: How to Handle Server Restarts Without Losing State

Mastra does not persist workflow execution state by default. Here is what breaks, why, and four patterns to fix it.

The Problem

Mastra's workflow system is excellent for building multi-step AI pipelines. The suspend/resume API makes human-in-the-loop workflows straightforward to model. There is one catch that the docs understate: by default, Mastra stores workflow execution state in memory. When your server restarts -- after a deployment, a crash, or a Vercel cold start -- any suspended workflow is gone. No trace, no recovery, no notification to the user waiting for their result.

This is GitHub Issue #5549 and a recurring theme across Mastra's community discussions. This article explains why it happens, what your options are, and the patterns that actually work in production.

Mastra's default storage adapter is in-memory. Any workflow suspended with workflow.suspend() will be permanently lost on server restart. This affects ALL self-hosted Mastra deployments that use suspend/resume, human-in-the-loop steps, or long-running workflows.

Why It Happens

Mastra assumes short-lived execution by default -- a request comes in, a workflow runs to completion, and the result is returned. When you introduce suspend() to pause a workflow and wait for human input, the workflow state has to live somewhere between the pause and the resume. Without a configured storage adapter, it lives in process memory. That memory is gone the moment the process restarts.

This is not a bug -- it is a configuration gap. The framework supports persistent storage; it just is not the default, and the docs do not make the production implications obvious during onboarding.

Pattern 1: LibSQL / Turso Persistent Storage

Mastra ships with a LibSQL storage adapter that uses SQLite locally and can connect to a Turso cloud database in production. This is the lowest-friction production option for most self-hosted deployments.

import { Mastra } from '@mastra/core';
import { LibSQLStore } from '@mastra/libsql';
 
const mastra = new Mastra({
  storage: new LibSQLStore({
    // Local SQLite for development:
    url: 'file:./mastra.db',
 
    // Turso for production:
    // url: process.env.TURSO_DATABASE_URL,
    // authToken: process.env.TURSO_AUTH_TOKEN,
  }),
});

With this configured, suspended workflows are written to the database. When your server restarts, you can resume them by run ID:

// Resuming a suspended workflow after server restart
const runId = 'run_abc123'; // stored in your DB when you called suspend()
 
const run = await mastra.getWorkflow('myWorkflow').getRunById(runId);
if (run && run.status === 'suspended') {
  await run.resume({ humanInput: 'approved' });
}

Always store the run ID in your own database (e.g. alongside the user record) when you call suspend(). This is your recovery key. If you only hold it in the Mastra store, a database wipe loses the reference entirely.

Pattern 2: PostgreSQL for Multi-Instance Deployments

If you are running multiple Mastra instances behind a load balancer (or deploying to a platform like Railway or Render with multiple workers), you need a shared storage backend. PostgreSQL is the recommended option.

import { Mastra } from '@mastra/core';
import { PostgresStore } from '@mastra/pg';
 
const mastra = new Mastra({
  storage: new PostgresStore({
    connectionString: process.env.DATABASE_URL,
    // Optional: separate schema to avoid conflicts with your app tables
    schema: 'mastra',
  }),
});

With PostgreSQL, any instance can resume a workflow started by any other instance -- essential for autoscaling environments.

Pattern 3: Webhook-Based Resume for Human-in-the-Loop

The most robust pattern for human-in-the-loop workflows: when the workflow suspends, send a signed webhook URL to the human reviewer (via email, Slack, etc). When they click approve or reject, the webhook fires and resumes the workflow. The server restart problem becomes irrelevant because the resume is triggered externally.

// In your Mastra workflow step:
const approval = await step.suspend({
  payload: {
    message: 'Please review the generated report before sending.',
    approveUrl: `https://yourapp.com/api/resume?runId=${runId}&action=approve`,
    rejectUrl:  `https://yourapp.com/api/resume?runId=${runId}&action=reject`,
  },
});
 
// In your API route handler (e.g. Next.js App Router):
// POST /api/resume?runId=...&action=...
export async function POST(req: Request) {
  const { runId, action } = Object.fromEntries(new URL(req.url).searchParams);
  const run = await mastra.getWorkflow('myWorkflow').getRunById(runId);
  await run.resume({ approved: action === 'approve' });
  return Response.json({ ok: true });
}

Sign your webhook URLs with a short-lived HMAC token. Anyone with the URL can resume the workflow. Use a signed token in the query param and validate it on receipt before calling resume().

Pattern 4: Scheduled Recovery Job

Even with persistent storage, workflows can get stuck if a resume call fails silently (network error, downstream API down). Add a scheduled recovery job that scans for suspended workflows older than a threshold and either retries the resume or notifies an operator.

// Run this on a cron schedule (e.g. every 15 minutes)
async function recoverSuspendedWorkflows() {
  const suspendedRuns = await mastra
    .getWorkflow('myWorkflow')
    .getRuns({ status: 'suspended', olderThanMinutes: 60 });
 
  for (const run of suspendedRuns) {
    console.warn(`Stale suspended run: ${run.id}, suspended at ${run.suspendedAt}`);
    // Option 1: alert the team
    await notifySlack(`Workflow ${run.id} has been suspended for over 60 min`);
    // Option 2: auto-timeout and mark failed
    // await run.fail({ reason: 'Approval timeout exceeded' });
  }
}

Deployment Checklist

Configure a persistent storage adapter (LibSQLStore or PostgresStore) before deploying any workflow that uses suspend()
Store run IDs in your own database alongside user/request records
Use webhook-based resume for human-in-the-loop steps -- avoids dependency on server uptime
Add a scheduled recovery job to detect and handle stale suspended runs
Test recovery by: starting a workflow, suspending it, restarting the server, then verifying resume works
Sign all resume URLs with time-limited tokens