Fly.io's edge deployment model lets you run your application in 30+ regions simultaneously, routing each user to the nearest machine. For AI applications serving a global user base, this means faster response times and the ability to colocate LLM calls with your users. This guide covers the practical scaling controls.

Understanding Fly Machines

Each Fly.io VM is a Firecracker microVM — a full Linux environment but lightweight enough to start in ~300ms. You have fine-grained control over how many machines run, in which regions, and on what hardware.

Horizontal Scaling: More Machines

# Scale to 3 machines in the current region
fly scale count 3
 
# Scale to 2 machines per region
fly scale count 2 --region lhr,ord,sin
 
# Check current machine count and status
fly status

Vertical Scaling: More CPU and RAM

# Scale to 4 vCPUs and 8 GB RAM
fly scale vm performance-4x
 
# Available VM sizes:
# shared-cpu-1x    (256 MB RAM) — free tier
# shared-cpu-2x    (512 MB RAM)
# shared-cpu-4x    (1 GB RAM)
# performance-1x   (2 GB RAM)   — $0.0003124/min
# performance-2x   (4 GB RAM)
# performance-4x   (8 GB RAM)
# performance-8x   (16 GB RAM)

Multi-Region Deployment

# Add regions to your app
fly regions add ord  # Chicago
fly regions add sin  # Singapore
fly regions add gru  # São Paulo
 
# Remove a region
fly regions remove gru
 
# List active regions
fly regions list

Fly's anycast network automatically routes each request to the nearest healthy machine. If the nearest machine is busy, Fly routes to the next closest. No load balancer configuration required.

Autoscaling with fly.toml

[http_service]
  internal_port = 8080
  auto_stop_machines = 'stop'    # stop idle machines
  auto_start_machines = true     # start on incoming traffic
  min_machines_running = 1       # always keep 1 running (no cold start)
 
  [http_service.concurrency]
    type = 'requests'
    hard_limit = 50   # max concurrent requests per machine
    soft_limit = 25   # start scaling up at 25 concurrent requests
For AI applications, set concurrency limits based on your LLM call patterns. If each request holds a connection open for 5–30 seconds during LLM streaming, a hard_limit of 20–50 is typically right for a 1–2 vCPU machine.

Recipe: Global AI API with Regional Latency Optimisation

# fly.toml — run in 3 regions with autoscaling
app = 'global-ai-api'
primary_region = 'lhr'
 
[http_service]
  internal_port = 8080
  auto_stop_machines = 'stop'
  auto_start_machines = true
  min_machines_running = 1
 
  [http_service.concurrency]
    type = 'requests'
    hard_limit = 30
    soft_limit = 15
 
[[vm]]
  memory = '2gb'
  cpu_kind = 'performance'
  cpus = 2
# Deploy to all regions simultaneously
fly deploy
 
# Verify machines are running in all regions
fly status --all

Database Placement with Multi-Region

The biggest challenge with multi-region apps is database latency. If your database is only in London but your app is also in Singapore, every request from Singapore makes a transatlantic database round-trip.

Solutions: use Fly's managed Postgres with read replicas in each region, use a globally distributed database (Neon, PlanetScale, Turso), or use a read-through cache (Upstash Redis) at the edge.

Metadata Value
Title Scaling on Fly.io: Multi-Region Deployment and Machine Autoscaling
Tool Fly.io
Primary SEO keyword fly.io multi region scaling
Secondary keywords fly.io autoscaling, fly.io regions, fly.io horizontal scaling, fly.io global deployment
Estimated read time 7 minutes
Research date 2026-04-14