Running the LiteLLM Proxy in Production: Load Balancing, Caching, and Team Access Control

What the LiteLLM Proxy Adds

The LiteLLM Python library is useful when you control the calling code. The LiteLLM proxy server is useful when you have multiple services, multiple developers, or multiple agents all needing LLM access — and you want to centralise key management, logging, and rate limiting.

The proxy exposes an OpenAI-compatible HTTP API. Any service that can talk to OpenAI can talk to your proxy — with no code changes other than the base URL.

Centralised API key management — your team never sees provider keys directly.
Virtual keys — issue per-user or per-service keys with their own budget and rate limits.
Load balancing — round-robin or least-busy routing across multiple model deployments.
Redis caching — avoid duplicate LLM calls for identical prompts.
Unified logging — all calls through one place, one log format.

Quick Start with Docker

# Create a config file first (see next section), then:
docker run -d \
  -p 4000:4000 \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml --port 4000

The proxy is now running at http://localhost:4000. To verify:

curl http://localhost:4000/health

The config.yaml — Where Everything Lives

All proxy behaviour is controlled by config.yaml. Here is a production-ready starting point:

model_list:
  # Primary: Claude for quality tasks
  - model_name: claude-quality
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
 
  # Fast: GPT-4o-mini for high-volume cheap tasks
  - model_name: gpt-fast
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
 
  # Load balanced pool: multiple GPT-4o deployments
  - model_name: gpt4o-pool
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  - model_name: gpt4o-pool
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY_2
 
litellm_settings:
  # Global fallback order
  fallbacks: [{"gpt4o-pool": ["claude-quality"]}]
  num_retries: 3
  request_timeout: 60
  # Drop unsupported params silently
  drop_params: true
 
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # PostgreSQL for virtual keys

os.environ/VARIABLE_NAME syntax tells LiteLLM to read from the environment at startup, so your keys never appear in the config file itself.

Load Balancing Strategies

When you define multiple deployments under the same model_name, LiteLLM distributes traffic between them. Three strategies are available:

Strategy	Config value	Best for
Round robin (default)	"simple-shuffle"	Even distribution across identical deployments
Least busy	"least-busy"	Azure OpenAI with TPM/RPM limits — routes to least loaded
Latency-based	"latency-based-routing"	Route to fastest-responding deployment

router_settings:
  routing_strategy: latency-based-routing
  num_retries: 3
  retry_after: 5  # seconds between retries

Redis Caching

Caching identical prompts saves money and reduces latency. LiteLLM supports exact-match caching via Redis:

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600  # cache for 1 hour
    # Semantic caching (requires embedding model)
    # similarity_threshold: 0.8

With exact-match caching, identical messages to the same model return the cached response instantly. Useful for FAQ chatbots, template-driven workflows, or any scenario with repeated prompts.

Caching is exact-match by default. A single character difference in the messages array produces a cache miss. For fuzzy matching, enable semantic caching — but this adds an embedding call overhead per request.

Virtual Keys and Team Access Control

With a PostgreSQL database configured, you can issue virtual keys to individual developers, services, or agents. Each key gets its own budget and rate limits.

# Generate a virtual key via the admin API
curl -X POST http://localhost:4000/key/generate \
  -H 'Authorization: Bearer YOUR_MASTER_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "key_alias": "staging-pipeline",
    "models": ["gpt-fast", "gpt4o-pool"],
    "max_budget": 5.00,
    "budget_duration": "30d",
    "tpm_limit": 100000,
    "rpm_limit": 100
  }'

The response contains a key like sk-... that the service uses as its API key against the proxy. It cannot exceed its budget or rate limits, and it can only call the models you specified.

This pattern is valuable for multi-tenant AI platforms or large teams where you want to enforce per-team cost accountability without exposing your underlying provider keys.

Calling the Proxy from Any Service

Because the proxy is OpenAI-compatible, any service can use it by changing only the base URL:

from openai import OpenAI
 
# Point at your proxy instead of OpenAI directly
client = OpenAI(
    base_url="http://localhost:4000",
    api_key="sk-your-virtual-key"
)
 
response = client.chat.completions.create(
    model="claude-quality",  # your model_name from config
    messages=[{"role": "user", "content": "Hello"}]
)

This means LangChain, LlamaIndex, CrewAI, LangGraph, and any other framework that accepts an OpenAI-compatible endpoint works with your LiteLLM proxy without any framework-specific configuration.

Health Checks and Monitoring

# Check all model deployments
curl http://localhost:4000/health
 
# Check a specific model
curl http://localhost:4000/health/readiness
 
# View spend across all keys
curl http://localhost:4000/spend/logs \
  -H 'Authorization: Bearer YOUR_MASTER_KEY'

The proxy exposes a Prometheus metrics endpoint at /metrics. Useful metrics to alert on: litellm_request_total, litellm_failed_requests_total, litellm_deployment_latency_seconds.

Common Production Mistakes

Running a single proxy container with no redundancy — the proxy itself becomes a single point of failure. Run at least two instances behind a load balancer.
Using in-memory storage for virtual keys — these disappear on restart. Always configure a PostgreSQL database_url in production.
Not setting request_timeout — without this, a slow provider can block your application indefinitely.
Forgetting to rotate the master key — anyone with the master key can create unlimited virtual keys and run up your provider bills.
Caching streaming responses — LiteLLM does not cache streamed completions by default. If you call with stream=True, caching is bypassed.