What the LiteLLM Proxy Adds
The LiteLLM Python library is useful when you control the calling code. The LiteLLM proxy server is useful when you have multiple services, multiple developers, or multiple agents all needing LLM access — and you want to centralise key management, logging, and rate limiting.
The proxy exposes an OpenAI-compatible HTTP API. Any service that can talk to OpenAI can talk to your proxy — with no code changes other than the base URL.
- Centralised API key management — your team never sees provider keys directly.
- Virtual keys — issue per-user or per-service keys with their own budget and rate limits.
- Load balancing — round-robin or least-busy routing across multiple model deployments.
- Redis caching — avoid duplicate LLM calls for identical prompts.
- Unified logging — all calls through one place, one log format.
Quick Start with Docker
# Create a config file first (see next section), then:
docker run -d \
-p 4000:4000 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml --port 4000
The proxy is now running at http://localhost:4000. To verify:
curl http://localhost:4000/health
The config.yaml — Where Everything Lives
All proxy behaviour is controlled by config.yaml. Here is a production-ready starting point:
model_list:
# Primary: Claude for quality tasks
- model_name: claude-quality
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# Fast: GPT-4o-mini for high-volume cheap tasks
- model_name: gpt-fast
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
# Load balanced pool: multiple GPT-4o deployments
- model_name: gpt4o-pool
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY_1
- model_name: gpt4o-pool
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY_2
litellm_settings:
# Global fallback order
fallbacks: [{"gpt4o-pool": ["claude-quality"]}]
num_retries: 3
request_timeout: 60
# Drop unsupported params silently
drop_params: true
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL # PostgreSQL for virtual keys
os.environ/VARIABLE_NAME syntax tells LiteLLM to read from the environment at startup, so your keys never appear in the config file itself.Load Balancing Strategies
When you define multiple deployments under the same model_name, LiteLLM distributes traffic between them. Three strategies are available:
| Strategy | Config value | Best for |
|---|---|---|
| Round robin (default) | "simple-shuffle" | Even distribution across identical deployments |
| Least busy | "least-busy" | Azure OpenAI with TPM/RPM limits — routes to least loaded |
| Latency-based | "latency-based-routing" | Route to fastest-responding deployment |
router_settings:
routing_strategy: latency-based-routing
num_retries: 3
retry_after: 5 # seconds between retries
Redis Caching
Caching identical prompts saves money and reduces latency. LiteLLM supports exact-match caching via Redis:
litellm_settings:
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600 # cache for 1 hour
# Semantic caching (requires embedding model)
# similarity_threshold: 0.8
With exact-match caching, identical messages to the same model return the cached response instantly. Useful for FAQ chatbots, template-driven workflows, or any scenario with repeated prompts.
Caching is exact-match by default. A single character difference in the messages array produces a cache miss. For fuzzy matching, enable semantic caching — but this adds an embedding call overhead per request.Virtual Keys and Team Access Control
With a PostgreSQL database configured, you can issue virtual keys to individual developers, services, or agents. Each key gets its own budget and rate limits.
# Generate a virtual key via the admin API
curl -X POST http://localhost:4000/key/generate \
-H 'Authorization: Bearer YOUR_MASTER_KEY' \
-H 'Content-Type: application/json' \
-d '{
"key_alias": "staging-pipeline",
"models": ["gpt-fast", "gpt4o-pool"],
"max_budget": 5.00,
"budget_duration": "30d",
"tpm_limit": 100000,
"rpm_limit": 100
}'
The response contains a key like sk-... that the service uses as its API key against the proxy. It cannot exceed its budget or rate limits, and it can only call the models you specified.
This pattern is valuable for multi-tenant AI platforms or large teams where you want to enforce per-team cost accountability without exposing your underlying provider keys.
Calling the Proxy from Any Service
Because the proxy is OpenAI-compatible, any service can use it by changing only the base URL:
from openai import OpenAI
# Point at your proxy instead of OpenAI directly
client = OpenAI(
base_url="http://localhost:4000",
api_key="sk-your-virtual-key"
)
response = client.chat.completions.create(
model="claude-quality", # your model_name from config
messages=[{"role": "user", "content": "Hello"}]
)
This means LangChain, LlamaIndex, CrewAI, LangGraph, and any other framework that accepts an OpenAI-compatible endpoint works with your LiteLLM proxy without any framework-specific configuration.
Health Checks and Monitoring
# Check all model deployments
curl http://localhost:4000/health
# Check a specific model
curl http://localhost:4000/health/readiness
# View spend across all keys
curl http://localhost:4000/spend/logs \
-H 'Authorization: Bearer YOUR_MASTER_KEY'
The proxy exposes a Prometheus metrics endpoint at /metrics. Useful metrics to alert on: litellm_request_total, litellm_failed_requests_total, litellm_deployment_latency_seconds.
Common Production Mistakes
- Running a single proxy container with no redundancy — the proxy itself becomes a single point of failure. Run at least two instances behind a load balancer.
- Using in-memory storage for virtual keys — these disappear on restart. Always configure a PostgreSQL database_url in production.
- Not setting request_timeout — without this, a slow provider can block your application indefinitely.
- Forgetting to rotate the master key — anyone with the master key can create unlimited virtual keys and run up your provider bills.
- Caching streaming responses — LiteLLM does not cache streamed completions by default. If you call with stream=True, caching is bypassed.