The Multi-Model Problem
Most serious AI applications end up using more than one LLM. Claude for reasoning-heavy tasks. GPT-4o for speed. Gemini Flash when cost matters. A local Llama model for sensitive data that cannot leave your infrastructure.
The problem: each provider has a different SDK, different error types, different parameter names, and different retry behaviour. Without an abstraction layer, switching models means rewriting calling code. LiteLLM solves this.
LiteLLM is an open-source Python library and optional proxy server that provides a single OpenAI-compatible interface for 100+ LLM providers. You call one function; LiteLLM routes to whichever model you specify.
Installation and First Call
pip install litellm
from litellm import completion
# OpenAI
response = completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}]
)
# Anthropic — same call, different model string
response = completion(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}]
)
# Gemini
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}]
)
# Ollama (local)
response = completion(
model="ollama/llama3.2",
messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}],
api_base="http://localhost:11434"
)
print(response.choices[0].message.content)
The response object follows the OpenAI format regardless of which provider was called. Your downstream code does not need to change when you swap models.
Set your API keys as environment variables: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY. LiteLLM picks them up automatically.Model Fallbacks
Fallbacks are the feature that earns LiteLLM its place in production systems. If your primary model fails — rate limit, outage, timeout — LiteLLM automatically retries with the next model in the list.
from litellm import completion
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarise this contract"}],
fallbacks=["claude-sonnet-4-6", "gemini/gemini-2.0-flash"],
num_retries=2,
timeout=30
)
This pattern is particularly useful for high-availability pipelines where a single provider outage would otherwise bring down your product.
Context Window Fallbacks
LiteLLM also supports context window fallbacks — automatically routing to a larger-context model when your input exceeds the primary model's limit:
from litellm import completion
response = completion(
model="gpt-4o-mini",
messages=very_long_messages,
context_window_fallback_dict={
"gpt-4o-mini": "gpt-4o",
"gpt-4o": "claude-opus-4-6"
}
)
This prevents ContextWindowExceededError from crashing your pipeline when users submit unusually long inputs.Cost Tracking
LiteLLM calculates cost automatically for every completion call using up-to-date provider pricing tables:
from litellm import completion
response = completion(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Write a product description"}]
)
# Cost in USD for this call
print(response._hidden_params["response_cost"])
# e.g. 0.00043
# Token counts
print(response.usage.prompt_tokens)
print(response.usage.completion_tokens)
For aggregated cost tracking across your application, use the success callback:
import litellm
def track_cost(kwargs, completion_response, start_time, end_time):
cost = completion_response._hidden_params.get("response_cost", 0)
model = kwargs.get("model")
user = kwargs.get("user", "unknown")
# write to your database, Datadog, whatever
print(f"[COST] user={user} model={model} cost=${cost:.6f}")
litellm.success_callback = [track_cost]
Per-User Budget Limits
If you are building a product where each user calls LLMs through your backend, you will want to cap how much each user can spend. LiteLLM's budget manager handles this:
from litellm.budget_manager import BudgetManager
budget_manager = BudgetManager(project_name="myapp", client_type="local")
user_id = "user-123"
# Create a $0.10 daily budget for this user
budget_manager.create_budget(
total_budget=0.10,
user=user_id,
duration="daily"
)
# Before each call, check if the user has budget remaining
if budget_manager.get_current_cost(user=user_id) < budget_manager.get_total_budget(user=user_id):
response = completion(
model="gpt-4o-mini",
messages=messages,
user=user_id
)
budget_manager.update_cost(completion_obj=response, user=user_id)
else:
raise Exception("Daily budget exceeded for this user")
The local client_type stores budgets in a JSON file. For production multi-server setups, use client_type='hosted' to store budgets in LiteLLM's managed service, or implement your own Redis-backed storage.Async Support
For high-throughput applications, use the async variant:
from litellm import acompletion
import asyncio
async def run_parallel_calls():
tasks = [
acompletion(model="gpt-4o-mini", messages=[{"role": "user", "content": f"Summarise item {i}"}])
for i in range(10)
]
responses = await asyncio.gather(*tasks)
return [r.choices[0].message.content for r in responses]
results = asyncio.run(run_parallel_calls())
Model Parameter Mapping
Different providers use different parameter names. LiteLLM normalises the most common ones automatically:
| Your parameter | Maps to (OpenAI) | Maps to (Anthropic) | Maps to (Gemini) |
|---|---|---|---|
| max_tokens | max_tokens | max_tokens | max_output_tokens |
| temperature | temperature | temperature | temperature |
| stop | stop | stop_sequences | stop_sequences |
| top_p | top_p | top_p | top_p |
| stream | stream | stream | stream |
Parameters that a provider does not support are silently dropped rather than causing an error, which means the same call works across providers without per-provider conditionals in your code.
When NOT to Use LiteLLM
LiteLLM is not always the right choice:
- If you only ever call one provider — the abstraction layer adds unnecessary complexity.
- If you need provider-specific features that LiteLLM does not yet expose (e.g. Anthropic extended thinking, OpenAI o-series reasoning effort). You can pass extra_body parameters, but coverage is patchy.
- If latency is critical — the library adds a small overhead per call (typically 5-15ms). Usually acceptable, but worth measuring.
- If you are using the LiteLLM proxy server in a team setting — the proxy is excellent for centralised key management and logging, but it adds operational complexity and becomes a single point of failure if not run in HA mode.
Summary
LiteLLM earns its place as infrastructure in any multi-model AI application. The value is not just convenience — it is resilience (fallbacks), cost visibility (per-call and per-user tracking), and the ability to swap models without code changes. Start with the Python library; add the proxy server if your team needs centralised key management or cross-service logging.