The Multi-Model Problem

Most serious AI applications end up using more than one LLM. Claude for reasoning-heavy tasks. GPT-4o for speed. Gemini Flash when cost matters. A local Llama model for sensitive data that cannot leave your infrastructure.

The problem: each provider has a different SDK, different error types, different parameter names, and different retry behaviour. Without an abstraction layer, switching models means rewriting calling code. LiteLLM solves this.

LiteLLM is an open-source Python library and optional proxy server that provides a single OpenAI-compatible interface for 100+ LLM providers. You call one function; LiteLLM routes to whichever model you specify.

Installation and First Call

pip install litellm
 
from litellm import completion
 
# OpenAI
response = completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}]
)
 
# Anthropic — same call, different model string
response = completion(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}]
)
 
# Gemini
response = completion(
    model="gemini/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}]
)
 
# Ollama (local)
response = completion(
    model="ollama/llama3.2",
    messages=[{"role": "user", "content": "Explain vector embeddings in one paragraph"}],
    api_base="http://localhost:11434"
)
 
print(response.choices[0].message.content)
 

The response object follows the OpenAI format regardless of which provider was called. Your downstream code does not need to change when you swap models.

Set your API keys as environment variables: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY. LiteLLM picks them up automatically.

Model Fallbacks

Fallbacks are the feature that earns LiteLLM its place in production systems. If your primary model fails — rate limit, outage, timeout — LiteLLM automatically retries with the next model in the list.

from litellm import completion
 
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this contract"}],
    fallbacks=["claude-sonnet-4-6", "gemini/gemini-2.0-flash"],
    num_retries=2,
    timeout=30
)
 

This pattern is particularly useful for high-availability pipelines where a single provider outage would otherwise bring down your product.

Context Window Fallbacks

LiteLLM also supports context window fallbacks — automatically routing to a larger-context model when your input exceeds the primary model's limit:

from litellm import completion
 
response = completion(
    model="gpt-4o-mini",
    messages=very_long_messages,
    context_window_fallback_dict={
        "gpt-4o-mini": "gpt-4o",
        "gpt-4o": "claude-opus-4-6"
    }
)
 
This prevents ContextWindowExceededError from crashing your pipeline when users submit unusually long inputs.

Cost Tracking

LiteLLM calculates cost automatically for every completion call using up-to-date provider pricing tables:

from litellm import completion
 
response = completion(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Write a product description"}]
)
 
# Cost in USD for this call
print(response._hidden_params["response_cost"])
# e.g. 0.00043
 
# Token counts
print(response.usage.prompt_tokens)
print(response.usage.completion_tokens)
 

For aggregated cost tracking across your application, use the success callback:

import litellm
 
def track_cost(kwargs, completion_response, start_time, end_time):
    cost = completion_response._hidden_params.get("response_cost", 0)
    model = kwargs.get("model")
    user = kwargs.get("user", "unknown")
    # write to your database, Datadog, whatever
    print(f"[COST] user={user} model={model} cost=${cost:.6f}")
 
litellm.success_callback = [track_cost]
 

Per-User Budget Limits

If you are building a product where each user calls LLMs through your backend, you will want to cap how much each user can spend. LiteLLM's budget manager handles this:

from litellm.budget_manager import BudgetManager
 
budget_manager = BudgetManager(project_name="myapp", client_type="local")
 
user_id = "user-123"
 
# Create a $0.10 daily budget for this user
budget_manager.create_budget(
    total_budget=0.10,
    user=user_id,
    duration="daily"
)
 
# Before each call, check if the user has budget remaining
if budget_manager.get_current_cost(user=user_id) < budget_manager.get_total_budget(user=user_id):
    response = completion(
        model="gpt-4o-mini",
        messages=messages,
        user=user_id
    )
    budget_manager.update_cost(completion_obj=response, user=user_id)
else:
    raise Exception("Daily budget exceeded for this user")
 
The local client_type stores budgets in a JSON file. For production multi-server setups, use client_type='hosted' to store budgets in LiteLLM's managed service, or implement your own Redis-backed storage.

Async Support

For high-throughput applications, use the async variant:

from litellm import acompletion
import asyncio
 
async def run_parallel_calls():
    tasks = [
        acompletion(model="gpt-4o-mini", messages=[{"role": "user", "content": f"Summarise item {i}"}])
        for i in range(10)
    ]
    responses = await asyncio.gather(*tasks)
    return [r.choices[0].message.content for r in responses]
 
results = asyncio.run(run_parallel_calls())
 

Model Parameter Mapping

Different providers use different parameter names. LiteLLM normalises the most common ones automatically:

Your parameter Maps to (OpenAI) Maps to (Anthropic) Maps to (Gemini)
max_tokens max_tokens max_tokens max_output_tokens
temperature temperature temperature temperature
stop stop stop_sequences stop_sequences
top_p top_p top_p top_p
stream stream stream stream

Parameters that a provider does not support are silently dropped rather than causing an error, which means the same call works across providers without per-provider conditionals in your code.

When NOT to Use LiteLLM

LiteLLM is not always the right choice:

  • If you only ever call one provider — the abstraction layer adds unnecessary complexity.
  • If you need provider-specific features that LiteLLM does not yet expose (e.g. Anthropic extended thinking, OpenAI o-series reasoning effort). You can pass extra_body parameters, but coverage is patchy.
  • If latency is critical — the library adds a small overhead per call (typically 5-15ms). Usually acceptable, but worth measuring.
  • If you are using the LiteLLM proxy server in a team setting — the proxy is excellent for centralised key management and logging, but it adds operational complexity and becomes a single point of failure if not run in HA mode.

Summary

LiteLLM earns its place as infrastructure in any multi-model AI application. The value is not just convenience — it is resilience (fallbacks), cost visibility (per-call and per-user tracking), and the ability to swap models without code changes. Start with the Python library; add the proxy server if your team needs centralised key management or cross-service logging.