How to deploy embedding models, LLMs, and batch jobs on serverless GPU with Modal

What Modal is

Modal is a serverless compute platform specialised for Python data and AI workloads. You write a regular Python function, decorate it with @app.function, and Modal runs it on cloud infrastructure — including GPUs — without you managing servers, Docker images, or Kubernetes clusters.

The key difference from Lambda or Cloud Run: Modal is designed for GPU workloads, supports large container images with ML dependencies, and handles model loading efficiently through its volume and container lifecycle system.

Installation and setup

pip install modal
modal setup   # authenticate with your Modal account
 

Your first GPU function

# embed.py
import modal
 
app = modal.App('text-embeddings')
 
# Define the container image with your dependencies
image = modal.Image.debian_slim().pip_install(
    'sentence-transformers', 'torch'
)
 
@app.function(
    image=image,
    gpu='T4',          # cheapest GPU option
    timeout=300,
)
def embed_texts(texts: list[str]) -> list[list[float]]:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('BAAI/bge-small-en-v1.5')
    return model.encode(texts).tolist()
 
# Run locally (Modal handles remote execution)
if __name__ == '__main__':
    with modal.enable_output():
        embeddings = embed_texts.remote(['Hello world', 'How are you?'])
    print(f'Got {len(embeddings)} embeddings of dim {len(embeddings[0])}')
 
# Deploy and run
modal run embed.py
 
# Or deploy as a persistent endpoint
modal deploy embed.py
 

GPU options and cost

GPU VRAM Good for Approx cost/hr
T4 16 GB Small models, embeddings, 7B inference $0.59
A10G 24 GB 13B models, stable diffusion, batch work $1.10
A100-40GB 40 GB 70B models, fast inference $3.04
A100-80GB 80 GB Large models, fine-tuning $4.69
H100 80 GB Fastest inference, training $8.98
Start with T4 for development and small models. Move to A10G for 13B+ models. You pay only when the function is running — no idle cost.

The Model class: keep models loaded between calls

Without the Model class, every function call cold-starts the container and reloads the model — taking 30-60 seconds. Modal's @modal.cls() with @modal.enter() loads the model once and keeps it warm between calls.

import modal
 
app = modal.App('llm-inference')
 
image = (
    modal.Image.debian_slim()
    .pip_install('transformers', 'torch', 'accelerate')
)
 
@app.cls(
    image=image,
    gpu='A10G',
    container_idle_timeout=300,  # keep warm for 5 min after last call
)
class LLMModel:
 
    @modal.enter()
    def load(self):
        """Runs ONCE when the container starts."""
        from transformers import pipeline
        self.pipe = pipeline(
            'text-generation',
            model='mistralai/Mistral-7B-Instruct-v0.2',
            device_map='auto',
            torch_dtype='auto',
        )
        print('Model loaded!')
 
    @modal.method()
    def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
        """Called on every inference request."""
        result = self.pipe(
            prompt,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
        )
        return result[0]['generated_text'][len(prompt):]
 
# Usage:
if __name__ == '__main__':
    model = LLMModel()
    with modal.enable_output():
        response = model.generate.remote('Explain neural networks simply:')
    print(response)
 

Storing model weights with Volumes

Downloading large model weights on every cold start wastes time. Use Modal Volumes to persist weights across container restarts.

import modal
 
app = modal.App('llm-with-volume')
 
# Create a persistent volume for model weights
volume = modal.Volume.from_name('model-weights', create_if_missing=True)
 
image = modal.Image.debian_slim().pip_install('huggingface_hub', 'transformers', 'torch')
 
@app.function(image=image, volumes={'/models': volume})
def download_model():
    """Run this once to download model to the volume."""
    from huggingface_hub import snapshot_download
    snapshot_download(
        'mistralai/Mistral-7B-Instruct-v0.2',
        local_dir='/models/mistral-7b'
    )
    print('Downloaded!')
 
@app.cls(image=image, gpu='A10G', volumes={'/models': volume})
class FastLLM:
 
    @modal.enter()
    def load(self):
        from transformers import pipeline
        # Load from volume — no download needed
        self.pipe = pipeline('text-generation', model='/models/mistral-7b')
 
    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_new_tokens=200)[0]['generated_text']
 

Exposing as a web endpoint

Turn any Modal function into an HTTP endpoint with @modal.web_endpoint.

from fastapi import Request
 
@app.function(image=image, gpu='T4')
@modal.web_endpoint(method='POST')
async def embed_endpoint(request: Request):
    data = await request.json()
    texts = data['texts']
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('BAAI/bge-small-en-v1.5')
    embeddings = model.encode(texts).tolist()
    return {'embeddings': embeddings}
 
# After modal deploy, you get a stable URL:
# https://your-workspace--text-embeddings-embed-endpoint.modal.run
 

When to use Modal vs hosted LLM APIs

Situation Use Modal Use hosted API (OpenAI, Anthropic)
Need a specific open-source model Yes No
Data privacy / no external API calls Yes No
High-volume batch processing Yes — cheaper at scale Expensive
Low latency single requests Cold starts hurt Yes
Quick prototyping Overkill Yes
Fine-tuned private model Yes Only if provider offers fine-tuning