Modal for AI Engineers: Run GPU Inference Without Managing Infrastructure

How to deploy embedding models, LLMs, and batch jobs on serverless GPU with Modal

Modal is a serverless compute platform specialised for Python data and AI workloads. You write a regular Python function, decorate it with @app.function, and Modal runs it on cloud infrastructure — including GPUs — without you managing servers, Docker images, or Kubernetes clusters.

The key difference from Lambda or Cloud Run: Modal is designed for GPU workloads, supports large container images with ML dependencies, and handles model loading efficiently through its volume and container lifecycle system.

Installation and setup

pip install modal
modal setup   # authenticate with your Modal account

Your first GPU function

# embed.py
import modal
 
app = modal.App('text-embeddings')
 
# Define the container image with your dependencies
image = modal.Image.debian_slim().pip_install(
    'sentence-transformers', 'torch'
)
 
@app.function(
    image=image,
    gpu='T4',          # cheapest GPU option
    timeout=300,
)
def embed_texts(texts: list[str]) -> list[list[float]]:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('BAAI/bge-small-en-v1.5')
    return model.encode(texts).tolist()
 
# Run locally (Modal handles remote execution)
if __name__ == '__main__':
    with modal.enable_output():
        embeddings = embed_texts.remote(['Hello world', 'How are you?'])
    print(f'Got {len(embeddings)} embeddings of dim {len(embeddings[0])}')

# Deploy and run
modal run embed.py
 
# Or deploy as a persistent endpoint
modal deploy embed.py

GPU options and cost

GPU	VRAM	Good for	Approx cost/hr
T4	16 GB	Small models, embeddings, 7B inference	$0.59
A10G	24 GB	13B models, stable diffusion, batch work	$1.10
A100-40GB	40 GB	70B models, fast inference	$3.04
A100-80GB	80 GB	Large models, fine-tuning	$4.69
H100	80 GB	Fastest inference, training	$8.98

Start with T4 for development and small models. Move to A10G for 13B+ models. You pay only when the function is running — no idle cost.

The Model class: keep models loaded between calls

Without the Model class, every function call cold-starts the container and reloads the model — taking 30-60 seconds. Modal's @modal.cls() with @modal.enter() loads the model once and keeps it warm between calls.

import modal
 
app = modal.App('llm-inference')
 
image = (
    modal.Image.debian_slim()
    .pip_install('transformers', 'torch', 'accelerate')
)
 
@app.cls(
    image=image,
    gpu='A10G',
    container_idle_timeout=300,  # keep warm for 5 min after last call
)
class LLMModel:
 
    @modal.enter()
    def load(self):
        """Runs ONCE when the container starts."""
        from transformers import pipeline
        self.pipe = pipeline(
            'text-generation',
            model='mistralai/Mistral-7B-Instruct-v0.2',
            device_map='auto',
            torch_dtype='auto',
        )
        print('Model loaded!')
 
    @modal.method()
    def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
        """Called on every inference request."""
        result = self.pipe(
            prompt,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
        )
        return result[0]['generated_text'][len(prompt):]
 
# Usage:
if __name__ == '__main__':
    model = LLMModel()
    with modal.enable_output():
        response = model.generate.remote('Explain neural networks simply:')
    print(response)

Storing model weights with Volumes

Downloading large model weights on every cold start wastes time. Use Modal Volumes to persist weights across container restarts.

import modal
 
app = modal.App('llm-with-volume')
 
# Create a persistent volume for model weights
volume = modal.Volume.from_name('model-weights', create_if_missing=True)
 
image = modal.Image.debian_slim().pip_install('huggingface_hub', 'transformers', 'torch')
 
@app.function(image=image, volumes={'/models': volume})
def download_model():
    """Run this once to download model to the volume."""
    from huggingface_hub import snapshot_download
    snapshot_download(
        'mistralai/Mistral-7B-Instruct-v0.2',
        local_dir='/models/mistral-7b'
    )
    print('Downloaded!')
 
@app.cls(image=image, gpu='A10G', volumes={'/models': volume})
class FastLLM:
 
    @modal.enter()
    def load(self):
        from transformers import pipeline
        # Load from volume — no download needed
        self.pipe = pipeline('text-generation', model='/models/mistral-7b')
 
    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_new_tokens=200)[0]['generated_text']

Exposing as a web endpoint

Turn any Modal function into an HTTP endpoint with @modal.web_endpoint.

from fastapi import Request
 
@app.function(image=image, gpu='T4')
@modal.web_endpoint(method='POST')
async def embed_endpoint(request: Request):
    data = await request.json()
    texts = data['texts']
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('BAAI/bge-small-en-v1.5')
    embeddings = model.encode(texts).tolist()
    return {'embeddings': embeddings}
 
# After modal deploy, you get a stable URL:
# https://your-workspace--text-embeddings-embed-endpoint.modal.run

Situation	Use Modal	Use hosted API (OpenAI, Anthropic)
Need a specific open-source model	Yes	No
Data privacy / no external API calls	Yes	No
High-volume batch processing	Yes — cheaper at scale	Expensive
Low latency single requests	Cold starts hurt	Yes
Quick prototyping	Overkill	Yes
Fine-tuned private model	Yes	Only if provider offers fine-tuning

Modal for AI Engineers: Run GPU Inference Without Managing Infrastructure

Installation and setup

Your first GPU function

GPU options and cost

The Model class: keep models loaded between calls

Storing model weights with Volumes

Exposing as a web endpoint

Give feedback on this guide

Stay sharp as AI tools evolve

Modal for AI Engineers: Run GPU Inference Without Managing Infrastructure

What Modal is

Installation and setup

Your first GPU function

GPU options and cost

The Model class: keep models loaded between calls

Storing model weights with Volumes

Exposing as a web endpoint

When to use Modal vs hosted LLM APIs

Give feedback on this guide

Stay sharp as AI tools evolve