In Plain English
You've built an AI app that calls an LLM, streams tokens back to the user, and sometimes runs for 30-60 seconds per request. Now you need somewhere to run it. The three main hosting families — serverless functions, containers, and edge functions — were each designed for a different era of workloads, and LLM apps stress-test all of them in different ways.
Think of it like choosing a restaurant model. Serverless is a ghost kitchen: someone else owns the building, the staff appear only when an order arrives, and you pay purely per meal. Containers are a leased restaurant space: you control the kitchen layout, the stove stays warm between customers, and you pay a monthly rent whether anyone eats or not. Edge functions are a chain of pop-up stalls placed in every neighborhood: each stall is tiny and cheap, but you can only sell items that don't need a full kitchen.
Why It Matters for AI Apps
Traditional web requests finish in under a second. LLM-powered requests are different in three ways that matter enormously for infrastructure:
- Duration: a single streamed GPT-4o response can take 20-60 seconds end-to-end. An agentic loop with five tool calls multiplies that further.
- Streaming: users expect tokens to appear as they are generated, not after a long blank wait. Buffering the full response and sending it in one shot destroys the UX.
- Bursty but infrequent: most AI apps have uneven traffic — quiet for hours, then a spike — which makes always-on dedicated servers wasteful.
Pick the wrong deployment target and you hit one of three failure modes: killed by a timeout (serverless function cuts off mid-stream), frozen by a cold start (a 10-second Lambda init before the first token), or bankrupted by idle cost (a beefy VM running at 2% CPU waiting for occasional requests).
How Each Deployment Model Works
Each model has a distinct lifecycle for an incoming HTTP request. Understanding that lifecycle reveals where it will struggle with LLM workloads.
- Request arrives
- Cold start: runtime boots (0-5 s)
- Handler runs (billed per ms)
- Response sent
- Instance freezes or terminates
- Request arrives
- Process already warm
- Handler runs
- Response sent
- Process stays alive, awaiting next request
- Request arrives at nearest PoP
- Isolate spins up (<1 ms)
- Handler runs (CPU budget: seconds)
- Response sent from edge node
- Isolate may be reused or recycled
Serverless functions
Platforms like AWS Lambda, Google Cloud Functions, and Vercel Functions spin up a runtime only when a request arrives. The billing model is per-invocation and per-millisecond of execution. For AI apps, the critical knobs are the maximum duration and streaming support.
Vercel Functions are a common choice for Next.js AI apps. On the Hobby plan, functions time out at 60 seconds — enough for many single-turn LLM calls but risky for agentic loops. The Pro plan's Fluid Compute model extends this to 800 seconds and is what Vercel recommends for AI streaming. Vercel Edge Functions (the V8-isolate tier) must begin sending a response within 25 seconds but can then stream for up to 300 seconds — a good fit for streaming chat if you open the stream quickly.
AWS Lambda has a 15-minute (900s) hard cap, but the bigger trap is API Gateway's 29-second integration timeout, which sits in front of Lambda and cannot be increased. The workaround is Lambda response streaming — Lambda can push chunks directly to the client, bypassing the API Gateway timeout — but it requires the InvokeWithResponseStreaming API and specific SDK wiring. AWS has a dedicated blog post on this pattern (see Further Reading).
Containers (always-on or scale-to-zero)
Platforms like Fly.io, Railway, Render, and Google Cloud Run run your Docker image as a long-lived process. There is no platform-imposed request timeout — your server can hold a streaming connection open for as long as the underlying TCP connection survives. This makes containers the most natural fit for LLM streaming.
Google Cloud Run is a hybrid: it feels serverless (scale-to-zero, pay-per-request) but runs real containers. Request timeout is configurable up to 60 minutes for services. Cloud Run also supports NVIDIA L4 GPUs (generally available), giving you a path to self-hosted inference if you outgrow external API calls. Cold starts with an L4 GPU and a loaded model framework like Ollama range from roughly 11 to 35 seconds — acceptable for background tasks, painful for interactive chat unless you keep minimum instances > 0.
Edge functions
Edge functions (Cloudflare Workers, Vercel Edge Functions, Deno Deploy) run V8 JavaScript isolates distributed across hundreds of Points of Presence worldwide. Latency to the user is extremely low because the code runs in their region. But the trade-off is severe CPU constraints: a standard Cloudflare Workers plan gives you about 10 ms of CPU time per request, and even the paid plan's default of 30 seconds can be exhausted quickly by anything compute-heavy.
For AI apps, edge functions are best used as a thin proxy — route the request, attach auth headers, forward to your LLM provider, and stream the response straight to the client. They cannot run PyTorch or any model weights locally. Cloudflare offers a separate product — Workers AI — that routes your inference request to GPU nodes while still dispatching from the edge, but that is distinct from a plain Worker.
Platform-by-Platform Comparison
The table below summarises the key constraints for popular platforms in mid-2026. Numbers change — always check official docs before committing.
| Platform | Type | Max request duration | Streaming support | Cold start (typical) | Best for |
|---|---|---|---|---|---|
| AWS Lambda + API GW | Serverless | 29 s (API GW cap) | Via response streaming API only | < 1 s (warm) | Short tasks; streaming needs extra setup |
| AWS Lambda (direct invoke) | Serverless | 900 s | Via InvokeWithResponseStreaming | < 1 s (warm) | Agentic loops bypassing API GW |
| Vercel Functions (Hobby) | Serverless | 60 s | Yes, via AI SDK streaming | ~200 ms | Simple chat UI on Hobby plan |
| Vercel Fluid Compute (Pro) | Serverless | 800 s | Yes | ~200 ms | Production Next.js AI apps |
| Vercel Edge Functions | Edge | Stream up to 300 s | Yes (stream must start < 25 s) | < 50 ms | Streaming proxy; no heavy compute |
| Google Cloud Run | Container (scale-to-zero) | 60 min | Yes (chunked encoding) | 5-35 s (GPU), < 2 s (CPU) | FastAPI/Flask AI APIs; GPU inference |
| Fly.io | Container | No platform limit | Yes (WebSocket + HTTP/2) | < 1 s (warm machines) | Full-stack AI app, persistent state |
| Railway | Container | No platform limit | Yes | ~5-15 s (cold deploy) | Rapid iteration, simple ops |
| Render | Container | No platform limit | Yes | < 2 s (warm) | Django/FastAPI workers + background jobs |
| Cloudflare Workers | Edge | 30 s CPU (paid) | Yes (but proxy only) | < 1 ms | Auth gateway, prompt router, thin proxy |
| Modal / RunPod Serverless | Serverless GPU | No hard limit | Yes | 2-10 s (model load) | Self-hosted model inference, GPU bursts |
Cost Patterns and Scaling
Deployment cost for AI apps is dominated by two very different line items: compute (where your code runs) and inference (what you pay the LLM provider per token). Optimising the wrong one is a common mistake.
Serverless: pay-per-millisecond scales to zero but rewards fast code
If your app has uneven traffic — quiet nights, busy afternoons — serverless billing aligns naturally with usage. A function that sits idle costs nothing. But a function that streams a 45-second response to 100 concurrent users is paying for 4,500 function-seconds per batch, which adds up quickly at scale. The per-GB-second pricing of AWS Lambda means memory allocation directly drives cost, so right-sizing memory matters.
Containers: fixed floor, linear scale
A Fly.io machine or Railway service has a monthly floor cost even at zero traffic. For low-traffic apps this is wasteful; for consistently busy apps it becomes cheaper than serverless. Fly.io's auto_stop_machines = true option brings back scale-to-zero behaviour while keeping long-request support — a middle ground worth considering.
Serverless GPU: the inference hosting tier
If you are hosting your own model weights (open-source LLM, custom fine-tune), serverless GPU platforms like Modal and RunPod bridge the gap between serverless convenience and GPU power. Modal's H100 runs around $4.50/hr versus RunPod's ~$2.50/hr for equivalent hardware — Modal charges a premium for its developer-friendly @app.function decorator workflow where RunPod requires Docker image management. For bursty inference workloads under 30% GPU utilisation, serverless GPU is almost always cheaper than a reserved GPU instance.
# Modal example: GPU function with scale-to-zero
import modal
app = modal.App("llm-inference")
@app.function(
gpu="A10G",
container_idle_timeout=60, # scale to zero after 60s idle
timeout=300, # 5-minute hard limit per call
)
def run_inference(prompt: str) -> str:
# model is loaded once per container instance
return model.generate(prompt)How to Choose
The right answer depends on three questions: how long your requests run, how variable your traffic is, and how much infrastructure you want to operate.
- Traffic is spiky and unpredictable → serverless or Cloud Run (scale-to-zero). Pay only when busy.
- Requests stream for > 30 seconds regularly → containers (Fly.io, Railway, Render) or Vercel Fluid Compute. No platform timeout to work around.
- You need global low latency but do no heavy compute → edge function as a thin proxy, with the real work offloaded to a container or LLM API.
- You host your own model → serverless GPU (Modal, RunPod) for burst, a reserved GPU pod for baseline.
- Team wants minimal ops overhead → Railway or Render; push a Dockerfile and it runs.
Going Deeper
Once you've chosen a deployment target, several advanced patterns become important at production scale.
Streaming architectures beyond HTTP
Server-Sent Events (SSE) is the standard protocol for token streaming over HTTP. For agentic apps that need bidirectional communication — where the client can cancel a run, inject new messages, or subscribe to tool-call updates — WebSockets or HTTP/2 server push are more appropriate. Fly.io and Railway both support WebSockets natively; Lambda requires the API Gateway WebSocket API, which is a separate configuration.
Keeping containers warm without paying for idle
Cold starts are the enemy of interactive AI apps. Google Cloud Run lets you configure min-instances to keep at least one container warm at all times — you pay for that instance continuously but eliminate cold starts for the first user after a quiet period. Modal offers a keep_warm parameter for the same purpose. For cost-optimised architectures, a single warm instance plus scale-to-zero overflow can cover most traffic shapes.
Background queues for long agentic tasks
If your agent loop genuinely takes 5-15 minutes, consider decoupling the trigger from the execution entirely. The user's HTTP request creates a job and returns a job ID immediately. A background worker (a long-running container or a serverless GPU function) processes the job asynchronously. The client polls or subscribes to a push channel for results. This pattern sidesteps every timeout constraint because the originating HTTP request completes in under a second.
Observability across deployment tiers
Distributed deployments — edge proxy feeding a container feeding an LLM API — require trace IDs to propagate across every hop. Pass a X-Request-ID header from the edge function through to your container logs. Most LLM provider SDKs let you attach custom metadata to requests; tie that to your trace ID so you can correlate LLM latency with the overall request timeline. Tools like Langfuse and Helicone provide LLM-aware observability that integrates with standard OpenTelemetry traces.
FAQ
Can I use Vercel serverless functions for a streaming ChatGPT-style app?
Yes, with caveats. On the Hobby plan the 60-second function timeout is fine for most single-turn chats but will cut off long responses. For production, Vercel's Pro plan with Fluid Compute raises the limit to 800 seconds and is explicitly designed for AI streaming use cases. Make sure you're using the Vercel AI SDK's streamText helper, which handles the SSE protocol correctly.
Why does my Lambda function time out at 29 seconds even though I set a 5-minute timeout?
The 29-second limit is on API Gateway, not Lambda. API Gateway has a fixed maximum integration timeout of 29 seconds that cannot be increased. Solutions: switch to Lambda response streaming with InvokeWithResponseStreaming (bypasses API Gateway), use an Application Load Balancer instead (which supports idle timeout up to 4000 seconds), or move to a container platform.
Are edge functions suitable for AI apps?
Only for the thin proxy layer — authentication, rate limiting, routing, and forwarding the streaming response to the client. Edge functions cannot run model weights or any significant CPU computation. Cloudflare's Workers AI is a separate product that routes inference to GPU nodes; a standard Worker just relays the request.
What is the cheapest way to deploy a low-traffic AI app?
For apps with < 1,000 requests per day, a scale-to-zero container (Google Cloud Run free tier, Render free tier, or Railway's Hobby plan) usually beats serverless for cost because LLM inference is the dominant cost, not compute. Keep a single warm instance if cold starts hurt UX; the monthly cost of one 256 MB container is typically a few dollars.
How do I prevent cold starts from ruining the first-user experience?
Set a minimum instance count of 1 on your platform (Cloud Run min-instances, Fly.io min_machines_running, Modal keep_warm). This keeps one instance alive continuously at idle cost. For serverless GPU, Modal's keep_warm=1 ensures the GPU container never fully de-allocates. As a fallback, a lightweight health-check ping every 5 minutes can keep most serverless platforms from recycling your instance.
When should I use a background queue instead of a long HTTP connection?
When a single task reliably takes more than 2-3 minutes, or when you need retry logic, progress updates, or fan-out parallelism. Return a job ID immediately on the HTTP request, process the work in a background worker (a container, a serverless GPU function, or a task queue like Celery or BullMQ), and push results via WebSocket or polling. This approach works on any deployment platform because no single HTTP connection ever needs to stay open for long.