AI/TLDR

How to Add AI Features to an Existing App (Without a Rewrite)

A pragmatic playbook for shipping your first AI feature inside the codebase you already have.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In plain English

You have a working app — a SaaS product, an internal tool, a customer-facing service — and you want to ship an AI feature without blowing up everything that already works. The good news is that adding an LLM to an existing application is much closer to adding a new third-party API than it is to a rewrite. You call a URL, you get text back, you display it. The tricky part is everything around that call: keeping your app reliable when the AI is unavailable, controlling costs before they surprise you, and rolling out to real users without accidentally degrading the existing experience.

Think of it like installing a new appliance in a house. You don't tear down the walls — you find the right outlet, run a new circuit breaker for it, and make sure the rest of the house still works if that appliance trips. The appliance here is an LLM API. Your job is to wire it in cleanly: a dedicated integration layer, a circuit breaker that falls back gracefully, and a meter that tells you what it's costing to run.

Why it matters

Most existing apps were built before reliable LLM APIs existed. That doesn't mean they need to be rewritten to benefit from AI. The same app that summarises support tickets, drafts email replies, extracts structured data from uploads, or generates first-draft content is more valuable than the version that doesn't — and all of those features can be added as thin AI layers on top of data flows that already exist.

The risk isn't the AI code itself. The risk is treating an LLM call like a database query: always available, deterministic, instant, and free. LLM APIs are none of those things. They time out, return 429 rate-limit errors, produce different outputs for the same input, and can cost two orders of magnitude more per request than your current infra. Engineering for those realities from the start is what separates a production AI feature from a flaky demo.

ConcernWhat can go wrongThe mitigation
ReliabilityProvider outage or rate-limit breaks the featureFallback to cached result or non-AI path
CostViral usage spikes spend 10x overnightPer-user token budgets and request rate limits
CorrectnessModel hallucinates, user trusts itShow AI output as a suggestion, not ground truth
RolloutAll users get the new feature at onceFeature flag with percentage rollout
LatencyLLM takes 3-8 s, UI freezesAsync/streaming, optimistic UI, spinner with timeout

How it works

The integration has four layers that sit between your existing application code and the LLM provider. Each layer has a single responsibility. Together they let you add AI features without touching the rest of your app.

Layer 1 — Feature flag gate

Gate every new AI feature behind a flag from day one. Tools like LaunchDarkly and the open-source GrowthBook let you roll out to 1% of users, watch error rates and latency, and expand or kill the flag without a deployment. LaunchDarkly specifically added AI Configs (generally available in 2025) that let you swap prompt templates and model IDs at runtime — so you can A/B test whether GPT-4.1 or Claude Sonnet 4 produces better summaries without touching code.

Layer 2 — AI service module

Keep all AI logic in one module, not scattered across controllers. The module has three responsibilities: build the prompt from structured app data, call the gateway, and parse the response back into a typed structure your app can use. Isolating it here means you can swap models, change prompts, or disable the feature entirely without touching business logic.

typescripttypescript
// ai/summarize-ticket.ts — a self-contained AI service module
import { openai } from './gateway'; // your gateway client

export async function summarizeTicket(
  ticketBody: string
): Promise<{ summary: string } | null> {
  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-4.1-mini',
      messages: [
        {
          role: 'system',
          content: 'Summarise the support ticket in 1-2 sentences. Be factual.',
        },
        { role: 'user', content: ticketBody },
      ],
      max_tokens: 120,
      timeout: 8000, // 8 s hard cap
    });
    const text = response.choices[0]?.message?.content?.trim();
    return text ? { summary: text } : null;
  } catch {
    return null; // graceful fallback — caller shows original text
  }
}

Layer 3 — Gateway

LiteLLM is an open-source Python/proxy layer that accepts OpenAI-format requests and routes them to 100+ providers. You point your code at the LiteLLM proxy URL instead of api.openai.com; all retries, fallbacks, and cost tracking happen at the proxy without changing your application code. For Python-heavy teams it also ships as a drop-in SDK. Portkey covers the same ground with a hosted option and semantic response caching — returning a cached reply for semantically similar prompts saves both latency and cost.

Layer 4 — Fallback strategy

Design every AI call so the app works normally if the call fails. The four standard resilience patterns for LLM integrations are: retry with exponential backoff (handles transient 429 and 5xx errors), model fallback chain (OpenAI down → route to Anthropic → route to smaller local model), circuit breaker (stop calling a provider that's been failing for 60 s, reopen after a probe succeeds), and graceful degradation (show the original content instead of the AI enhancement when all else fails).

Where AI actually fits in a typical app

Most existing apps have a handful of natural seams where AI drops in without disrupting anything. The trick is spotting them. Look for any place where a user currently does repetitive text work, where you have structured data that users need to interpret, or where free-form input needs to be normalised.

App areaAI feature to addIntegration point
Support ticketingAuto-summary + suggested reply draftAfter ticket creation webhook
Document uploadsExtract key fields into structured JSONAfter file-processing step
SearchRe-rank results with semantic relevanceAfter existing keyword search returns hits
User onboardingGenerate a personalised welcome checklistOn first login, async background job
Admin dashboardNatural-language query over your own dataNew /ask endpoint alongside existing REST API
Email composerAuto-draft based on CRM contextBefore user hits Send — shown as editable suggestion

Notice the pattern: in every case the AI feature is additive, not a replacement. The original data flow still exists. If the AI layer fails or is disabled, the app reverts to its pre-AI behaviour. This is the key mindset shift — AI as progressive enhancement, not a dependency.

Keeping costs under control

LLM API pricing is token-based, and output tokens cost roughly 3-5x more than input tokens across most providers (as of mid-2026, GPT-4.1 Mini charges about $0.40/M input and $1.60/M output; Claude Haiku is in a similar range). A feature that averages 3,000 input tokens and 400 output tokens per call costs roughly $0.002 per request at those rates — negligible at 100 calls/day, but $600/month at 10,000 calls/day. Model the math before you ship.

Five levers to manage spend

  1. Right-size the model. Use a fast, cheap model (GPT-4.1 Mini, Claude Haiku, Gemini 2.0 Flash) for features that don't need deep reasoning. Save flagship models for cases where quality visibly matters to users.
  2. Cap max_tokens on every call. An uncapped completion can return 4,000 tokens when you needed 100. Always set a hard ceiling matched to what the feature actually uses.
  3. Enable prompt caching. Anthropic and OpenAI both offer prefix caching: if your system prompt is the same across calls (it usually is), the provider charges a fraction of the input cost for the cached prefix — typically 60-80% less.
  4. Rate-limit at the user level. Give each user or tenant a daily token budget. Enforce it in your AI service module before making the API call. LiteLLM and Portkey both expose per-key spend limits; for a simpler setup, a Redis counter keyed to the user ID works fine.
  5. Batch non-urgent work. OpenAI's Batch API and Anthropic's Message Batches API process jobs asynchronously with ~50% cost discounts. Background tasks (nightly summaries, bulk tag generation) are perfect candidates.
pythonpython
# Simple per-user rate limiter using Redis
import redis

r = redis.Redis()
MAX_TOKENS_PER_DAY = 50_000  # tune to your budget

def check_and_reserve_tokens(user_id: str, estimated_tokens: int) -> bool:
    key = f"ai:tokens:{user_id}:{datetime.date.today()}"
    current = r.get(key)
    used = int(current) if current else 0
    if used + estimated_tokens > MAX_TOKENS_PER_DAY:
        return False  # caller falls back to non-AI path
    r.incrby(key, estimated_tokens)
    r.expire(key, 86400)  # reset tomorrow
    return True

Going deeper

Once your first AI feature is stable, two architectural decisions become worth revisiting: observability and model portability.

Observability: you can't optimise what you can't see

Instrument every AI call with structured logs that capture: the model used, prompt token count, completion token count, latency in milliseconds, and whether the call succeeded or fell back. This is the minimum you need to debug a regression, spot cost anomalies, and justify (or kill) the feature. Tools like Helicone, Portkey, and LiteLLM's proxy dashboard aggregate these metrics automatically if you route through them. If you're calling the provider SDK directly, emit the same fields to your existing logging pipeline.

Model portability: avoid vendor lock-in from day one

The LLM market moves fast. The model you ship with today may not be the best option in six months. Protect yourself by keeping the model name and provider in a config value (or a feature flag via LaunchDarkly AI Configs) rather than hardcoded in your service module. Use the OpenAI-compatible request format where possible — it's the de facto standard that Anthropic, Mistral, and dozens of others support via adapters or LiteLLM routing — so switching providers is a one-line config change.

When to upgrade from a thin integration to a proper AI layer

A single call returning freeform text is fine for a first feature. Signs that you've outgrown it: you need the model to call tools or look up data before answering (move to function calling / tool use), you're injecting your own documents into every prompt (move to RAG with a vector store), or you need multi-step reasoning with memory (consider an agent framework like LangGraph or the Anthropic Agent SDK). Each of these is a natural next rung — but you won't know which one your app actually needs until you've shipped the simple version and watched real users interact with it.

FAQ

Do I need to rewrite my backend to use an LLM API?

No. An LLM API is just an HTTP endpoint that accepts JSON and returns JSON. Any backend that can make an outbound HTTP request — Node, Python, Ruby, Java, Go — can call it with no architectural changes. Add a new service module, a new API route if needed, and wire them together.

What's the safest first AI feature to ship to production?

An asynchronous, optional enhancement that runs after the user's primary action — a post-save summariser, a background tag suggester, or an auto-draft that shows up the next time the user opens an item. Because it's not in the critical path, a slow or failing LLM call cannot break the user's workflow.

How do I prevent the AI feature from costing too much if usage spikes?

Set a hard spend cap in your provider's dashboard immediately. Then add per-user daily token limits enforced before each call, use max_tokens on every request, and use a cheap model for high-volume features. Batch non-urgent work through the provider's Batch API for ~50% savings.

What happens when the LLM API is down or rate-limited?

Your app should fall back to the non-AI experience — show the original content, hide the AI widget, or queue the request for later. Implement retry with exponential backoff for transient errors (429, 503), and wrap the call so it returns null on final failure rather than throwing. Never let an LLM error propagate to a user-facing 500.

Should I use LiteLLM or call the provider SDK directly?

For a single provider and a single model, calling the SDK directly is simpler and fine. Once you need fallback across providers, cost aggregation, or the ability to swap models without code changes, a gateway like LiteLLM (open-source, self-hosted) or Portkey (hosted, with semantic caching) earns its place. Both expose an OpenAI-compatible API, so the switch is minimal.

How do I roll out an AI feature safely to real users?

Gate it behind a feature flag from day one. Start at 1-5% of users, monitor error rates, latency, and cost per request for 24-48 hours, then expand in steps. If anything looks wrong, turn the flag off instantly — no rollback deploy needed. Tools like LaunchDarkly and the open-source GrowthBook both support percentage rollouts out of the box.

Further reading