A couple years ago, backend engineers had a recurring conversation. It went something like this: "We're so tightly coupled to this database that we can't swap it out without rewriting everything. Every time the ORM changes, we're stuck." The answer the industry landed on was loose coupling — define contracts between components, put interfaces between your application logic and its dependencies, make implementations swappable. Microservices was one expression of that. Dependency injection was another. The principle underneath both of them is the same: don't let your application care about what's behind the interface, only about what the interface promises.
I keep watching AI builders make the same mistake we made with databases. They build their systems directly on top of a specific model, a specific platform, a specific tool SDK. It works great until the model gets a new version that behaves differently. Or the platform changes their pricing. Or a new model comes out that's 40% better at exactly the task you're doing, and switching to it means rewriting everything.
In AI, you're going to hit at least one of those moments per year. Probably more. The landscape isn't stabilizing. It's accelerating. And if your system is tightly coupled to today's best option, you're taking on technical debt that comes due every time the calendar turns.
This is the architecture piece I wish I'd read before I built my first five AI systems. The ones now in what I call the Museum of Dead Agents.
The Coupling Tax
Every time you write model="claude-opus-4" directly into your application logic, you're making a bet. You're betting that model stays the best option for your use case. You're betting the pricing stays reasonable. You're betting Anthropic's priorities stay aligned with yours.
Those are bad bets. Not because any of those things will definitely change — some of them won't. But because the cost of being wrong is a rewrite, and the cost of being right is zero. The abstraction layer costs you maybe an afternoon. The rewrite costs you a week, and it happens at the worst possible time, which is when you're already under pressure to ship something else.
The coupling tax shows up in three places:
The prompt layer. Your prompts are written to exploit quirks of a specific model. You've got instructions in there that only make sense because Claude handles context a certain way, or because GPT-4 needs explicit formatting instructions that Claude infers. When you switch models, half your prompts break in ways that are hard to diagnose.
The model layer. Your application talks directly to one provider's SDK. The API calls, error handling, retry logic, and response parsing are all written against one interface. Swap the provider and you're rewriting all of it.
The tool layer. Anthropic calls them tools. OpenAI calls them function calls. The schema is similar but not identical. If your tool definitions are written against one format, you're doing translation work manually every time you want to try a different model.
Separate these three layers now. It's not complicated. It just requires deciding upfront that the model is a plug, not a foundation.
The Prompt Layer: Templates as Contracts
The single highest-leverage change I made was moving my prompts out of code and into structured templates.
Here's what a model-agnostic prompt template looks like in practice:
# prompts/article-writer.yaml
version: 1
meta:
name: article-writer
description: Writes long-form articles in Joe's voice
category: content
model_hints:
default:
temperature: 0.7
max_tokens: 4000
openai:
temperature: 0.8 # GPT tends to be flatter at the same temp; nudge up
local:
temperature: 0.6 # Local models benefit from slightly tighter sampling
max_tokens: 2000 # Conservative for resource-constrained environments
system: |
You are a content writer producing long-form articles for Joe Bond, a consultant
and builder working at the intersection of AI and practical business systems.
Voice guidelines:
- Direct and peer-to-peer. The reader is technical. Don't over-explain.
- First-person throughout. Specific numbers where available.
- Short paragraphs. Punchy. Earned conclusions, not asserted ones.
- No em-dashes. No semicolons. No filler phrases.
Output format: Markdown. H2 section headers. No introduction header.
user: |
Write a long-form article with the following brief:
Topic: {{topic}}
Angle: {{angle}}
Key points to cover:
{{#each key_points}}
- {{this}}
{{/each}}
Source material:
{{source_material}}
The key is what's in model_hints. Your application logic reads those and applies them based on which provider it's routing to. The system and user templates stay the same. The temperature nudge for GPT vs. Claude is a one-line change in the YAML, not a rewrite of your pipeline.
This also means your prompts are version-controlled as config, not buried in code. You can A/B test prompt versions. You can roll back. You can see the diff when something starts producing worse output. That operational visibility alone is worth the effort.
The Model Layer: One Interface, Many Providers
Once your prompts are templates, you need a router that takes a rendered prompt and sends it to whichever model is currently configured. The interface looks the same regardless of what's behind it.
A minimal model router has three responsibilities:
Route. Read the current model config and call the right provider SDK. Claude goes to Anthropic. GPT goes to OpenAI. A local model goes to your Ollama instance or llama.cpp server. The calling code doesn't know which one was chosen.
Normalize. Every provider returns responses in a slightly different shape. The router translates all of them into a single response format your application understands. Message content, token counts, stop reasons — all normalized before they leave the router.
Handle failure. Rate limits, timeouts, provider outages. The router knows about fallbacks. If Anthropic is throttling, it can retry with a different provider. Your application code sees an error only if all configured options fail.
// model-router.js
async function complete(promptTemplate, variables, options = {}) {
const provider = options.provider ?? config.defaultProvider;
const rendered = renderTemplate(promptTemplate, variables);
const hints = promptTemplate.model_hints?.[provider]
?? promptTemplate.model_hints?.default
?? {};
try {
const raw = await providers[provider].complete(rendered, hints);
return normalize(raw, provider);
} catch (err) {
if (options.fallback && config.fallbackProvider) {
return complete(promptTemplate, variables, {
provider: config.fallbackProvider,
fallback: false,
});
}
throw err;
}
}
Your application calls complete(template, vars). It does not care whether that goes to Claude or GPT-4o or a local Mistral instance running on your laptop. The model is genuinely a plug.
Swapping from Claude to GPT-4o now means changing one line in your config file. Running a test against three models in parallel is a loop over an array. This is what loose coupling buys you.
The Tool Layer: Capability Contracts
Tool calling is where tight coupling hurts the most, because the schemas are just different enough between providers to be annoying.
Anthropic's tool format uses input_schema. OpenAI's uses parameters. The JSON Schema inside is roughly the same, but the wrapper is different. If you've written your tool definitions once, against one provider's format, you're doing manual translation every time you try another.
The fix is to define your tools in a neutral format and let the router translate:
# tools/web-search.yaml
name: web_search
description: Search the web for current information on a topic
parameters:
query:
type: string
description: The search query
required: true
max_results:
type: integer
description: Maximum number of results to return
default: 5
The router translates this into Anthropic format, OpenAI format, or whatever else you need. Your tool definition doesn't change. Your tool implementation — the actual function that runs the search — definitely doesn't change. Only the schema wrapper does, and you write that translation once.
This also means adding a new tool is a YAML file, not a code change scattered across provider-specific configuration blocks.
When to Couple Tightly
Not everything needs an abstraction layer. The overhead is real, and applying it indiscriminately is its own kind of engineering failure.
Couple tightly when: the capability is genuinely provider-specific and you have no intention of swapping it. Anthropic's extended thinking, for example. If you're building something that relies on that feature, abstract it if you want, but be honest — you're not going to replace it with a local model anytime soon. The abstraction buys you nothing.
Couple tightly when: it's a prototype. You're testing whether the idea works at all. Don't engineer for flexibility before you know what you're building. Get the thing running, validate the concept, then refactor toward loose coupling once you know it's worth the investment.
The rule: abstract when you can see a plausible reason to swap the implementation within the next year. If you can't construct that scenario, the abstraction is speculative engineering. Skip it.
Local Models as the Escape Hatch
Everything above assumes you're still calling cloud providers. Local models change the calculus entirely.
No rate limits. No policy changes. No one deciding your use case isn't worth supporting this quarter. Your hardware, your rules. The gap between local model quality and cloud model quality has closed significantly, and it keeps closing. There are already tasks where a local model running on reasonable hardware matches what you'd get from a cloud API call — at zero marginal cost per request.
The value of local models isn't just cost. It's optionality. When Anthropic throttled paid subscribers during peak hours, I had an alternative running on my own machine within a day. That's only possible if your architecture already treats the model as a plug.
If you're not running local models in at least some part of your stack, you're leaving that option closed. Open it now, even if you only route 10% of your calls there. Learn how your prompts behave with local models. Start closing the gap so that when the cloud providers do something frustrating — and they will — you have somewhere to go.
The Compound Benefit
Here's what this architecture gives you beyond just flexibility.
Experimentation becomes cheap. Want to know if GPT-4o is better than Claude for your specific task? Route 10% of traffic to each. Compare outputs. You didn't rewrite anything to run that test.
Cost optimization is ongoing. Models get cheaper. New models come out at lower price points. When you can swap providers in config, you can chase the cost curve as it drops. When you're tightly coupled, you're stuck with whatever your provider charges today.
You build transferable knowledge. When you understand prompting as a discipline — context management, constraint setting, output structuring — those skills work everywhere. When you understand one model's quirks, that knowledge goes stale every six months. The people who are doing well in AI right now aren't the ones who mastered one platform. They're the ones who learned how AI actually works.
The landscape will shift again. It shifts every year, sometimes more. A model will come out that's meaningfully better than what's available today, at a price point that makes your current setup look expensive. A tool will emerge that makes something you're currently doing manually trivially easy. A platform you depend on will change its terms.
The builders who thrive through those shifts are the ones who built for the model that doesn't exist yet. Not because they predicted what it would be. Because they built systems that don't care.
— Joe