Writing · Draft

← All writing

Local Models Aren't Worse, They're Different

I had a harness that was supposed to build me an app.

The setup looked sharp. I ran OpenClaw on top of Qwen 3 32b coder, MLX, on my Mac Studio. Wired up the tool calls. Gave it the spec. Sat back to watch a working app appear in my project folder.

What appeared was slop.

Not "needs polishing" slop. Not "missed an edge case" slop. Code that didn't compile. Files that referenced functions that didn't exist. Tests that were vibes-based and tested nothing. I'd given the same kind of brief I'd give Claude Code a hundred times before, and Claude would have shipped me something usable. This shipped me garbage.

So I assumed the harness was the problem. I rewrote it. Made it tighter. Better tool plumbing. Narrower loop. Same prompt style. Same brief.

More garbage.

That's the moment most people quit. I almost did. The takeaway most people walk away with is "local models aren't ready yet." It's the headline you see in every comment thread. It's the easy excuse if you tried this once, it didn't work, and you don't want to keep paying the experimentation tax.

The takeaway is wrong.

What I was actually doing wrong

I was prompting a local model the way I prompt Claude.

Claude Code is forgiving. You can throw it a half-baked idea, a vague spec, a task that's really three tasks pretending to be one, and Claude figures out what you meant. It builds the missing context. It infers the goals you didn't articulate. It compensates for sloppy work because it's smart enough to.

Qwen 3 32b coder is also smart. It's just not Claude-smart. It needs you to do the work Claude does for you.

When I handed it "build me an app that does X," I was handing it a task that contained six implicit subtasks, four implicit decisions, and zero explicit constraints. Claude would have made all six subtasks visible, made reasonable guesses on the four decisions, and asked me about the constraints if it cared. Qwen tried to do all of it at once and produced something that looked like code but wasn't.

The fix wasn't a bigger model. The fix was structural.

The structural fix

Stop asking one agent to do everything.

The thing that finally worked, after weeks of slop, was breaking the work into agents that each do exactly one thing. Right-sized model for each. Tight prompt for each. Chained.

Here's a real one I'm running. It's a Reddit research pipeline for one of my projects.

Three agents. Each one has a prompt that fits on a screen. Each one is sized to a model that's appropriate for the work. The "find threads" agent doesn't need a coder model. The "draft a reply" agent benefits from a bigger model with better tool calls. None of them are trying to do all three jobs at once.

It works. Reliably. On local models.

The contrast with the original failure is almost embarrassing in hindsight. "Build me an app" was eight agents' worth of work crammed into one prompt. Claude would have figured out which eight, done them in some order, and shipped me something. Qwen tried to do all eight simultaneously, made a mess, and got blamed for it.

Why this is actually a craft point, not a local-model point

Here's the part that surprised me.

The discipline that makes local work also makes cloud work better. Breaking your task into purpose-built agents with tight prompts is good practice everywhere. You should be doing it whether you're running Claude or Qwen.

It's just that Claude lets you skip it. ChatGPT lets you skip it less. Local models don't let you skip it at all. When people say "local doesn't work," what they're often saying is "local doesn't cover for me the way Claude does."

That's not a weakness of local. That's a forcing function.

Every team I've seen ship a reliable AI system, on cloud or local, does the same thing. They break work into small specialized agents. They don't hand a single agent the whole problem. They prompt narrowly. They right-size models. The cloud teams could get away with not doing this. They do it anyway because it makes their systems faster, cheaper, and more debuggable.

Local makes the discipline mandatory. Cloud makes it optional. Either way, the discipline is what separates "this kind of works in a demo" from "this runs in production."

What I'm not covering here

There's a lot more to making local AI deployments work than what I just described.

Quantization choice matters. The Q4 quant of 32b coder I was running probably hurt me. Chat template format matters too. I had it wrong twice and got garbage I blamed on the model. Sampling parameters matter more on local than they do on Claude. KV cache and context window management is its own art. Tool calling reliability is wildly different across local models, and "supports tool calls" in a model card means basically nothing in practice.

Hardware is its own conversation. A Mac Studio in the corner of the office is a different beast than a closet of them is a different beast than a server rack with GPUs. Each fits a different workload. I'll write that piece separately.

Any one of those can sink a deployment on its own. The reason I'm not covering them here is that the failure I see most often in local AI is upstream of all of them. People bring a generalist prompt to a specialist tool, the specialist tool produces garbage, and they conclude the tool is bad.

The tool is fine. The prompt was lazy.

The diagnostic

If you tried local AI and it didn't work, before you blame the model, ask yourself one question.

Did you build one agent that's supposed to do the whole job?

If yes, that agent would probably fail on Claude too. You just wouldn't notice, because Claude would limp through and ship you something that mostly works. The version of your system that runs reliably on local is the version that would have run better on cloud all along.

That's the actual lesson. Local isn't worse. Local is just less forgiving. And less forgiving is, weirdly, exactly the discipline most production AI systems need.

I'm still figuring some of this out. Local is moving fast. The models are getting better, smaller, faster, cheaper. The harness layer is getting smarter. The tooling around quantization and serving is getting more accessible. Six months from now this piece probably needs an update.

If you've gone deeper than I have on any of this and you think I'm wrong about something, tell me. I'd rather learn it from you than from another harness full of slop.

— Joe

← More writing