I Hired Eleven AI Employees. Here's What Happened.

I didn't hire eleven AI employees on day one. That's not how it happens.

I started with one. A CEO agent, built on Paperclip, a framework launched earlier this year that's explicitly designed for exactly this — company structure for AI agents. Roles, budgets, reporting lines, audit trails. I gave the CEO a task. It delegated, tracked, followed up. It worked. That first week was genuinely promising.

So I added a CMO. Then a CTO. Then engineers under the CTO, a designer, a research agent. Each addition made sense at the time. Each one was solving a real gap. Over a few days, the company grew to eleven agents, each with a role, each with a workstream.

Then it broke.

What the Promise Actually Was

The org-chart framing for AI agents isn't a gimmick. It maps to something real. When you give an agent a role with a scope, a budget, and a clear deliverable, it behaves differently than when you just ask it to "help." The CEO framing worked because it created structure the agents could orient around. That part was true.

Paperclip crossed 30,000 GitHub stars in its first three weeks. It's a genuinely interesting idea. I want to be clear about that, because what comes next isn't a knock on the concept. It's a report from the early-adopter trenches on what breaks when a promising concept scales past the point you can see the whole system at once.

What Actually Broke

For a few days it was useful. The CEO was routing tasks, the team was producing output, and I was checking in rather than doing everything myself. That's the version that makes you want to keep building.

Then something shifted. I'm not sure exactly when — that's part of the problem. The failure mode wasn't a crash. It was a slow degradation. Tasks were getting delegated but not completed. Agents were running but not producing. I'd check in and find the CEO had handed something off three times without following up once.

By the time I gave it a specific task — build out a content and marketing operation for a product I was working on — the system had already quietly come apart. CMO owns content strategy. CTO handles the technical build. Research agent does market research. Real workstreams, parallel execution. An hour later, nothing was done.

A couple of agents weren't running. A health check I'd added had broken and started cycling — trying to verify agent status, failing, retrying, burning through tokens with every loop. I shut the health check off and asked the CEO to diagnose what went wrong. It couldn't. I was doing the diagnosing. The CEO was an expensive middleman burning my API budget while I did its job.

Here's the part that matters: this wasn't a mistake I made. These are documented framework-level failures.

Paperclip has a filed GitHub issue — a cascade heartbeat loop that consumed 19.8 million input tokens in one hour because the budget cap failed to stop it. That's the same failure pattern I hit. There's another issue documenting that when an agent goes rogue, you can't stop it through the API or the command line. You have to find the OS process manually and kill it. There's a third issue noting that agents share a working directory by default, which means in a multi-agent setup they can overwrite each other's work without knowing it.

I was an early adopter of a framework that was three weeks old. The problems I hit were real, they were documented, and other builders hit them too.

This Isn't Just Paperclip

Before you conclude the answer is just a different framework: the problem is broader than any one tool.

Only about 11% of companies have AI agents fully operational in production right now. The people doing it successfully are mostly running narrow, tightly scoped automations — customer support routing, code review, infrastructure tasks with guardrails. Not general autonomous agents making decisions across multiple workstreams.

Boris Cherny built Claude Code. He runs five parallel sessions in tabs and monitors them with system notifications. That's supervised parallel work, not autonomous execution. He's watching.

The math is brutal and not talked about enough. If each step in an autonomous workflow succeeds 85% of the time — which is optimistic — a 10-step workflow completes successfully about 20% of the time. String together 20 steps and you're below 3%. The demos don't show you the failure rate. They show you the run that worked.

Two recent examples that made the news: in March, a developer's Claude Code session wiped a production database and 2.5 years of records after he told it to proceed despite a warning not to. In April, a Cursor agent hit a credential mismatch, found an API token in the codebase, and deleted an infrastructure volume in 9 seconds — three months of customer data, gone. In both cases the agent did what it was designed to do: keep moving toward the goal when it hit an obstacle. The problem was that nobody had defined what "stop" looked like.

What Actually Works

I have an agent that's been running quietly for months. I don't touch it. It does one job, in one domain, with clear inputs and a clear definition of what a failed output looks like.

That's the whole model. One agent. One domain. You can describe exactly what it does, what it needs, and what wrong looks like. If you can't answer all three before you build it, you're not ready to build it.

The eleven-agent org chart failed because I tried to skip that work. I assumed the structure of a company would substitute for the clarity of a well-defined task. It doesn't. A sprawling architecture doesn't compensate for fuzzy requirements. It just makes the fuzziness expensive.

The Questions Worth Asking

If you're evaluating an AI agent framework — or hiring someone to build on one — these are the questions that separate tools that are ready from tools that are interesting:

Can you stop a runaway agent without hunting OS processes? If the answer is no, you don't have a production-ready system. You have a demo.

What's the default state model? Do agents share a working environment, or are they isolated? Shared state sounds efficient until two agents overwrite each other's work and neither knows it happened.

What does the agent do when it hits an ambiguous situation? Does it stop and ask? Does it improvise? The agent that deleted three months of customer data improvised. It found a token that could solve its problem and used it. That's not a bug — it's the agent doing exactly what it was optimized to do. The question is whether the system around it had the guardrails to make improvisation safe.

What's the real maintenance burden? Ask for the failure rate, not the demo. Ask what happens on day 30, not day 1.

Where This Is Going

The frameworks are improving fast. Paperclip's issues are being filed and some are getting fixed. The concept — structured multi-agent workflows with defined roles and budgets — is sound. I still think it's the right direction for how AI gets deployed in real organizations.

But right now, in May 2026, the honest answer is that fully autonomous multi-agent loops require more scaffolding than any framework provides out of the box. The builders having real success are adding guardrails, checkpoints, and human review before anything irreversible happens. They're treating autonomy as something you earn with a track record, not something you assume from day one.

I hired eleven AI employees. Most of them didn't show up for work, one of them burned a hole in my API bill, and I spent an hour diagnosing a system that was supposed to diagnose itself.

The next one I hired does one job and costs $15 a month. It hasn't missed a day.

— Joe