Sarah had always wondered if her job could be done by a computer. As a project manager at a marketing firm, she spent her days juggling spreadsheets, coordinating between departments, and making decisions that seemed pretty straightforward. “How hard could it be?” she’d joke with colleagues during lunch breaks.
That question just got a very real answer. Researchers at Carnegie Mellon University decided to find out by creating an entire company staffed exclusively by AI systems – no humans allowed. They wanted to see if today’s most advanced artificial intelligence could handle the messy, complicated reality of office work.
The results were eye-opening, and not in the way tech enthusiasts might hope.
When AI Agents Tried to Run a Real Company
This wasn’t your typical AI company experiment where bots answered simple questions or performed single tasks. The researchers created a fully simulated business environment where AI systems had to work as actual employees. Each artificial worker was powered by cutting-edge models you’ve probably heard of: Claude 3.5 Sonnet, GPT-4o, Google Gemini, Amazon Nova, Meta’s Llama, and Alibaba’s Qwen.
Instead of being treated as chatbots, these AI systems were given real job titles and responsibilities. Some became “financial analysts,” others took on roles as “project managers” or “software engineers.” The goal was ambitious: see how well artificial intelligence could handle the complex, multi-step work that millions of people do every day.
“We wanted to test whether AI could really replace human workers in knowledge-based jobs,” explains one researcher involved in the study. “Not just answer questions, but actually think, plan, and execute like a real employee would.”
The experiment included simulated departments like HR that the AI agents had to interact with. They needed to send messages, request information, coordinate tasks, and navigate office politics – just like you do at your job.
What These AI Employees Actually Had to Do
The tasks weren’t designed to trick the AI systems. Instead, researchers chose typical knowledge work assignments that combine analysis, judgment, and coordination – the kind of stuff that fills most office workers’ days.
Here’s what the artificial employees were asked to handle:
- Navigate company file systems and analyze databases
- Pull information from multiple documents and create summaries
- Compare virtual office tours and recommend new premises
- Communicate with other departments for approvals and additional data
- Follow instructions that included both explicit steps and unspoken expectations
- Manage time and cost constraints while completing projects
- Make judgment calls when information was incomplete or conflicting
The researchers also threw in realistic workplace complications. Sometimes instructions were unclear. Other times, the AI agents had to take initiative when faced with unexpected problems. They needed to prioritize tasks, manage deadlines, and work within budget constraints.
| AI Model | Success Rate | Partial Credit Score |
|---|---|---|
| Claude 3.5 Sonnet | 24% | 34.4% |
| GPT-4o | 19% | 28.2% |
| Google Gemini | 16% | 25.1% |
| Amazon Nova | 14% | 22.8% |
| Meta Llama | 12% | 20.5% |
| Alibaba Qwen | 11% | 19.3% |
“The difference between answering a question and actually doing a job turned out to be massive,” notes another researcher. “These systems that seem so impressive in demonstrations really struggled with real workplace complexity.”
Why Three-Quarters of Tasks Ended in Failure
Even the best-performing AI system, Claude 3.5 Sonnet, successfully completed only 24% of its assigned tasks. When researchers gave partial credit for work that was incomplete but showed some understanding, Claude’s score barely improved to 34.4%. The other AI models performed even worse.
The failures weren’t random glitches. They revealed fundamental limitations in how current AI systems handle real-world work environments. The artificial employees struggled with several key challenges:
- Context switching: Moving between different types of tasks confused the AI systems
- Ambiguous instructions: When directions weren’t crystal clear, the bots often gave up or went in wrong directions
- Multi-step planning: Complex projects requiring several phases proved too difficult to manage
- Inter-departmental communication: Coordinating with other AI agents led to confusion and delays
- Judgment calls: Making decisions with incomplete information was nearly impossible for most systems
One particularly telling example involved a simple task: choosing new office space. The AI agents could analyze individual office listings just fine. But when asked to weigh factors like cost, location, and team preferences, then communicate their reasoning to a simulated HR department for approval, most systems fell apart.
“What surprised us most was how quickly things broke down when multiple steps were involved,” explains a Carnegie Mellon researcher. “These models are incredibly sophisticated, but they still think in very rigid, linear ways.”
What This Means for Your Job and the Future
Before you either panic about robot replacements or dismiss AI entirely, this AI company experiment reveals something more nuanced about the future of work. The technology isn’t ready to replace human workers wholesale, but it’s not useless either.
The study shows that current AI systems work best as specialized tools rather than general-purpose employees. They can handle specific, well-defined tasks but struggle with the messy, interconnected nature of most real jobs.
For workers like Sarah, the project manager we met at the beginning, this research suggests jobs aren’t disappearing overnight. Instead, AI might become a powerful assistant that handles routine analysis while humans focus on coordination, creativity, and complex decision-making.
“The gap between AI capabilities and human work is bigger than many people realize,” says a workplace automation expert. “This experiment shows we’re still years away from artificial intelligence that can truly replace human judgment and adaptability.”
Companies considering major AI implementations should pay attention to these results. Rather than expecting AI to replace entire departments, the smart move might be identifying specific tasks where artificial intelligence can genuinely help, while keeping humans in charge of overall strategy and complex problem-solving.
The financial implications are significant too. Organizations investing heavily in AI automation might want to reconsider their timelines and expectations. This experiment suggests that human workers will remain essential for complex knowledge work far longer than some predictions suggest.
FAQs
Which AI system performed best in the company experiment?
Claude 3.5 Sonnet had the highest success rate at 24%, though this was still considered a poor performance overall.
What types of tasks did the AI employees struggle with most?
Multi-step projects requiring coordination between departments and judgment calls with incomplete information proved most difficult for the AI systems.
Does this mean AI can’t help with office work at all?
Not at all – the study shows AI works well for specific, well-defined tasks but struggles as a general-purpose employee replacement.
How realistic was this simulated company environment?
Very realistic – researchers included typical workplace challenges like unclear instructions, departmental coordination, and time/budget constraints.
Should companies stop investing in workplace AI based on these results?
Companies should adjust expectations and focus on using AI as a specialized tool rather than a wholesale human replacement.
When might AI become capable of handling complex office work?
The researchers suggest current limitations indicate we’re still years away from AI that can match human adaptability in complex workplace situations.
