Imagine if you staff an entire company with AI. Researcher did just this. And guess what happens?
- Reto Zeidler
- May 25
- 3 min read
Updated: May 28
Carnegie Mellon researchers created TheAgentCompany, a virtual firm staffed entirely by AI agents to test their workplace capabilities. Shockingly, even the best AI completed only 24% of assigned tasks. While the experiment revealed AI's current limitations, it also highlights both challenges and opportunities in workplace automation. Here is what happens..
Agents running wild...
TheAgentCompany was Carnegie Mellon's ambitious experiment to see if today's most advanced AI models could handle the complexities of a real workplace. Researchers created a complete virtual software company environment with internal websites, Slack-like chat systems, employee manuals, and even digital HR managers and CTOs for the AI agents to interact with.
"Another AI, unable to find the right colleague in the chat tool, decided to create a new user with the same name - problem solved.."
The AI employees, representing models from Google, OpenAI, Anthropic, and Meta, were assigned tasks across finance, administration, and software engineering. The results? Let's just say human jobs are safe for now. The best performer, Claude 3.5 Sonnet from Anthropic, completed a mere 24% of tasks, while Google's Gemini 2.0 Flash managed only 11.4%, and OpenAI's GPT-4o achieved a paltry 8.6%. Amazon's Nova Pro v1 was the office slacker, completing just 1.7% of assignments.
The AI workforce demonstrated a spectacular lack of common sense and social skills. In one scenario, an AI agent couldn't close a simple pop-up window blocking access to important files. Instead of clicking the obvious "X" button, it messaged HR for help, who then suggested contacting IT support - a task neither of them ever completed.
Another AI, unable to find the right colleague in the chat tool, decided to create a new user with the same name - problem solved, AI style! When asked to paste responses into a Word document, the AI treated it as a plain text file, completely baffled by the task.
Each task cost approximately $6 in computing power, making this perhaps the most expensive yet underperforming workforce in corporate history. The AI employees routinely misinterpreted conversations, ignored key instructions, and prematurely marked incomplete tasks as finished - essentially embodying every workplace nightmare rolled into one digital package.
What does this experiment show?
From a scientific standpoint, TheAgentCompany experiment provides valuable insights into the current capabilities and limitations of AI agents in professional environments.
"AI agents performed better in software development tasks compared to administrative or financial roles. "
The study reveals that while AI can handle certain simple tasks autonomously, it struggles significantly with complex, long-horizon tasks that require common sense, social skills, and technical competence.
The researchers found that AI agents performed better in software development tasks compared to administrative or financial roles. This discrepancy likely stems from the abundance of publicly available code for training versus the scarcity of data on internal business processes. This suggests that domain-specific training with proprietary workflow data could significantly improve AI agent performance in specialized tasks.
Despite the underwhelming results, the researchers remain optimistic about the future of AI in the workplace. They note that newer LLM models are becoming more capable and cost-efficient, with open-weight models closing the gap with proprietary frontier models. The study serves as a benchmark for measuring future progress in AI agent capabilities.
Graham Neubig, a professor at Carnegie Mellon and study co-author, suggests that AI's impact on the workforce may follow a pattern similar to translation services. Despite the advancement of machine translation, the number of human translators has actually increased by 11% between 2020 and 2023, as efficiency gains expanded the overall market size. This indicates that AI may complement human workers rather than replace them entirely.
The experiment also highlights the potential for AI to assist with specific aspects of complex tasks, even if it cannot yet handle entire workflows autonomously. Companies like Johnson & Johnson and Moody's are already exploring this hybrid approach, using AI agents for targeted applications while keeping humans involved in the process.
More on the paper:
