New Benchmark Reveals AI Agents Complete Under 3% of Freelance Tasks
TL;DR: The Remote Labor Index, a new benchmark from Scale AI and the Centre for AI Safety, tested leading AI agents on simulated freelance work across multiple disciplines. The best-performing agent completed less than 3% of tasks, earning $1,810 from a possible $143,991. The research challenges predictions of imminent widespread job displacement through AI automation, revealing significant limitations in current agent capabilities for complex, multi-step work.
Benchmark Methodology
Researchers developed the Remote Labor Index to measure AI agents’ ability to perform economically valuable work. The benchmark included freelance tasks verified through Upwork workers, spanning:
- Graphic design
- Video editing
- Game development
- Data scraping
- Administrative tasks
Each task included job descriptions, necessary files, and human-completed examples for reference. The benchmark aimed to simulate real-world freelance scenarios rather than artificial test environments.
Performance Results
The five leading AI agents tested demonstrated limited capability:
- Manus (Chinese startup) - Highest performer
- Grok (xAI)
- Claude (Anthropic)
- ChatGPT (OpenAI)
- Gemini (Google)
Even the best-performing agent completed less than 3% of available work, earning $1,810 from $143,991 in potential income. This performance level contrasts sharply with industry predictions about imminent AI workforce replacement.
Identified Limitations
Dan Hendrycks, director of the Centre for AI Safety, identified specific capability gaps preventing effective task completion:
- Tool Integration: Agents struggle to coordinate multiple software tools effectively
- Long-term Memory: Current architectures lack persistent memory storage across work sessions
- Continual Learning: Inability to acquire new skills through on-the-job experience
- Complex Multi-step Tasks: Difficulty maintaining context and progress through extended workflows
Whilst AI models have improved significantly in coding, mathematics, and logical reasoning, these advances haven’t translated to competence in practical freelance work requiring tool coordination and sustained context.
Industry Context
The benchmark results challenge several recent industry claims:
- Anthropic CEO Dario Amodei suggested 90% of coding work would be automated within months (March 2025)
- OpenAI’s GDPval benchmark indicates frontier models approaching human abilities across 220 office tasks
- Amazon recently attributed 14,000 job cuts partially to generative AI advances
Beth Galetti, Amazon’s senior vice president of people experience and technology, described AI as “the most transformative technology we’ve seen since the Internet,” enabling companies to “innovate much faster than ever before.”
However, the Remote Labor Index suggests AI capabilities remain far from replacing human workers in complex, multi-faceted roles.
Benchmark Limitations
The researchers acknowledge their measurement doesn’t capture AI’s complete economic impact:
- Limited Scope: Many professional tasks fall outside freelance work categories tested
- Productivity Amplification: Real-world workers often use AI as an assistive tool rather than replacement
- Human-AI Collaboration: The benchmark doesn’t measure performance in hybrid workflows
Bing Liu, director of research at Scale AI, noted: “We have debated AI and jobs for years, but most of it has been hypothetical or theoretical.” This benchmark provides empirical data for previously speculative discussions.
Strategic Implications for UK Businesses
The benchmark results suggest several considerations for UK organisations:
- Realistic Automation Planning: Current AI agents struggle with complex, multi-step workflows requiring tool coordination—focus automation efforts on well-defined, single-domain tasks
- Human-AI Augmentation: Rather than wholesale replacement, AI tools are more effective augmenting human workers’ capabilities
- Skill Development Priority: Organisations should invest in upskilling workers to collaborate effectively with AI tools rather than planning for full automation
- Vendor Claims Scrutiny: Benchmark diverse AI solutions against real workflow requirements before committing to automation investments
The research indicates that whilst AI continues improving, human-level performance across economically valuable work remains “some ways off.” Businesses should plan AI adoption strategies accordingly, focusing on augmentation and specific use cases rather than complete workforce replacement.
Source Attribution:
- Source: Wired
- Author: Will Knight
- Original: https://www.wired.com/story/ai-agents-are-terrible-freelance-workers/
- Published: 29 October 2025