A recent study evaluating artificial intelligence agents on real-world freelance tasks revealed that even the most advanced models complete only 2.5% of projects at a professional level acceptable to clients. The Remote Labor Index, developed by researchers from the Center for AI Safety and Scale AI, tested six leading AI systems on 240 actual jobs sourced from platforms like Upwork. These tasks, which included game development, 3D modeling, video production, architectural planning and data analysis, represented over 6,000 hours of human labor valued at $140,000. Human evaluators compared AI outputs to the original human deliverables, assessing whether a reasonable client would pay for the work.

The study, published in October 2025, aimed to measure AI’s potential to automate economically valuable remote work, moving beyond simulated benchmarks to practical applications. Results showed low automation rates across all models. Manus led with 2.5%, followed by Grok 4 and Claude Sonnet 4.5 at 2.1% each, GPT-5 at 1.7%, ChatGPT agent at 1.3% and Gemini 2.5 Pro at 0.8%. Failures often stemmed from incomplete outputs, such as truncated videos or missing files; poor quality below professional standards; technical errors like corrupt formats; and inconsistencies, including mismatched elements in designs. AI performed better on narrower tasks like simple coding or data visualization but struggled with complex, multi-step projects requiring sustained reasoning or tool integration.

 

 

This benchmark emerges amid widespread predictions that AI will disrupt employment. A 2025 PwC survey found 60% of CEOs reported no financial returns from AI investments, despite heavy spending. Gartner analysts projected that by 2026, half of companies that replaced workers with AI would rehire them due to underperformance. In software, Microsoft noted 30% of its code was AI-generated in 2025, coinciding with major outages attributed to quality issues. Medical applications have seen over 100 FDA-reported AI malfunctions since 2024, including surgical errors leading to patient injuries.

The transcript from a Cold Fusion video episode, dated early 2026, highlighted similar concerns, citing the study’s initial findings with Claude Opus 4.5 at 3.75% success—slightly higher than later reports but still indicating over 96% failure. It emphasized AI’s strengths in creative areas like audio generation or web scraping but warned of overhyped capabilities, noting companies like Anthropic, Google and Microsoft paid content creators up to $500,000 each for promotions in 2025. Yann LeCun, creator of convolutional neural networks, stated in the video that current architectures mimic rather than understand, failing basic tasks like chess rules despite vast training data, and called for foundational research beyond scaling.

Implications extend to economic risks, with U.S. tech firms investing $380 billion in AI infrastructure from 2025 to 2026, yet stocks like Oracle fell below pre-partnership levels. The study suggests AI remains a tool for augmentation, not replacement, in most fields. A related Upwork index released in November 2025 found human-AI collaboration boosted completion rates by up to 70% on simple tasks, underscoring the need for oversight. Researchers noted steady improvements in Elo scores—a relative performance metric—indicating progress, but absolute rates remain near zero. As AI evolves, stakeholders must prepare for targeted impacts on language-heavy or data-retrieval jobs while addressing biases and reliability gaps.

 

 

 

 

Floating Vimeo Video