AI AGENTS

AI Agents Pass Just 2.6% of Real Work Tasks in New Evaluation

A new evaluation framework reveals that mainstream AI agents only succeed in complex tasks 2.6% of the time, raising concerns about their readiness for real-world applications.

AI Agents Pass Just 2.6% of Real Work Tasks in New Evaluation
CoinSynaptic Desk
AI AGENTS · Correspondent
· PUBLISHED JUN 11, 2026 · 2 MIN READ

A recent evaluation framework from UC Berkeley has revealed a startling statistic: mainstream AI agents manage to pass just 2.6% of real-world professional tasks in its most difficult tier. This benchmark, known as the Agents’ Last Exam, was developed with insights from over 250 experts across more than 100 institutions and aims to provide a clearer understanding of AI capabilities in practical settings.

The benchmark assesses 55 non-physical sub-industries, grouped into 13 clusters, based on the O*NET/SOC 2018 taxonomy. Currently, the project team has cataloged over 1,500 tasks, with plans to expand this to 5,000. Each task yields verifiable outcomes, addressing the common errors that large language models often exhibit—responses that may sound plausible but are incorrect.

While the average success rate sits at a mere 2.6%, the standout performer, Codex running on gpt-5-5, achieved a pass rate of approximately 26%. Other models, including those based on Cursor and Claude, followed closely behind, but none reached the proficiency necessary for practical use.

A key differentiator of the Agents’ Last Exam is its emphasis on long-term task performance rather than simple question answering. Many AI agents may correctly answer individual questions but struggle with more complex workflows that require context retention, sequential decision-making, and the production of reliable deliverables. This shortfall raises important questions about the readiness of AI agents for deployment in professional environments.

Led by UC Berkeley’s RDI, the initiative has attracted collaboration from notable institutions such as MIT, Harvard, Stanford, Goldman Sachs, JPMorgan, Meta, Amazon, and Adobe. The comprehensive nature of this benchmark indicates a shared concern among industry leaders regarding the practical applications of AI technology.

See also  Modal Labs Raises $355M, Valuation Soars Amid AI Infrastructure Demand

As AI continues to integrate into various sectors, the implications of these findings could be significant. The low pass rate suggests that developers and companies may need to adjust their expectations and timelines for AI deployment in real-world tasks. This benchmark not only measures current capabilities but also serves as a roadmap for future enhancements in AI performance.

The Agents’ Last Exam underscores the significant challenges that AI agents face in executing complex tasks. As the benchmark evolves, it aims to push the boundaries of what AI can achieve, ultimately seeking to improve the technology's applicability in real-world scenarios. The current figures indicate a long road ahead before AI agents can effectively meet the demands of professional environments.

Quick answers

What is the Agents’ Last Exam?

It is a large-scale evaluation framework developed by UC Berkeley to assess AI agents' performance on real-world professional tasks.

How many tasks are included in the benchmark?

The benchmark currently includes over 1,500 tasks, with plans to expand to 5,000.

Which AI agent performed best in the evaluation?

Codex running on gpt-5-5 achieved the highest pass rate at approximately 26%.

What are the main focuses of the benchmark?

The benchmark evaluates long-term task performance rather than quick question answering.

CoinSynaptic Desk

AI Agents · 2,293 stories

CoinSynaptic Desk covers the intersection of artificial intelligence and decentralized networks — frontier AI infrastructure, crypto-native AI agents, Bittensor subnets, DePIN economies, and tokenized compute.

THE DAILY SIGNAL

The stories that move AI & crypto markets — before the market reacts.

Free. 7am ET. Five stories. 62,400 readers.