AI Agents Pass Just 2.6% of Real Work Tasks in New Evaluation

A recent evaluation framework from UC Berkeley has revealed a startling statistic: mainstream AI agents manage to pass just 2.6% of real-world professional tasks in its most difficult tier. This benchmark, known as the Agents’ Last Exam, was developed with insights from over 250 experts across more than 100 institutions and aims to provide a clearer understanding of AI capabilities in practical settings.

The benchmark assesses 55 non-physical sub-industries, grouped into 13 clusters, based on the O*NET/SOC 2018 taxonomy. Currently, the project team has cataloged over 1,500 tasks, with plans to expand this to 5,000. Each task yields verifiable outcomes, addressing the common errors that large language models often exhibit—responses that may sound plausible but are incorrect.

While the average success rate sits at a mere 2.6%, the standout performer, Codex running on gpt-5-5, achieved a pass rate of approximately 26%. Other models, including those based on Cursor and Claude, followed closely behind, but none reached the proficiency necessary for practical use.

A key differentiator of the Agents’ Last Exam is its emphasis on long-term task performance rather than simple question answering. Many AI agents may correctly answer individual questions but struggle with more complex workflows that require context retention, sequential decision-making, and the production of reliable deliverables. This shortfall raises important questions about the readiness of AI agents for deployment in professional environments.

Led by UC Berkeley’s RDI, the initiative has attracted collaboration from notable institutions such as MIT, Harvard, Stanford, Goldman Sachs, JPMorgan, Meta, Amazon, and Adobe. The comprehensive nature of this benchmark indicates a shared concern among industry leaders regarding the practical applications of AI technology.

As AI continues to integrate into various sectors, the implications of these findings could be significant. The low pass rate suggests that developers and companies may need to adjust their expectations and timelines for AI deployment in real-world tasks. This benchmark not only measures current capabilities but also serves as a roadmap for future enhancements in AI performance.

The Agents’ Last Exam underscores the significant challenges that AI agents face in executing complex tasks. As the benchmark evolves, it aims to push the boundaries of what AI can achieve, ultimately seeking to improve the technology's applicability in real-world scenarios. The current figures indicate a long road ahead before AI agents can effectively meet the demands of professional environments.

Quick answers

What is the Agents’ Last Exam?

It is a large-scale evaluation framework developed by UC Berkeley to assess AI agents' performance on real-world professional tasks.

How many tasks are included in the benchmark?

The benchmark currently includes over 1,500 tasks, with plans to expand to 5,000.

Which AI agent performed best in the evaluation?

Codex running on gpt-5-5 achieved the highest pass rate at approximately 26%.

What are the main focuses of the benchmark?

The benchmark evaluates long-term task performance rather than quick question answering.

CoinSynaptic Desk

AI Agents · 2,293 stories

CoinSynaptic Desk covers the intersection of artificial intelligence and decentralized networks — frontier AI infrastructure, crypto-native AI agents, Bittensor subnets, DePIN economies, and tokenized compute.

AI Agents Pass Just 2.6% of Real Work Tasks in New Evaluation

Quick answers

What is the Agents’ Last Exam?

How many tasks are included in the benchmark?

Which AI agent performed best in the evaluation?

What are the main focuses of the benchmark?

CoinSynaptic Desk

The stories that move AI & crypto markets — before the market reacts.

More from AI Agents

Agentic AI Set to Transform Productivity by 2034

Trustap Secures $10M to Empower AI Agents for Online Shopping

Jenne Partners with Vida to Enhance AI Agent Offerings for Resellers

Mitigram Unveils AI Agent to Streamline Trade Finance Risk Pricing