In a striking development within AI benchmarks, OpenAI’s GPT-5.5 has emerged as the top performer in the newly established Agents’ Last Exam (ALE), recording a pass rate of 24.0%. This evaluation tool, created by researchers from the University of California, Berkeley, measures AI's ability to execute economically valuable workflows, moving away from traditional benchmarks that often focus on isolated coding tasks.
The launch of ALE represents a shift in how AI performance is assessed. Instead of relying on narrow question-answering frameworks, ALE evaluates models based on their ability to handle complex, real-world professional tasks. Agents must demonstrate proficiency across various functional layers, including reasoning, visual perception, and operational execution within virtual environments. The benchmark’s strict Generalist Computer-Use Agent (GCUA) framework ensures that only models capable of integrating multiple skills can succeed.
Anthropic's highly anticipated Claude Fable 5, released just a day before the ALE results, secured third place with a pass rate of 22.0%. While both models show advanced capabilities, the results indicate that OpenAI’s offerings currently outperform their competitors in managing complex, multi-part prompts. Users have noted that Anthropic's Claude architecture can sometimes overlook essential steps in workflows, a critical flaw in the demanding context of ALE.
Defining the Benchmark
ALE introduces a thorough evaluation strategy designed to address the shortcomings of previous benchmarks. Initially featuring 1,490 task instances, the benchmark is set to expand to 5,000 tasks, all based on the U.S. federal occupational taxonomy. The tasks span 55 industries and draw from authentic professional experiences, requiring agents to engage in activities like 3D model creation and neuroimaging analysis. This authenticity highlights the substantial gap between current AI capabilities and the demands of real-world applications.
The performance data shows that even the most advanced models face significant challenges. For instance, on the most difficult tier of the ALE, known as "Last-Exam," many configurations, including earlier versions of Anthropic's models, achieved a concerning 0.0% pass rate. This highlights the ongoing hurdles in developing AI that can competently perform professional-grade tasks.
Tackling Benchmark Contamination
Another important aspect of ALE is its approach to mitigating benchmark contamination, a common issue where test questions inadvertently become part of the training datasets for AI models. ALE employs a dual-use deployment model, allowing for both open-source research and strict control over evaluation data. Only about 10% of the tasks are publicly shared, ensuring the integrity of the benchmark remains intact.
This strategy promotes a "living benchmark" environment, where tasks are systematically rotated to prevent models from memorizing the evaluation criteria. Additionally, ALE distinguishes between tasks requiring proprietary software and those that utilize public resources, offering a clear comparison between models in varied contexts.
Implications for the AI Ecosystem
The results from ALE serve as a sobering reality check for the AI ecosystem. With billions of dollars invested in AI agents, the industry needs accurate assessments of their capabilities. As Zengyi Qin, a researcher involved in the project, noted, the benchmark represents a necessary validation of AI performance claims. The low pass rates remind us that while advancements are being made, there is still a considerable distance to cover before AI agents can be reliably integrated into professional workflows.
As AI development continues to evolve, the outcomes of the Agents’ Last Exam underscore the pressing need for rigorous evaluation methodologies. The performance of GPT-5.5 and the challenges faced by other leading models highlight the importance of maintaining high standards in AI assessments, ultimately moving the technology closer to real-world applicability.
The stories that move AI & crypto markets — before the market reacts.
Free. 7am ET. Five stories. 62,400 readers.



