Measuring AI Ability to Complete Long Tasks: Opus 4.5 has 50% horizon of 4h49M

https://news.ycombinator.com/rss Hits: 20
Summary

Linear Scale Log Scale 50% Success 80% Success Analysis code available on GitHub Raw data available here This is our most up-to-date measurement of the task-completion time horizons for public language models. We intend to update this graph periodically whenever we have new measurements to share. For methodological details, including a definition of the task-completion time horizon, see the blog post below and the associated paper. Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts. Full paper | Github repo We think that forecasting the capabilities of future AI systems is important for understanding and preparing for the impact of powerful AI. But predicting capability trends is hard, and even understanding the abilities of today鈥檚 models can be confusing. Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most exam-style problems for a fraction of the cost. With some task-specific adaptation, they can also serve as useful tools in many applications. And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. They are unable to reliably handle even relatively low-skill, computer-based work like remote executive assi...

First seen: 2025-12-21 04:30

Last seen: 2025-12-22 00:33