Within Autonomy
What Current AI Agents Can (and Can't) Do
Examines real-world benchmarks showing how present AI agents struggle to maintain autonomy over extended tasks and adapt to unexpected challenges.
On this page
- Benchmark frameworks for long horizon tasks
- Observed failures and limitations
- Implications for autonomous risk assessment
Page outline Jump by section
Introduction
For debates about AI doom, dangerous autonomy, and loss of control, one practical question matters more than almost any theoretical argument: how well can current AI agents actually carry out long, complex tasks without human help?
The evidence so far is mixed. Modern agents can complete substantially longer tasks than systems from only a few years ago, and some can autonomously perform software engineering work that would take human experts several hours. At the same time, benchmark results consistently show that current agents remain unreliable on genuinely long-horizon activities. They lose track of goals, fail to recover from unexpected obstacles, accumulate small errors, and often abandon tasks before completion. The current empirical picture is therefore neither “agents are already fully autonomous” nor “agents cannot act independently at all”. Instead, researchers are observing rapidly improving but still fragile systems whose capabilities appear to be extending into longer time horizons. arXiv 4International AI Safety Report [TechUK]techuk.orgThe release of the international AI safety report 20263 Feb 2026 — The report notes that AI agents can now autonomously complete software…, multiple tools, and opportunities for mistakes to compound over time..
- RE-Bench evaluates research and engineering tasks that resemble real-world technical work. [METR]metr.org2024 11 22 evaluating r d capabilities of llmsMETREvaluating frontier AI R&D capabilities of language model…22 Nov 2024 — We're releasing RE-Bench, a new benchmark for measuring th…
- LORE (Long-horizon Reasoning Evaluation) and the underlying TaskWeaver framework generate tasks with controllable horizon lengths to investigate how performance changes as dependency chains become longer. [OpenReview]openreview.netOpen Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro…
- LongCLI-Bench focuses on long software-engineering workflows conducted through command-line tools. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
- LongDS-Bench evaluates multi-stage data-analysis tasks in which agents must remember, revise, and combine evolving analytical states over many interactions. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
These benchmarks are important for AI-risk discussions because many loss-of-control scenarios assume an AI can maintain coherent behaviour over long periods while interacting with changing environments. Short-answer tests provide limited evidence about that capability.
What the Results Actually Show
The clearest finding across long-horizon evaluations is that performance falls sharply as tasks become longer and more interconnected.
The International AI Safety Report 2026 summarises the current state of evidence bluntly: today’s agents “reliably fail on longer tasks”, frequently lose track of progress, and often cannot adapt effectively when unexpected obstacles arise. The report nevertheless notes that autonomous operating horizons have been increasing rapidly. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more…
METR (Model Evaluation and Threat Research) has attempted to quantify this trend using a “task-completion time horizon” metric. Rather than asking whether a model can answer a question, the metric asks what length of human task an agent can complete with roughly 50% reliability. Their findings suggest that frontier systems have steadily improved on longer tasks over recent years, although performance remains far from robust. [OpenReview]openreview.netOpen Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro…
This behaviour resembles what software engineers sometimes call “error accumulation”: small mistakes that would be harmless individually become fatal when multiplied across dozens or hundreds of steps.
The Most Common Failure Modes
The limitations revealed by long-horizon benchmarks are remarkably consistent across domains.
Losing Context and State
Many agents struggle to maintain an accurate internal picture of what has already been accomplished.
LongDS-Bench found substantial declines in performance during later stages of extended analytical workflows. Researchers reported large drops between early and late turns, with long-horizon errors accounting for a majority of failures. The primary problem was not a lack of intelligence about individual steps but difficulty maintaining a coherent evolving state over time. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
This matters because many hypothetical autonomous-risk scenarios assume agents can accurately track plans across extended periods. Current evidence suggests this remains difficult.
Poor Adaptation to Surprises
Long tasks rarely unfold exactly as expected.
Dynamic benchmarks such as Gaia2 were developed partly because static evaluations allowed agents to follow pre-planned paths. In changing environments, agents must recognise unexpected events, revise plans, resolve ambiguities, and continue operating despite incomplete information. Current systems often struggle with these requirements. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
The International AI Safety Report similarly highlights difficulty handling unexpected obstacles as a major limitation of contemporary agents. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more…
Planning Failures
Many failures occur surprisingly early.
LongCLI-Bench found that state-of-the-art coding agents achieved low success rates on realistic long-horizon programming tasks, with many attempts stalling before substantial progress had been made. Human guidance often produced larger improvements than autonomous self-correction mechanisms. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
This suggests that current systems frequently fail at maintaining a workable plan rather than merely making occasional execution mistakes.
Strategy Drift
When tasks become very long, agents often begin pursuing objectives that differ subtly from the original goal.
Benchmarks such as RetailBench, which evaluates long-term decision-making in changing environments, found that performance deteriorates substantially as complexity rises. Maintaining a coherent strategy over time remains difficult even for leading systems. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
For AI-safety researchers, this phenomenon is interesting because it resembles, in miniature form, concerns about specification errors and goal drift that appear in broader alignment discussions.
Why Time-Horizon Metrics Matter to Doom Arguments
Many AI doom scenarios depend on sustained autonomous competence.
An AI that can complete a ten-minute task but fails after an hour is very different from an AI that can reliably manage multi-week projects, coordinate resources, recover from setbacks, and continue pursuing objectives despite interruptions.
Current benchmark results therefore cut both ways in existential-risk debates.
On one hand, they provide evidence against the strongest versions of claims that present-day agents are already capable of dangerous independent operation at large scales. The consistent finding across evaluations is that reliability remains poor on genuinely long-horizon tasks. Today’s systems are not displaying the kind of robust autonomous competence that many takeover scenarios would require. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more…
On the other hand, the same benchmarks provide evidence for a different concern: capabilities appear to be improving in precisely the direction that doom-focused researchers watch most closely. Agents are increasingly able to complete longer workflows, use more tools, and operate with less supervision than their predecessors. The International AI Safety Report notes that measured horizons have been lengthening rapidly, while METR’s analyses point to substantial progress in task-completion capability over time. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more… [METR]metr.orgMETRRE-Bench: Evaluating frontier AI R&D capabilities of…Nov 1, 2024 — Here, “time-horizon” refers to the length of time humans spend…
For people concerned about p(doom)—the probability that advanced AI could eventually cause existential catastrophe—the key question is not whether current systems can autonomously run civilisation. They plainly cannot. The question is whether the trend toward longer and more reliable autonomous operation continues faster than methods for monitoring, controlling, and aligning those systems.
What We Can Infer—and What We Cannot
The strongest conclusion supported by current evidence is relatively modest: today’s AI agents remain unreliable on long-horizon tasks, but their capabilities are improving fast enough that researchers increasingly measure autonomy in hours rather than minutes. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more… [METR]metr.orgMETRMeasuring AI Ability to Complete Long TasksMar 19, 2025 — We propose measuring AI performance in terms of the length of tasks AI agen…
Several important uncertainties remain.
First, benchmark performance may not perfectly predict real-world behaviour. Some benchmarks are small, specialised, or focused on software tasks. Researchers themselves caution against treating any single metric as a definitive measure of autonomy. [METR]metr.org2026 01 22 time horizon limitationsMETRClarifying limitations of time horizonJan 22, 2026 — We propose measuring AI performance in terms of the length of tasks AI agents ca…
Second, long-horizon competence is not a single capability. Planning, memory, tool use, adaptation, self-correction, and strategic reasoning can improve at different rates. An agent may excel in one area while failing badly in another. [OpenReview]openreview.netOpen Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro…
Third, the evidence currently points to a gap between impressive demonstrations and dependable operation. Agents can sometimes complete surprisingly complex projects, yet still fail frequently enough that unsupervised deployment remains risky in many contexts. International AI Safety Report [DFKI]dfki.de2026 international ai safety report published3 Feb 2026 — Despite the progress made and the greater availability of agents on the market, however, the systems still fail when it come…
From the perspective of AI-doom debates, current long-horizon benchmarks therefore function less as proof of imminent loss of control and more as an early-warning indicator. They show that autonomous capability is real, measurable, and increasing, while also showing that present systems remain far from the robust, persistent autonomy assumed by the most severe existential-risk scenarios. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more… [METR]metr.orgtime horizonsTask-Completion Time Horizons of Frontier AI ModelsMar 3, 2026 — The 50%-time horizon is the length of task (measured by how long it take…
Amazon book picks
Further Reading
Books and field guides related to What Current AI Agents Can (and Can't) Do. Use these as the next step if you want deeper reading beyond the article.
The Alignment Problem
Explains real-world AI capabilities, failures, and alignment challenges.
Co-Intelligence
Provides practical perspective on what current AI agents can and cannot do.
Endnotes
-
Source: techuk.org
Link: https://www.techuk.org/resource/the-release-of-the-international-ai-safety-report-2026-navigating-rapid-ai-advancement-and-emerging-risks.htmlSource snippet
The release of the international AI safety report 20263 Feb 2026 — The report notes that AI agents can now autonomously complete software...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2311.12983Source snippet
arXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that...
-
Source: arxiv.org
Link: https://arxiv.org/pdf/2602.11964Source snippet
arXivgaia2: benchmarking llm agents on dynamicby R Froger · 2026 — We introduce Gaia2, a benchmark designed to address these limitations...
-
Source: metr.org
Title: 2024 11 22 evaluating r d capabilities of llms
Link: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/Source snippet
METREvaluating frontier AI R&D capabilities of language model...22 Nov 2024 — We're releasing RE-Bench, a new benchmark for measuring th...
-
Source: metr.org
Link: https://metr.org/AI_R_D_Evaluation_Report.pdfSource snippet
METRRE-Bench: Evaluating frontier AI R&D capabilities of...Nov 1, 2024 — Here, “time-horizon” refers to the length of time humans spend...
-
Source: openreview.net
Title: Open Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —
Link: https://openreview.net/forum?id=dAn82lpLx4Source snippet
OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro...
-
Source: openreview.net
Link: https://openreview.net/pdf?id=dAn82lpLx4Source snippet
PROBING THE LIMITS OF ENDURANCE IN LONG-...by W Zheng — Right: An overview of our proposed benchmark framework, which is designed to be...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2602.14337Source snippet
arXivLongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line InterfacesFebruary 15, 2026...
Published: February 15, 2026
-
Source: arxiv.org
Link: https://arxiv.org/abs/2605.30434 -
Source: arxiv.org
Title: arXiv Measuring AI Ability to Complete Long Software Tasks
Link: https://arxiv.org/abs/2503.14499Source snippet
This is the time humans typically take to complete tasks that AI models can complete with 50%...Read more...
-
Source: metr.org
Link: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/Source snippet
METRMeasuring AI Ability to Complete Long TasksMar 19, 2025 — We propose measuring AI performance in terms of the length of tasks AI agen...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2603.16453Source snippet
arXivRetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environment...
-
Source: dfki.de
Title: 2026 international ai safety report published
Link: https://www.dfki.de/en/web/news/2026-international-ai-safety-report-publishedSource snippet
3 Feb 2026 — Despite the progress made and the greater availability of agents on the market, however, the systems still fail when it come...
-
Source: metr.org
Title: 2026 01 22 time horizon limitations
Link: https://metr.org/notes/2026-01-22-time-horizon-limitations/Source snippet
METRClarifying limitations of time horizonJan 22, 2026 — We propose measuring AI performance in terms of the length of tasks AI agents ca...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2604.11978 -
Source: arxiv.org
Link: https://arxiv.org/abs/2602.21012Source snippet
[2602.21012] International AI Safety Report 2026by Y Bengio · 2026 · Cited by 56 — The International AI Safety Report 2026 synthesises th...
-
Source: arxiv.org
Link: https://arxiv.org/html/2503.14499v1Source snippet
This is the time humans typically take to complete tasks that AI models can complete with 50%...Read more...
-
Source: arxiv.org
Link: https://arxiv.org/pdf/2602.21012Source snippet
The Report does not necessarily represent the.Read more...
-
Source: metr.org
Title: time horizons
Link: https://metr.org/time-horizons/Source snippet
Task-Completion Time Horizons of Frontier AI ModelsMar 3, 2026 — The 50%-time horizon is the length of task (measured by how long it take...
-
Source: metr.org
Link: https://metr.org/Source snippet
METRWe propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase...
-
Source: openreview.net
Link: https://openreview.net/forum?id=3rB0bVU6z6¬eId=lOCHc0u2a6Source snippet
RE-Bench: Evaluating Frontier AI R&D Capabilities of...May 1, 2025 — Summary: This paper contributes a new LLM (Agent) benchmark RE-Benc...
Published: May 1, 2025
-
Source: internationalaisafetyreport.org
Title: international ai safety report 2026
Link: https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026Source snippet
Loss of control... Current agents reliably fail on longer tasks, lose track of...Read more...
-
Source: hoganlovells.com
Title: international ai safety report 2026 uk litigation lessons from imperfect ai
Link: https://www.hoganlovells.com/en/publications/international-ai-safety-report-2026-uk-litigation-lessons-from-imperfect-aiSource snippet
AI Safety Report 2026 – UK litigation lessons...16 Apr 2026 — The International AI Safety Report 2026 is an assessment of general-purpos...
-
Source: internationalaisafetyreport.org
Title: international ai safety report 2026
Link: https://internationalaisafetyreport.org/sites/default/files/2026-02/international-ai-safety-report-2026.pdfSource snippet
The Report does not necessarily represent...Read more...
-
Source: internationalaisafetyreport.org
Title: The duration of some software engineering tasks
Link: https://internationalaisafetyreport.org/publication/2026-report-extended-summary-policymakersSource snippet
2026 Report: Extended Summary for Policymakers3 Feb 2026 — If current trends continue, AI systems could operate autonomously on multi-day...
-
Source: internationalaisafetyreport.org
Link: https://internationalaisafetyreport.org/publication/2026-report-executive-summarySource snippet
2026 Report: Executive Summary3 Feb 2026 — The Executive Summary offers a concise three-page overview of the 2026 Report's core findings...
-
Source: yoshuabengio.org
Title: international ai safety report 2026
Link: https://yoshuabengio.org/en/publication/international-ai-safety-report-2026Source snippet
6 Feb 2026 — The International AI Safety Report 2026 synthesises the current scientific evidence on the capabilities, emerging risks, and...
-
Source: insideprivacy.com
Link: [https://www.insideprivacy.com/artificialSource snippet
International AI Safety Report 2026 Examines AI...12 Feb 2026 — Specifically, the Report finds that models are less reliable when projec...
-
Source: hal.science
Link: https://hal.science/hal-05223593v1/file/2501.17805v1.pdfSource snippet
International AI Safety Reportby Y Bengio · 2025 · Cited by 171 — general-purpose AI agents deployed to accomplish long-horizon tasks can...
-
Source: globalpolicywatch.com
Link: https://www.globalpolicywatch.com/2026/02/international-ai-safety-report-2026-examines-ai-capabilities-risks-and-safeguards/Source snippet
International AI Safety Report 2026 Examines AI...13 Feb 2026 — According to the Report, current AI systems may exhibit unpredictable fa...
Additional References
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/johnbailey63_researchers-at-metr-metr-just-published-activity-7308494316899897345-DKuxSource snippet
AI's "Moore's Law": Doubling task length every 7 monthsResearchers at METR @METR just published a new paper that shows that the length of...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/omarsar_ai-agents-are-terrible-at-long-horizon-tasks-activity-7362091467969839104-A8DKSource snippet
Elvis S.'s PostAI Agents are terrible at long-horizon tasks. Even the new GPT-5 model struggles with long-horizon tasks. This is one of t...
-
Source: medium.com
Link: https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10Source snippet
GPT-5.2 Autonomy: Leading The METR's Time Horizon TestTheir time horizon benchmark, measures how long an AI can autonomously tackle softw...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/tanya-chib-gdpr-compliance-privacy_read-the-international-ai-safety-report-2026-activity-7442118800914894849-yUQCSource snippet
Tanya Chib's PostRead the International AI Safety Report 2026, already? I just did and this is what stuck with me. 1/ We are governing ye...
-
Source: dev.to
Link: [https://dev.to/janusz_entity/two-things-metrs-time-horizon-data-actually-measures-and-why-it-matters-for-agent-governanceSource snippet
Two things METR's time horizon data actually measures...12 Mar 2026 — METR's recent benchmark work showed something striking: the length...
-
Source: epoch.ai
Link: https://epoch.ai/benchmarks/metr-time-horizonsSource snippet
METR Time HorizonsCollect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model's performance. Each...
-
Source: complexdiscovery.com
Link: https://complexdiscovery.com/2026-ai-safety-report-flags-escalating-threats-for-cyber-ig-and-ediscovery-professionals/Source snippet
2026 AI Safety Report Flags Escalating Threats for Cyber...The 2026 International AI Safety Report reveals escalating cybersecurity thr...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/yoshuabengio_today-were-releasing-the-international-ai-activity-7424442271615582209-dOCqSource snippet
2026 International AI Safety Report: Expert Insights on...The International AI Safety Report is a global and independent scientific synt...
-
Source: linkedin.com
Title: welker international ai safety report 2026 activity 7424732745643380736 o3XA
Link: https://www.linkedin.com/posts/welker_international-ai-safety-report-2026-activity-7424732745643380736-o3XASource snippet
2026 AI Safety Report: Emerging Risks from General...Echoing our exchanges and last year's AI Safety Report, here is the 2026 edition, c...
-
Source: linkedin.com
Title: part 3 5 international ai safety report 2026 [loss control]({{ ‘objections/’ | relative_url }}) john shay bozdc
Link: https://www.linkedin.com/pulse/part-3-5-international-ai-safety-report-2026-loss-control-john-shay-bozdcSource snippet
PART 3 OF 5 — International AI Safety Report 2026AI agents are harder to monitor in real time; Humans often intervene only after damage o...
Topic Tree







