Within Autonomy

What Current AI Agents Can (and Can't) Do

Examines real-world benchmarks showing how present AI agents struggle to maintain autonomy over extended tasks and adapt to unexpected challenges.

On this page

  • Benchmark frameworks for long horizon tasks
  • Observed failures and limitations
  • Implications for autonomous risk assessment
Preview for What Current AI Agents Can (and Can't) Do

Introduction

For debates about AI doom, dangerous autonomy, and loss of control, one practical question matters more than almost any theoretical argument: how well can current AI agents actually carry out long, complex tasks without human help?

Current Benchmarks illustration 1 The evidence so far is mixed. Modern agents can complete substantially longer tasks than systems from only a few years ago, and some can autonomously perform software engineering work that would take human experts several hours. At the same time, benchmark results consistently show that current agents remain unreliable on genuinely long-horizon activities. They lose track of goals, fail to recover from unexpected obstacles, accumulate small errors, and often abandon tasks before completion. The current empirical picture is therefore neither “agents are already fully autonomous” nor “agents cannot act independently at all”. Instead, researchers are observing rapidly improving but still fragile systems whose capabilities appear to be extending into longer time horizons. arXiv 4International AI Safety Report [TechUK]techuk.orgThe release of the international AI safety report 20263 Feb 2026 — The report notes that AI agents can now autonomously complete software…, multiple tools, and opportunities for mistakes to compound over time..

  • RE-Bench evaluates research and engineering tasks that resemble real-world technical work. [METR]metr.org2024 11 22 evaluating r d capabilities of llmsMETREvaluating frontier AI R&D capabilities of language model…22 Nov 2024 — We're releasing RE-Bench, a new benchmark for measuring th…
  • LORE (Long-horizon Reasoning Evaluation) and the underlying TaskWeaver framework generate tasks with controllable horizon lengths to investigate how performance changes as dependency chains become longer. [OpenReview]openreview.netOpen Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro…
  • LongCLI-Bench focuses on long software-engineering workflows conducted through command-line tools. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…
  • LongDS-Bench evaluates multi-stage data-analysis tasks in which agents must remember, revise, and combine evolving analytical states over many interactions. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…

These benchmarks are important for AI-risk discussions because many loss-of-control scenarios assume an AI can maintain coherent behaviour over long periods while interacting with changing environments. Short-answer tests provide limited evidence about that capability.

What the Results Actually Show

The clearest finding across long-horizon evaluations is that performance falls sharply as tasks become longer and more interconnected.

The International AI Safety Report 2026 summarises the current state of evidence bluntly: today’s agents “reliably fail on longer tasks”, frequently lose track of progress, and often cannot adapt effectively when unexpected obstacles arise. The report nevertheless notes that autonomous operating horizons have been increasing rapidly. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more…

METR (Model Evaluation and Threat Research) has attempted to quantify this trend using a “task-completion time horizon” metric. Rather than asking whether a model can answer a question, the metric asks what length of human task an agent can complete with roughly 50% reliability. Their findings suggest that frontier systems have steadily improved on longer tasks over recent years, although performance remains far from robust. [OpenReview]openreview.netOpen Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro…

This behaviour resembles what software engineers sometimes call “error accumulation”: small mistakes that would be harmless individually become fatal when multiplied across dozens or hundreds of steps.

The Most Common Failure Modes

The limitations revealed by long-horizon benchmarks are remarkably consistent across domains.

Losing Context and State

Many agents struggle to maintain an accurate internal picture of what has already been accomplished.

LongDS-Bench found substantial declines in performance during later stages of extended analytical workflows. Researchers reported large drops between early and late turns, with long-horizon errors accounting for a majority of failures. The primary problem was not a lack of intelligence about individual steps but difficulty maintaining a coherent evolving state over time. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…

This matters because many hypothetical autonomous-risk scenarios assume agents can accurately track plans across extended periods. Current evidence suggests this remains difficult.

Poor Adaptation to Surprises

Long tasks rarely unfold exactly as expected.

Dynamic benchmarks such as Gaia2 were developed partly because static evaluations allowed agents to follow pre-planned paths. In changing environments, agents must recognise unexpected events, revise plans, resolve ambiguities, and continue operating despite incomplete information. Current systems often struggle with these requirements. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…

The International AI Safety Report similarly highlights difficulty handling unexpected obstacles as a major limitation of contemporary agents. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more…

Current Benchmarks illustration 2

Planning Failures

Many failures occur surprisingly early.

LongCLI-Bench found that state-of-the-art coding agents achieved low success rates on realistic long-horizon programming tasks, with many attempts stalling before substantial progress had been made. Human guidance often produced larger improvements than autonomous self-correction mechanisms. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…

This suggests that current systems frequently fail at maintaining a workable plan rather than merely making occasional execution mistakes.

Strategy Drift

When tasks become very long, agents often begin pursuing objectives that differ subtly from the original goal.

Benchmarks such as RetailBench, which evaluates long-term decision-making in changing environments, found that performance deteriorates substantially as complexity rises. Maintaining a coherent strategy over time remains difficult even for leading systems. [arXiv]arxiv.orgarXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that…

For AI-safety researchers, this phenomenon is interesting because it resembles, in miniature form, concerns about specification errors and goal drift that appear in broader alignment discussions.

Why Time-Horizon Metrics Matter to Doom Arguments

Many AI doom scenarios depend on sustained autonomous competence.

An AI that can complete a ten-minute task but fails after an hour is very different from an AI that can reliably manage multi-week projects, coordinate resources, recover from setbacks, and continue pursuing objectives despite interruptions.

Current benchmark results therefore cut both ways in existential-risk debates.

On one hand, they provide evidence against the strongest versions of claims that present-day agents are already capable of dangerous independent operation at large scales. The consistent finding across evaluations is that reliability remains poor on genuinely long-horizon tasks. Today’s systems are not displaying the kind of robust autonomous competence that many takeover scenarios would require. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more…

On the other hand, the same benchmarks provide evidence for a different concern: capabilities appear to be improving in precisely the direction that doom-focused researchers watch most closely. Agents are increasingly able to complete longer workflows, use more tools, and operate with less supervision than their predecessors. The International AI Safety Report notes that measured horizons have been lengthening rapidly, while METR’s analyses point to substantial progress in task-completion capability over time. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more… [METR]metr.orgMETRRE-Bench: Evaluating frontier AI R&D capabilities of…Nov 1, 2024 — Here, “time-horizon” refers to the length of time humans spend…

For people concerned about p(doom)—the probability that advanced AI could eventually cause existential catastrophe—the key question is not whether current systems can autonomously run civilisation. They plainly cannot. The question is whether the trend toward longer and more reliable autonomous operation continues faster than methods for monitoring, controlling, and aligning those systems.

Current Benchmarks illustration 3

What We Can Infer—and What We Cannot

The strongest conclusion supported by current evidence is relatively modest: today’s AI agents remain unreliable on long-horizon tasks, but their capabilities are improving fast enough that researchers increasingly measure autonomy in hours rather than minutes. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more… [METR]metr.orgMETRMeasuring AI Ability to Complete Long TasksMar 19, 2025 — We propose measuring AI performance in terms of the length of tasks AI agen…

Several important uncertainties remain.

First, benchmark performance may not perfectly predict real-world behaviour. Some benchmarks are small, specialised, or focused on software tasks. Researchers themselves caution against treating any single metric as a definitive measure of autonomy. [METR]metr.org2026 01 22 time horizon limitationsMETRClarifying limitations of time horizonJan 22, 2026 — We propose measuring AI performance in terms of the length of tasks AI agents ca…

Second, long-horizon competence is not a single capability. Planning, memory, tool use, adaptation, self-correction, and strategic reasoning can improve at different rates. An agent may excel in one area while failing badly in another. [OpenReview]openreview.netOpen Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro…

Third, the evidence currently points to a gap between impressive demonstrations and dependable operation. Agents can sometimes complete surprisingly complex projects, yet still fail frequently enough that unsupervised deployment remains risky in many contexts. International AI Safety Report [DFKI]dfki.de2026 international ai safety report published3 Feb 2026 — Despite the progress made and the greater availability of agents on the market, however, the systems still fail when it come…

From the perspective of AI-doom debates, current long-horizon benchmarks therefore function less as proof of imminent loss of control and more as an early-warning indicator. They show that autonomous capability is real, measurable, and increasing, while also showing that present systems remain far from the robust, persistent autonomy assumed by the most severe existential-risk scenarios. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026Loss of control… Current agents reliably fail on longer tasks, lose track of…Read more… [METR]metr.orgtime horizonsTask-Completion Time Horizons of Frontier AI ModelsMar 3, 2026 — The 50%-time horizon is the length of task (measured by how long it take…

Amazon book picks

Further Reading

Books and field guides related to What Current AI Agents Can (and Can't) Do. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: techuk.org
    Link: https://www.techuk.org/resource/the-release-of-the-international-ai-safety-report-2026-navigating-rapid-ai-advancement-and-emerging-risks.html
    Source snippet

    The release of the international AI safety report 20263 Feb 2026 — The report notes that AI agents can now autonomously complete software...

  2. Source: arxiv.org
    Link: https://arxiv.org/abs/2311.12983
    Source snippet

    arXiv[2311.12983] GAIA: a benchmark for General AI Assistantsby G Mialon · 2023 · Cited by 649 — GAIA proposes real-world questions that...

  3. Source: arxiv.org
    Link: https://arxiv.org/pdf/2602.11964
    Source snippet

    arXivgaia2: benchmarking llm agents on dynamicby R Froger · 2026 — We introduce Gaia2, a benchmark designed to address these limitations...

  4. Source: metr.org
    Title: 2024 11 22 evaluating r d capabilities of llms
    Link: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
    Source snippet

    METREvaluating frontier AI R&D capabilities of language model...22 Nov 2024 — We're releasing RE-Bench, a new benchmark for measuring th...

  5. Source: metr.org
    Link: https://metr.org/AI_R_D_Evaluation_Report.pdf
    Source snippet

    METRRE-Bench: Evaluating frontier AI R&D capabilities of...Nov 1, 2024 — Here, “time-horizon” refers to the length of time humans spend...

  6. Source: openreview.net
    Title: Open Review Probing the Limits of Endurance in Long-Horizon Tasksby W Zheng —
    Link: https://openreview.net/forum?id=dAn82lpLx4
    Source snippet

    OpenReviewProbing the Limits of Endurance in Long-Horizon Tasksby W Zheng — Summary: The paper introduces TaskWeaver, a framework for pro...

  7. Source: openreview.net
    Link: https://openreview.net/pdf?id=dAn82lpLx4
    Source snippet

    PROBING THE LIMITS OF ENDURANCE IN LONG-...by W Zheng — Right: An overview of our proposed benchmark framework, which is designed to be...

  8. Source: arxiv.org
    Link: https://arxiv.org/abs/2602.14337
    Source snippet

    arXivLongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line InterfacesFebruary 15, 2026...

    Published: February 15, 2026

  9. Source: arxiv.org
    Link: https://arxiv.org/abs/2605.30434

  10. Source: arxiv.org
    Title: arXiv Measuring AI Ability to Complete Long Software Tasks
    Link: https://arxiv.org/abs/2503.14499
    Source snippet

    This is the time humans typically take to complete tasks that AI models can complete with 50%...Read more...

  11. Source: metr.org
    Link: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
    Source snippet

    METRMeasuring AI Ability to Complete Long TasksMar 19, 2025 — We propose measuring AI performance in terms of the length of tasks AI agen...

  12. Source: arxiv.org
    Link: https://arxiv.org/abs/2603.16453
    Source snippet

    arXivRetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environment...

  13. Source: dfki.de
    Title: 2026 international ai safety report published
    Link: https://www.dfki.de/en/web/news/2026-international-ai-safety-report-published
    Source snippet

    3 Feb 2026 — Despite the progress made and the greater availability of agents on the market, however, the systems still fail when it come...

  14. Source: metr.org
    Title: 2026 01 22 time horizon limitations
    Link: https://metr.org/notes/2026-01-22-time-horizon-limitations/
    Source snippet

    METRClarifying limitations of time horizonJan 22, 2026 — We propose measuring AI performance in terms of the length of tasks AI agents ca...

  15. Source: arxiv.org
    Link: https://arxiv.org/abs/2604.11978

  16. Source: arxiv.org
    Link: https://arxiv.org/abs/2602.21012
    Source snippet

    [2602.21012] International AI Safety Report 2026by Y Bengio · 2026 · Cited by 56 — The International AI Safety Report 2026 synthesises th...

  17. Source: arxiv.org
    Link: https://arxiv.org/html/2503.14499v1
    Source snippet

    This is the time humans typically take to complete tasks that AI models can complete with 50%...Read more...

  18. Source: arxiv.org
    Link: https://arxiv.org/pdf/2602.21012
    Source snippet

    The Report does not necessarily represent the.Read more...

  19. Source: metr.org
    Title: time horizons
    Link: https://metr.org/time-horizons/
    Source snippet

    Task-Completion Time Horizons of Frontier AI ModelsMar 3, 2026 — The 50%-time horizon is the length of task (measured by how long it take...

  20. Source: metr.org
    Link: https://metr.org/
    Source snippet

    METRWe propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase...

  21. Source: openreview.net
    Link: https://openreview.net/forum?id=3rB0bVU6z6&noteId=lOCHc0u2a6
    Source snippet

    RE-Bench: Evaluating Frontier AI R&D Capabilities of...May 1, 2025 — Summary: This paper contributes a new LLM (Agent) benchmark RE-Benc...

    Published: May 1, 2025

  22. Source: internationalaisafetyreport.org
    Title: international ai safety report 2026
    Link: https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026
    Source snippet

    Loss of control... Current agents reliably fail on longer tasks, lose track of...Read more...

  23. Source: hoganlovells.com
    Title: international ai safety report 2026 uk litigation lessons from imperfect ai
    Link: https://www.hoganlovells.com/en/publications/international-ai-safety-report-2026-uk-litigation-lessons-from-imperfect-ai
    Source snippet

    AI Safety Report 2026 – UK litigation lessons...16 Apr 2026 — The International AI Safety Report 2026 is an assessment of general-purpos...

  24. Source: internationalaisafetyreport.org
    Title: international ai safety report 2026
    Link: https://internationalaisafetyreport.org/sites/default/files/2026-02/international-ai-safety-report-2026.pdf
    Source snippet

    The Report does not necessarily represent...Read more...

  25. Source: internationalaisafetyreport.org
    Title: The duration of some software engineering tasks
    Link: https://internationalaisafetyreport.org/publication/2026-report-extended-summary-policymakers
    Source snippet

    2026 Report: Extended Summary for Policymakers3 Feb 2026 — If current trends continue, AI systems could operate autonomously on multi-day...

  26. Source: internationalaisafetyreport.org
    Link: https://internationalaisafetyreport.org/publication/2026-report-executive-summary
    Source snippet

    2026 Report: Executive Summary3 Feb 2026 — The Executive Summary offers a concise three-page overview of the 2026 Report's core findings...

  27. Source: yoshuabengio.org
    Title: international ai safety report 2026
    Link: https://yoshuabengio.org/en/publication/international-ai-safety-report-2026
    Source snippet

    6 Feb 2026 — The International AI Safety Report 2026 synthesises the current scientific evidence on the capabilities, emerging risks, and...

  28. Source: insideprivacy.com
    Link: [https://www.insideprivacy.com/artificial
    Source snippet

    International AI Safety Report 2026 Examines AI...12 Feb 2026 — Specifically, the Report finds that models are less reliable when projec...

  29. Source: hal.science
    Link: https://hal.science/hal-05223593v1/file/2501.17805v1.pdf
    Source snippet

    International AI Safety Reportby Y Bengio · 2025 · Cited by 171 — general-purpose AI agents deployed to accomplish long-horizon tasks can...

  30. Source: globalpolicywatch.com
    Link: https://www.globalpolicywatch.com/2026/02/international-ai-safety-report-2026-examines-ai-capabilities-risks-and-safeguards/
    Source snippet

    International AI Safety Report 2026 Examines AI...13 Feb 2026 — According to the Report, current AI systems may exhibit unpredictable fa...

Additional References

  1. Source: linkedin.com
    Link: https://www.linkedin.com/posts/johnbailey63_researchers-at-metr-metr-just-published-activity-7308494316899897345-DKux
    Source snippet

    AI's "Moore's Law": Doubling task length every 7 monthsResearchers at METR @METR just published a new paper that shows that the length of...

  2. Source: linkedin.com
    Link: https://www.linkedin.com/posts/omarsar_ai-agents-are-terrible-at-long-horizon-tasks-activity-7362091467969839104-A8DK
    Source snippet

    Elvis S.'s PostAI Agents are terrible at long-horizon tasks. Even the new GPT-5 model struggles with long-horizon tasks. This is one of t...

  3. Source: medium.com
    Link: https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10
    Source snippet

    GPT-5.2 Autonomy: Leading The METR's Time Horizon TestTheir time horizon benchmark, measures how long an AI can autonomously tackle softw...

  4. Source: linkedin.com
    Link: https://www.linkedin.com/posts/tanya-chib-gdpr-compliance-privacy_read-the-international-ai-safety-report-2026-activity-7442118800914894849-yUQC
    Source snippet

    Tanya Chib's PostRead the International AI Safety Report 2026, already? I just did and this is what stuck with me. 1/ We are governing ye...

  5. Source: dev.to
    Link: [https://dev.to/janusz_entity/two-things-metrs-time-horizon-data-actually-measures-and-why-it-matters-for-agent-governance
    Source snippet

    Two things METR's time horizon data actually measures...12 Mar 2026 — METR's recent benchmark work showed something striking: the length...

  6. Source: epoch.ai
    Link: https://epoch.ai/benchmarks/metr-time-horizons
    Source snippet

    METR Time HorizonsCollect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model's performance. Each...

  7. Source: complexdiscovery.com
    Link: https://complexdiscovery.com/2026-ai-safety-report-flags-escalating-threats-for-cyber-ig-and-ediscovery-professionals/
    Source snippet

    2026 AI Safety Report Flags Escalating Threats for Cyber...The 2026 International AI Safety Report reveals escalating cybersecurity thr...

  8. Source: linkedin.com
    Link: https://www.linkedin.com/posts/yoshuabengio_today-were-releasing-the-international-ai-activity-7424442271615582209-dOCq
    Source snippet

    2026 International AI Safety Report: Expert Insights on...The International AI Safety Report is a global and independent scientific synt...

  9. Source: linkedin.com
    Title: welker international ai safety report 2026 activity 7424732745643380736 o3XA
    Link: https://www.linkedin.com/posts/welker_international-ai-safety-report-2026-activity-7424732745643380736-o3XA
    Source snippet

    2026 AI Safety Report: Emerging Risks from General...Echoing our exchanges and last year's AI Safety Report, here is the 2026 edition, c...

  10. Source: linkedin.com
    Title: part 3 5 international ai safety report 2026 [loss control]({{ ‘objections/’ | relative_url }}) john shay bozdc
    Link: https://www.linkedin.com/pulse/part-3-5-international-ai-safety-report-2026-loss-control-john-shay-bozdc
    Source snippet

    PART 3 OF 5 — International AI Safety Report 2026AI agents are harder to monitor in real time; Humans often intervene only after damage o...

Topic Tree

Follow this branch

Parent topic

Autonomy When Does AI Autonomy Become Dangerous?

Related pages 3

More on this topic 3