Why bigger AI models still resist human understanding

Introduction

One of the central claims in debates about AI doom and existential risk is that the most powerful AI systems remain “black boxes”: they can perform increasingly impressive tasks, yet neither developers nor outside researchers can fully explain how their internal computations produce those capabilities. If interpretability does not keep pace with capability growth, future systems could become more powerful while remaining only partially understood.

Black Box Evidence illustration 1 Importantly, the evidence for this concern is not merely theoretical. Government assessments, frontier-lab research, and technical studies all point to a similar pattern: scaling models has produced large gains in capability, but not matching gains in human understanding of how those systems work internally. The result is a growing gap between what frontier models can do and what researchers can reliably explain. That gap is one reason interpretability features prominently in arguments about loss of control, misalignment, and long-term AI risk. [GOV.UK]GOV.UKFrontier AI: capabilities and risks – discussion paperIt describes the current state and key trends relating to frontier AI capabilities, and then explores how frontier AI capabilities…Rea…

What frontier AI safety reports say about opacity

Several major assessments of advanced AI systems explicitly identify limited interpretability as an unresolved problem rather than a solved engineering challenge.

The UK government’s discussion paper on frontier AI noted that there is substantial uncertainty about how advanced systems develop capabilities and risks, and highlighted the difficulty of understanding and predicting behaviour in increasingly capable models. The report treats this lack of understanding as a significant obstacle to risk assessment and governance. [GOV.UK]GOV.UKreport pioneered by AI Security Institute gives…Dec 18, 2025 — The AI Security Institute's Frontier AI Trends Report, a public assessm…

The International AI Safety Report similarly describes current general-purpose AI systems as difficult to understand in mechanistic terms, despite extensive progress in evaluating their external behaviour. Researchers can often measure what a model does, but understanding why it does it remains much harder. [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026International AI Safety ReportInternational AI Safety Report 20263 Feb 2026 — This Report assesses what general-purpose AI systems can do…

This distinction matters. Behavioural evaluations can reveal whether a model succeeds or fails on a task. They do not necessarily reveal the internal reasoning, representations, or strategies that produced the result. A system may appear aligned and cooperative under testing while relying on internal processes that researchers have not identified.

The UK AI Security Institute’s work also reflects this reality. Considerable effort has gone into developing evaluation frameworks for frontier models, yet evaluation itself exists partly because internal transparency remains limited. Researchers frequently rely on testing behaviour because direct understanding of the underlying computation is incomplete. [Inspect]inspect.aisi.org.ukInspectInspect AIInspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior…

Experiments showing scale does not automatically improve interpretability

A common assumption might be that larger models become easier to understand because researchers have more tools and more data. The evidence so far does not support that conclusion.

OpenAI’s GPT-4 technical report described a recurring challenge in frontier AI development: some capabilities emerge in ways that are difficult to predict from smaller systems. The report warned that developers should expect unexpected capabilities and complex interactions as models scale. In other words, even the organisations building these systems cannot always foresee how behaviour will change with increased scale. [cdn.openai.com]cdn.openai.comGPT-4 Technical Report27 Mar 2023 — Certain capabilities remain hard to predict… should be prepared for emergent capabilities and comp…

This unpredictability is itself evidence of limited interpretability. If researchers fully understood the internal mechanisms that generate capabilities, sudden or surprising behavioural jumps would be less common.

Anthropic’s interpretability research provides another revealing example. In 2024 the company announced what it described as the first detailed look inside a production-scale language model, identifying millions of internal features associated with concepts and behaviours. The achievement was widely viewed as a major advance. Yet the announcement itself emphasised that modern language models are generally treated as black boxes and that understanding remains incomplete. The significance of the breakthrough came precisely because such visibility had previously been unavailable. [Anthropic]anthropic.comAnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — This interpretability discovery could, in future, help us make AI model…Published: May 21, 2024

More recent Anthropic research tracing model reasoning reached a similar conclusion. Researchers demonstrated that some internal reasoning pathways can be reconstructed and studied, but the work was presented as an early step rather than a comprehensive solution. The need for specialised methods to uncover hidden reasoning illustrates how much of the underlying computation remains opaque. [Anthropic]anthropic.comAnthropicTracing the thoughts of a large language modelMar 27, 2025 — We explored a way that interpretability can help tell apart "faithf…

A striking theme across these projects is that interpretability advances often reveal additional complexity rather than eliminating it. Researchers gain visibility into certain circuits or representations, only to discover many more interacting components beneath them. [Anthropic]anthropic.comOpen-sourcing circuit-tracing toolsMay 29, 2025 — In our recent interpretability research, we introduced a new method to trace the thoughts of a large language model.Read more…Published: May 29, 2025

Black Box Evidence illustration 2

When researchers look inside, they often find unexpected behaviour

One reason black-box concerns matter for AI-risk debates is that interpretability work has occasionally uncovered internal processes that were not obvious from external behaviour alone.

Anthropic’s investigations into model reasoning found cases where models appeared to use internal strategies that were more complicated than their outward explanations suggested. Researchers studying reasoning traces have explored differences between what a model says it is doing and what internal evidence suggests it is actually doing. [Anthropic]anthropic.comInterpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally…

Other interpretability work has identified situations where internal planning, hidden intermediate steps, or strategic behaviour were not fully reflected in the model’s final answer. Researchers have reported evidence that models can sometimes represent information internally without explicitly revealing it. [Anthropic]anthropic.comResearchThe mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation…

This does not prove that current systems are deceptive in any broad sense. However, it does demonstrate a key point in the interpretability debate: observing outputs is not always enough to understand internal cognition. If hidden internal representations already exist in today’s systems, some researchers worry that future, more capable models could develop increasingly sophisticated internal processes that remain difficult to inspect. [Anthropic]anthropic.comAnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — This interpretability discovery could, in future, help us make AI model…Published: May 21, 2024

For AI-doom arguments, this observation is often more important than any individual experiment. The concern is not merely that a model can make mistakes. It is that humans may lack reliable visibility into the reasoning processes that generate important decisions.

Why computational limits slow real-world transparency

Even if interpretability techniques continue improving, there are practical reasons why transparency may lag behind capability growth.

Frontier models contain billions or even trillions of learned parameters distributed across enormous computational structures. Understanding a single behaviour may require analysing interactions across many layers and components rather than inspecting one easily identifiable module.

Anthropic’s feature-mapping work illustrates the scale of the challenge. Researchers identified millions of interpretable features within one model, representing only a partial step towards understanding the full system. The fact that meaningful progress required discovering millions of internal concepts highlights how much information must be analysed before comprehensive understanding becomes possible. [Anthropic]anthropic.comAnthropicTracing the thoughts of a large language modelMar 27, 2025 — We explored a way that interpretability can help tell apart "faithf…

There is also a mismatch between capability scaling and interpretability scaling. Training runs receive vast computational resources because improved performance produces commercial and strategic benefits. Interpretability research, by contrast, often proceeds through labour-intensive investigation of already-trained systems. Capability growth can therefore outpace understanding even when interpretability research is successful.

Government and safety assessments increasingly acknowledge this imbalance. Frontier models are advancing rapidly across domains such as coding, scientific reasoning, autonomy-related tasks, and specialised expertise, while understanding of internal mechanisms remains comparatively limited. [AI Security Institute]aisi.gov.ukAI Security InstituteFrontier AI Trends Report by The AI Security Institute (AISI)Autonomy skills: Models can now complete hour-long soft…

Black Box Evidence illustration 3

How strong is the evidence?

The evidence that larger AI models remain black boxes is substantial, but it should not be overstated.

The strongest points are:

Frontier developers routinely acknowledge limited understanding of internal mechanisms.
Major government and international safety assessments treat interpretability as an unresolved research challenge.
Large-scale models continue to display behaviours that are difficult to predict in advance.
Significant interpretability breakthroughs are still reported as partial progress rather than comprehensive solutions.
Researchers frequently rely on behavioural testing because direct mechanistic understanding remains incomplete. [Anthropic]anthropic.comOpen-sourcing circuit-tracing toolsMay 29, 2025 — In our recent interpretability research, we introduced a new method to trace the thoughts of a large language model.Read more…Published: May 29, 2025 [cdn.openai.com]cdn.openai.comgpt 4 system cardopenai.comGPT-4 System Card10 Mar 2023 — Ensure that safety assessments cover emergent risks: As models get more capable, we should be pr… [GOV.UK]GOV.UKai security institute frontier ai trends report factsheetSecurity Institute – Frontier AI Trends report factsheetDec 18, 2025 — The UK AI Security Institute (AISI) has conducted evaluations of…

At the same time, there is evidence for progress. Mechanistic interpretability has advanced substantially since the early generations of neural networks. Researchers can now identify some internal concepts, trace certain reasoning pathways, and intervene in specific model behaviours. The field is not standing still. [Anthropic]anthropic.comInterpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally… [Anthropic]anthropic.comResearchThe mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation…

The key dispute is therefore not whether interpretability research works at all. It is whether it can scale fast enough to keep up with frontier AI capabilities.

For those worried about AI doom or high p(doom) estimates, the concern is that capability growth may continue to outrun understanding. If future systems become far more autonomous, strategic, or influential while remaining only partially understood, humanity could find itself relying on systems whose internal objectives and reasoning processes cannot be reliably inspected. Critics argue that this outcome is far from inevitable and that interpretability methods may improve dramatically. The current evidence, however, suggests that larger models have not yet become proportionally more transparent as they have become more capable. Anthropic 3GOV.UK [International AI Safety Report]internationalaisafetyreport.orginternational ai safety report 2026International AI Safety ReportInternational AI Safety Report 20263 Feb 2026 — This Report assesses what general-purpose AI systems can do…

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Matserpi Robotic Arm Car Kit for Raspberry Pi - AI Vision, 5DOF, Educational Rob

Search eBay.com: AI robot kit

Browse similar on eBay.com

Example eBay listing

TurboPi Smart Robot Car Kit Vision AI Robot-Hiwonder 2DOF HD Cam for Raspberry

Search eBay.com: AI robot kit

Browse similar on eBay.com

Example eBay listing

UGV Beast PI4B AI Kit Opensource Off-Road Tracked AI Robot PC Vision

Search eBay.com: AI robot kit

Browse similar on eBay.com

Example eBay listing

DOFBOT 6DOF Robot Arm Kit Mechanical Arm AI Visual Recognition for Nvidia Jetson

Search eBay.com: AI robot kit

Browse similar on eBay.com

Browse more on eBay.com

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Example eBay listing

A.I. Artificial Intelligence Movie Film Poster Art Print

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Example eBay listing

A I Artificial Intelligence 6 Movie Poster Art Print Print Classic Rare Gallery

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Example eBay listing

AI - Artificial Intelligence (Poster + Slipcase) Blu-Ray

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Example eBay listing

Artificial intelligence is no a mat Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: GOV.UK
Title: Frontier AI: capabilities and risks – discussion paper
Link: https://www.gov.uk/government/publications/frontier-ai-capabilities-and-risks-discussion-paper/frontier-ai-capabilities-and-risks-discussion-paper
Source snippet
It describes the current state and key trends relating to frontier AI capabilities, and then explores how frontier AI capabilities...Rea...
Source: aisi.gov.uk
Link: https://www.aisi.gov.uk/frontier-ai-trends-report
Source snippet
AI Security InstituteFrontier AI Trends Report by The AI Security Institute (AISI)Autonomy skills: Models can now complete hour-long soft...
Source: cdn.openai.com
Link: https://cdn.openai.com/papers/gpt-4.pdf
Source snippet
GPT-4 Technical Report27 Mar 2023 — Certain capabilities remain hard to predict... should be prepared for emergent capabilities and comp...
Source: anthropic.com
Link: https://www.anthropic.com/research/mapping-mind-language-model
Source snippet
AnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — This interpretability discovery could, in future, help us make AI model...

Published: May 21, 2024
Source: time.com
Title: No One Truly Knows How AI Systems Work
Link: https://time.com/6980210/anthropic-interpretability-ai-safety-research/
Source snippet
A New Discovery Could Change ThatAI systems, particularly neural networks, are often seen as "black boxes" due to their complexity and op...
Source: anthropic.com
Link: https://www.anthropic.com/research/tracing-thoughts-language-model
Source snippet
AnthropicTracing the thoughts of a large language modelMar 27, 2025 — We explored a way that interpretability can help tell apart "faithf...
Source: aisi.gov.uk
Title: aisi frontier ai trends report 2025
Link: https://www.aisi.gov.uk/research/aisi-frontier-ai-trends-report-2025
Source snippet
AISI Frontier AI Trends Report (2025)Dec 18, 2025 — This report presents our first public analysis of the trends we've observed. It seeks...
Source: aisi.gov.uk
Title: 5 key findings from our first frontier ai trends report
Link: https://www.aisi.gov.uk/blog/5-key-findings-from-our-first-frontier-ai-trends-report
Source snippet
Dec 18, 2025 — The report contains a selection of aggregated testing results to illustrate high-level trends in AI progress across domain...
Source: OpenAI
Link: https://openai.com/
Source snippet
comOpenAI | Research & DeploymentWe believe our research will eventually lead to [artificial]({{ 'artificial-goals/' | relative_url }}) general intelligence, a system that can solve...
Source: OpenAI
Title: introducing o3 and o4 mini
Link: https://openai.com/index/introducing-o3-and-o4-mini/
Source snippet
comIntroducing OpenAI o3 and o4-miniApr 16, 2025 — OpenAI o4-mini is a smaller model optimized for fast, cost-efficient reasoning—it achi...
Source: cdn.openai.com
Title: gpt 4 system card
Link: https://cdn.openai.com/papers/gpt-4-system-card.pdf
Source snippet
openai.comGPT-4 System Card10 Mar 2023 — Ensure that safety assessments cover emergent risks: As models get more capable, we should be pr...
Source: OpenAI
Title: gpt 4 research
Link: https://openai.com/index/gpt-4-research/
Source snippet
comGPT-4Mar 14, 2023 — GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capabl...
Source: OpenAI
Title: generative models
Link: https://openai.com/index/generative-models/
Source snippet
comGenerative modelsJun 16, 2016 — This post describes four projects that share a common theme of enhancing or using generative models, a...
Source: GOV.UK
Link: https://www.gov.uk/government/news/inaugural-report-pioneered-by-ai-security-institute-gives-clearest-picture-yet-of-capabilities-of-most-advanced-ai
Source snippet
report pioneered by AI Security Institute gives...Dec 18, 2025 — The AI Security Institute's Frontier AI Trends Report, a public assessm...
Source: GOV.UK
Title: ai security institute frontier ai trends report factsheet
Link: https://www.gov.uk/government/publications/ai-security-institute-frontier-ai-trends-report-factsheet
Source snippet
Security Institute – Frontier AI Trends report factsheetDec 18, 2025 — The UK AI Security Institute (AISI) has conducted evaluations of...
Source: assets.publishing.service.gov.uk
Link: https://assets.publishing.service.gov.uk/media/65395abae6c968000daa9b25/frontier-ai-capabilities-risks-report.pdf
Source snippet
This report explains why. It describes the current state and key trends relating to frontier AI...Read more...
Source: anthropic.com
Title: Open-sourcing circuit-tracing tools
Link: https://www.anthropic.com/research/open-source-circuit-tracing
Source snippet
May 29, 2025 — In our recent interpretability research, we introduced a new method to trace the thoughts of a large language model.Read more...

Published: May 29, 2025
Source: anthropic.com
Link: https://www.anthropic.com/research/team/interpretability
Source snippet
Interpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally...
Source: anthropic.com
Link: https://www.anthropic.com/research
Source snippet
ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation...
Source: far.ai
Link: https://far.ai/about/newsletters/2025-q1-ai-safety
Source snippet
2025 Q1: AI Safety: From Research to Global ActionOur position paper on AI safety evaluation reveals a critical gap in how frontier model...
Source: internationalaisafetyreport.org
Title: international ai safety report 2026
Link: https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026
Source snippet
International AI Safety ReportInternational AI Safety Report 20263 Feb 2026 — This Report assesses what general-purpose AI systems can do...
Source: inspect.aisi.org.uk
Link: https://inspect.aisi.org.uk/
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Anthropic
Source snippet
AnthropicAnthropic is an American artificial intelligence (AI) company headquartered in San Francisco. It has developed a range of lar...
Source: reddit.com
Link: https://www.reddit.com/r/slatestarcodex/comments/1cyicgw/anthropic_mapping_the_mind_of_a_large_language/
Source snippet
Anthropic: Mapping the Mind of a Large Language ModelThis is the first ever detailed look inside a modern, production-grade large languag...
Source: reddit.com
Link: https://www.reddit.com/r/slatestarcodex/comments/1jlfyhq/anthropic_tracing_the_thoughts_of_an_llm/
Source snippet
Anthropic: Tracing the thoughts of an LLM: r/slatestarcodexClaude sometimes thinks in a conceptual space that is shared between language...
Source: reddit.com
Link: https://www.reddit.com/r/singularity/comments/1jlb6la/anthropic_tracing_the_thoughts_of_a_large/
Source snippet
understand and build internal world models to explain the outer world.Read more...
Source: blog.biocomm.ai
Title: anthropic research tracing the thoughts of a large language model
Link: https://blog.biocomm.ai/2025/03/28/anthropic-research-tracing-the-thoughts-of-a-large-language-model/
Source snippet
Research. Tracing the thoughts of a large...Mar 29, 2025 — Anthropic's researchers have taken significant steps towards understanding th...

Additional References

Source: arxiv.org
Link: https://arxiv.org/html/2303.08774v4
Source snippet
GPT-4 Technical ReportGPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment...
Source: arxiv.org
Link: https://arxiv.org/pdf/2503.04746
Source snippet
Emerging Practices in Frontier AI Safety Frameworksby MD Buhl · 2025 · Cited by 4 — At the AI Seoul Summit in 2024, a number of AI develo...
Source: linkedin.com
Link: https://www.linkedin.com/posts/will-douglas-heaven-843358b_openai-has-trained-its-llm-to-confess-to-activity-7402080933052575745-HwSD
Source snippet
OpenAI's Confessions: A New Tool for LLM InterpretabilityOpenAI is testing another new way to expose the complicated processes at work in...
Source: techradar.com
Link: [https://www.techradar.com/ai-platforms-assistants/anthropic-detects-strategic-manipulation-features-in-claude-mythos-including-exploit-attempts-and-hidden-evaluation-awareness
Source snippet
These internal behaviors—such as exploiting system permissions, hiding malicious code, and circumventing rules—were not always visible in...
Source: transformer-circuits.pub
Link: https://transformer-circuits.pub/
Source snippet
Transformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea...
Source: linkedin.com
Link: https://www.linkedin.com/posts/tesssbuckley_today-uks-ai-security-institute-of-department-activity-7407352566029828097-ZTJf
Source snippet
UK AI Security Institute Publishes Frontier AI Trends ReportAs the first public analysis of trends by AISI it draws on two years' worth o...
Source: tomshardware.com
Link: https://www.tomshardware.com/tech-industry/artificial-intelligence/openclaw-creator-burns-through-1-3-million-in-openai-api-tokens-in-a-single-month
Source snippet
OpenClaw creator burned through $1.3 million in OpenAI API tokens in a single month — bill covered 603 billion tokens across 7.6 million...
Source: medium.com
Link: https://medium.com/%40adnanmasood/inside-the-ai-black-box-for-real-this-time-2026-state-of-ai-interpretability-and-explainability-b58bf30755ed
Source snippet
Inside the AI Black Box, for Real This Time — The 2026...A technical point of view on modern AI transparency, from post-hoc explanation...
Source: Tech Policy Press
Link: https://techpolicy.press/the-us-governments-ai-safety-gambit-a-step-forward-or-just-another-voluntary-commitment
Source snippet
The US Government's AI Safety Gambit: A Step Forward or...Sep 20, 2024 — The US AISI's agreement with OpenAI and Anthropic falls short o...
Source: babl.ai
Link: https://babl.ai/uk-report-warns-frontier-ai-capabilities-are-advancing-faster-than-safety-safeguards/
Source snippet
UK Report Warns Frontier AI Capabilities Are Advancing...Dec 26, 2025 — The UK's AISI has released a new Frontier AI Trends Report warni...

Why bigger AI models still resist human understanding

Introduction

What frontier AI safety reports say about opacity

Experiments showing scale does not automatically improve interpretability

When researchers look inside, they often find unexpected behaviour

Why computational limits slow real-world transparency

How strong is the evidence?

Further Reading

The Alignment Problem

Human Compatible

Rebooting AI

Superintelligence

Marketplace Samples

Matserpi Robotic Arm Car Kit for Raspberry Pi - AI Vision, 5DOF, Educational Rob

TurboPi Smart Robot Car Kit Vision AI Robot-Hiwonder 2DOF HD Cam for Raspberry

UGV Beast PI4B AI Kit Opensource Off-Road Tracked AI Robot PC Vision

DOFBOT 6DOF Robot Arm Kit Mechanical Arm AI Visual Recognition for Nvidia Jetson

A.I. Artificial Intelligence Movie Film Poster Art Print

A I Artificial Intelligence 6 Movie Poster Art Print Print Classic Rare Gallery

AI - Artificial Intelligence (Poster + Slipcase) Blu-Ray

Artificial intelligence is no a mat Framed Wall Art Poster Canvas Print Picture

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2