Why mapping AI internals gets harder with scale

Introduction

Mechanistic interpretability is the branch of AI research that tries to reverse-engineer neural networks and identify the internal mechanisms that produce specific behaviours. Rather than asking an AI system why it generated an answer, researchers attempt to trace the actual computations inside the model: which features were detected, which internal circuits were activated, and how information flowed through the network.

Model Circuits illustration 1 For people concerned about AI doom, alignment failures, or loss of control over advanced systems, mechanistic interpretability is attractive because it promises something stronger than behavioural testing. In principle, if researchers could understand a frontier model’s internal reasoning, they might detect dangerous goals, deceptive strategies, or other warning signs before those behaviours appear openly.

The difficulty is that the approach becomes harder as models become more capable. Some recent work has shown that researchers can identify meaningful internal features and circuits in large language models, but the same research has also highlighted how enormous the scaling challenge remains. The central question is no longer whether mechanistic interpretability can work in small cases. It is whether it can keep pace with frontier models whose internal computations may be vastly more complex than anything humans can inspect directly. [Anthropic]anthropic.comMapping the Mind of a Large Language ModelAnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — We have identified how millions of concepts are represented inside Clau…Published: May 21, 2024

How distributed representations resist human interpretation

One of the biggest obstacles is that modern neural networks do not usually store concepts in neat, isolated locations.

A common intuition is that a neuron might represent a single idea such as “dog”, “Paris”, or “danger”. In practice, researchers repeatedly find that many neurons respond to mixtures of unrelated concepts. This phenomenon is often called polysemanticity. A single neuron may participate in multiple computations depending on context, making it difficult to assign a simple human-readable meaning to it. [Anthropic]anthropic.comtowards monosemanticity decomposing language models with dictionary learningAnthropicDecomposing Language Models With Dictionary Learning5 Oct 2023 — In our latest paper, Towards Monosemanticity: Decomposing Langu…

The deeper problem is that models often use what researchers call superposition. Instead of allocating separate internal resources to separate concepts, a network can compress many features into the same representational space. Anthropic’s work on toy models and later interpretability research argues that neural networks frequently represent more features than they have obvious dimensions available, causing concepts to overlap and interfere with one another. [transformer-circuits.pub]transformer-circuits.pubscaling monosemanticityExtracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco…Published: May 2024

This creates a scaling problem for interpretability: [openaipublic.blob.core.windows.net]openaipublic.blob.core.windows.netLanguage models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab…Published: May 2023

The model may contain millions of meaningful features.
Many features are represented across combinations of neurons rather than individual neurons.
Important computations may only emerge from interactions among many components.
Human-understandable concepts may not correspond neatly to the model’s internal structure.

Researchers have therefore increasingly shifted from studying individual neurons to studying higher-level features extracted using tools such as sparse autoencoders. This has produced important progress, but it also reveals the sheer number of features involved. Anthropic’s work on Claude 3 Sonnet reported evidence for millions of internal features, illustrating both the promise and the scale of the challenge. [Anthropic]anthropic.comsuperposition memorization and double descentSuperposition, Memorization, and Double Descent5 Jan 2023 — In a recent paper, we found that simple neural networks trained on toy tasks… [transformer-circuits.pub]transformer-circuits.pubToy Models of SuperpositionSep 14, 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse in…

Why finding one circuit does not reveal the whole mechanism

Early mechanistic interpretability successes often focused on relatively narrow tasks. Researchers identified specific circuits responsible for behaviours such as indirect object identification, token prediction patterns, or simple reasoning steps. These results demonstrated that meaningful internal structure exists and can sometimes be mapped. [transformer-circuits.pub]transformer-circuits.pubTransformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea…

However, frontier models appear to rely on many overlapping mechanisms rather than single clean pathways.

A useful analogy is biological brains. Finding one neural pathway involved in vision does not mean vision depends entirely on that pathway. Multiple subsystems often contribute simultaneously, providing redundancy and robustness.

Large language models appear to exhibit something similar. A circuit discovered in one setting may not be the only route through which the model can achieve a task. Alternative pathways may exist, and the model may switch strategies depending on context. As model size increases, the number of potential interactions grows dramatically. [arXiv]arxiv.orgarXivA Practical Review of Mechanistic Interpretability for…10 Mar 2025 — Our survey brings a unique perspective of task-centric surve…

This matters for AI safety because researchers are often interested in rare but dangerous behaviours. Suppose a safety team identifies one circuit associated with deceptive reasoning and modifies it. If the model can achieve the same outcome through several other circuits, the intervention may provide only limited assurance.

The challenge is therefore not merely locating a mechanism. It is determining whether that mechanism is the complete explanation for a behaviour or only one component in a much larger network of computations.

Backup strategies and redundant circuits in large models

As models become more capable, they often become more robust.

From an engineering perspective, robustness is desirable. If a few neurons fail or some inputs change, the model can still perform well. But robustness can be the enemy of interpretability.

A model with many redundant pathways may continue producing the same behaviour even after researchers disable a circuit they believe is important. The behaviour survives because other components can compensate.

This creates several difficulties:

Causal uncertainty. Researchers may identify a component strongly correlated with a behaviour without proving that it is uniquely responsible for it.

Intervention failure. Removing an apparently important circuit may have surprisingly little effect.

Hidden alternatives. Models may possess backup strategies that only appear under unusual conditions.

Distribution shifts. A model may rely on one circuit during interpretability experiments but switch to another when deployed in a different environment.

These concerns are especially relevant in discussions of deceptive alignment and loss-of-control scenarios. If advanced models develop strategies that can be implemented through many different internal pathways, discovering one pathway may not reveal the full picture of what the system is capable of doing.

Model Circuits illustration 2

Why larger models are not automatically more transparent

A common argument is that larger models may become easier to understand because they often develop more structured internal representations.

There is some evidence supporting this idea. Anthropic’s work on monosemantic features suggests that meaningful, relatively interpretable features can be extracted from large models using sparse autoencoders. Researchers have successfully identified features corresponding to concepts ranging from geographic locations to coding patterns and linguistic structures. [transformer-circuits.pub]transformer-circuits.pubmonosemantic featuresDecomposing Language Models With Dictionary Learning4 Oct 2023 — In this paper, we use a weak dictionary learning algorithm called a spar… [Anthropic]anthropic.comengineering challenges interpretabilityThe engineering challenges of scaling interpretabilityJun 13, 2024 — Our Sparse Autoencoders—the tools we use to investigate “features”—a…

Yet these successes do not imply that frontier systems become transparent.

Several scaling pressures push in the opposite direction:

Larger models contain vastly more features. [anthropic.com]anthropic.comtoy models of superposition14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investiga…
More features create more potential interactions.
More capabilities create more complex circuits.
New behaviours can emerge that were absent in smaller systems.

The result is a paradox. Larger models may contain cleaner local structures while simultaneously becoming harder to understand globally.

Researchers might successfully explain thousands or even millions of individual features while still lacking a comprehensive understanding of how those features combine to produce high-level behaviour. Knowing the parts is not necessarily the same as understanding the system.

This distinction is crucial in AI doom debates. The question is not whether some internal representations can be interpreted. The question is whether enough of the system can be understood to provide confidence that dangerous objectives, deceptive reasoning, or other catastrophic failure modes are absent.

Automation helps, but may not solve the scaling problem

Recognising that humans cannot manually inspect billions of parameters, researchers have increasingly explored automated interpretability.

OpenAI demonstrated one version of this approach by using GPT-4 to generate explanations for neurons in GPT-2. The broader goal is to create systems that help explain other systems, allowing interpretability research to scale beyond manual investigation. [OpenAI]OpenAIlanguage models can explain neurons in language modelsLanguage models can explain neurons in…9 May 2023 — We use GPT-4 to automatically write explanations for the behavior of neurons in la…Published: May 2023 [2openaipublic.blob.core.windows.net]openaipublic.blob.core.windows.netLanguage models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab…Published: May 2023

Automated circuit-discovery methods have also improved substantially. New techniques can identify candidate circuits far faster than earlier approaches, making larger-scale investigations more practical. [arXiv]arxiv.orgarXivA Practical Review of Mechanistic Interpretability for…10 Mar 2025 — Our survey brings a unique perspective of task-centric surve…

However, automation introduces its own questions:

How can researchers verify that automated explanations are correct?
Can an AI reliably explain mechanisms more advanced than itself?
How much of a frontier model must be understood before safety conclusions become justified?
Could automated explanations themselves become misleading?

In effect, automation may help address the labour problem without fully solving the understanding problem.

Model Circuits illustration 3

What this means for AI doom arguments

Mechanistic interpretability occupies a distinctive place in existential-risk debates because it targets a specific concern: humans may lose the ability to understand what increasingly capable systems are doing internally.

Supporters argue that interpretability could eventually provide an “AI MRI” capable of revealing hidden goals, deceptive planning, or dangerous reasoning before catastrophe occurs. Progress on sparse autoencoders, feature discovery, and circuit analysis is often cited as evidence that the field is moving in that direction. [transformer-circuits.pub]transformer-circuits.pubCircuits UpdatesIn a linear representation, each feature f i f_i…Read more… [Anthropic]anthropic.comInterpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally…

Sceptics do not necessarily deny the value of interpretability. Instead, many question whether it can scale quickly enough. Frontier models already contain enormous numbers of interacting features, and future systems may be substantially more complex still. Even optimistic researchers frequently describe interpretability as being in an early stage relative to the scale of the systems being studied. [transformer-circuits.pub]transformer-circuits.pubscaling monosemanticityExtracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco…Published: May 2024 [Anthropic]anthropic.comtoy models of superposition14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investiga…

For AI doom discussions, this creates an uncomfortable possibility. If mechanistic interpretability scales more slowly than capabilities, society could face increasingly powerful systems before it possesses reliable tools for understanding their internal decision-making. Whether that gap remains manageable or becomes a serious alignment problem is one of the central unresolved questions in contemporary AI safety research.

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

DENSO Industrial Robot Arm Model 1:6 Scale Manipulator Simulation Display Gift

Search eBay.com: robot display model

Browse similar on eBay.com

Example eBay listing

Lost In Space YM-3 Robot Mini Display Model in Retro TV 17RMB03

Search eBay.com: robot display model

Browse similar on eBay.com

Example eBay listing

Kaiyodo Grendizer Cold Cast Figure with Base Super Robot Display Model New

Search eBay.com: robot display model

Browse similar on eBay.com

Example eBay listing

3D Printed Robot Anatomy Head Bust 6.1in Sci-Fi Display Model Art Figure

Search eBay.com: robot display model

Browse similar on eBay.com

Browse more on eBay.com

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Example eBay listing

AI ARTIFICIAL INTELLIGENCE . 2001 ORIGINAL MOVIE POSTER vintage 24 YEARS OLD

Search eBay.co.uk: AI poster

Browse similar on eBay.co.uk

Example eBay listing

AI - Artificial Intelligence (Poster + Slipcase) Blu-Ray

Search eBay.co.uk: AI poster

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: anthropic.com
Title: Mapping the Mind of a Large Language Model
Link: https://www.anthropic.com/research/mapping-mind-language-model
Source snippet
AnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — We have identified how millions of concepts are represented inside Clau...

Published: May 21, 2024
Source: transformer-circuits.pub
Title: scaling monosemanticity
Link: https://transformer-circuits.pub/2024/scaling-monosemanticity/
Source snippet
Extracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco...

Published: May 2024
Source: anthropic.com
Title: towards monosemanticity decomposing language models with dictionary learning
Link: https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning
Source snippet
AnthropicDecomposing Language Models With Dictionary Learning5 Oct 2023 — In our latest paper, Towards Monosemanticity: Decomposing Langu...
Source: arxiv.org
Link: https://arxiv.org/html/2407.02646v2
Source snippet
arXivA Practical Review of Mechanistic Interpretability for...10 Mar 2025 — Our survey brings a unique perspective of task-centric surve...
Source: transformer-circuits.pub
Link: https://transformer-circuits.pub/2022/toy_model/index.html
Source snippet
Toy Models of SuperpositionSep 14, 2022 — In this paper, we use toy models — small ReLU networks trained on [synthetic data]({{ 'synthetic-data/' | relative_url }}) with sparse in...
Source: anthropic.com
Title: superposition memorization and double descent
Link: https://www.anthropic.com/research/superposition-memorization-and-double-descent
Source snippet
Superposition, Memorization, and Double Descent5 Jan 2023 — In a recent paper, we found that simple neural networks trained on toy tasks...
Source: arxiv.org
Link: https://arxiv.org/abs/2309.08600
Source snippet
arXivSparse Autoencoders Find Highly Interpretable Features in...by H Cunningham · 2023 · Cited by 1050 — These autoencoders learn sets...
Source: transformer-circuits.pub
Link: https://transformer-circuits.pub/
Source snippet
Transformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea...
Source: arxiv.org
Link: https://arxiv.org/abs/2407.00886
Source: OpenAI
Title: language models can explain neurons in language models
Link: https://openai.com/index/language-models-can-explain-neurons-in-language-models/
Source snippet
Language models can explain neurons in...9 May 2023 — We use GPT-4 to automatically write explanations for the behavior of neurons in la...

Published: May 2023
Source: openaipublic.blob.core.windows.net
Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
Source snippet
Language models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab...

Published: May 2023
Source: anthropic.com
Title: engineering challenges interpretability
Link: https://www.anthropic.com/research/engineering-challenges-interpretability
Source snippet
The engineering challenges of scaling interpretabilityJun 13, 2024 — Our Sparse Autoencoders—the tools we use to investigate “features”—a...
Source: arxiv.org
Link: https://arxiv.org/abs/2602.11180
Source: transformer-circuits.pub
Title: monosemantic features
Link: https://transformer-circuits.pub/2023/monosemantic-features
Source snippet
Decomposing Language Models With Dictionary Learning4 Oct 2023 — In this paper, we use a weak dictionary learning algorithm called a spar...
Source: transformer-circuits.pub
Title: Circuits Updates
Link: https://transformer-circuits.pub/2024/july-update/index.html
Source snippet
In a linear representation, each feature f i f_i...Read more...
Source: anthropic.com
Link: https://www.anthropic.com/research/team/interpretability
Source snippet
Interpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally...
Source: anthropic.com
Title: toy models of superposition
Link: https://www.anthropic.com/research/toy-models-of-superposition
Source snippet
14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on [synthetic]({{ 'synthetic-data/' | relative_url }}) data with sparse input features — to investiga...
Source: anthropic.com
Link: https://www.anthropic.com/research/decomposing-language-models-into-understandable-components
Source snippet
Decomposing Language Models Into Understandable...Oct 5, 2023 — This work is a result of Anthropic's investment in Mechanistic Interpret...
Source: arxiv.org
Title: features in large language models via sparse autoencoders.Read more
Link: https://arxiv.org/html/2503.05613v3
Source snippet
A Survey on Sparse Autoencoders: Interpreting the Internal...23 Sept 2025 — Towards monosemanticity: Decomposing language models with di...
Source: arxiv.org
Link: https://arxiv.org/html/2310.06200v1
Source snippet
The Importance of Prompt Tuning for Automated Neuron...In bills2023language, the team from OpenAI showcases that GPT-4 can be useful in...
Source: github.com
Link: https://github.com/openai/automated-interpretability
Source snippet
openai/automated-interpretabilityThis repository contains code and tools associated with the Language models can explain neurons in langu...
Source: galileo.ai
Title: anthropic ai interpretability breakthrough
Link: https://galileo.ai/blog/anthropic-ai-interpretability-breakthrough
Source snippet
How Anthropic Made AI 70% More Interpretable1 Aug 2025 — Discover Anthropic's breakthrough: sparse autoencoders make AI 70% interpretable...
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Anthropic
Source snippet
AnthropicAnthropic is an American [artificial]({{ 'artificial-goals/' | relative_url }}) intelligence (AI) company headquartered in San Francisco. It has developed a range of lar...
Source: strikingloo.github.io
Link: https://strikingloo.github.io/wiki/monosemanticity
Source snippet
Towards MonosemanticityOct 5, 2023 — In this paper, we use a weak dictionary learning algorithm called a sparse autoencoder to generate l...
Source: linkedin.com
Link: https://www.linkedin.com/posts/warren-wong-code_ai-machinelearning-interpretability-activity-7341814151171723264-1VN9
Source snippet
OpenAI's Automated Interpretability: Explaining Neurons in...Jun 20, 2025 — OpenAI's 2023 paper, "Language Models Can Explain Neurons in...
Source: techcrunch.com
Link: https://techcrunch.com/2026/05/18/anthropic-has-acquired-the-dev-tools-startup-used-by-openai-google-and-cloudflare/
Source: theorempath.com
Title: mechanistic interpretability
Link: https://theorempath.com/topics/mechanistic-interpretability
Source snippet
Features, Circuits, SAEsby R Sneiderman · 2026 — Mechanistic interpretability for transformers: superposition, sparse autoencoders, the l...
Source: facebook.com
Link: https://www.facebook.com/datasciencedojo/posts/-openai-just-released-a-groundbreaking-paper-that-pushes-mechanistic-interpretab/860302616519891/
Source snippet
OpenAI just released a groundbreaking paper that pushes...Language models can explain neurons in language models use GPT-4 to automatica...
Source: simonwillison.net
Link: https://simonwillison.net/2023/May/9/explain-neurons/
Source snippet
May 9, 2023 — “We generated cluster labels by embedding each neuron explanation using the OpenAI Embeddings API, then clustering them and...

Published: May 9, 2023
Source: aarnphm.xyz
Title: mechanistic interpretability
Link: https://aarnphm.xyz/thoughts/mechanistic-interpretability
Source snippet
Aaron's notesJan 6, 2026 — This greatly simplifies resulting circuits by: Handling cross-layer superposition directly; Allowing features...

Additional References

Source: github.com
Link: https://github.com/zepingyu0512/awesome-llm-understanding-mechanism
Source snippet
Awesome Papers for Understanding LLM MechanismThis list focuses on understanding the internal mechanism of large language models (LLM). W...
Source: tryalign.ai
Link: https://tryalign.ai/resources/blog/scaling-monosemanticity-extracting-interpretable-features-from-claude-3-sonnet
Source snippet
Extracting Interpretable Features from Claude 3 SonnetThe Anthropic research team managed to extract interpretable features from the acti...
Source: reddit.com
Link: https://www.reddit.com/r/Futurology/comments/13d8m62/language_models_can_explain_neurons_in_language/
Source snippet
Language models can explain neurons in language modelsWe propose an automated process that uses GPT-4 to produce and score natural langua...
Source: krmopuri.github.io
Link: https://krmopuri.github.io/xml/static_files/presentations/MI-Pranav.pdf
Source snippet
Mechanistic InterpretabilityGPT circuits are understood as compositions of two “atomic” circuits, the “query-key” and. “output-value” cir...
Source: medium.com
Link: https://medium.com/thedeephub/understanding-the-scaling-of-monosemanticity-in-ai-models-a-comprehensive-analysis-f72818fa44ca
Source snippet
Understanding the “Scaling of Monosemanticity” in AI ModelsA particular aspect of AI is called monosemanticity, where parts of an AI syst...
Source: lsd-project.jp
Link: https://lsd-project.jp/weblsd/o/begin/mechanistic
Source snippet
ライフサイエンス辞書: mechanistic機構の, メカニズムの, 機構的な. 【類義語】machinery, mechanism, mechanistically, organization. mechanistic insight *** コーパス PubMe...
Source: reddit.com
Link: https://www.reddit.com/r/programming/comments/185gcbc/god_help_us_lets_try_to_understand_ai/
Source: youtube.com
Link: https://www.youtube.com/watch?v=XrCq3pQJS6w
Source snippet
ACM AI | Reading Group W24W5 | Mechanistic Interpretability...This week, with William Zhou, we take a deep dive into mechanistic intepre...
Source: youtube.com
Link: https://www.youtube.com/watch?v=vFdVrX503W0
Source snippet
Language Models Can Explain Neurons in Language ModelsIn this paper reading we discuss OpenAI's paper "Language Models Can Explain Neuron...
Source: youtube.com
Link: https://www.youtube.com/watch?v=qMBWbJQ3b2g

Why mapping AI internals gets harder with scale

Introduction

How distributed representations resist human interpretation

Why finding one circuit does not reveal the whole mechanism

Backup strategies and redundant circuits in large models

Why larger models are not automatically more transparent

Automation helps, but may not solve the scaling problem

What this means for AI doom arguments

Further Reading

The Alignment Problem

Human Compatible

Architects of Intelligence

Deep Learning

Marketplace Samples

DENSO Industrial Robot Arm Model 1:6 Scale Manipulator Simulation Display Gift

Lost In Space YM-3 Robot Mini Display Model in Retro TV 17RMB03

Kaiyodo Grendizer Cold Cast Figure with Base Super Robot Display Model New

3D Printed Robot Anatomy Head Bust 6.1in Sci-Fi Display Model Art Figure

AI ARTIFICIAL INTELLIGENCE . 2001 ORIGINAL MOVIE POSTER vintage 24 YEARS OLD

AI - Artificial Intelligence (Poster + Slipcase) Blu-Ray

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2