Within Scaling Limits
Why mapping AI internals gets harder with scale
Attempts to map AI circuits and hidden representations face growing limits as models scale to billions of parameters.
On this page
- How distributed representations resist human interpretation
- Backup strategies and redundant circuits in large models
- Why larger models are not automatically more transparent
Page outline Jump by section
Introduction
Mechanistic interpretability is the branch of AI research that tries to reverse-engineer neural networks and identify the internal mechanisms that produce specific behaviours. Rather than asking an AI system why it generated an answer, researchers attempt to trace the actual computations inside the model: which features were detected, which internal circuits were activated, and how information flowed through the network.
For people concerned about AI doom, alignment failures, or loss of control over advanced systems, mechanistic interpretability is attractive because it promises something stronger than behavioural testing. In principle, if researchers could understand a frontier model’s internal reasoning, they might detect dangerous goals, deceptive strategies, or other warning signs before those behaviours appear openly.
The difficulty is that the approach becomes harder as models become more capable. Some recent work has shown that researchers can identify meaningful internal features and circuits in large language models, but the same research has also highlighted how enormous the scaling challenge remains. The central question is no longer whether mechanistic interpretability can work in small cases. It is whether it can keep pace with frontier models whose internal computations may be vastly more complex than anything humans can inspect directly. [Anthropic]anthropic.comMapping the Mind of a Large Language ModelAnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — We have identified how millions of concepts are represented inside Clau…
How distributed representations resist human interpretation
One of the biggest obstacles is that modern neural networks do not usually store concepts in neat, isolated locations.
A common intuition is that a neuron might represent a single idea such as “dog”, “Paris”, or “danger”. In practice, researchers repeatedly find that many neurons respond to mixtures of unrelated concepts. This phenomenon is often called polysemanticity. A single neuron may participate in multiple computations depending on context, making it difficult to assign a simple human-readable meaning to it. [Anthropic]anthropic.comtowards monosemanticity decomposing language models with dictionary learningAnthropicDecomposing Language Models With Dictionary Learning5 Oct 2023 — In our latest paper, Towards Monosemanticity: Decomposing Langu…
The deeper problem is that models often use what researchers call superposition. Instead of allocating separate internal resources to separate concepts, a network can compress many features into the same representational space. Anthropic’s work on toy models and later interpretability research argues that neural networks frequently represent more features than they have obvious dimensions available, causing concepts to overlap and interfere with one another. [transformer-circuits.pub]transformer-circuits.pubscaling monosemanticityExtracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco…
This creates a scaling problem for interpretability: [openaipublic.blob.core.windows.net]openaipublic.blob.core.windows.netLanguage models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab…
- The model may contain millions of meaningful features.
- Many features are represented across combinations of neurons rather than individual neurons.
- Important computations may only emerge from interactions among many components.
- Human-understandable concepts may not correspond neatly to the model’s internal structure.
Researchers have therefore increasingly shifted from studying individual neurons to studying higher-level features extracted using tools such as sparse autoencoders. This has produced important progress, but it also reveals the sheer number of features involved. Anthropic’s work on Claude 3 Sonnet reported evidence for millions of internal features, illustrating both the promise and the scale of the challenge. [Anthropic]anthropic.comsuperposition memorization and double descentSuperposition, Memorization, and Double Descent5 Jan 2023 — In a recent paper, we found that simple neural networks trained on toy tasks… [transformer-circuits.pub]transformer-circuits.pubToy Models of SuperpositionSep 14, 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse in…
Why finding one circuit does not reveal the whole mechanism
Early mechanistic interpretability successes often focused on relatively narrow tasks. Researchers identified specific circuits responsible for behaviours such as indirect object identification, token prediction patterns, or simple reasoning steps. These results demonstrated that meaningful internal structure exists and can sometimes be mapped. [transformer-circuits.pub]transformer-circuits.pubTransformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea…
However, frontier models appear to rely on many overlapping mechanisms rather than single clean pathways.
A useful analogy is biological brains. Finding one neural pathway involved in vision does not mean vision depends entirely on that pathway. Multiple subsystems often contribute simultaneously, providing redundancy and robustness.
Large language models appear to exhibit something similar. A circuit discovered in one setting may not be the only route through which the model can achieve a task. Alternative pathways may exist, and the model may switch strategies depending on context. As model size increases, the number of potential interactions grows dramatically. [arXiv]arxiv.orgarXivA Practical Review of Mechanistic Interpretability for…10 Mar 2025 — Our survey brings a unique perspective of task-centric surve…
This matters for AI safety because researchers are often interested in rare but dangerous behaviours. Suppose a safety team identifies one circuit associated with deceptive reasoning and modifies it. If the model can achieve the same outcome through several other circuits, the intervention may provide only limited assurance.
The challenge is therefore not merely locating a mechanism. It is determining whether that mechanism is the complete explanation for a behaviour or only one component in a much larger network of computations.
Backup strategies and redundant circuits in large models
As models become more capable, they often become more robust.
From an engineering perspective, robustness is desirable. If a few neurons fail or some inputs change, the model can still perform well. But robustness can be the enemy of interpretability.
A model with many redundant pathways may continue producing the same behaviour even after researchers disable a circuit they believe is important. The behaviour survives because other components can compensate.
This creates several difficulties:
Causal uncertainty. Researchers may identify a component strongly correlated with a behaviour without proving that it is uniquely responsible for it.
Intervention failure. Removing an apparently important circuit may have surprisingly little effect.
Hidden alternatives. Models may possess backup strategies that only appear under unusual conditions.
Distribution shifts. A model may rely on one circuit during interpretability experiments but switch to another when deployed in a different environment.
These concerns are especially relevant in discussions of deceptive alignment and loss-of-control scenarios. If advanced models develop strategies that can be implemented through many different internal pathways, discovering one pathway may not reveal the full picture of what the system is capable of doing.
Why larger models are not automatically more transparent
A common argument is that larger models may become easier to understand because they often develop more structured internal representations.
There is some evidence supporting this idea. Anthropic’s work on monosemantic features suggests that meaningful, relatively interpretable features can be extracted from large models using sparse autoencoders. Researchers have successfully identified features corresponding to concepts ranging from geographic locations to coding patterns and linguistic structures. [transformer-circuits.pub]transformer-circuits.pubmonosemantic featuresDecomposing Language Models With Dictionary Learning4 Oct 2023 — In this paper, we use a weak dictionary learning algorithm called a spar… [Anthropic]anthropic.comengineering challenges interpretabilityThe engineering challenges of scaling interpretabilityJun 13, 2024 — Our Sparse Autoencoders—the tools we use to investigate “features”—a…
Yet these successes do not imply that frontier systems become transparent.
Several scaling pressures push in the opposite direction:
- Larger models contain vastly more features. [anthropic.com]anthropic.comtoy models of superposition14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investiga…
- More features create more potential interactions.
- More capabilities create more complex circuits.
- New behaviours can emerge that were absent in smaller systems.
The result is a paradox. Larger models may contain cleaner local structures while simultaneously becoming harder to understand globally.
Researchers might successfully explain thousands or even millions of individual features while still lacking a comprehensive understanding of how those features combine to produce high-level behaviour. Knowing the parts is not necessarily the same as understanding the system.
This distinction is crucial in AI doom debates. The question is not whether some internal representations can be interpreted. The question is whether enough of the system can be understood to provide confidence that dangerous objectives, deceptive reasoning, or other catastrophic failure modes are absent.
Automation helps, but may not solve the scaling problem
Recognising that humans cannot manually inspect billions of parameters, researchers have increasingly explored automated interpretability.
OpenAI demonstrated one version of this approach by using GPT-4 to generate explanations for neurons in GPT-2. The broader goal is to create systems that help explain other systems, allowing interpretability research to scale beyond manual investigation. [OpenAI]OpenAIlanguage models can explain neurons in language modelsLanguage models can explain neurons in…9 May 2023 — We use GPT-4 to automatically write explanations for the behavior of neurons in la… [2openaipublic.blob.core.windows.net]openaipublic.blob.core.windows.netLanguage models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab…
Automated circuit-discovery methods have also improved substantially. New techniques can identify candidate circuits far faster than earlier approaches, making larger-scale investigations more practical. [arXiv]arxiv.orgarXivA Practical Review of Mechanistic Interpretability for…10 Mar 2025 — Our survey brings a unique perspective of task-centric surve…
However, automation introduces its own questions:
- How can researchers verify that automated explanations are correct?
- Can an AI reliably explain mechanisms more advanced than itself?
- How much of a frontier model must be understood before safety conclusions become justified?
- Could automated explanations themselves become misleading?
In effect, automation may help address the labour problem without fully solving the understanding problem.
What this means for AI doom arguments
Mechanistic interpretability occupies a distinctive place in existential-risk debates because it targets a specific concern: humans may lose the ability to understand what increasingly capable systems are doing internally.
Supporters argue that interpretability could eventually provide an “AI MRI” capable of revealing hidden goals, deceptive planning, or dangerous reasoning before catastrophe occurs. Progress on sparse autoencoders, feature discovery, and circuit analysis is often cited as evidence that the field is moving in that direction. [transformer-circuits.pub]transformer-circuits.pubCircuits UpdatesIn a linear representation, each feature f i f_i…Read more… [Anthropic]anthropic.comInterpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally…
Sceptics do not necessarily deny the value of interpretability. Instead, many question whether it can scale quickly enough. Frontier models already contain enormous numbers of interacting features, and future systems may be substantially more complex still. Even optimistic researchers frequently describe interpretability as being in an early stage relative to the scale of the systems being studied. [transformer-circuits.pub]transformer-circuits.pubscaling monosemanticityExtracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco… [Anthropic]anthropic.comtoy models of superposition14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investiga…
For AI doom discussions, this creates an uncomfortable possibility. If mechanistic interpretability scales more slowly than capabilities, society could face increasingly powerful systems before it possesses reliable tools for understanding their internal decision-making. Whether that gap remains manageable or becomes a serious alignment problem is one of the central unresolved questions in contemporary AI safety research.
Amazon book picks
Further Reading
Books and field guides related to Why mapping AI internals gets harder with scale. Use these as the next step if you want deeper reading beyond the article.
The Alignment Problem
Explains interpretability, alignment and why understanding model internals is difficult.
Architects of Intelligence
Provides context on scaling, deep learning and future AI capabilities.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Helps readers understand distributed representations and neural network structure.
Endnotes
-
Source: anthropic.com
Title: Mapping the Mind of a Large Language Model
Link: https://www.anthropic.com/research/mapping-mind-language-modelSource snippet
AnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — We have identified how millions of concepts are represented inside Clau...
Published: May 21, 2024
-
Source: transformer-circuits.pub
Title: scaling monosemanticity
Link: https://transformer-circuits.pub/2024/scaling-monosemanticity/Source snippet
Extracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco...
Published: May 2024
-
Source: anthropic.com
Title: towards monosemanticity decomposing language models with dictionary learning
Link: https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learningSource snippet
AnthropicDecomposing Language Models With Dictionary Learning5 Oct 2023 — In our latest paper, Towards Monosemanticity: Decomposing Langu...
-
Source: arxiv.org
Link: https://arxiv.org/html/2407.02646v2Source snippet
arXivA Practical Review of Mechanistic Interpretability for...10 Mar 2025 — Our survey brings a unique perspective of task-centric surve...
-
Source: transformer-circuits.pub
Link: https://transformer-circuits.pub/2022/toy_model/index.htmlSource snippet
Toy Models of SuperpositionSep 14, 2022 — In this paper, we use toy models — small ReLU networks trained on [synthetic data]({{ 'synthetic-data/' | relative_url }}) with sparse in...
-
Source: anthropic.com
Title: superposition memorization and double descent
Link: https://www.anthropic.com/research/superposition-memorization-and-double-descentSource snippet
Superposition, Memorization, and Double Descent5 Jan 2023 — In a recent paper, we found that simple neural networks trained on toy tasks...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2309.08600Source snippet
arXivSparse Autoencoders Find Highly Interpretable Features in...by H Cunningham · 2023 · Cited by 1050 — These autoencoders learn sets...
-
Source: transformer-circuits.pub
Link: https://transformer-circuits.pub/Source snippet
Transformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2407.00886 -
Source: OpenAI
Title: language models can explain neurons in language models
Link: https://openai.com/index/language-models-can-explain-neurons-in-language-models/Source snippet
Language models can explain neurons in...9 May 2023 — We use GPT-4 to automatically write explanations for the behavior of neurons in la...
Published: May 2023
-
Source: openaipublic.blob.core.windows.net
Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.htmlSource snippet
Language models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab...
Published: May 2023
-
Source: anthropic.com
Title: engineering challenges interpretability
Link: https://www.anthropic.com/research/engineering-challenges-interpretabilitySource snippet
The engineering challenges of scaling interpretabilityJun 13, 2024 — Our Sparse Autoencoders—the tools we use to investigate “features”—a...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2602.11180 -
Source: transformer-circuits.pub
Title: monosemantic features
Link: https://transformer-circuits.pub/2023/monosemantic-featuresSource snippet
Decomposing Language Models With Dictionary Learning4 Oct 2023 — In this paper, we use a weak dictionary learning algorithm called a spar...
-
Source: transformer-circuits.pub
Title: Circuits Updates
Link: https://transformer-circuits.pub/2024/july-update/index.htmlSource snippet
In a linear representation, each feature f i f_i...Read more...
-
Source: anthropic.com
Link: https://www.anthropic.com/research/team/interpretabilitySource snippet
Interpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally...
-
Source: anthropic.com
Title: toy models of superposition
Link: https://www.anthropic.com/research/toy-models-of-superpositionSource snippet
14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on [synthetic]({{ 'synthetic-data/' | relative_url }}) data with sparse input features — to investiga...
-
Source: anthropic.com
Link: https://www.anthropic.com/research/decomposing-language-models-into-understandable-componentsSource snippet
Decomposing Language Models Into Understandable...Oct 5, 2023 — This work is a result of Anthropic's investment in Mechanistic Interpret...
-
Source: arxiv.org
Title: features in large language models via sparse autoencoders.Read more
Link: https://arxiv.org/html/2503.05613v3Source snippet
A Survey on Sparse Autoencoders: Interpreting the Internal...23 Sept 2025 — Towards monosemanticity: Decomposing language models with di...
-
Source: arxiv.org
Link: https://arxiv.org/html/2310.06200v1Source snippet
The Importance of Prompt Tuning for Automated Neuron...In bills2023language, the team from OpenAI showcases that GPT-4 can be useful in...
-
Source: github.com
Link: https://github.com/openai/automated-interpretabilitySource snippet
openai/automated-interpretabilityThis repository contains code and tools associated with the Language models can explain neurons in langu...
-
Source: galileo.ai
Title: anthropic ai interpretability breakthrough
Link: https://galileo.ai/blog/anthropic-ai-interpretability-breakthroughSource snippet
How Anthropic Made AI 70% More Interpretable1 Aug 2025 — Discover Anthropic's breakthrough: sparse autoencoders make AI 70% interpretable...
-
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/AnthropicSource snippet
AnthropicAnthropic is an American [artificial]({{ 'artificial-goals/' | relative_url }}) intelligence (AI) company headquartered in San Francisco. It has developed a range of lar...
-
Source: strikingloo.github.io
Link: https://strikingloo.github.io/wiki/monosemanticitySource snippet
Towards MonosemanticityOct 5, 2023 — In this paper, we use a weak dictionary learning algorithm called a sparse autoencoder to generate l...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/warren-wong-code_ai-machinelearning-interpretability-activity-7341814151171723264-1VN9Source snippet
OpenAI's Automated Interpretability: Explaining Neurons in...Jun 20, 2025 — OpenAI's 2023 paper, "Language Models Can Explain Neurons in...
-
Source: techcrunch.com
Link: https://techcrunch.com/2026/05/18/anthropic-has-acquired-the-dev-tools-startup-used-by-openai-google-and-cloudflare/ -
Source: theorempath.com
Title: mechanistic interpretability
Link: https://theorempath.com/topics/mechanistic-interpretabilitySource snippet
Features, Circuits, SAEsby R Sneiderman · 2026 — Mechanistic interpretability for transformers: superposition, sparse autoencoders, the l...
-
Source: facebook.com
Link: https://www.facebook.com/datasciencedojo/posts/-openai-just-released-a-groundbreaking-paper-that-pushes-mechanistic-interpretab/860302616519891/Source snippet
OpenAI just released a groundbreaking paper that pushes...Language models can explain neurons in language models use GPT-4 to automatica...
-
Source: simonwillison.net
Link: https://simonwillison.net/2023/May/9/explain-neurons/Source snippet
May 9, 2023 — “We generated cluster labels by embedding each neuron explanation using the OpenAI Embeddings API, then clustering them and...
Published: May 9, 2023
-
Source: aarnphm.xyz
Title: mechanistic interpretability
Link: https://aarnphm.xyz/thoughts/mechanistic-interpretabilitySource snippet
Aaron's notesJan 6, 2026 — This greatly simplifies resulting circuits by: Handling cross-layer superposition directly; Allowing features...
Additional References
-
Source: github.com
Link: https://github.com/zepingyu0512/awesome-llm-understanding-mechanismSource snippet
Awesome Papers for Understanding LLM MechanismThis list focuses on understanding the internal mechanism of large language models (LLM). W...
-
Source: tryalign.ai
Link: https://tryalign.ai/resources/blog/scaling-monosemanticity-extracting-interpretable-features-from-claude-3-sonnetSource snippet
Extracting Interpretable Features from Claude 3 SonnetThe Anthropic research team managed to extract interpretable features from the acti...
-
Source: reddit.com
Link: https://www.reddit.com/r/Futurology/comments/13d8m62/language_models_can_explain_neurons_in_language/Source snippet
Language models can explain neurons in language modelsWe propose an automated process that uses GPT-4 to produce and score natural langua...
-
Source: krmopuri.github.io
Link: https://krmopuri.github.io/xml/static_files/presentations/MI-Pranav.pdfSource snippet
Mechanistic InterpretabilityGPT circuits are understood as compositions of two “atomic” circuits, the “query-key” and. “output-value” cir...
-
Source: medium.com
Link: https://medium.com/thedeephub/understanding-the-scaling-of-monosemanticity-in-ai-models-a-comprehensive-analysis-f72818fa44caSource snippet
Understanding the “Scaling of Monosemanticity” in AI ModelsA particular aspect of AI is called monosemanticity, where parts of an AI syst...
-
Source: lsd-project.jp
Link: https://lsd-project.jp/weblsd/o/begin/mechanisticSource snippet
ライフサイエンス辞書: mechanistic機構 の, メカニズム の, 機構的 な. 【類義語】machinery, mechanism, mechanistically, organization. mechanistic insight *** コーパス PubMe...
-
Source: reddit.com
Link: https://www.reddit.com/r/programming/comments/185gcbc/god_help_us_lets_try_to_understand_ai/ -
Source: youtube.com
Link: https://www.youtube.com/watch?v=XrCq3pQJS6wSource snippet
ACM AI | Reading Group W24W5 | Mechanistic Interpretability...This week, with William Zhou, we take a deep dive into mechanistic intepre...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=vFdVrX503W0Source snippet
Language Models Can Explain Neurons in Language ModelsIn this paper reading we discuss OpenAI's paper "Language Models Can Explain Neuron...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=qMBWbJQ3b2g
Topic Tree





