Within Scaling Limits

Why mapping AI internals gets harder with scale

Attempts to map AI circuits and hidden representations face growing limits as models scale to billions of parameters.

On this page

  • How distributed representations resist human interpretation
  • Backup strategies and redundant circuits in large models
  • Why larger models are not automatically more transparent
Preview for Why mapping AI internals gets harder with scale

Introduction

Mechanistic interpretability is the branch of AI research that tries to reverse-engineer neural networks and identify the internal mechanisms that produce specific behaviours. Rather than asking an AI system why it generated an answer, researchers attempt to trace the actual computations inside the model: which features were detected, which internal circuits were activated, and how information flowed through the network.

Model Circuits illustration 1 For people concerned about AI doom, alignment failures, or loss of control over advanced systems, mechanistic interpretability is attractive because it promises something stronger than behavioural testing. In principle, if researchers could understand a frontier model’s internal reasoning, they might detect dangerous goals, deceptive strategies, or other warning signs before those behaviours appear openly.

The difficulty is that the approach becomes harder as models become more capable. Some recent work has shown that researchers can identify meaningful internal features and circuits in large language models, but the same research has also highlighted how enormous the scaling challenge remains. The central question is no longer whether mechanistic interpretability can work in small cases. It is whether it can keep pace with frontier models whose internal computations may be vastly more complex than anything humans can inspect directly. [Anthropic]anthropic.comMapping the Mind of a Large Language ModelAnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — We have identified how millions of concepts are represented inside Clau…Published: May 21, 2024

How distributed representations resist human interpretation

One of the biggest obstacles is that modern neural networks do not usually store concepts in neat, isolated locations.

A common intuition is that a neuron might represent a single idea such as “dog”, “Paris”, or “danger”. In practice, researchers repeatedly find that many neurons respond to mixtures of unrelated concepts. This phenomenon is often called polysemanticity. A single neuron may participate in multiple computations depending on context, making it difficult to assign a simple human-readable meaning to it. [Anthropic]anthropic.comtowards monosemanticity decomposing language models with dictionary learningAnthropicDecomposing Language Models With Dictionary Learning5 Oct 2023 — In our latest paper, Towards Monosemanticity: Decomposing Langu…

The deeper problem is that models often use what researchers call superposition. Instead of allocating separate internal resources to separate concepts, a network can compress many features into the same representational space. Anthropic’s work on toy models and later interpretability research argues that neural networks frequently represent more features than they have obvious dimensions available, causing concepts to overlap and interfere with one another. [transformer-circuits.pub]transformer-circuits.pubscaling monosemanticityExtracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco…Published: May 2024

This creates a scaling problem for interpretability: [openaipublic.blob.core.windows.net]openaipublic.blob.core.windows.netLanguage models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab…Published: May 2023

  • The model may contain millions of meaningful features.
  • Many features are represented across combinations of neurons rather than individual neurons.
  • Important computations may only emerge from interactions among many components.
  • Human-understandable concepts may not correspond neatly to the model’s internal structure.

Researchers have therefore increasingly shifted from studying individual neurons to studying higher-level features extracted using tools such as sparse autoencoders. This has produced important progress, but it also reveals the sheer number of features involved. Anthropic’s work on Claude 3 Sonnet reported evidence for millions of internal features, illustrating both the promise and the scale of the challenge. [Anthropic]anthropic.comsuperposition memorization and double descentSuperposition, Memorization, and Double Descent5 Jan 2023 — In a recent paper, we found that simple neural networks trained on toy tasks… [transformer-circuits.pub]transformer-circuits.pubToy Models of SuperpositionSep 14, 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse in…

Why finding one circuit does not reveal the whole mechanism

Early mechanistic interpretability successes often focused on relatively narrow tasks. Researchers identified specific circuits responsible for behaviours such as indirect object identification, token prediction patterns, or simple reasoning steps. These results demonstrated that meaningful internal structure exists and can sometimes be mapped. [transformer-circuits.pub]transformer-circuits.pubTransformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea…

However, frontier models appear to rely on many overlapping mechanisms rather than single clean pathways.

A useful analogy is biological brains. Finding one neural pathway involved in vision does not mean vision depends entirely on that pathway. Multiple subsystems often contribute simultaneously, providing redundancy and robustness.

Large language models appear to exhibit something similar. A circuit discovered in one setting may not be the only route through which the model can achieve a task. Alternative pathways may exist, and the model may switch strategies depending on context. As model size increases, the number of potential interactions grows dramatically. [arXiv]arxiv.orgarXivA Practical Review of Mechanistic Interpretability for…10 Mar 2025 — Our survey brings a unique perspective of task-centric surve…

This matters for AI safety because researchers are often interested in rare but dangerous behaviours. Suppose a safety team identifies one circuit associated with deceptive reasoning and modifies it. If the model can achieve the same outcome through several other circuits, the intervention may provide only limited assurance.

The challenge is therefore not merely locating a mechanism. It is determining whether that mechanism is the complete explanation for a behaviour or only one component in a much larger network of computations.

Backup strategies and redundant circuits in large models

As models become more capable, they often become more robust.

From an engineering perspective, robustness is desirable. If a few neurons fail or some inputs change, the model can still perform well. But robustness can be the enemy of interpretability.

A model with many redundant pathways may continue producing the same behaviour even after researchers disable a circuit they believe is important. The behaviour survives because other components can compensate.

This creates several difficulties:

Causal uncertainty. Researchers may identify a component strongly correlated with a behaviour without proving that it is uniquely responsible for it.

Intervention failure. Removing an apparently important circuit may have surprisingly little effect.

Hidden alternatives. Models may possess backup strategies that only appear under unusual conditions.

Distribution shifts. A model may rely on one circuit during interpretability experiments but switch to another when deployed in a different environment.

These concerns are especially relevant in discussions of deceptive alignment and loss-of-control scenarios. If advanced models develop strategies that can be implemented through many different internal pathways, discovering one pathway may not reveal the full picture of what the system is capable of doing.

Model Circuits illustration 2

Why larger models are not automatically more transparent

A common argument is that larger models may become easier to understand because they often develop more structured internal representations.

There is some evidence supporting this idea. Anthropic’s work on monosemantic features suggests that meaningful, relatively interpretable features can be extracted from large models using sparse autoencoders. Researchers have successfully identified features corresponding to concepts ranging from geographic locations to coding patterns and linguistic structures. [transformer-circuits.pub]transformer-circuits.pubmonosemantic featuresDecomposing Language Models With Dictionary Learning4 Oct 2023 — In this paper, we use a weak dictionary learning algorithm called a spar… [Anthropic]anthropic.comengineering challenges interpretabilityThe engineering challenges of scaling interpretabilityJun 13, 2024 — Our Sparse Autoencoders—the tools we use to investigate “features”—a…

Yet these successes do not imply that frontier systems become transparent.

Several scaling pressures push in the opposite direction:

  • Larger models contain vastly more features. [anthropic.com]anthropic.comtoy models of superposition14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investiga…
  • More features create more potential interactions.
  • More capabilities create more complex circuits.
  • New behaviours can emerge that were absent in smaller systems.

The result is a paradox. Larger models may contain cleaner local structures while simultaneously becoming harder to understand globally.

Researchers might successfully explain thousands or even millions of individual features while still lacking a comprehensive understanding of how those features combine to produce high-level behaviour. Knowing the parts is not necessarily the same as understanding the system.

This distinction is crucial in AI doom debates. The question is not whether some internal representations can be interpreted. The question is whether enough of the system can be understood to provide confidence that dangerous objectives, deceptive reasoning, or other catastrophic failure modes are absent.

Automation helps, but may not solve the scaling problem

Recognising that humans cannot manually inspect billions of parameters, researchers have increasingly explored automated interpretability.

OpenAI demonstrated one version of this approach by using GPT-4 to generate explanations for neurons in GPT-2. The broader goal is to create systems that help explain other systems, allowing interpretability research to scale beyond manual investigation. [OpenAI]OpenAIlanguage models can explain neurons in language modelsLanguage models can explain neurons in…9 May 2023 — We use GPT-4 to automatically write explanations for the behavior of neurons in la…Published: May 2023 [2openaipublic.blob.core.windows.net]openaipublic.blob.core.windows.netLanguage models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab…Published: May 2023

Automated circuit-discovery methods have also improved substantially. New techniques can identify candidate circuits far faster than earlier approaches, making larger-scale investigations more practical. [arXiv]arxiv.orgarXivA Practical Review of Mechanistic Interpretability for…10 Mar 2025 — Our survey brings a unique perspective of task-centric surve…

However, automation introduces its own questions:

  • How can researchers verify that automated explanations are correct?
  • Can an AI reliably explain mechanisms more advanced than itself?
  • How much of a frontier model must be understood before safety conclusions become justified?
  • Could automated explanations themselves become misleading?

In effect, automation may help address the labour problem without fully solving the understanding problem.

Model Circuits illustration 3

What this means for AI doom arguments

Mechanistic interpretability occupies a distinctive place in existential-risk debates because it targets a specific concern: humans may lose the ability to understand what increasingly capable systems are doing internally.

Supporters argue that interpretability could eventually provide an “AI MRI” capable of revealing hidden goals, deceptive planning, or dangerous reasoning before catastrophe occurs. Progress on sparse autoencoders, feature discovery, and circuit analysis is often cited as evidence that the field is moving in that direction. [transformer-circuits.pub]transformer-circuits.pubCircuits UpdatesIn a linear representation, each feature f i f_i…Read more… [Anthropic]anthropic.comInterpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally…

Sceptics do not necessarily deny the value of interpretability. Instead, many question whether it can scale quickly enough. Frontier models already contain enormous numbers of interacting features, and future systems may be substantially more complex still. Even optimistic researchers frequently describe interpretability as being in an early stage relative to the scale of the systems being studied. [transformer-circuits.pub]transformer-circuits.pubscaling monosemanticityExtracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco…Published: May 2024 [Anthropic]anthropic.comtoy models of superposition14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investiga…

For AI doom discussions, this creates an uncomfortable possibility. If mechanistic interpretability scales more slowly than capabilities, society could face increasingly powerful systems before it possesses reliable tools for understanding their internal decision-making. Whether that gap remains manageable or becomes a serious alignment problem is one of the central unresolved questions in contemporary AI safety research.

Amazon book picks

Further Reading

Books and field guides related to Why mapping AI internals gets harder with scale. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Helps readers understand distributed representations and neural network structure.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: anthropic.com
    Title: Mapping the Mind of a Large Language Model
    Link: https://www.anthropic.com/research/mapping-mind-language-model
    Source snippet

    AnthropicMapping the Mind of a Large Language ModelMay 21, 2024 — We have identified how millions of concepts are represented inside Clau...

    Published: May 21, 2024

  2. Source: transformer-circuits.pub
    Title: scaling monosemanticity
    Link: https://transformer-circuits.pub/2024/scaling-monosemanticity/
    Source snippet

    Extracting Interpretable Features from Claude 3 Sonnet21 May 2024 — Eight months ago, we demonstrated that sparse autoencoders could reco...

    Published: May 2024

  3. Source: anthropic.com
    Title: towards monosemanticity decomposing language models with dictionary learning
    Link: https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning
    Source snippet

    AnthropicDecomposing Language Models With Dictionary Learning5 Oct 2023 — In our latest paper, Towards Monosemanticity: Decomposing Langu...

  4. Source: arxiv.org
    Link: https://arxiv.org/html/2407.02646v2
    Source snippet

    arXivA Practical Review of Mechanistic Interpretability for...10 Mar 2025 — Our survey brings a unique perspective of task-centric surve...

  5. Source: transformer-circuits.pub
    Link: https://transformer-circuits.pub/2022/toy_model/index.html
    Source snippet

    Toy Models of SuperpositionSep 14, 2022 — In this paper, we use toy models — small ReLU networks trained on [synthetic data]({{ 'synthetic-data/' | relative_url }}) with sparse in...

  6. Source: anthropic.com
    Title: superposition memorization and double descent
    Link: https://www.anthropic.com/research/superposition-memorization-and-double-descent
    Source snippet

    Superposition, Memorization, and Double Descent5 Jan 2023 — In a recent paper, we found that simple neural networks trained on toy tasks...

  7. Source: arxiv.org
    Link: https://arxiv.org/abs/2309.08600
    Source snippet

    arXivSparse Autoencoders Find Highly Interpretable Features in...by H Cunningham · 2023 · Cited by 1050 — These autoencoders learn sets...

  8. Source: transformer-circuits.pub
    Link: https://transformer-circuits.pub/
    Source snippet

    Transformer Circuits ThreadAnthropic's Interpretability Research. A surprising fact about modern large language models is that nobody rea...

  9. Source: arxiv.org
    Link: https://arxiv.org/abs/2407.00886

  10. Source: OpenAI
    Title: language models can explain neurons in language models
    Link: https://openai.com/index/language-models-can-explain-neurons-in-language-models/
    Source snippet

    Language models can explain neurons in...9 May 2023 — We use GPT-4 to automatically write explanations for the behavior of neurons in la...

    Published: May 2023

  11. Source: openaipublic.blob.core.windows.net
    Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
    Source snippet

    Language models can explain neurons in language models9 May 2023 — This paper applies automation to the problem of scaling an interpretab...

    Published: May 2023

  12. Source: anthropic.com
    Title: engineering challenges interpretability
    Link: https://www.anthropic.com/research/engineering-challenges-interpretability
    Source snippet

    The engineering challenges of scaling interpretabilityJun 13, 2024 — Our Sparse Autoencoders—the tools we use to investigate “features”—a...

  13. Source: arxiv.org
    Link: https://arxiv.org/abs/2602.11180

  14. Source: transformer-circuits.pub
    Title: monosemantic features
    Link: https://transformer-circuits.pub/2023/monosemantic-features
    Source snippet

    Decomposing Language Models With Dictionary Learning4 Oct 2023 — In this paper, we use a weak dictionary learning algorithm called a spar...

  15. Source: transformer-circuits.pub
    Title: Circuits Updates
    Link: https://transformer-circuits.pub/2024/july-update/index.html
    Source snippet

    In a linear representation, each feature f i f_i...Read more...

  16. Source: anthropic.com
    Link: https://www.anthropic.com/research/team/interpretability
    Source snippet

    Interpretability ResearchThe mission of the Interpretability team is to discover and understand how large language models work internally...

  17. Source: anthropic.com
    Title: toy models of superposition
    Link: https://www.anthropic.com/research/toy-models-of-superposition
    Source snippet

    14 Sept 2022 — In this paper, we use toy models — small ReLU networks trained on [synthetic]({{ 'synthetic-data/' | relative_url }}) data with sparse input features — to investiga...

  18. Source: anthropic.com
    Link: https://www.anthropic.com/research/decomposing-language-models-into-understandable-components
    Source snippet

    Decomposing Language Models Into Understandable...Oct 5, 2023 — This work is a result of Anthropic's investment in Mechanistic Interpret...

  19. Source: arxiv.org
    Title: features in large language models via sparse autoencoders.Read more
    Link: https://arxiv.org/html/2503.05613v3
    Source snippet

    A Survey on Sparse Autoencoders: Interpreting the Internal...23 Sept 2025 — Towards monosemanticity: Decomposing language models with di...

  20. Source: arxiv.org
    Link: https://arxiv.org/html/2310.06200v1
    Source snippet

    The Importance of Prompt Tuning for Automated Neuron...In bills2023language, the team from OpenAI showcases that GPT-4 can be useful in...

  21. Source: github.com
    Link: https://github.com/openai/automated-interpretability
    Source snippet

    openai/automated-interpretabilityThis repository contains code and tools associated with the Language models can explain neurons in langu...

  22. Source: galileo.ai
    Title: anthropic ai interpretability breakthrough
    Link: https://galileo.ai/blog/anthropic-ai-interpretability-breakthrough
    Source snippet

    How Anthropic Made AI 70% More Interpretable1 Aug 2025 — Discover Anthropic's breakthrough: sparse autoencoders make AI 70% interpretable...

  23. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Anthropic
    Source snippet

    AnthropicAnthropic is an American [artificial]({{ 'artificial-goals/' | relative_url }}) intelligence (AI) company headquartered in San Francisco. It has developed a range of lar...

  24. Source: strikingloo.github.io
    Link: https://strikingloo.github.io/wiki/monosemanticity
    Source snippet

    Towards MonosemanticityOct 5, 2023 — In this paper, we use a weak dictionary learning algorithm called a sparse autoencoder to generate l...

  25. Source: linkedin.com
    Link: https://www.linkedin.com/posts/warren-wong-code_ai-machinelearning-interpretability-activity-7341814151171723264-1VN9
    Source snippet

    OpenAI's Automated Interpretability: Explaining Neurons in...Jun 20, 2025 — OpenAI's 2023 paper, "Language Models Can Explain Neurons in...

  26. Source: techcrunch.com
    Link: https://techcrunch.com/2026/05/18/anthropic-has-acquired-the-dev-tools-startup-used-by-openai-google-and-cloudflare/

  27. Source: theorempath.com
    Title: mechanistic interpretability
    Link: https://theorempath.com/topics/mechanistic-interpretability
    Source snippet

    Features, Circuits, SAEsby R Sneiderman · 2026 — Mechanistic interpretability for transformers: superposition, sparse autoencoders, the l...

  28. Source: facebook.com
    Link: https://www.facebook.com/datasciencedojo/posts/-openai-just-released-a-groundbreaking-paper-that-pushes-mechanistic-interpretab/860302616519891/
    Source snippet

    OpenAI just released a groundbreaking paper that pushes...Language models can explain neurons in language models use GPT-4 to automatica...

  29. Source: simonwillison.net
    Link: https://simonwillison.net/2023/May/9/explain-neurons/
    Source snippet

    May 9, 2023 — “We generated cluster labels by embedding each neuron explanation using the OpenAI Embeddings API, then clustering them and...

    Published: May 9, 2023

  30. Source: aarnphm.xyz
    Title: mechanistic interpretability
    Link: https://aarnphm.xyz/thoughts/mechanistic-interpretability
    Source snippet

    Aaron's notesJan 6, 2026 — This greatly simplifies resulting circuits by: Handling cross-layer superposition directly; Allowing features...

Additional References

  1. Source: github.com
    Link: https://github.com/zepingyu0512/awesome-llm-understanding-mechanism
    Source snippet

    Awesome Papers for Understanding LLM MechanismThis list focuses on understanding the internal mechanism of large language models (LLM). W...

  2. Source: tryalign.ai
    Link: https://tryalign.ai/resources/blog/scaling-monosemanticity-extracting-interpretable-features-from-claude-3-sonnet
    Source snippet

    Extracting Interpretable Features from Claude 3 SonnetThe Anthropic research team managed to extract interpretable features from the acti...

  3. Source: reddit.com
    Link: https://www.reddit.com/r/Futurology/comments/13d8m62/language_models_can_explain_neurons_in_language/
    Source snippet

    Language models can explain neurons in language modelsWe propose an automated process that uses GPT-4 to produce and score natural langua...

  4. Source: krmopuri.github.io
    Link: https://krmopuri.github.io/xml/static_files/presentations/MI-Pranav.pdf
    Source snippet

    Mechanistic InterpretabilityGPT circuits are understood as compositions of two “atomic” circuits, the “query-key” and. “output-value” cir...

  5. Source: medium.com
    Link: https://medium.com/thedeephub/understanding-the-scaling-of-monosemanticity-in-ai-models-a-comprehensive-analysis-f72818fa44ca
    Source snippet

    Understanding the “Scaling of Monosemanticity” in AI ModelsA particular aspect of AI is called monosemanticity, where parts of an AI syst...

  6. Source: lsd-project.jp
    Link: https://lsd-project.jp/weblsd/o/begin/mechanistic
    Source snippet

    ライフサイエンス辞書: mechanistic機構 の, メカニズム の, 機構的 な. 【類義語】machinery, mechanism, mechanistically, organization. mechanistic insight *** コーパス PubMe...

  7. Source: reddit.com
    Link: https://www.reddit.com/r/programming/comments/185gcbc/god_help_us_lets_try_to_understand_ai/

  8. Source: youtube.com
    Link: https://www.youtube.com/watch?v=XrCq3pQJS6w
    Source snippet

    ACM AI | Reading Group W24W5 | Mechanistic Interpretability...This week, with William Zhou, we take a deep dive into mechanistic intepre...

  9. Source: youtube.com
    Link: https://www.youtube.com/watch?v=vFdVrX503W0
    Source snippet

    Language Models Can Explain Neurons in Language ModelsIn this paper reading we discuss OpenAI's paper "Language Models Can Explain Neuron...

  10. Source: youtube.com
    Link: https://www.youtube.com/watch?v=qMBWbJQ3b2g

Topic Tree

Follow this branch

Parent topic

Scaling Limits Why Bigger AI Models May Resist Human Understanding

Related pages 2