LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?
TL;DR Highlight
A training-free technique (RYS) that duplicates Transformer layers works across all modern LLMs — and reveals that internal representations converge toward a "universal language" independent of human language.
Who Should Read
ML engineers who want to boost LLM inference performance without retraining. Researchers interested in multilingual representations and the internal structure of Transformers.
Core Mechanics
- RYS (Repeat Your Self) — simply duplicating middle layers without any weight changes yields measurable improvements on math reasoning and emotional intelligence benchmarks. Validated on Qwen3.5-27B.
- Transformer internals split into 3 phases: Encoding (layers 0–5, surface normalization) → Reasoning (~45 layers, language-agnostic abstract space) → Decoding (~64 layers, language-specific token generation)
- Multilingual hidden-state analysis shows that in the reasoning phase, content identity dominates language identity — the same concept maps to the same representation regardless of language
- Searching 2 million layer-duplication configs via surrogate modeling + beam search shows contiguous mid-stack blocks dominate complex multi-block compositions on efficiency
- Layer pair (33, 34) is Pareto-optimal — EQ score +0.0945, overhead +1.56%
Evidence
- Duplicating layers (33,34) in Qwen3.5-27B yields +0.0945 on EQ benchmark with only +1.56% parameter overhead
- 2M config search shows contiguous mid-stack blocks form the Pareto frontier; complex multi-block combos are strictly dominated on efficiency
- Language convergence pattern in reasoning layers observed consistently across multiple LLMs in multilingual hidden-state analysis
How to Apply
- For open-source Transformer models (Qwen, Llama, etc.): identify the middle layer range and modify inference code to pass through those layers twice — no training required
- When unsure which layer pair is optimal: use surrogate modeling + beam search to explore efficiently. Far cheaper than full grid search
- RYS is orthogonal to quantization — apply RYS alongside INT4/INT8 quantization to pursue both cost savings and performance improvements simultaneously