LLM Architecture Gallery
TL;DR Highlight
Dr. Sebastian Raschka put together a one-page gallery with architecture diagrams and key specs for dozens of major LLMs — Llama, DeepSeek, Qwen, Gemma and more — so you can compare design decisions at a glance.
Who Should Read
ML engineers who train or fine-tune LLMs directly, and AI developers who want to quickly understand architecture differences when choosing an open-source model.
Core Mechanics
- Llama 3 8B is a standard Dense model using GQA (Grouped Query Attention, where multiple heads share the KV cache to reduce memory) and RoPE (positional encoding), serving as a baseline for comparing other models like OLMo 2.
- DeepSeek V3 and R1 use a Sparse MoE (Mixture of Experts) architecture that activates only 37B out of 671B total parameters, combined with MLA (Multi-head Latent Attention). They also add a Dense prefix and Shared Expert to make large models practically deployable at inference time.
- DeepSeek R1 is not a new base architecture — it keeps the same structure as V3 but changes only the training recipe to specialize in reasoning. It's a case where training methodology, not architectural innovation, drives the performance gap.
- Gemma 3 27B uses a 5:1 hybrid attention approach where only 1 in 5 attention layers uses global attention, while the other 4 use Sliding Window Attention (SWA) — significantly increasing the local attention ratio compared to Gemma 2.
- Llama 4 Maverick is an MoE model activating only 17B out of 400B total parameters, following DeepSeek V3's design direction but using GQA for attention and opting for fewer but larger experts.
- The Qwen3 series ranges from 235B MoE down to 4B Dense, consistently applying QK-Norm (normalizing query and key vectors to improve training stability) across the entire lineup. The 235B-A22B MoE version is structurally very similar to DeepSeek V3 but drops the Shared Expert.
- OLMo 2 7B takes a unique normalization approach by placing Post-norm inside residual connections instead of the standard Pre-norm, improving training stability. It also sticks with classic MHA (Multi-Head Attention) rather than GQA.
- The gallery provides config.json links, technical report links, parameter counts, dates, decoder types, attention mechanisms, and key design points for each model in a fact-sheet format — no need to dig through the original papers for quick comparisons.
Evidence
- Commenters noted this gallery could serve a similar role to the 'Neural Network Zoo' (asimovinstitute.org) that visualized dozens of neural network architectures and became widely used as an educational resource — many felt an LLM version was long overdue.
- One commenter shared a link via zoomhub.net (https://zoomhub.net/LKrpB) for zooming in on the architecture diagrams, offering a practical workaround since the original image has so much detail that click-to-zoom alone is cumbersome.
- There were requests to add an 'evolution lineage' or 'family tree' visualization showing which models influenced which, along with a scale view for visually comparing parameter counts — noting that the current gallery makes it hard to track the chronological flow of architectural innovations.
- Someone asked the author whether creating this gallery revealed anything surprising or unexpected about LLM architectures — probing whether the curatorial process itself generated new insights beyond just compilation.
How to Apply
- When choosing a base model for fine-tuning, use this gallery to quickly check the attention method (GQA vs MHA vs MLA) and normalization approach (Pre-norm vs Post-norm) instead of reading the full paper.
- When comparing model inference costs, use the MoE active parameter count (e.g., DeepSeek V3's 37B, Maverick's 17B) rather than total parameter count as the primary cost estimate.
- When evaluating model selection — especially for Qwen3 and DeepSeek V3 — check whether they use Shared Experts, since this architectural detail can significantly affect how experts specialize and overall performance.
- Use this gallery as a teaching aid when explaining 'what makes LLMs different from each other' to non-experts. Having the config.json and technical report links handy for deeper dives is a bonus.
Terminology
Related Papers
ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
도메인 파인튜닝으로 망가진 LLM 안전성을, 재학습 없이 추론 시점에 작은 안전 모델에서 빌려와 복구하는 방법.
The iPad was on Tailscale: a WebRTC debugging story
WebRTC 데이터 채널에서 iPad만 응답을 못 받는 희귀 버그를 추적한 결과, webrtc-rs의 하드코딩된 MTU 상수와 Tailscale의 IPv6 Fragment 패킷 드롭이 동시에 작용한 복합 버그였다는 2주간의 디버깅 실화.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
LLM 기반 하이퍼파라미터 최적화 에이전트와 CMA-ES, TPE 같은 고전 알고리즘을 직접 비교한 연구로, LLM 단독으로는 고전 방법을 이기지 못하지만 두 방법을 합친 하이브리드 'Centaur'가 최고 성능을 낸다는 결론이 나왔다.
What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
Bold, 하이라이트, 공백 배치 같은 타이포그래피 트릭으로 GPT-4o, Llama Guard 등 10개 콘텐츠 모더레이션 시스템을 99% 이상 우회할 수 있다.
Did Claude increase bugs in rsync?
rsync 프로젝트에 Claude AI가 도입된 이후 버그가 늘었다는 소셜 미디어 주장을 실제 데이터와 통계 분석으로 검증한 글로, 결론적으로 Claude 도입 후 릴리즈가 역사적 분포에서 유독 버그가 많다는 통계적 근거는 없었다.
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
Firebase 취약점을 가진 앱을 직접 제작하고 GPT-5.5, Claude, Deepseek 등 주요 LLM이 자율적으로 해킹할 수 있는지 실험한 결과, GPT-5.5가 70% 성공률로 압도적이었고 Claude는 보안 거부 정책 때문에 능력과 무관하게 낮은 점수를 기록했다.