Show HN: I built a tiny LLM to demystify how language models work
TL;DR Highlight
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.
Who Should Read
Developers curious about how LLMs work internally but found large model codebases too complex to approach. Particularly suitable for beginners who want to follow the entire process from tokenizer to training loop with code, even without a PhD or high-performance GPU.
Core Mechanics
- GuppyLM is an extremely small LLM with only 8.7 million parameters (~9M), generating short, lowercase sentences about water, food, light, and life in a fish tank, like the Guppy fish character.
- The training data consists of 60,000 synthetic conversations across 60 topics, generated solely from synthetic data without using any actual internet crawling data.
- The architecture is Transformer-based, with a very simple structure consisting of 6 layers, a hidden dimension of 384, 6 attention heads, and an FFN (feedforward network) dimension of 768.
- You can directly run the entire pipeline – data generation → tokenizer training → model architecture definition → training loop → inference – on a single GPU in Google Colab in about 5 minutes.
- Due to its small size, the model can run in a browser, and the code is kept as simple as possible to allow for end-to-end understanding of the entire LLM workflow without complex infrastructure.
- The training data is all lowercase, so there's a limitation where entering uppercase input (e.g., 'HELLO') results in nonsensical responses, which is a good educational example showing how directly the tokenizer and training data determine model behavior.
- It is trained only using next token prediction and it's interesting that consistent character conversations are possible without additional techniques like RLHF or fine-tuning.
- The repository includes separate Jupyter notebooks for training (train_guppylm.ipynb) and inference (use_guppylm.ipynb), allowing you to examine the training and execution steps separately.
Evidence
- "Questions comparing GuppyLM to Andrej Karpathy's microgpt and minGPT arose, with opinions stating that GuppyLM has a lower code complexity and a more concrete goal of character learning, making it more accessible for beginners. There was feedback that developers unfamiliar with concepts like multi-head attention, ReLU FFN, LayerNorm, and learned positional embeddings would find the code difficult to understand without further documentation. The analogy of GuppyLM potentially becoming an LLM education tool, like Minix for OS education, was also suggested. The site https://bbycroft.net/llm was mentioned, recommending it as a helpful resource for understanding internal operations by visualizing LLM layers in 3D. One commenter shared their experience of creating their own mini LLM using Milton's \"Paradise Lost\" as training data, and reported that this educational approach was effective in understanding LLMs. A commenter developing a system where multiple agents interact in a shared world shared their experience that behavior changes dramatically with resource constraints, other agents, and persistent memory, suggesting a greater focus on environment design than model optimization. Some example responses appeared to simply reproduce the training data, raising questions about how the model responds to completely new questions not present in the training data. A humorous comment about Guppy's answer, \"The meaning of life is food,\" receiving more support than responses from models 10,000 times larger, received a lot of positive feedback."
How to Apply
- If you're new to LLM internal structures, running the train_guppylm.ipynb notebook directly in Google Colab will allow you to experience the entire process from tokenizer configuration → model architecture definition → training loop → inference within 5 minutes, and you can immediately see the impact of modifying the code at each step.
- If you want to experiment with the potential of small, domain-specific models, you can refer to the 60K synthetic conversation data generation method to create your own domain data and train it with the same architecture to feel how data quality and diversity directly affect model output.
- If you need to educate your team about LLM concepts, you can leverage GuppyLM's Minix-like philosophy (simple but functional implementation) to use it as study material to explain key concepts like multi-head attention and positional embedding at the code level.
- If you need to create a demo of an LLM running in a browser, keeping the model small, like GuppyLM's 8.7 million parameters, allows for client-side execution via WebAssembly or ONNX conversion, and you can use the project's architecture settings (6 layers, hidden 384, heads 6) as a starting point to adjust the size.
Terminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Nanocode: The best Claude Code that $200 can buy in pure JAX on TPUs
An open-source library that allows you to train a 1.3B parameter coding agent model from scratch on a $200 (approximately 270,000 KRW) TPU, following Anthropic's Constitutional AI approach. It can serve as a hands-on reference for developers who want to directly understand the entire AI training pipeline.