Welcome to microgpt

microgpt is a complete, working GPT (Generative Pre-trained Transformer) model written in just 250 lines of pure Python with zero dependencies. No PyTorch, no NumPy, nothing but Python's standard library.

This is an educational project by Andrej Karpathy - one of the pioneers of deep learning. The goal is to show exactly how a language model works under the hood.

What You'll Learn

By reading through this documentation, you'll understand:

Concept	What It Means
Tokenization	How to convert text into numbers a computer can process
Neural Networks	How computers learn patterns from data
Autograd	How computers calculate gradients automatically
Transformers	The architecture behind GPT, BERT, and ChatGPT
Attention	How models "focus" on relevant parts of text
Training	How models learn from examples
Inference	How models generate new text

How Simple Is It?

Here's the entire training loop in just a few lines:

for step in range(num_steps):
    # Get a training example
    tokens = encode(document)

    # Forward pass - make predictions
    loss = forward_pass(tokens)

    # Backward pass - figure out how to improve
    loss.backward()

    # Update weights
    update_weights()

That's it! That's the heart of machine learning.

How to Run

python microgpt.py

You'll see it download a dataset of names, train for a while, then generate new names.

Customizing the Model

You can change the model size:

# Small model (fast)
python microgpt.py --n_embd 16 --n_layer 1

# Medium model
python microgpt.py --n_embd 32 --n_layer 3 --num_steps 5000

# Large model (slower but smarter)
python microgpt.py --n_embd 64 --n_layer 6 --num_steps 10000

Documentation Roadmap

Start here, then follow along in order:

Introduction

Welcome to microgpt

What You'll Learn

How Simple Is It?

How to Run

Customizing the Model

Documentation Roadmap

Quick Start Guide

Introduction

What is a Neural Network?

What is a Transformer?

How Tokenization Works

The Autograd Engine

Understanding Gradients

Model Parameters

Linear Layers

Embeddings

Softmax Explained

RMSNorm

Multi-Head Attention

The MLP Block

The GPT Forward Pass

Loss Function

The Training Loop

The Adam Optimizer

Inference & Generation

Code Reference (Line by Line)

On this page