What is a Transformer?

A Transformer is the type of neural network architecture that powers GPT, BERT, ChatGPT, and most modern AI language models. It was introduced in the famous 2017 paper "Attention Is All You Need."

The Big Picture

A transformer takes in a sequence of tokens (words or characters) and produces a sequence of predictions:

Input: "The cat sat on the" → Output: "mat"

It processes the entire input at once (not one word at a time), which makes it fast and able to learn long-range dependencies.

The Two Main Parts

A transformer has two main stages:

Input → [Encoder / Pre-training] → [Output]
         (understanding)          (generation)

But GPT specifically only uses the decoder part (the right side), which is why it's called a "decoder-only" transformer.

The Core Components

A transformer is made of these parts:

Component	What it does
Embedding	Convert tokens to number vectors
Positional Encoding	Tell the model where each token is
Self-Attention	Let each token "look at" other tokens
Feed-Forward Network	Process each position's information
Layer Norm	Keep numbers in a good range
Residual Connections	Help gradients flow

We'll cover each of these in detail.

What Makes Transformers Special?

1. Self-Attention

This is the key innovation. Each position in the sequence can "attend to" (look at) every other position:

Position 0 ("The"): looks at positions 1, 2, 3, 4...
Position 1 ("cat"): looks at positions 0, 2, 3, 4...

This allows the model to learn relationships between words, even if they're far apart in the sentence.

2. Parallel Processing

Unlike older models (RNNs, LSTMs), transformers can process all positions at once. This makes them much faster to train.

3. Scalability

Transformers work well with:

More layers
More attention heads
More data

This is why GPT-3 has 96 layers and GPT-4 is even larger.

How GPT Uses Transformers

GPT is a "decoder-only" transformer:

Input tokens → [Block 1] → [Block 2] → ... [Block N] → Output predictions
              (attention + MLP) (attention + MLP)          (next token)

Each "Block" contains:

Multi-Head Self-Attention - Let tokens talk to each other
Feed-Forward Network - Process each token's information
Residual Connections - Add the input back to output
Layer Norm - Keep numbers stable

Why "Attention Is All You Need"

The original paper title suggests that attention (specifically self-attention) is the only special mechanism needed. No recurrence, no convolution - just attention.

This was revolutionary because:

Attention is simpler than RNNs
It captures long-range dependencies better
It's more parallelizable

The Transformer in microgpt

In microgpt.py, the gpt() function implements a mini transformer:

def gpt(token_id, pos_id, keys, values):
    # 1. Get embeddings
    tok_emb = state_dict['wte'][token_id]
    pos_emb = state_dict['wpe'][pos_id]
    x = [t + p for t, p in zip(tok_emb, pos_emb)]
    x = rmsnorm(x)

    # 2. Process through layers
    for li in range(n_layer):
        x = transformer_block(x, li, keys, values)

    # 3. Output
    logits = linear(x, state_dict['lm_head'])
    return logits

We'll break down each part in the following sections.

Summary

A Transformer is a neural network architecture that:

Processes sequences using self-attention
Allows every position to "see" every other position
Is highly parallelizable and scalable
Powers all modern language models

GPT uses a "decoder-only" transformer that predicts the next token in a sequence.

What is a Transformer?

On this page