Trailhead

Orientation, prerequisites, and what a GPT is actually trying to do. Set up your environment and understand the landscape before the hike begins.

Learner understands the project scope, has a working environment, and can articulate what next-token prediction means at a high level.

Lessons

Module Progress

2 of 5 complete

From Text to Tokens

Before a GPT can process text, it needs to convert human-readable characters into numbers. Tokenization is the bridge between the world of strings and the world of tensors.

Plain English

Imagine you have a book, but the computer can only read numbers. Tokenization is like creating a dictionary where every word (or piece of a word) gets a unique number. When the model sees 'hello', it looks up the number for 'hello' in its dictionary — maybe it's 42. Now the model can work with 42 instead of the letters h-e-l-l-o.

Engineering Explanation

Tokenization maps a string of characters to a sequence of integer indices from a fixed vocabulary. In microGPT, we use character-level tokenization: each unique character in the training corpus gets a unique integer ID. The vocabulary size V determines the size of the embedding table. The encode() function maps chars → ints, and decode() maps ints → chars.

Code Walkthrough

# Character-level tokenizer
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# Example
print(encode("hello"))  # [46, 43, 50, 50, 53]
print(decode([46, 43, 50, 50, 53]))  # "hello"

What Goes In

A string of text, e.g. 'hello world'

What Comes Out

A list of integers, e.g. [46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]

Why It Exists

The model operates on numbers, not text. Tokenization is the necessary first step that converts human language into a format the neural network can process. Without it, there's no way to feed text into the model.

What Breaks If You Remove It

If you remove tokenization, the model has no input. If you change the vocabulary (e.g., switch from character-level to word-level), the entire embedding table changes size, and all downstream dimensions must be updated.

Common Misconceptions

Tokens are not always words — in character-level tokenization, each token is a single character.
The token IDs are arbitrary — 'a' being 0 and 'b' being 1 has no semantic meaning.
Tokenization is not learned — it's a fixed preprocessing step defined before training.

Quick Check

If your training text contains 65 unique characters, what is the vocabulary size?

Articulate It

Explain to a colleague what tokenization does in microGPT, why character-level tokenization is used, and what would change if you switched to a different tokenization scheme.

Modify It

Modify the tokenizer to use word-level tokenization instead of character-level. What changes in the vocabulary size? What downstream code needs to be updated?