Documents, BOS, vocabulary, tokenization, and next-token prediction. Understand what the model sees and what it's trying to predict.
Learner can explain what the model is predicting and why, trace from raw text to token IDs, and describe the vocabulary.
Module Progress
2 of 5 complete
Before a GPT can process text, it needs to convert human-readable characters into numbers. Tokenization is the bridge between the world of strings and the world of tensors.
Imagine you have a book, but the computer can only read numbers. Tokenization is like creating a dictionary where every word (or piece of a word) gets a unique number. When the model sees 'hello', it looks up the number for 'hello' in its dictionary — maybe it's 42. Now the model can work with 42 instead of the letters h-e-l-l-o.
Tokenization maps a string of characters to a sequence of integer indices from a fixed vocabulary. In microGPT, we use character-level tokenization: each unique character in the training corpus gets a unique integer ID. The vocabulary size V determines the size of the embedding table. The encode() function maps chars → ints, and decode() maps ints → chars.
# Character-level tokenizer
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
# Example
print(encode("hello")) # [46, 43, 50, 50, 53]
print(decode([46, 43, 50, 50, 53])) # "hello"What Goes In
A string of text, e.g. 'hello world'
What Comes Out
A list of integers, e.g. [46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
Why It Exists
The model operates on numbers, not text. Tokenization is the necessary first step that converts human language into a format the neural network can process. Without it, there's no way to feed text into the model.
What Breaks If You Remove It
If you remove tokenization, the model has no input. If you change the vocabulary (e.g., switch from character-level to word-level), the entire embedding table changes size, and all downstream dimensions must be updated.
If your training text contains 65 unique characters, what is the vocabulary size?
Articulate It
Explain to a colleague what tokenization does in microGPT, why character-level tokenization is used, and what would change if you switched to a different tokenization scheme.
Modify It
Modify the tokenizer to use word-level tokenization instead of character-level. What changes in the vocabulary size? What downstream code needs to be updated?