Makemore w/ Andrej Karpathy

Preface #

The next semester, I plan on taking up NLP: Self Supervised Models by Dr. Daniel Khashabi. As much as I do believe that the instructor is great, I’m also well aware that I may be lacking a solid foundation to grasp the concepts being delivered in class as it progresses by.

I’m also, hopefully, going to be working under Dr. Burns next semester to build data engineering systems for NLP (specifically dealing with prefill caching and decoding). Having atleast a basic understanding of what I’m building systems for is something that I think is very important. It may not be, but I believe it to be.

So, I’m going through an excellent series by Andrej Karpathy called “Building makemore”. The series progressively builds upon the foundations that it sets up to create a complex NLP architecture.

I’ve finished the first video and Andrej also has a few exercises in the description of the video that makes you think and experiment with the system that you’ve built over the course of the video.

Below, I’ll be answering some of the exercise questions. (I do owe some credit to Krayorn whose solutions I had to look at for some of the tricker parts of the exercises)

Exercises #

The raw solutions to the questions posed can be found here.

The following code is used to initialize the words to be used, the string to integers map and integers to string map.

import torch

words = open('./data/names.txt', 'r').read().splitlines()
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

E01: Trigram Language Model #

Question: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

Initially, I was a little confused as to how I’d be representing the trigram model. Should I have picked a 3d matrix to hold the values? Or a 2d matrix and consolidate the two input characters together as a single unit. (input “ab” output “c” as opposed to input “a” input “b” output “c”).

I realised that it’s easier on the neural network side to follow the latter approach while the former approach was easy to write out for the deterministic/static approach.

Deterministic Trigram Language Model #

In this mode, I placed the intersection of all the three letters into one cell. I now understand that this may be an incorrect way to perform this as it utilizes too much space but it still works.

N = torch.zeros((27, 27, 27), dtype=torch.int32)

for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        N[ix1, ix2, ix3] += 1

P = (N+1).float()
P /= P.sum(2, keepdim=True)

This provides us with our probabilities. Notice that I’m summing over the second axis. This allows me to normalize a singular row within a 3d tensor.

Now that we have a deterministic probability matrix, we just need to compute the log likelihood to get its score. There’s no training per se involved. There’s only sampling.

log_likelihood = 0.0
n = 0
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        prob = P[ix1, ix2, ix3]
        logprob = torch.log(prob)
        log_likelihood += logprob
        # print(f"{ch1}{ch2}{ch3}: {prob:.4f} {logprob:.4f}")
        n += 1

print(f"{log_likelihood=}")
nll = -log_likelihood
print(f"{nll=}")
print(f"Average NLL={nll/n}")

Trigram Language Model as a Neural Network #

The Neural Network approach is definitely more interesting and evolves quite nicely. Even within a single layer of 27 neurons, we’re reaching a decent NLL nearly rivaling the deterministic probability distribution.

# Create the dataset
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print("Number of examples: ", num)

# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*2, 27), generator=g, requires_grad=True)

We group up two words into a single tuple and store this as our input in the neural network. Pay attention to the dimensions of W, our weight matrix, as it is now a (54, 27) scale matrix.

This is because the combination of two letters can be formulated as ch + (27 + ch).

for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float() # input to the network; one-hot encoding
    logits = xenc.view(-1, 27*2) @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # Normalizing for next character probabilities
    loss = -probs[torch.arange(len(xs)), ys].log().mean()

    # backward pass
    W.grad = None # set the gradient to zero
    loss.backward()

    # Update
    W.data += -50 * W.grad

print("Trigram NN Loss: ", loss.item())

The reason we need to alter the view for xenc too is so that we may fit the (54, 27) scale of the W matrix. We don’t know how many rows we need so we let pytorch decide with the -1 but we do know that we want 54 columns so that the math (given below) checks out for matrix multiplication.

(x, 54) @ (54, 27)

E02: Evaluate Dev and Test Sets #

Question: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

We can split the words up like so,

import math
c = len(words)
train_words = words[:math.floor(c*0.8)]
dev_words = words[math.floor(c*0.8):math.floor(c*0.9)]
test_words = words[math.floor(c*0.9):]

Bigram Model #

Then, we train our bigram model like before. After which, we can run it through our dev and test set to see how it performs.

# Dev set
xs, ys = [], []
for w in dev_words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)

xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(len(xs)), ys].log().mean()

print("Loss over dev words: ", loss.item())

Loss over dev words: 2.612

# Test set
xs, ys = [], []
for w in test_words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)

xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(len(xs)), ys].log().mean()

print("Loss over test words: ", loss.item())

Loss over test words: 2.617

Trigram Model #

# Dev set
xs, ys = [], []
for w in dev_words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)

xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc.view(-1, 27*2) @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(len(xs)), ys].log().mean()

print("Loss over dev words: ", loss.item())

Loss over dev words: 2.399
Performing the same thing for the test set, we get the loss: 2.398

E03: Regularization #

Question: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

Let’s train the trigram model while regularizing the loss.

As usual, we first create the dataset,

# Create the dataset
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print("Number of examples: ", num)

# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*2, 27), generator=g, requires_grad=True)

Then, we proceed to train it with the regularization loss.

for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float() # input to the network; one-hot encoding
    logits = xenc.view(-1, 27*2) @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # Normalizing for next character probabilities
    loss = -probs[torch.arange(len(xs)), ys].log().mean() + 0.001 * (W**2).mean() # regularization loss

    # backward pass
    W.grad = None # set the gradient to zero
    loss.backward()

    # Update
    W.data += -50 * W.grad

print("Trigram NN Training Loss: ", loss.item())

Finally, we run it against the dev set to get our losses.

# Dev set
xs, ys = [], []
for w in dev_words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)

xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc.view(-1, 27*2) @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(len(xs)), ys].log().mean()

print("Loss over dev words: ", loss.item())

I varied the regularization loss across a few levels and reran the experiment a few times, the observations are addressed in the next section.

Comparing the level of regularization to the loss. #

Currently, the method of regularization is by using a fraction of the squared mean.

Level	Training Loss	Dev Loss
0	2.263	2.398
0.001	2.264	2.399
0.005	2.264	2.399
0.01	2.272	2.398
0.1	2.327	2.402
0.3	2.399	2.42
0.5	2.452	2.454
1	2.545	2.506

The training loss seems to do better with a bit of regularization but the dev set loss doesn’t improve by much. Running it against the test loss too, showed the same trend.

E04: Indexing in favor of one-hot encoding. #

Question:we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

Tricky for sure. I can’t solve this one without throwing another loop into the mix and scaling up the time complexity.

E05: Cross entropy #

Question: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we’d prefer to use F.cross_entropy instead?

The code is only altered in the training segment where we can get rid of the loss calculation because cross entropy seems to do it for us:

for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float() # input to the network; one-hot encoding
    logits = xenc.view(-1, 27*2) @ W # predict log-counts
    loss = F.cross_entropy(logits, ys)

    # backward pass
    W.grad = None # set the gradient to zero
    loss.backward()

    # Update
    W.data += -50 * W.grad

print("Trigram NN Training Loss (cross-entropy): ", loss.item())

To my knowledge, they perform the same operations as our general operations. Perhaps the reason why we’d prefer to use cross entropy is due to PyTorch’s internal optimizations.

That’s about it though. If you’ve read this far, you should definitely check out the Andrej’s series. It’s pretty awesome and super informative.