Implementing GPT in NumPy
https://jaykmody.com/blog/gpt-from-scratch
Russian version:
Пишем GPT в 60 строк NumPy (часть 1 из 2)
Пишем GPT в 60 строк NumPy (окончание, 2/2)
In this post, we'll implement a GPT from scratch in just 60 lines of numpy
. We'll then load the trained GPT-2 model weights released by OpenAI into our implementation and generate some text.
Note:
This post assumes familiarity with Python, NumPy, and some basic experience training neural networks.
This implementation is missing tons of features on purpose to keep it as simple as possible while remaining complete. The goal is to provide a simple yet complete technical introduction to the GPT as an educational tool.
The GPT architecture is just one small part of what makes LLMs what they are today..
All the code for this blog post can be found at github.com/jaymody/picoGPT.
EDIT (Feb 9th, 2023): Added a "What's Next" section and updated the intro with some notes.
EDIT (Feb 28th, 2023): Added some additional sections to "What's Next".
Table of Contents
What is a GPT?
GPT stands for Generative Pre-trained Transformer. It's a type of neural network architecture based on the Transformer. Jay Alammar's How GPT3 Works is an excellent introduction to GPTs at a high level, but here's the tl;dr:
Generative: A GPT generates text.
Pre-trained: A GPT is trained on lots of text from books, the internet, etc ...
Transformer: A GPT is a decoder-only transformer neural network.
Large Language Models (LLMs) like OpenAI's GPT-3, Google's LaMDA, and Cohere's Command XLarge are just GPTs under the hood. What makes them special is they happen to be 1) very big (billions of parameters) and 2) trained on lots of data (hundreds of gigabytes of text).
Fundamentally, a GPT generates text given a prompt. Even with this very simple API (input = text, output = text), a well-trained GPT can do some pretty awesome stuff like write your emails, summarize a book, give you instagram caption ideas, explain black holes to a 5 year old, code in SQL, and even write your will.
So that's a high-level overview of GPTs and their capabilities. Let's dig into some more specifics.
Input / Output
The function signature for a GPT looks roughly like this:
Input
The input is some text represented by a sequence of integers that map to tokens in the text:
Tokens are sub-pieces of the text, which are produced using some kind of tokenizer. We can map tokens to integers using a vocabulary:
In short:
We have a string.
We use a tokenizer to break it down into smaller pieces called tokens.
We use a vocabulary to map those tokens to integers.
In practice, we use more advanced methods of tokenization than simply splitting by whitespace, such as Byte-Pair Encoding or WordPiece, but the principle is the same:
There is a
vocab
that maps string tokens to integer indicesThere is an
encode
method that convertsstr -> list[int]
There is a
decode
method that convertslist[int] -> str
Output
The output is a 2D array, where output[i][j]
is the model's predicted probability that the token at vocab[j]
is the next token inputs[i+1]
. For example:
To get a next token prediction for the whole sequence, we simply take the token with the highest probability in output[-1]
:
Taking the token with the highest probability as our prediction is known as greedy decoding or greedy sampling.
The task of predicting the next logical word in a sequence is called language modeling. As such, we can call a GPT a language model.
Generating a single word is cool and all, but what about entire sentences, paragraphs, etc ...?
Generating Text
Autoregressive
We can generate full sentences by iteratively getting the next token prediction from our model. At each iteration, we append the predicted token back into the input:
This process of predicting a future value (regression), and adding it back into the input (auto), is why you might see a GPT described as autoregressive.
Sampling
We can introduce some stochasticity (randomness) to our generations by sampling from the probability distribution instead of being greedy:
This allows us to generate different sentences given the same input. When combined with techniques like top-k, top-p, and temperature, which modify the distribution prior to sampling, the quality of our outputs is greatly increased. These techniques also introduce some hyperparameters that we can play around with to get different generation behaviors (for example, increasing temperature makes our model take more risks and thus be more "creative").
Training
We train a GPT like any other neural network, using gradient descent with respect to some loss function. In the case of a GPT, we take the cross entropy loss over the language modeling task:
This is a heavily simplified training setup, but it illustrates the point. Notice the addition of params
to our gpt
function signature (we left this out in the previous sections for simplicity). During each iteration of the training loop:
We compute the language modeling loss for the given input text example
The loss determines our gradients, which we compute via backpropagation
We use the gradients to update our model parameters such that the loss is minimized (gradient descent)
Notice, we don't use explicitly labelled data. Instead, we are able to produce the input/label pairs from just the raw text itself. This is referred to as self-supervised learning.
Self-supervision enables us to massively scale train data, just get our hands on as much raw text as possible and throw it at the model. For example, GPT-3 was trained on 300 billion tokens of text from the internet and books:
Table 2.2 from GPT-3 paper
Of course, you need a sufficiently large model to be able to learn from all this data, which is why GPT-3 has 175 billion parameters and probably cost between $1m-10m in compute cost to train.[3]
This self-supervised training step is called pre-training, since we can reuse the "pre-trained" models weights to further train the model on downstream tasks, such as classifying if a tweet is toxic or not. Pre-trained models are also sometimes called foundation models.
Training the model on downstream tasks is called fine-tuning, since the model weights have already been pre-trained to understand language, it's just being fine-tuned to the specific task at hand.
The "pre-training on a general task + fine-tuning on a specific task" strategy is called transfer learning.
Prompting
In principle, the original GPT paper was only about the benefits of pre-training a transformer model for transfer learning. The paper showed that pre-training a 117M GPT achieved state-of-the-art performance on various NLP (natural language processing) tasks when fine-tuned on labelled datasets.
It wasn't until the GPT-2 and GPT-3 papers that we realized a GPT model pre-trained on enough data with enough parameters was capable of performing any arbitrary task by itself, no fine-tuning needed. Just prompt the model, perform autoregressive language modeling, and like voila, the model magically gives us an appropriate response. This is referred to as in-context learning, because the model is using just the context of the prompt to perform the task. In-context learning can be zero shot, one shot, or few shot:
Figure 2.1 from the GPT-3 Paper
Generating text given a prompt is also referred to as conditional generation, since our model is generating some output conditioned on some input.
GPTs are not limited to NLP tasks. You can condition the model on anything you want. For example, you can turn a GPT into a chatbot (i.e. ChatGPT) by conditioning it on the conversation history. You can also further condition the chatbot to behave a certain way by prepending the prompt with some kind of description (i.e. "You are a chatbot. Be polite, speak in full sentences, don't say harmful things, etc ..."). Conditioning the model like this can even give your chatbot a persona. However, this is not robust, you can still "jailbreak" the model and make it misbehave.
With that out of the way, let's finally get to the actual implementation.
Setup
Clone the repository for this tutorial:
Then let's install our dependencies:
Note: This code was tested with Python 3.9.10
.
A quick breakdown of each of the files:
encoder.py
contains the code for OpenAI's BPE Tokenizer, taken straight from their gpt-2 repo.utils.py
contains the code to download and load the GPT-2 model weights, tokenizer, and hyperparameters.gpt2.py
contains the actual GPT model and generation code, which we can run as a python script.gpt2_pico.py
is the same asgpt2.py
, but in even fewer lines of code. Why? Because why not.
We'll be reimplementing gpt2.py
from scratch, so let's delete it and recreate it as an empty file:
As a starting point, paste the following code into gpt2.py
:
Breaking down each of the 4 sections:
The
gpt2
function is the actual GPT code we'll be implementing. You'll notice that the function signature includes some extra stuff in addition toinputs
:wte
,wpe
,blocks
, andln_f
the parameters of our model.n_head
is a hyperparameter that is needed during the forward pass.
The
generate
function is the autoregressive decoding algorithm we saw earlier. We use greedy sampling for simplicity.[tqdm](https://www.google.com/search?q=tqdm)
is a progress bar to help us visualize the decoding process as it generates tokens one at a time.The
main
function handles:Loading the tokenizer (
encoder
), model weights (params
), and hyperparameters (hparams
)Encoding the input prompt into token IDs using the tokenizer
Calling the generate function
Decoding the output IDs into a string
[fire.Fire(main)](https://github.com/google/python-fire)
just turns our file into a CLI application, so we can eventually run our code with:python gpt2.py "some prompt here"
Let's take a closer look at encoder
, hparams
, and params
, in a notebook, or an interactive python session, run:
This will download the necessary model and tokenizer files to models/124M
and load encoder
, hparams
, and params
into our code.
Encoder
encoder
is the BPE tokenizer used by GPT-2:
Using the vocabulary of the tokenizer (stored in encoder.decoder
), we can take a peek at what the actual tokens look like:
Notice, sometimes our tokens are words (e.g. Not
), sometimes they are words but with a space in front of them (e.g. Ġall
, the [Ġ
represents a space](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/bpe.py#L22-L33)), sometimes there are part of a word (e.g. capes is split into Ġcap
and es
), and sometimes they are punctuation (e.g. .
).
One nice thing about BPE is that it can encode any arbitrary string. If it encounters something that is not present in the vocabulary, it just breaks it down into substrings it does understand:
We can also check the size of the vocabulary:
The vocabulary, as well as the byte-pair merges which determines how strings are broken down, is obtained by training the tokenizer. When we load the tokenizer, we're loading the already trained vocab and byte-pair merges from some files, which were downloaded alongside the model files when we ran load_encoder_hparams_and_params
. See models/124M/encoder.json
(the vocabulary) and models/124M/vocab.bpe
(byte-pair merges).
Hyperparameters
hparams
is a dictionary that contains the hyper-parameters of our model:
We'll use these symbols in our code's comments to show the underlying shape of things. We'll also use n_seq
to denote the length of our input sequence (i.e. n_seq = len(inputs)
).
Parameters
params
is a nested json dictionary that hold the trained weights of our model. The leaf nodes of the json are NumPy arrays. If we print params
, replacing the arrays with their shapes, we get:
These are loaded from the original OpenAI tensorflow checkpoint:
The following code converts the above tensorflow variables into our params
dictionary.
For reference, here's the shapes of params
but with the numbers replaced by the hparams
they represent:
You'll probably want to come back to reference this dictionary to check the shape of the weights as we implement our GPT. We'll match the variable names in our code with the keys of this dictionary for consistency.
Basic Layers
Last thing before we get into the actual GPT architecture itself, let's implement some of the more basic neural network layers that are non-specific to GPTs.
GELU
The non-linearity (activation function) of choice for GPT-2 is GELU (Gaussian Error Linear Units), an alternative for ReLU:
Figure 1 from the GELU paper
It is approximated by the following function:
Like ReLU, GELU operates element-wise on the input:
Softmax
Good ole softmax:
We use the [max(x)
trick for numerical stability](https://jaykmody.com/blog/stable-softmax/).
Softmax is used to a convert set of real numbers (between
and ) to probabilities (between 0 and 1, with the numbers all summing to 1). We apply softmax
over the last axis of the input.
Layer Normalization
Layer normalization standardizes values to have a mean of 0 and a variance of 1:
where is the mean of , is the variance of , and and are learnable parameters.
Layer normalization ensures that the inputs for each layer are always within a consistent range, which is supposed to speed up and stabilize the training process. Like Batch Normalization, the normalized output is then scaled and offset with two learnable vectors gamma and beta. The small epsilon term in the denominator is used to avoid a division by zero error.
Layer norm is used instead of batch norm in the transformer for various reasons. The differences between various normalization techniques is outlined in this excellent blog post.
We apply layer normalization over the last axis of the input.
Linear
Your standard matrix multiplication + bias:
Linear layers are often referred to as projections (since they are projecting from one vector space to another vector space).
GPT Architecture
The GPT architecture follows that of the transformer:
Figure 1 from Attention is All You Need
But uses only the decoder stack (the right part of the diagram):
GPT Architecture
Note, the middle "cross-attention" layer is also removed since we got rid of the encoder.
At a high level, the GPT architecture has three sections:
Text + positional embeddings
A transformer decoder stack
A projection to vocab step
In code, it looks like this:
Let's break down each of these three sections into more detail.
Embeddings
Token Embeddings
Token IDs by themselves are not very good representations for a neural network. For one, the relative magnitudes of the token IDs falsely communicate information (for example, if Apple = 5
and Table = 10
in our vocab, then we are implying that 2 * Table = Apple
). Secondly, a single number is not a lot of dimensionality for a neural network to work with.
To address these limitations, we'll take advantage of word vectors, specifically via a learned embedding matrix:
Recall, wte
is a [n_vocab, n_embd]
matrix. It acts as a lookup table, where the
th row in the matrix corresponds to the learned vector for the th token in our vocabulary. wte[inputs]
uses integer array indexing to retrieve the vectors corresponding to each token in our input.
Like any other parameter in our network, wte
is learned. That is, it is randomly initialized at the start of training and then updated via gradient descent.
Positional Embeddings
One quirk of the transformer architecture is that it doesn't take into account position. That is, if we randomly shuffled our input and then accordingly unshuffled the output, the output would be the same as if we never shuffled the input in the first place (the ordering of inputs doesn't have any effect on the output).
Of course, the ordering of words is a crucial part of language (duh), so we need some way to encode positional information into our inputs. For this, we can just use another learned embedding matrix:
Recall, wpe
is a [n_ctx, n_embd]
matrix. The
th row of the matrix contains a vector that encodes information about the th position in the input. Similar to wte
, this matrix is learned during gradient descent.
Notice, this restricts our model to a maximum sequence length of n_ctx
.[4] That is, len(inputs) <= n_ctx
must hold.
Combined
We can add our token and positional embeddings to get a combined embedding that encodes both token and positional information.
Decoder Stack
This is where all the magic happens and the "deep" in deep learning comes in. We pass our embedding through a stack of n_layer
transformer decoder blocks.
Stacking more layers is what allows us to control how deep our network is. GPT-3 for example, has a whopping 96 layers. On the other hand, choosing a larger n_embd
value allows us to control how wide our network is (for example, GPT-3 uses an embedding size of 12288).
Projection to Vocab
In our final step, we project the output of the final transformer block to a probability distribution over our vocab:
Couple things to note here:
We first pass
x
through a final layer normalization layer before doing the projection to vocab. This is specific to the GPT-2 architecture (this is not present in the original GPT and Transformer papers).We are reusing the embedding matrix
wte
for the projection. Other GPT implementations may choose to use a separate learned weight matrix for the projection, however sharing the embedding matrix has a couple of advantages:You save some parameters (although at GPT-3 scale, this is negligible).
Since the matrix is both responsible for mapping both to words and from words, in theory, it may learn a richer representation compared to having two separate matrixes.
We don't apply
softmax
at the end, so our outputs will be logits instead of probabilities between 0 and 1. This is done for several reasons:softmax
is monotonic, so for greedy samplingnp.argmax(logits)
is equivalent tonp.argmax(softmax(logits))
makingsoftmax
redundantsoftmax
is irreversible, meaning we can always go fromlogits
toprobabilities
by applyingsoftmax
, but we can't go back tologits
fromprobabilities
, so for maximum flexibility, we output thelogits
Numerically stability (for example, to compute cross entropy loss, taking
[log(softmax(logits))
is numerically unstable compared tolog_softmax(logits)
](https://jaykmody.com/blog/stable-softmax/#cross-entropy-and-log-softmax)
The projection to vocab step is also sometimes called the language modeling head. What does "head" mean? Once your GPT is pre-trained, you can swap out the language modeling head with some other kind of projection, like a classification head for fine-tuning the model on some classification task. So your model can have multiple heads, kind of like a hydra.
So that's the GPT architecture at a high level, let's actually dig a bit deeper into what the decoder blocks are doing.
Decoder Block
The transformer decoder block consists of two sublayers:
Multi-head causal self attention
Position-wise feed forward neural network
Each sublayer utilizes layer normalization on their inputs as well as a residual connection (i.e. add the input of the sublayer to the output of the sublayer).
Some things to note:
Multi-head causal self attention is what facilitates the communication between the inputs. Nowhere else in the network does the model allow inputs to "see" each other. The embeddings, position-wise feed forward network, layer norms, and projection to vocab all operate on our inputs position-wise. Modeling relationships between inputs is tasked solely to attention.
The Position-wise feed forward neural network is just a regular 2 layer fully connected neural network. This just adds a bunch of learnable parameters for our model to work with to facilitate learning.
In the original transformer paper, layer norm is placed on the output
layer_norm(x + sublayer(x))
while we place layer norm on the inputx + sublayer(layer_norm(x))
to match GPT-2. This is referred to as pre-norm and has been shown to be important in improving the performance of the transformer.Residual connections (popularized by ResNet) serve a couple of different purposes:
Makes it easier to optimize neural networks that are deep (i.e. networks that have lots of layers). The idea here is that we are providing "shortcuts" for the gradients to flow back through the network, making it easier to optimize the earlier layers in the network.
Without residual connections, deeper models see a degradation in performance when adding more layers (possibly because it's hard for the gradients to flow all the way back through a deep network without losing information). Residual connections seem to give a bit of an accuracy boost for deeper networks.
Can help with the vanishing/exploding gradients problem.
Let's dig a little deeper into the 2 sublayers.
Position-wise Feed Forward Network
This is just a simple multi-layer perceptron with 2 layers:
Nothing super fancy here, we just project from n_embd
up to a higher dimension 4*n_embd
and then back down to n_embd
[5].
Recall, from our params
dictionary, that our mlp
params look like this:
Multi-Head Causal Self Attention
This layer is probably the most difficult part of the transformer to understand. So let's work our way up to "Multi-Head Causal Self Attention" by breaking each word down into its own section:
Attention
Self
Causal
Multi-Head
Attention
I have another blog post on this topic, where we derive the scaled dot product equation proposed in the original transformer paper from the ground up:
As such, I'm going to skip an explanation for attention in this post. You can also reference Lilian Weng's Attention? Attention! and Jay Alammar's The Illustrated Transformer which are also great explanations for attention.
We'll just adapt our attention implementation from my blog post:
Self
When q
, k
, and v
all come from the same source, we are performing self-attention (i.e. letting our input sequence attend to itself):
For example, if our input is "Jay went to the store, he bought 10 apples."
, we would be letting the word "he" attend to all the other words, including "Jay", meaning the model can learn to recognize that "he" is referring to "Jay".
We can enhance self attention by introducing projections for q
, k
, v
and the attention output:
This enables our model to learn a mapping for q
, k
, and v
that best helps attention distinguish relationships between inputs.
We can reduce the number of matrix multiplication from 4 to just 2 if we combine w_q
, w_k
and w_v
into a single matrix w_fc
, perform the projection, and then split the result:
This is a bit more efficient as modern accelerators (GPUs) can take better advantage of one large matrix multiplication rather than 3 separate small ones happening sequentially.
Finally, we add bias vectors to match the implementation of GPT-2, use our linear
function, and rename our parameters to match our params
dictionary:
Recall, from our params
dictionary, our attn
params look like this:
Causal
There is a bit of an issue with our current self-attention setup, our inputs can see into the future! For example, if our input is ["not", "all", "heroes", "wear", "capes"]
, during self attention we are allowing "wear" to see "capes". This means our output probabilities for "wear" will be biased since the model already knows the correct answer is "capes". This is no good since our model will just learn that the correct answer for input
can be taken from input .
To prevent this, we need to somehow modify our attention matrix to hide or mask our inputs from being able to see into the future. For example, let's pretend our attention matrix looks like this:
Each row corresponds to a query and the columns to a key. In this case, looking at the row for "wear", you can see that it is attending to "capes" in the last column with a weight of 0.295. To prevent this, we want to set that entry to 0.0
:
In general, to prevent all the queries in our input from looking into the future, we set all positions
where to 0
:
We call this masking. One issue with our above masking approach is our rows no longer sum to 1 (since we are setting them to 0 after the softmax
has been applied). To make sure our rows still sum to 1, we need to modify our attention matrix before the softmax
is applied.
This can be achieved by setting entries that are to be masked to
prior to the softmax
:
where mask
is the matrix (for n_seq=5
):
We use -1e10
instead of -np.inf
as -np.inf
can cause nans
.
Adding mask
to our attention matrix instead of just explicitly setting the values to -1e10
works because practically, any number plus -inf
is just -inf
.
We can compute the mask
matrix in NumPy with (1 - np.tri(n_seq)) * -1e10
.
Putting it all together, we get:
Multi-Head
We can further improve our implementation by performing n_head
separate attention computations, splitting our queries, keys, and values into heads:
There are three steps added here:
Split
q, k, v
inton_head
heads:
Compute attention for each head:
Merge the outputs of each head:
Notice, this reduces the dimension from n_embd
to n_embd/n_head
for each attention computation. This is a tradeoff. For reduced dimensionality, our model gets additional subspaces to work when modeling relationships via attention. For example, maybe one attention head is responsible for connecting pronouns to the person the pronoun is referencing. Maybe another might be responsible for grouping sentences by periods. Another could simply be identifying which words are entities, and which are not. Although, it's probably just another neural network black box.
The code we wrote performs the attention computations over each head sequentially in a loop (one at a time), which is not very efficient. In practice, you'd want to do these in parallel. For simplicity, we'll just leave this sequential.
With that, we're finally done our GPT implementation! Now, all that's left to do is put it all together and run our code.
Putting it All Together
Putting everything together, we get gpt2.py, which in its entirety is a mere 120 lines of code (60 lines if you remove comments and whitespace).
We can test our implementation with:
which gives the output:
It works!!!
We can test that our implementation gives identical results to OpenAI's official GPT-2 repo using the following Dockerfile (Note: this won't work on M1 Macbooks because of tensorflow shenanigans and also warning, it downloads all 4 GPT-2 model sizes, which is a lot of GBs of stuff to download):
which should give an identical result:
What Next?
This implementation is cool and all, but it's missing a ton of bells and whistles:
GPU/TPU Support
Replace NumPy with JAX:
That's it. You can now use the code with GPUs and even TPUs! Just make sure you install JAX correctly.
Backpropagation
Again, if we replace NumPy with JAX:
Then computing the gradients is as easy as:
Batching
Once again, if we replace NumPy with JAX[7]:
Then, making our gpt2
function batched is as easy as:
Inference Optimization
Our implementation is quite inefficient. The quickest and most impactful optimization you can make (outside of GPU + batching support) would be to implement a kv cache. Also, we implemented our attention head computations sequentially, when we should really be doing it in parallel[8].
There's many many more inference optimizations. I recommend Lillian Weng's Large Transformer Model Inference Optimization and Kipply's Transformer Inference Arithmetic as a starting point.
Training
Training a GPT is pretty standard for a neural network (gradient descent w.r.t a loss function). Of course, you also need to use the standard bag of tricks when training a GPT (i.e. use the Adam optimizer, find the optimal learning rate, regularization via dropout and/or weight decay, use a learning rate scheduler, use the correct weight initialization, batching, etc ...).
The real secret sauce to training a good GPT model is the ability to scale the data and the model, which is where the real challenge is.
For scaling data, you'll want a corpus of text that is big, high quality, and diverse.
Big means billions of tokens (terabytes of data). For example, check out The Pile, which is an open source pre-training dataset for large language models.
High quality means you want to filter out duplicate examples, unformatted text, incoherent text, garbage text, etc ...
Diverse means varying sequence lengths, about lots of different topics, from different sources, with differing perspectives, etc ... Of course, if there are any biases in the data, it will reflect in the model, so you need to be careful of that as well.
Scaling the model to billions of parameters involves a cr*p ton of engineering (and money lol). Training frameworks can get absurdly long and complex. A good place to start would be Lillian Weng's How to Train Really Large Models on Many GPUs. On the topic there's also the NVIDIA's Megatron Framework, Cohere's Training Framework, Google's PALM, the open source mesh-transformer-jax (used to train EleutherAI's open source models), and many many more.
Evaluation
Oh boy, how does one even evaluate LLMs? Honestly, it's really hard problem. HELM is pretty comprehensive and a good place to start, but you should always be skeptical of benchmarks and evaluation metrics.
Architecture Improvements
I recommend taking a look at Phil Wang's X-Transformer's. It has the latest and greatest research on the transformer architecture. This paper is also a pretty good summary (see Table 1). Facebook's recent LLaMA paper is also probably a good reference for standard architecture improvements (as of February 2023).
Stopping Generation
Our current implementation requires us to specify the exact number of tokens we'd like to generate ahead of time. This is not a very good approach as our generations end up being too long, too short, or cutoff mid-sentence.
To resolve this, we can introduce a special end of sentence (EOS) token. During pre-training, we append the EOS token to the end of our input (i.e. tokens = ["not", "all", "heroes", "wear", "capes", ".", "<|EOS|>"]
). During generation, we simply stop whenever we encounter the EOS token (or if we hit some maximum sequence length):
GPT-2 was not pre-trained with an EOS token, so we can't use this approach in our code, but most LLMs nowadays use an EOS token.
Unconditional Generation
Generating text with our model requires us to condition it with a prompt. However, we can also make our model perform unconditional generation, where the model generates text without any kind of input prompt.
This is achieved by prepending a special beginning of sentence (BOS) token to the start of the input during pre-training (i.e. tokens = ["<|BOS|>", "not", "all", "heroes", "wear", "capes", "."]
). Then, to generate text unconditionally, we input a list that contains just the BOS token:
GPT-2 was pre-trained with a BOS token (which is confusingly named <|endoftext|>
), so running unconditional generation with our implementation is as easy as changing the following line to:
And then running:
Which generates:
Because we are using greedy sampling, the output is not very good (repetitive) and is deterministic (i.e. same output each time we run the code). To get generations that are both higher quality and non-deterministic, we'd need to sample directly from the distribution (ideally after applying something like top-p).
Unconditional generation is not particularly useful, but it's a fun way of demonstrating the abilities of a GPT.
Fine-tuning
We briefly touched on fine-tuning in the training section. Recall, fine-tuning is when we re-use the pre-trained weights to train the model on some downstream task. We call this process transfer-learning.
In theory, we could use zero-shot or few-shot prompting to get the model to complete our task, however, if you have access to a labelled dataset, fine-tuning a GPT is going to yield better results (results that can scale given additional data and higher quality data).
There are a couple different topics related to fine-tuning, I've broken them down below:
Classification Fine-tuning
In classification fine-tuning, we give the model some text and we ask it to predict which class it belongs to. For example, consider the IMDB dataset, which contains movie reviews that rate the movie as either good, or bad:
To fine-tune our model, we replace the language modeling head with a classification head, which we apply to the last token output:
We only use the last token output x[-1]
because we only need to produce a single probability distribution for the entire input instead of n_seq
distributions as in the case of language modeling. We take the last token in particular (instead of say the first token or a combination of all the tokens) because the last token is the only token that is allowed to attend to the entire sequence and thus has information about the input text as a whole.
As per usual, we optimize w.r.t. the cross entropy loss:
We can also perform multi-label classification (i.e. an example can belong to multiple classes, not just a single class) by applying sigmoid
instead of softmax
and taking the binary cross entropy loss with respect to each class (see this stack-exchange question).
Generative Fine-tuning
Some tasks can't be neatly categorized into classes. For example, consider the task of summarization. We can fine-tune these types of task by simply performing language modeling on the input concatenated with the label. For example, here's what a single summarization training sample might look like:
We train the model as we do during pre-training (optimize w.r.t language modeling loss).
At predict time, we feed the model the everything up to --- Summary ---
and then perform auto-regressive language modeling to generate the summary.
The choice of the delimiters --- Article ---
and --- Summary ---
are arbitrary. How you choose to format the text is up to you, as long as it is consistent between training and inference.
Notice, we can also formulate classification tasks as generative tasks (for example with IMDB):
However, this will probably perform worse than doing classification fine-tuning directly (loss includes language modeling on the entire sequence, not just the final prediction, so the loss specific to the prediction will get diluted)
Instruction Fine-tuning
Most state-of-the-art large language models these days also undergo an additional instruction fine-tuning step after being pre-trained. In this step, the model is fine-tuned (generative) on thousands of instruction prompt + completion pairs that were human labeled. Instruction fine-tuning can also be referred to as supervised fine-tuning, since the data is human labelled (i.e. supervised).
So what's the benefit of instruction fine-tuning? While predicting the next word in a wikipedia article makes the model is good at continuing sentences, it doesn't make it particularly good at following instructions, or having a conversation, or summarizing a document (all the things we would like a GPT to do). Fine-tuning them on human labelled instruction + completion pairs is a way to teach the model how it can be more useful, and make them easier to interact with. This call this AI alignment, as we are aligning the model to do and behave as we want it to. Alignment is an active area of research, and includes more than just following instructions (bias, safety, intent, etc ...).
What does this instruction data look like exactly? Google's FLAN models were trained on various academic NLP datasets (which are already human labelled):
Figure 3 from FLAN paper
OpenAI's InstructGPT on the other hand was trained on prompts collected from their own API. They then paid workers to write completions for those prompts. Here's a breakdown of the data:
Table 1 and 2 from InstructGPT paper
Parameter Efficient Fine-tuning
When we talk about fine-tuning in the above sections, it is assumed that we are updating all of the model parameters. While this yields the best performance, it is costly both in terms of compute (need to back propagate over the entire model) and in terms of storage (for each fine-tuned model, you need to store a completely new copy of the parameters).
The most simple approach to this problem is to only update the head and freeze (i.e. make it untrainable) the rest of the model. While this would speed up training and greatly reduce the number of new parameters, it would not perform particularly well since we are losing out on the deep in deep learning. We could instead selectively freeze specific layers (i.e. freeze all layers except the last 4, or freeze every other layer, or freeze all parameters except multi-head attention parameters), which would help restore the depth. As a result this will perform a lot better, but we become a lot less parameter efficient and we lose out on some of those training speed gains.
Instead, we can utilize parameter-efficient fine-tuning methods. This is still an active area of research, and there are lots of different methods to choose from.
As an example, take the Adapters paper. In this approach, we add an additional "adapter" layer after the FFN and MHA layers in the transformer block. The adapter layer is just a simple 2 layer fully connected neural network, where the input and output dimensions are n_embd
, and the hidden dimension is smaller than n_embd
:
Figure 2 from the Adapters paper
The size of the hidden dimension is a hyper-parameter that we can set, enabling us to tradeoff parameters for performance. For a BERT model, the paper showed that using this approach can reduce the number of trained parameters to 2% while only sustaining a small hit in performance (<1%) when compared to a full fine-tune.
Training at scale, collecting terabytes of data, making the model fast, evaluating performance, and aligning the models to serve humans is the life's work of the 100s of engineer/researchers required to make LLMs what they are today, not just the architecture. The GPT architecture just happened to be the first neural network architecture that has nice scaling properties, is highly parallelizable on GPUs, and is good at modeling sequences. The real secret sauce comes from scaling the data and model (as always), GPT just enables us to do that[9]. It's possible that the transformer has hit the hardware lottery, and some other architecture is still out there waiting to dethrone the transformer. ↩︎
For certain applications, the tokenizer doesn't require a
decode
method. For example, if you want to classify if a movie review is saying the movie was good or bad, you only need to be able toencode
the text and do a forward pass of the model, there is no need fordecode
. For generating text however,decode
is a requirement. ↩︎Although, with the InstructGPT and Chinchilla papers, we've realized that we don't actually need to train models that big. An optimally trained and instruction fine-tuned GPT at 1.3B parameters can outperform GPT-3 at 175B parameters. ↩︎
The original transformer paper used a calculated positional embedding which they found performed just as well as learned positional embeddings, but has the distinct advantage that you can input any arbitrarily long sequence (you are not restricted by a maximum sequence length). However, in practice, your model is only going to be as the good sequence lengths that it was trained on. You can't just train a GPT on sequences that are 1024 long and then expect it to perform well at 16k tokens long. Recently however, there has been some success with relative positional embeddings, such as Alibi and RoPE. ↩︎
Different GPT models may choose a different hidden width that is not
4*n_embd
, however this is the common practice for GPT models. Also, we give the multi-head attention layer a lot of attention (pun intended) for driving the success of the transformer, but at the scale of GPT-3, 80% of the model parameters are contained in the feed forward layer. Just something to think about. ↩︎If you're not convinced, stare at the softmax equation and convince yourself this is true (maybe even pull out a pen and paper):
I love JAX ❤️. ↩︎
Using JAX, this is as simple as
heads = jax.vmap(attention, in_axes=(0, 0, 0, None))(q, k, v, causal_mask)
. ↩︎Actually, I might argue that there is something inherently better about the way attention models sequences vs recurrent/convolutional layers, but now we in a footnote inside a footnote, so I digress. ↩︎
Last updated