The Coding Notebook
Memorable coding moments of a software engineer
AI For Developers: How Transformer LLMs Work
There are my notes from the DeepLearning.AI course How Transformer LLMs Work

Words Representation

When dealing with language models it is important to understand how words are represented as numeric values, we will go over the evolution of this topic.

Here is a timeline of language models
276fcc30f3e0f156c2a5824d5746bbc9.png

Bag of Words

In this method we simply count the number of times a word in a vocabulary appears in the sentence we want to encode.
Note that not all words may appear in the vocabulary, usually the vocabulary is built only from words that appear in the training dataset. In the example below, the word "cute" is not part of the vocabulary, that's ok.
The "bag of words" is simply the count of words in a vector representation:

507d4e611885f1a1b2820b7404ddaf5b.png

Word2Vec

Bag of words is a very simple representation of words and it does not consider the semantic nature of text.
Word2Vec was designed to capture the meaning of a word, it is also represented as a vector of numbers, called "embeddings".
These embeddings vectors are built using neural networks.

We start by initializing a random embedding vector for each word in the vocabulary, then we train a network that takes a pair of words and tries to predict whether the two words are "close" to each other. The embeddings vector are just a layer from the neural network, the vector captures the "meaning" of the word.

d47841d0c3766cf2973046149fa1a0ca.png

The final result is that words with similar meaning are clustered together on the embedding space (note that the embeddings vector dimension can be very large, like 512, 1024 and more)

acd94c251c5f94bb40b57b6c344da25b.png

Types of Embeddings

Once we created an embeddings vector for each word (more specifically for a "token" which can be a part of a word) in a sentence, we can use various techniques (like average these vectors) to get the "sentence" embeddings, and similarly the entire "document" embedding.

d722ca92648577f40927b07a8eb4a5ba.png

Encoding and Decoding Context

Word2Vec creates an embeddings vector for each word in a sentence regardless of its position (or context) in the sentence. For example, in the sentences "The money is in the bank" and "The bank of the river", Word2Vec will generate the same embeddings vector to the word "bank" although the meaning in each sentence is different. The word embeddings should change with the context.

RNNs

Recurrent Neural Networks can be used to model entire sequences, like time series or words in sentence. RNNs process the entire sequence so each word is processed in the context of the previous words.

In a translation task 2 RNNs are used, an Encoder and a Decoder.
The encoder tries to create a context vector that represents the sentence in the source language, then the decoder uses this context vector to generate text in the target language.

9dc5cdbd479ede699394b48dbefee95a.png

Autoregressive

The decoder generates the translated words one-by-one, it starts by taking the entire sentence in the source language (and the encoder context) to generate the first translated word, then the translated word is appended to the previous input and the decoder generates the second word, and so on…

9445bffc634a88043dd9a4a915448383.png

The words are generated one at a time, using the same context embedding, this doesn't work good for long sentences as a single context fails to represent the entire input.

Attention

Attention is a mechanism that allows a model to focus on parts of the input that are relevant to one another. The attention mechanism was introduced in 2014 (three years before the Transformer architecture).
The mechanism builds an "attention map" that gives higher weights to pair of words from the input and output that relate to each other (do not confuse it with self-attention).

9799814456503ad84f6c6b3ac9424142.png

The attention mechanism is added to the decoder step, now instead of passing a single context vector from the encoder to the decoder, we pass the "hidden state" of each word to the decoder, hidden state is an internal layer of the RNN that contains information about previous words. The decoder uses the attention mechanism to look at the entire sequence to generate the output, instead of the limited context embedding.

a53e4826fcc0fd3c059ac94ceeb4a698.png

Transformers

The transformer architecture was introduced in 2017 in a paper called "Attention is all you need", the architecture is solely based on attention without the use of RNNs (hence the name of the paper…).
The major advantage of the transformer is that it allows the model to be trained in parallel which speeds up calculations.

The transformer is built from a stack of encoders and decoders blocks, each uses the attention mechanism. By stacking the blocks we amplify the strength of the encoders/decoders.

9241240eca64cfab1d79d6966baef1eb.png

Encoder Block

The input words are converted to embeddings, but instead of word2vec we start with random values. Then self-attention process the embeddings and updates them, creating an embeddings that are more contextualized due to the attention mechanism. These are passed to a feed-forward neural network to create the finalized contextualized word embeddings.

The self-attention process the input sequence against itself, as opposed to the original attention mechanism from 2014 that processed the input sequence against the output sequence.
Here the attention map results higher weights to words in the sentence that are more related to each other.

da62202ca859afa53906273cfe6a7915.png

### Decoder Block
The decoder takes any previously generated word and pass it to the "masked self-attention" for processing, the result is an updated intermediate embeddings. These embeddings are passed to another attention, along with the encoder output, to create a single embeddings. This single embeddings is passed to a feed-forward network to create the next generated word.

38df4ffb99f073f567b993553d8fd64b.png

Masked Self-Attention

Masked Self-Attention is similar to self-attention but it removes the upper diagonal, hence mask future positions, any given token can only attend to token that came before it.

ba97f701305cf889ca1a9390dd771881.png

Representation Models

The original transformer architecture was an encoder-decoder which is best for text translations, but its hard to use it for other language tasks, like classification.

BERT

Bidirectional Encoder Representations from Transformers, BERT, is an encoder only model that generates contextualized word embeddings, these embeddings can then be used for classification.

b9a046dd7b49b776016b95408f61072f.png

The encoder blocks are the same as we saw above, the [CLS] is a special token that represents the specific task we fine tune the model for (classification).

Training is done by randomly masking words in a sentence and training the model to predict the masked word.

37c24c3800780027693dceac5ee04b90.png

This process is the "pre training" after which we get a "base" model that was trained on masked data. Then we can fine tune this model to specific tasks like Classification, Named Entity Recognition, Paraphrase Identification etc.

Generative Models

In contrast to representation models, generative models are decoders-only models, the input embeddings are initialized randomly and passed to a stack of decoders:

8915b8236d523030b0a10a4e0c198813.png

After the release of ChatGPT (GPT 3.5) at the end of 2022 there was a flow of newly released generative models, both proprietary and open source, and 2023 became the Year of Generative AI

0af68c0c6f91e447e93c805f49d200a7.png

Tokenizers

The process of tokenizing the input text vary from model to model, some can tokenize it word-by-word, some can break words input 2 tokens, some can tokenize character-by-character etc.
After the sentence has been tokenized each token is assign an integer id, and the tokenizer vocabulary is built.
The important thing is that once a model was trained using a specific tokenizer the same tokenizer must be used during inference (text generation).

Example

Let's see how difference tokenizers will tokenize the sentence:

English and CAPITALIZATION 🎵.

bert-base-cased

[CLS] English and CA ##PI ##TA ##L ##I ##Z ##AT ##ION [UNK] . [SEP]

This tokenizer has 28,996 tokens in its vocabulary.
The [CLS] hint for the Classification task, the hashtags signals this token belongs to the token before it, [UNK represents an unknown word (a word that is not it the tokenizer vocabulary), [SEP] represents a seperator.

Xenova/gpt-4

English and CAPITAL IZATION � � � .

While GPT4 is closed model we still have access to its tokenizer.
This tokenizer has 100,263 tokens in its vocabulary, see how the word CAPITALIZATION is broken into 2 parts only, a smaller vocabulary usually requires higher fragmentation of long words (the short tokens can be shared among many words).
A larger vocabulary on the other hand requires more computation and memory (as the generated output is actually an array with the size of the vocabulary where each token gets a probability of being the next token).
Also note that since GPT is used for text generation we don't have the special [CLS] token etc.

Transformer LLM

Lets dive into how a transformer LLM works, we know that it all starts with a prompt:

Write an email apologizing to Sarah for the tragic gardeing mishap.
Explain how it happened.

And the LLM will generate the output token-by-token:

Dear Sarah.
I would like ...

Overview

There three main components for a Transformer LLM: Tokenizer, Stack of Transformer Blocks, Language Modeling Head

ae26d31faf94f198240d5782336c34f5.png

Tokenizer

We already discussed what a tokenizer is, the LLM has its own tokenizer vocabulary where each token is associated with an integer.
Also the trained model has an embeddings vector for each token in the vocabulary, these embeddings are substituted at the beginning (input token -> token embedding).

Language Modeling Head

(we skip the transformer blocks for now)
The output of the transformer blocks is an "embedding like" vector that encompass the "best" next token. The language modeling head is where this "embedding like" vector is transformed into a probability map.
The output is an array with the size of the vocabulary, where each token in the vocabulary is assigned a probability of being the next token.

8eca76dc897a8661f0ea6441ff71f87b.png

Once we have to probability map we need to pick the next token out of all the probabilities.
We can always choose the highest probability token, this is called "greedy decoding", usually controlled by a parameter called temprature. Another option is to choose a token from a basket of tokens, we add to the basket the tokens with the highest probability, one-by-one, until the sum of all added probabilities reaches some threshold, then we randomly choose a word. This threshold is usually controlled by a parameter called top-p.

Decoding Loop

As we know the text is generated in an autoregressive manner, meaning we generate the first word then we append that word to the input to generate the next word.

Transformers are very efficient as they can process all the tokens in parallel, the number of tokens it can process in parallel is usually the context size.

8feccf2e849af6c4eae6526f8d7b5a59.png

Once we generated the first word it is appended to the prompt to generate the next word, but now we use cached calculations from the previous step, so no need to process all the tokens again.
That is why the metric "time to first token" is used in LLMs as generating the first token is when we process the most tokens.

f2e12521cb7af24dfa67e5ca16306293.png

Transformer Block

Once we have an input prompt we first replace the words with tokens and then replace the tokens with their pre-computed vector embeddings.
All embeddings are processed thru a set of transformer blocks, each block has its own self-attention layer and a feed-forward network.

Feed-Forward Neural Network

Lets look at the prompt "The dog chased the llama because it".
As we said each the token goes thru the transformer block, and in a nutshell, the neural network learns to predict the next token (or specifically an "embedding like" vector that encompass info about the next token, this passed to the language modeling head for processing).

ff7c4bfb3d53995c599d8ce612f2c832.png

Self-Attention Layer

Continuing we the previous example, if we had only the NN and it had to predict the next token for the word "it" this would be a difficult task, as statistically there could be a lot of words that can come after the word "it". The goal of the self-attention layer is to provide a "better" input to the NN, an input that encompasses more contextual info.
In our example, the contextual info may be a hint that the word "it" refers to the "llama", so when the NN process the word "it" it has some "understanding" that "it" refers to the "llama".

318e377fc09c881cf05e8d3d123650fa.png

In a high level, when the attention layer is processing a token it embeds relevant information from the previous tokens into the current token, specifically:
1) Relevance Scoring - how relevant a previous token to the current token.
2) Combining Information - combining information from the relevant tokens into the current token.

b3e5258cc6def776570fc4115e598cf9.png

Relevance Scoring

Self-attention happens in what is called "Attention Head", usually there are multiple attention heads.
The input to the attention head is a sequence of embedded tokens with positional encoding (remember that transformers, unlike recurrent neural networks (RNNs), don't inherently process sequential data. Therefore, positional encodings are added to the embeddings to provide the model with information about the order of the tokens in the sequence).
These tokens (with positional encoding) are then transformed, using learned weight metrices, to produce three vectors: Queries (Q), Keys (K) and Values (V).

0519779266135ecec87aac9a3ba1cb7b.png

The Relevance Scores are calculated for each token by multiplying the current token's Q vector with all other tokens K vectors, this produce a relevance score for the current token with all other token.

e0782c1d904d54ee2e1c6745680308f7.png

Combining Information

The Combining Information step is done by multiplying the relevance score vector with the V vector of each token.
So if in the example below we are processing the "it" token, then we first computed the relevance score for each token with the word "it", then we multiply each token's value with its relevance score and sum it up.

441229b0b55fd2b49afd798e1d685ef2.png

Multiple Heads

The self-attention layer has multiple attention heads, each head with its own Q, K, V metrices.
Multi-head attention allows the model to learn multiple, different types of relationships simultaneously, also each head can focus on a different aspect of the input.

2fb3047ae4d98009267eda8793a00e78.png

To make self-attention more efficient several heads are grouped together and they share the same K, V metrices, typically referred to as n_groups and n_attention_heads

0097c7de77a6553475b1cff017106e8c.png

Sparse Attention

In our example each token can attend to all previous tokens, but in larger models this is very expansive, in sparse attention a token can attend only to a limited number of previous tokens.
In example (a) each token can attend to all previous tokens
In example (b) it can attend to a maximum of 4 previous tokens (and maybe some more backward with jumps).
In example (c) the input sequence is chunked into lengths of 4 tokens.

082adb3c81328f5ca2b68ba57bf7c464.png

In order to support a really large context window there is a concept of Ring Attention, it is beyond the scope of this class but in general it uses multiple GPUs to enable scaling the context window.

Example

Model Architecture

Now we should have enough knowledge to understand the description of a model architecture, here is an example of the Llama 3 model.

7ba0d9fc1c1b5cec4d0b995ce76d2f64.png

  • Layers (32) is the number of transformer blocks.
  • Model Dimension (4096) is the token embeddings length.
  • FFN Dimension (14,336) is the size of the feed-forward neural network (in the image)
  • Attention Heads (32) is the number of total heads
  • Key/Value Heads (8) is the number of attention groups (shared K/V)
  • Vocabulary Size (128,000) is the number of known tokens in the model.

Step by Step Inference

Lets run a text generation task step-by-step to see what we have in each step.

Installing Phi-3

We will use the phi-3 model, first lets load it and its tokenizer.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("../models/microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "../models/microsoft/Phi-3-mini-4k-instruct",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
)

Check the number of tokens

print(tokenizer.vocab_size)

# output: 32000

Check the model architecture

print(model)

Output:

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

we can see the main components are:

  1. The Phi3Model that generates the probabilities for the next token for each token in the vocabulary, it consist of:
    1. The Embedding layer, it can get up to 32,064 tokens and generates embedding vector of size 3072 for each token.
    2. The 32 transformer blocks, Phi3DecoderLayer, each with:
      1. Self Attention layer Phi3Attention
      2. Feed Forward Neural Network Phi3MLP
  2. The Language Model component lm_head that generates the probabilities for each token in the vocabulary being the next token.
    NOTE The lm_head output size is 32,064 while the tokenizer vocabulary size is only 32,00, why? in order to optimize processing vector sizes are rounded to a multiply of 64.

First we will run the Phi3Model directly:

prompt = "The capital of France is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

print(input_ids)

# output:
tensor([[ 450, 7483,  310, 3444,  338]])

Run the model:

model_output = model.model(input_ids)
print(model_output[0].shape)

# output:
torch.Size([1, 5, 3072])

The output are 5 "embedding like" vectors that represents the next token for each of the input tokens, so if we want to complete our prompt we need to get the next token for the word "is", which is the 5th token.
To get the actual next token out from the "embedding like" vector we need to run the lm_head.

lm_head_output = model.lm_head(model_output[0])
lm_head_output.shape

# output:
torch.Size([1, 5, 32064])

Now we see that the output are 5 vectors, one per input token, and for each there is a vector with the size of the vocabulary, each value is a logit value representing the probability of this token being the next token.
Lets take the probabilities for the next token of the 5th token:

next_token_logits = lm_head_output[0, 4]
print(next_token_logits)

# output:
tensor([27.8750, 29.5000, 28.0000,  ..., 20.3750, 20.3750, 20.3750],
  dtype=torch.bfloat16, grad_fn=<SelectBackward0>)

Now as we discussed there are different strategies of picking the next token, we will pick the one that has the highest probability, and we decode the token id back to a word:

next_token_id = next_token_logits.argmax()
print(tokenizer.decode(next_token_id))

# output:
Paris

And voila! We got the correct completion !!

Recent Improvements (2024)

We have learned the "original" transformer architecture, over the years there have been some modifications and as of 2024 this is where we stand:

5cfaeb175e2fec9743704dc95d018ac1.png

  1. Most noticeable is the lack of "positions encoding" at the beginning, it is still important for the self-attention to know the position of the token it is processing, but now the positional encoding is done in the self-attention by what is called "rotary embeddings".
  2. Self-attention layer is optimized with the grouped-query attention (as we discussed).
  3. The raw input tokens are merged with the output of the self-attention layer before moving to the NN (the bypass dashed line around the self-attention layer and the + sign). This is actually not a change from the 2017 architecture but more visible on the 2024 drawing.
  4. The normalization layers moved to be before the self-attention and the feed-forward network, some experimentations showed it yields better results.

Rotary Embeddings (RoPE)

The goal of rotary embeddings is to improve and optimize the training process.
Assume we are training a model with context size of 16k, and in our batch we have 32 documents. Each batch can have a potential size of 16k, but there are many documents that are less than 16k so the batch vector of 16k is getting padded with 0.

f9017825c44b1e5626931ec2d284fefb.png

The GPU still runs computation on the entire 16k vector, it doesn't care that the numbers are 0.
What if instead of padding the vector with 0s we could have put several documents in a single batch:

4196eac8e6f8df0aab20c6e632305422.png

Now let's think about positional encoding when our data looks like this. If we use the simple regular positional encoding, which is the word index in the context vector, it will work for the first doc, but for the second doc it won't work, as the first word of the 2nd doc should have a positional encoding of 0.
Rotary embeddings is a way to solve this problem, it adds positional information at the self-attention layer, just before calculating the relevance scores.

Mixture of Experts (MoE)

We saw that in the transformer block we have a single feed-forward neural network, what if we had several feed-forward networks, each specialize on a specific kind of tokens, and we had a router that routes the tokens to the appropriate "expert".
Note these are not domain experts (like psychology or biology) but more token related experts like punctuation (, . ? etc.), verbs (said, read, etc.), conjunctions (the, and, if etc.), visual descriptions (dark, outer, yellow etc.)

f11ac3197dccbc2cdce95f94b40494b3.png

During training the router itself is being trained as well on how to best route each token (classification task), it is also possible to chose 2 experts and combine their results.

Computational Requirements

Each expert is a full neural network, and these neural networks are where most of the LLM parameters are.
When loading a model with MoE we need to load into memory all experts which requires more memory than a regular LLM.
But, during inference we do not activate all experts, usually only 2 experts are activated at most, hence requiring less compute capabilities (in general, an expert network has fewer parameters than the single feed-forward network of a regular LLM).

Here is an example of the Mixtral 8x7B model (that has 8 experts).

292a2f40791e22a7bed8896af3043b9f.png

While the model has a total of 46.7B parameters, during inference only 12.8B will be activated.

The END