Understanding The Transformers architecture: “Attention is all you need”, paper reading
Passing by AI ideas and looking back at the most fascinating ideas that come in the field of AI in general that I’ve come across and found so interesting to grasp and try to understand deeply, I think it is the Transformer architecture, we always had those different achievements in the field of AI that come with new inventions of methodologies that help neural networks in processing any input you feed it, could be video, images or speech or text, to use it in different tasks like next word prediction or detecting a cat in an image. So Recently, we notice this convergence to the transformer architecture as the new promise for AI.
The paper “Attention is all you need” came out in 2016, we will discuss the main points about transformer architecture:
Outline:
- Background on Sequence Modeling and the Need for Parallelization
- Attention Mechanism
- Encoder-Decoder Architecture
- Key Takeaways
- Conclusion
- Background on Sequence Modeling and the Need for Parallelization:
the authors of this paper “Attention is all you need” propose a revolutionary model architecture the Transformer, which replaces recurrent and convolutional layers with attention mechanisms. This paper presents the Transformer’s design, its advantages, and its exceptional performance in machine translation tasks.
Before getting into the content of our paper, we have to look back at what problem we are trying to solve in this paper after we solved the problem of sequence modeling with Recurrent Neural networks (RNNs) and gated RNNs (LSTMs), those models tend to face another problem which is model forgetting because the quality of our model is determined by the ability to look back or to remember what it was said, so let’s take for example a student that is studying for an exam by using traditional study methods like reading and taking notes(passive learning). However, as he progresses he realizes that he’s forgetting important information from what he learned, and it becomes challenging for him to retain everything he learned, and this model of forgetting makes it difficult for him, so maybe if he tries to adopt another method of learning (active learning) he can still retain all the pieces of information he learned.
Similarly, in the field of sequence modeling, researchers initially relied on RNNs and LSTMs to catch long dependencies in the sequences of data. Those models were effective to some degree but faced a similar problem of model forgetting, they struggled to retain crucial information from previous parts of the sequence, reducing its ability to generate accurate predictions.
In general, RNN and LSTM Networks have some disadvantages that prevent us from reaching the ultimate accuracy in our models, as also they need more time to train, so they are slow to train, plus the input data needs to be passed sequentially which it doesn’t benefit the GPUs that are designed for parallel computation, so is there another method to parallelize sequential data?
That’s why this paper introduces a new model called the Transformer that overcomes the issue of model forgetting by using a different approach called attention, it allows the model to look back and remember relevant information from all parts of the sequence just like adopting an efficient study technique that helps you retain and recall all the information.
2. Attention Mechanism:
we notice the presence of an encoder model positioned on the left-hand side, while the decoder occupies the right-hand side. Both components consist of a fundamental unit composed of two integral elements: an attention mechanism and a feed-forward network. This unit is iterated N times to reinforce its effectiveness. However, before understanding this architecture, it is important to thoroughly comprehend the fundamental concept: the self-attention mechanism.
The self-attention mechanism:
I’d like to approach this intriguing concept with an example, imagine you’re sitting in a room filled with people, and everyone is having a conversation. Now, instead of focusing on a single conversation, you notice you have this incredible power that gives you the ability to pay attention to multiple conversations simultaneously, retaining the most information from each conversation. This is similar to how the self-attention mechanism works. the model has the capacity to pay attention to different parts of the input sequence and capture the dependencies between them.
So the self-attention mechanism is a fundamental component that enables the model to capture relationships between different positions within a sequence, Let’s break it down step by step in how it works in theory:
I’ve come across this genius lecture from Peter Bloem’s blog, I will use its illustrations to explain the concept better using my own words;
As we see in the illustration a sequence input of vectors (the input embeddings) that we could get from any technique that converts words to numbers (like word2vec), but the problem here is those vectors are independent of each other x1 doesn’t know about x2 and x2 doesn’t know about x3, etc. So by employing self-attention, transformer models can directly attend to all the positions in the input sequence simultaneously to produce those output vectors yi at the end that are more context aware we can say, and that with applying this operation illustrated below that is called the weighted sum:
Weighted Sum = Sum(Attention Weight * Value Vector)
where the attention weight is the softmax-normalized attention score and the value vector represents the input vector.
So how do we get the attention weight Wij?
First, attention takes 3 inputs that we create from the input embeddings which are sets of projections as this video describes them: Query, Key, and Value, Why do we need them?
The query, key, and value components in the self-attention mechanism serve specific purposes and enable the model to capture meaningful relationships and dependencies in the input sequence, every vector in self-attention occurs in three different positions: First, the value which contains the actual information or representation associated with each position in the input sequence, Second, the query, and third the vector that the query is matched against which is called the key.
for more understanding, let’s consider the sentence illustrated above, “Attention is all you need”. This sentence consists of 5 words or tokens, if we focus on the word “all”, we can see that the words “is” and “you” are nearby, but they don’t provide any meaningful context for understanding “all”. Instead, the words “Attention” and “need” are more closely related to “all” in the sentence, giving us a better understanding of their meaning. This shows that proximity alone is not always enough; context is crucially important.
When we input this sentence into a computer, it treats each word as a token, so we assigned word embeddings(V1, V2…), but those word embeddings lack context. So, the idea is to apply a weighting or similarity measure to obtain final word embeddings (Y1, Y2..) that capture more overall context.
stop, stop, now that I made sure you understand the concept from the surface, you may wonder how we compute the weighted sum, fine let’s break it down to more small steps:
- Obtaining the weights: What weights we are talking about? Wij in the weighted sum is not a parameter of the model but a value that we have to calculate from the inputs, This is done by taking the dot product between all the elements of the vectors inputs followed by the Normalization layer when we apply the softmax function.
- New Embeddings (with context): And here where the magic happens, we simply take those weights and multiply it with all the word vectors in a sentence.
I add this illustration that shows the process of obtaining Y1, which is repeated for each word in the sentence.
A. Scaled Dot Product Attention:
Scaled Dot Product Attention is a key component of the self-attention mechanism in the Transformer architecture. It allows the model to compute the importance or relevance of different positions within a sequence.
The final output of the scaled dot product attention is a weighted sum of the value vectors, where the attention weights serve as the weights for the summation
So as the figure below shows the steps of how this works, first it starts with calculating the Context Similarity matrix by multiplying the matrices Q and K (they have the same size), and after that, we scale the resulting matrix or the compatibility matrix as the author of the paper call it by the factor 1 over square root of the size vector K (Dk), as the masking step is optional we can skip it, then it is fed to normalization layer using the softmax function to get the attention matrix, finally, it computes another matrix multiplication between Vector V and our attention matrix.
and that’s a single-attention head, we calculate it for one token of the sentence. Then we move on to calculate the others.
B. Multi-Head Attention:
We can say that multi-head attention is just Multiple calculated self-attention heads concatenated, So it’s just applying the concept of the Scaled Dot-Product mechanism multiple times in parallel and concatenating the results. Instead of using the vectors of words, we use matrices of those stacked vectors in computing attention.
(The transformer uses eight attention heads)
3. Encoder-Decoder Architecture:
The Encoder-Decoder architecture for the Transformer model is much like the Recurrent Neural Networks, but the difference is that the input can be passed in parallel, in RNNs the word embeddings are generated step by step and we need inputs of the previous state to make operations on the current word hidden state. However, the transformer breaks this concept because it has no time to waste on the small task of passing all the words of the sentence simultaneously and determining the word embeddings simultaneously as you notice in the figure below.
So how does the transformer architecture process this? let’s get deeper into the components of the Transformer model:
As we know most transduction models use the Encoder-Decoder structure. Unlike traditional transduction models that heavily rely on recurrent or convolutional layers depending on the task at hand, the Transformer replaces these layers with attention mechanisms. The attention mechanism enables the model to capture dependencies between different positions within a sequence efficiently, allowing it to process and generate outputs in parallel. This parallelization leads to faster training and inference times compared to sequential models like RNNs.
Here, both the encoder and the decoder use a stacked self-attention and point-wise, fully connected layers,
For the encoder part:
As you see in the figure above, the first step is converting inputs into input embeddings, because of course computers can’t understand a series of words, we need to convert them to numbers, so that’s what the embedding layer is for.
So we gonna use some model to represent those words into numbers such as a Bag of words when we give the machine a dictionary, then represent each word by its number in that dictionary, or we can use one-hot encoding?
But actually, that’s not a great way to represent the words, for that we have to use word embeddings( similar words have similar vectors in vector space). The embedding layer in the encoder produces a 512-dimensional embedding vector for each token or word of the sentence.
Next, we pass those word embeddings to positional encoding( for the decoder also): this approach came in to solve the problem of the order or the position of the word can change the meaning of the sentence because the lack of positional information could lead to confusion for the model when trying to capture the meaning and context of the sentence. so to solve it we have to add a representation of the position of the word in the sentence to the word embeddings before feeding it to the encoder.
To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the embeddings so that the two can be summed. There are many choices of positional encodings.
from the paper “Attention is all you need” -Page 6
You can check this blog here, it explains this concept much better,
In order to calculate the positional encoding vectors, the author of the paper uses a sinusoidal function:
That was only the data processing part, let’s see what the real encoding part hides for us:
Now we feed the embeddings that contain information about context to The encoder block which goes through a multi-head attention layer and feed-forward layer(we explained above how the attention mechanism works).
After producing 8 matrices from the multi-head attention layer, we have to find a way to condense these 8 down into one matrix, because the Feed-Forward layer is expecting just a single matrix, how do we do that?
We concatenate the matrices and multiply them by additional weights,
Then we pass to add and Normalization layer where we calculate the sum of the output vector from the previous layer and the input embedding layer from the first step, then we send the output to the fully connected feed-forward layer( that contains each of the layers in our encoder and decoder), consists of two linear transformations with a ReLU activation in between.
This intricate process continues with the fully connected feed-forward layer, where the output from the previous step undergoes two linear transformations with a ReLU activation in between. This transformation enhances the model’s ability to capture complex relationships and patterns within the data.
The output from the feed-forward layer is then forwarded to the subsequent encoder block, repeating the multi-head attention and feed-forward layer operations. This iterative process, conducted multiple times (N times, as determined by the architecture), refines the model’s understanding of the input sequence and extracts increasingly abstract representations.
To further illuminate the concept, envision this process as akin to a skilled detective team unraveling a complex case. Each encoder block acts as a detective with specific expertise, diligently investigating different aspects of the input sequence. The multi-head attention layer allows these “detectives” to collaborate effectively, sharing insights about different positions within the sequence.
As the model progresses through the encoder blocks, the input sequence’s contextual information is distilled and refined. The positional encoding ensures that the model understands the nuanced relationships between words, while the attention mechanism enables it to focus on different parts of the sequence concurrently.
The final output from the encoder represents a comprehensive understanding of the input sequence, incorporating both local and global dependencies. This rich representation is then passed to the decoder, where a similar process unfolds in the generation of the final output sequence. In essence, the encoder’s role is to distill meaningful information from the input sequence, leveraging attention mechanisms and parallelization to efficiently capture dependencies. The iterative nature of the encoder blocks ensures a progressively refined understanding, laying the foundation for the Transformer’s remarkable performance in natural language processing tasks.
I’d love to give you this analogy for encoder-decoder architecture to make the understanding more entertaining:
Encoder: Imagine the encoder as a team of detectives investigating a complex mystery. Each detective in the team has specific expertise and can pay attention to different aspects of the case. They work together in a coordinated manner to gather information, analyze clues, and understand the big picture. The team also has a communication network that allows them to share their findings and collaborate effectively. This network helps them combine their individual insights into a cohesive understanding of the mystery.
Decoder: The decoder can be likened to a group of storytellers who are given the task of crafting a compelling narrative based on the clues provided by the detective team. Each storyteller has their own style and imagination, but they need to stay grounded in the information provided by the detectives. They carefully listen to the detectives’ findings and use their storytelling skills to weave a coherent and engaging story. They pay attention to the details and sequence of events, ensuring that the story unfolds logically and stays faithful to the known information.
4. Key Takeaways:
The Transformer architecture brings several key takeaways to the forefront of AI research and application:
- Parallelization and Efficiency: One of the fundamental advantages of the Transformer is its ability to process inputs in parallel, a departure from sequential models like RNNs. This parallelization not only accelerates training and inference times but also harnesses the computational power of modern GPUs designed for parallel computation.
- Addressing Model Forgetting: The Transformer effectively addresses the issue of model forgetting encountered by traditional RNNs and LSTMs. Through the self-attention mechanism, it enables the model to retain and recall relevant information from all parts of the input sequence simultaneously. This leads to more context-aware predictions and robust sequence modeling.
- Attention Mechanism’s Power: The heart of the Transformer lies in its attention mechanism, specifically the self-attention mechanism. This powerful concept allows the model to focus on different positions within a sequence simultaneously, capturing intricate relationships and dependencies. The weighted sum calculation, driven by attention scores, ensures that the model can give varying degrees of importance to different elements of the input sequence.
- Multi-Head Attention: The introduction of multi-head attention further enhances the model’s capacity to capture complex dependencies. By applying the Scaled Dot Product Attention multiple times in parallel and concatenating the results, the Transformer achieves a more comprehensive understanding of the relationships within the data. This multi-head approach contributes to the model’s versatility and robustness.
- Encoder-Decoder Paradigm Shift: The Transformer’s Encoder-Decoder architecture marks a paradigm shift from traditional transduction models relying on recurrent or convolutional layers. The attention mechanism replaces these layers, allowing for efficient parallel processing. This shift results in faster training and inference, paving the way for improved performance across various natural language processing tasks.
5. Conclusion:
In conclusion, the Transformer architecture, as introduced in the seminal paper “Attention is all you need,” stands as a transformative milestone in the field of natural language processing. Its emphasis on attention mechanisms, parallelization, and addressing the challenges posed by model forgetting has propelled AI research into new realms of efficiency and accuracy. The ability to capture intricate dependencies within sequences has unlocked the potential for more nuanced understanding and generation of natural language.
As researchers continue to delve deeper into refining the Transformer architecture and exploring its applications, we anticipate further breakthroughs in natural language understanding and generation tasks. The Transformer’s impact extends beyond its initial introduction, promising a future where AI models can achieve unprecedented levels of intelligence and contextual awareness. The journey from traditional sequential models to the Transformer represents a significant leap forward, bringing us closer to the realization of truly intelligent machines in the realm of artificial intelligence.