Back to brain

Attention

Alright so imagine we have ERn×dEE\in \mathbb{R}^{n\times d_E} where nn is the number of words in a sentence, dEd_E is the dimention of the embeddings (usually 12,888 or smth like that). Each EiE\vec{E_i}\in E is the word embedding of a word.

Attention is practically the dimensional changes to each embeddings in order to conceptualize the respective contextual information. So, we do the following. The idea is to check how words correlate to eachother, we do that by making a query vector of the embedding which will ask the key vectors of each embedding through a dot product and depending how much the query with the key are correlated will be the relation between the embedding with the query and that with the key. So, Qcat\vec{Q}_{cat} could have a large activation with Kwiskas\vec{K}_{wiskas}. So how do we do this? The idea is that we have the embedding, and we multiply that by some weights which will give us Q\vec{Q}. So, we have WQW_Q, WKW_K, and WVW_V (the last one not really but we’ll get to that in a minute) which are learnable parameters. In one head of attention we will encode the queries and keys of all embeddings, multiply all queries by their keys forward of the sentence, and then after normalizing the outputs, we calculate it by some arbitrary matrix to get “how much should we change the embedding”.

Q=WQEQ=W_Q \cdot E K=WKEK=W_K \cdot E QKQ \cdot K

This is the dot product of all queries times all keys, but the problem with this idea is that it gives out another matrix where we calculate the dot product of queries of the words that come next, which makes no sense. What I mean by this is that words’ contextual semantics aren’t given by the words after, but before. So in a sentence like “The red dog is eating”, “dog”’s context is “The red”, not “is eating”. So, a Qi\vec{Q_i} only should dot product with Kk<i\vec{K}_{k<i}. If we were to make the table of this, it would look like this (the ., 0, and O are basically circles where their magnitude are like analogous to the correlations, but in reality they’re real values).

Q1Q2Q3Q4Q5Q6Q7
K1O.0O.0.
K2O.00.0
K3O.0O.
K40.0.
K5.0
K6.0
K7.

Alright so, this will give us a new matrix whose columns we have to normalize. What we will do is mask the open values, the correlations we didn’t calculate, and make them -\infty, and then apply a softmax, in such a way that the columns always sum to one. Soooo

Q1Q2Q3Q4Q5Q6Q7
K1O.0O.0.
K2-\inftyO.00.0
K3-\infty-\inftyO.0O.
K4-\infty-\infty-\infty0.0.
K5-\infty-\infty-\infty-\infty.0
K6-\infty-\infty-\infty-\infty-\infty.0
K7-\infty-\infty-\infty-\infty-\infty-\infty.

And then softmax. Also, it’s numerically convenient to divide everything by dK\sqrt{d_K} (the square root of the dimension of KK). Sooo we’re now here:

softmax(QKdK)\text{softmax}\left(\frac{Q \cdot K}{\sqrt{d_K}}\right)

YUHHHH

Now, we have the contextual embedding, but how do we know how to change the embedding? We can’t simply sum this up. SOOOO enter the value matrix. Just like what we did to calculate QQ and KK, we define V:=WVEV:=W_V· E. Cool, and this will give us what ΔE\Delta E should be. Again, WVW_V are trainable parameters. So, this gives us one head of attention.

Attention(Q,K,V)=E+ΔE=E+softmax(QKTdk)V\text{Attention}(Q,K,V)=E+\Delta E = E +\text{softmax}\left(\frac{Q\cdot K^T}{\sqrt{d_k}}\right)V

YUHHHH

So, what is multi-head attention? It’s just doing this step multiples times on parallel, and summing all ΔE\Delta E. So, a layer of 10 heads would have 10 WQW_Q, 10 WKW_K, and 10 WVW_V. And for all those we would do

MHA(Q,K,V)=E+i=1nsoftmax(QiKiTdk)ViMHA(Q,K,V)=E+\sum_{i=1}^n \text{softmax}\left(\frac{Q_i\cdot K_i^T}{\sqrt{d_k}}\right)V_i

Where Q,K,Q,K, and VV would be tensors rather than only matrices. And this is only for one embedding, we apply MHAMHA to all embeddings. But lucky for us, the weight matrices are for all the embeddings, not one weight matrix per embedding.