Attention

Alright so imagine we have $E\in \mathbb{R}^{n\times d_E}$ where $n$ is the number of words in a sentence, $d_E$ is the dimention of the embeddings (usually 12,888 or smth like that). Each $\vec{E_i}\in E$ is the word embedding of a word.

Attention is practically the dimensional changes to each embeddings in order to conceptualize the respective contextual information. So, we do the following. The idea is to check how words correlate to eachother, we do that by making a query vector of the embedding which will ask the key vectors of each embedding through a dot product and depending how much the query with the key are correlated will be the relation between the embedding with the query and that with the key. So, $\vec{Q}_{cat}$ could have a large activation with $\vec{K}_{wiskas}$ . So how do we do this? The idea is that we have the embedding, and we multiply that by some weights which will give us $\vec{Q}$ . So, we have $W_Q$ , $W_K$ , and $W_V$ (the last one not really but we’ll get to that in a minute) which are learnable parameters. In one head of attention we will encode the queries and keys of all embeddings, multiply all queries by their keys forward of the sentence, and then after normalizing the outputs, we calculate it by some arbitrary matrix to get “how much should we change the embedding”.

Q=W_Q \cdot E

K=W_K \cdot E

Q \cdot K

This is the dot product of all queries times all keys, but the problem with this idea is that it gives out another matrix where we calculate the dot product of queries of the words that come next, which makes no sense. What I mean by this is that words’ contextual semantics aren’t given by the words after, but before. So in a sentence like “The red dog is eating”, “dog”’s context is “The red”, not “is eating”. So, a $\vec{Q_i}$ only should dot product with $\vec{K}_{k<i}$ . If we were to make the table of this, it would look like this (the ., 0, and O are basically circles where their magnitude are like analogous to the correlations, but in reality they’re real values).

	Q1	Q2	Q3	Q4	Q5	Q6	Q7
K1	O	.	0	O	.	0	.
K2		O	.	0	0	.	0
K3			O	.	0	O	.
K4				0	.	0	.
K5					.	0
K6						.	0
K7							.

Alright so, this will give us a new matrix whose columns we have to normalize. What we will do is mask the open values, the correlations we didn’t calculate, and make them $-\infty$ , and then apply a softmax, in such a way that the columns always sum to one. Soooo

	Q1	Q2	Q3	Q4	Q5	Q6	Q7
K1	O	.	0	O	.	0	.
K2	$-\infty$	O	.	0	0	.	0
K3	$-\infty$	$-\infty$	O	.	0	O	.
K4	$-\infty$	$-\infty$	$-\infty$	0	.	0	.
K5	$-\infty$	$-\infty$	$-\infty$	$-\infty$	.	0
K6	$-\infty$	$-\infty$	$-\infty$	$-\infty$	$-\infty$	.	0
K7	$-\infty$	$-\infty$	$-\infty$	$-\infty$	$-\infty$	$-\infty$	.

And then softmax. Also, it’s numerically convenient to divide everything by $\sqrt{d_K}$ (the square root of the dimension of $K$ ). Sooo we’re now here:

\text{softmax}\left(\frac{Q \cdot K}{\sqrt{d_K}}\right)

YUHHHH

Now, we have the contextual embedding, but how do we know how to change the embedding? We can’t simply sum this up. SOOOO enter the value matrix. Just like what we did to calculate $Q$ and $K$ , we define $V:=W_V· E$ . Cool, and this will give us what $\Delta E$ should be. Again, $W_V$ are trainable parameters. So, this gives us one head of attention.

\text{Attention}(Q,K,V)=E+\Delta E = E +\text{softmax}\left(\frac{Q\cdot K^T}{\sqrt{d_k}}\right)V

YUHHHH

So, what is multi-head attention? It’s just doing this step multiples times on parallel, and summing all $\Delta E$ . So, a layer of 10 heads would have 10 $W_Q$ , 10 $W_K$ , and 10 $W_V$ . And for all those we would do

MHA(Q,K,V)=E+\sum_{i=1}^n \text{softmax}\left(\frac{Q_i\cdot K_i^T}{\sqrt{d_k}}\right)V_i

Where $Q,K,$ and $V$ would be tensors rather than only matrices. And this is only for one embedding, we apply $MHA$ to all embeddings. But lucky for us, the weight matrices are for all the embeddings, not one weight matrix per embedding.