The original Transformer architecture was introduced in the paper Attention is All You Need by Vaswani et al. in 2017. The Transformer model has since become the foundation for many state-of-the-art natural language processing (NLP) models, including BERT, GPT-3, and T5.
1.1.1 Terminology
Self-attention: an attention mechanism relating different positions of a single sequence to compute a representation of the sequence.
Q,K,V matrix: query, keys, values. The output is computed as a weighted sum of the values. All these three are key component of Attention function. The following formula is also known as Scaled dot-product attention.
the dot products get larger variances when the dimension of q and k increase. So they scale the dot product by \frac{1}{\sqrt{d_k}} to make sure the dot product has close variance with q and k.