
Cosine Similarity vs Dot Product in Attention Mechanisms
For comparing the hidden states between the encoder and decoder, we need a similarity score . Two common approaches to calculate this are: Cosine similarity Dot product Cosine Similarity It performs a dot product on the vectors and then normalizes the result. Example Encoder output: [ -0.76 , 0.75 ] Decoder output: [ 0.91 , 0.38 ] Cosine similarity ≈ -0.39 Close to 1 → very similar → strong attention Close to 0 → not related Negative → opposite → low attention This is useful when: Values can vary a lot in size You want a consistent scale (-1 to 1) The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that. Dot Product Dot product is much simpler. It does the following: Multiply corresponding values Add them up Example (-0.76 × 0.91) + (0.75 × 0.38) = -0.41 Dot product is preferred in attention because: It’s fast It’s simple It gives good relative scores Even if the numbers are not normalized, the model c
Continue reading on Dev.to
Opens in a new tab



