Back to articles
Cosine Similarity vs Dot Product in Attention Mechanisms
News

Cosine Similarity vs Dot Product in Attention Mechanisms

via Dev.toRijul Rajesh

For comparing the hidden states between the encoder and decoder, we need a similarity score . Two common approaches to calculate this are: Cosine similarity Dot product Cosine Similarity It performs a dot product on the vectors and then normalizes the result. Example Encoder output: [ -0.76 , 0.75 ] Decoder output: [ 0.91 , 0.38 ] Cosine similarity ≈ -0.39 Close to 1 → very similar → strong attention Close to 0 → not related Negative → opposite → low attention This is useful when: Values can vary a lot in size You want a consistent scale (-1 to 1) The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that. Dot Product Dot product is much simpler. It does the following: Multiply corresponding values Add them up Example (-0.76 × 0.91) + (0.75 × 0.38) = -0.41 Dot product is preferred in attention because: It’s fast It’s simple It gives good relative scores Even if the numbers are not normalized, the model c

Continue reading on Dev.to

Opens in a new tab

Read Full Article
9 views

Related Articles