Cosine Similarity vs Dot Product in Attention Mechanisms

For comparing the hidden states between the encoder and decoder, we need a similarity score . Two common approaches to calculate this are: Cosine similarity Dot product Cosine Similarity It performs a dot product on the vectors and then normalizes the result. Example Encoder output: [ -0.76 , 0.75 ] Decoder output: [ 0.91 , 0.38 ] Cosine similarity ≈ -0.39 Close to 1 → very similar → strong attention Close to 0 → not related Negative → opposite → low attention This is useful when: Values can vary a lot in size You want a consistent scale (-1 to 1) The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that. Dot Product Dot product is much simpler. It does the following: Multiply corresponding values Add them up Example (-0.76 × 0.91) + (0.75 × 0.38) = -0.41 Dot product is preferred in attention because: It’s fast It’s simple It gives good relative scores Even if the numbers are not normalized, the model c

Cosine Similarity vs Dot Product in Attention Mechanisms

Related Articles

My favorite color e-reader is still $80 off, but hurry if you want to save

RHAPSODY OF REALITIES - 30TH MARCH 2026 "What a truth this is!

Grow Foundation Launches the Earliest Bug Bounty in Crypto History – 50,000,000 Grow Tokens at…

Running a Plan 9 network on OpenBSD

I Found the Same Hidden Equation in a 2,000-Year-Old Calendar, QR Codes, Jazz Theory, and Quantum…

Related Articles

News
My favorite color e-reader is still $80 off, but hurry if you want to save
ZDNet • 1h ago

News
RHAPSODY OF REALITIES - 30TH MARCH 2026 "What a truth this is!
Medium Programming • 2h ago

News
Grow Foundation Launches the Earliest Bug Bounty in Crypto History – 50,000,000 Grow Tokens at…
Medium Programming • 2h ago

News
Running a Plan 9 network on OpenBSD
Lobsters • 2h ago

News
I Found the Same Hidden Equation in a 2,000-Year-Old Calendar, QR Codes, Jazz Theory, and Quantum…
Medium Programming • 3h ago