
Understanding Attention Mechanisms – Part 1: Why Long Sentences Break Encoder–Decoders
In the previous articles , we understood Seq2Seq models. Now, on the path toward transformers, we need to understand one more concept before reaching there: Attention. The encoder in a basic encoder–decoder, by unrolling the LSTMs, compresses the entire input sentence into a single context vector . This works fine for short phrases like "Let's go" . But if we had a bigger input vocabulary with thousands of words, then we could input longer and more complicated sentences, like "Don't eat the delicious-looking and smelling pasta" . For longer phrases, even with LSTMs, words that are input early on can be forgotten . In this case, if we forget the first word "Don't" , then it becomes: "eat the delicious-looking and smelling pasta" So, sometimes it is important to remember the first word . Basic RNNs had problems with long-term memory because they ran both long- and short-term information through a single path . The main idea of Long Short-Term Memory (LSTM) units is that they solve this p
Continue reading on Dev.to
Opens in a new tab


