
From Counting Words to Learning Meaning
TF-IDF, Cosine Similarity, and Word2Vec By the end of this post, you'll understand two fundamentally different ways of representing words as vectors: sparse count-based vectors from information retrieval, and dense learned vectors from Word2Vec. You'll know how cosine similarity measures word closeness, how the skip-gram algorithm learns embeddings by training and then discarding a binary classifier, and why the resulting vectors can solve analogies like king - man + woman ≈ queen without anyone teaching the algorithm what "gender" or "royalty" means. You'll also understand why these embeddings inherit the biases of their training data, and what the difference is between static embeddings (one vector per word) and contextual embeddings (one vector per word per sentence ). Two ideas connect everything here. First: you can represent a word's meaning by the company it keeps. Second: predicting context is a better way to learn meaning than counting context. Those two ideas took NLP from sp
Continue reading on Dev.to
Opens in a new tab



