
Finding meaning in text, an experiment in document clustering
Problem For an assignment in the University of British Columbia's CPSC330 course in Applied Machine Learning, we were tasked with categorizing titles pulled from a sample of Food.com recipes . The goal was simple, use a subset of the banks 180,000+ recipes to find categories of recipes purely based off their titles. Achieving said goal was the real challenge, with so many different considerations made in the modeling process due to the nature of the data - text. The data From our sample of recipes, we pulled a smaller subset of data consisting 9100 words. We did this by removing duplicate entries, NaNs , short names (< 5 characters), and only selecting observations with tags that were amongst the top 300 tags in our sample. Below we can see what this unprocessed data of title names looks like, and a visualization of the words within the dataset. Index Recipe Name 42 i yam what i yam muffins 101 to your health muffins 129 250 00 chocolate chip cookies 138 lplermagronen 163 california ro
Continue reading on Dev.to
Opens in a new tab



