
I Read One Paper and Ended Up Swapping Visual AI Models 3 Times
One day I stumbled across a paper called ShowUI. A vision model that looks at screenshots and understands UI elements. "That sounds fun" — I thought. That curiosity led to 3 model swaps, an accessibility app concept, and a project I never shipped. 🧪 It Started with a Paper I came across ShowUI-2B by OpenBMB. Feed it a screenshot, and it detects buttons, text fields, icons — all the UI elements on screen. A Vision model purpose-built for understanding interfaces. "I could build something with this." That thought started everything. Testing Reality: Underwhelming When I actually ran it, the results didn't match the paper. On Korean-language UIs — especially heavily styled sites with custom CSS — it was bad. It couldn't even locate the username and password input fields. Not "low accuracy." It couldn't find them at all. Maybe 1 success out of 10 attempts. The model was also 4.7GB — not small. The testing environment was painful too. I couldn't set up a proper GPU environment, so I force-q
Continue reading on Dev.to
Opens in a new tab




