The Roman Nepali Embedding Problem
The Roman Nepali Embedding Problem I spelled the same Nepali word four different ways and asked four open-source embedding models whether the spellings meant the same thing. The model with the prettiest-looking cosine gap wasn’t the one that actually worked — and a twenty-line preprocessing script beat all four of them without touching a single weight. This post is a small experiment with a strong conclusion: if you are shipping NLP for Nepali users today, the best thing you can do is not a bigger model — it’s a regex. ...