The Roman Nepali Embedding Problem

The Roman Nepali Embedding Problem I spelled the same Nepali word four different ways and asked four open-source embedding models whether the spellings meant the same thing. The model with the prettiest-looking cosine gap wasn’t the one that actually worked — and a twenty-line preprocessing script beat all four of them without touching a single weight. This post is a small experiment with a strong conclusion: if you are shipping NLP for Nepali users today, the best thing you can do is not a bigger model — it’s a regex. ...

April 21, 2026 · 11 min · Anil Paudel