The Roman Nepali Embedding Problem

Tue, 21 Apr 2026 00:00:00 +0000

The Roman Nepali Embedding Problem

I spelled the same Nepali word four different ways and asked four open-source embedding models whether the spellings meant the same thing. The model with the prettiest-looking cosine gap wasn’t the one that actually worked — and a twenty-line preprocessing script beat all four of them without touching a single weight.

This post is a small experiment with a strong conclusion: if you are shipping NLP for Nepali users today, the best thing you can do is not a bigger model — it’s a regex.

NLP on Blogs by Anil

The Roman Nepali Embedding Problem

The Roman Nepali Embedding Problem