A Khmer-first RAG pipeline

Khmer has about seventeen million speakers, almost no dedicated LLM infrastructure, and a script that breaks most default tokenizers. Building RAG in Khmer means working around these things, not pretending they don't exist.

If you copy an English RAG tutorial and swap the corpus, you'll get a demo that works on short queries and falls apart on anything real. Here's what the real pipeline looks like — សួស្តី, and welcome.

Problem one — Khmer doesn't have word boundaries

Khmer is written without spaces between words. Your favourite splitter, which assumes whitespace, will produce chunks that are either one giant run-on or random character slices. Retrieval quality craters before you've even started.

Use a real segmenter. khmer-nltk in Python is the current default; a handful of transformer-based word-segmentation models exist if you need higher accuracy. Treat segmentation as an upstream step with its own eval — segmentation errors dominate retrieval errors more than any embedding choice you'll make later.

Problem two — embeddings that actually understand Khmer

OpenAI's text-embedding-3 does OK on Khmer but isn't great. It was trained on mostly English data; the representations are thinner and closer together, which hurts retrieval precision.

Multilingual-E5 is a better open default — trained explicitly across many languages including Khmer, and free to self-host. Cohere's multilingual embeddings are excellent but paid; worth benchmarking if your corpus is large enough to justify the spend.

Benchmark on your actual corpus with your actual queries. Don't trust generic multilingual leaderboards — they rarely include Khmer at all, and when they do, it's out-of-domain data that tells you nothing about how the model will perform on, say, agricultural extension documents or government circulars.

Chunk shape matters more in Khmer

Because tokenization is already lossy, you want larger chunks than you would in English. Our rule of thumb: 400–600 tokens per chunk, 100–150 token overlap. In English we'd say 200–400 with 50 overlap. The extra context compensates for segmentation errors and thinner embeddings.

Chunk on semantic boundaries where you can — paragraph or section breaks. Fixed-window chunking is a fallback, not a default.

The translation fallback

If Khmer embedding quality is genuinely blocking your product, translate everything to English at ingestion, embed the English, and retrieve against the English index. Keep the Khmer original for display and for the final LLM prompt.

It's not pure. It adds translation cost per document. You lose some Khmer-specific nuance in retrieval. But it ships, and for many products that's the right call for v1. Re-evaluate when multilingual embeddings catch up.

Evaluation is harder in Khmer

There are no public Khmer QA datasets that match your domain. You have to build your own eval set. Fifty to a hundred question/answer pairs from real user queries is usually enough to spot regressions — spend an afternoon with a bilingual collaborator, not a quarter with a contractor.

For scoring, use a judge model with a system prompt that explicitly evaluates meaning rather than literal overlap. Khmer has many ways to phrase the same answer — a strict string match will flag correct responses as wrong.

Khmer-specific gotchas

Digits — Khmer digits (០–៩) and Arabic digits (0–9) both appear in real documents. Normalise to one form before indexing or you'll miss obvious matches.
Homographs — some Latin characters render identically to Khmer ones. Handy for attackers who want to bypass filters; painful for your precision metrics. Normalise unicode aggressively.
Loanword spellings — terms like "AI", "API", "Vercel" appear in both Latin and Khmer transliteration. Index both. Users type whichever they typed last.
Zero-width space (U+200B) — extremely common in Khmer text for line-break hinting. Strip during preprocessing or your tokens and your queries won't line up.

Why this matters for PhnomShip

Serving Khmer-speaking users well is a moat. Most AI products deployed in Cambodia today are English-only — either because the builders don't know where to start, or because their providers made Khmer too hard. The teams that handle Khmer with care will win the Cambodian market.

Every gotcha on this list is a small piece of that moat. Build them in now, while your corpus is small and your eval set is new, and you'll compound an advantage that's hard to copy.

Tags · guide · khmer · rag · cambodia · embeddings