Overlap with those is not particularly large. n-grams are particularly sensitive... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		tgv on Feb 12, 2022 \| parent \| context \| favorite \| on: Lingua-Go, the most accurate language detection fo... Overlap with those is not particularly large. n-grams are particularly sensitive to spelling, and e.g. Afrikaans writes "Hy het skool toe gegaan", whereas it would be "Hij is naar school gegaan" in Dutch. Indonesian and Malay insert vowels in consonant clusters and replace quite a few consonants, so they should be easily distinguishable from Dutch, even on Dutch loan words (which are not that frequent anyway). Dutch has a much larger overlap with German (probably the largest), but even those can be distinguished (by a human) with just a few words of a meaningful sentence. I find it difficult to come up with three words that could be a grammatical fragment in both languages, but even then I expect the n-gram frequencies to be quite diverging.

yorwba on Feb 12, 2022 [–]

It's on single-word detection where the accuracy for Afrikaans and Dutch is between 50% and 60%. Understandable, considering "het" and "toe" are also Dutch words and "is" is also Afrikaans.

I meant that Indonesian and Malay would be difficult to distinguishing from each other.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact