LLM tokens and foreign languages

TL;DR If you use language other than English, you pay more per word sent/received from an LLM. If you don’t use Latin alphabet, you pay way more.

Here’s why. LLMs don’t charge by word, they charge by token received and returned. Tokens are generated statistically, based on the frequency of byte combinations in the training material (explanation link).

Tokenizers use UTF-8, and it produces shorter sequences for Latin alphabet. Non-Latin characters are encoded as 2 or even 3 bytes, so for languages like Hebrew and Japanese you may get more tokens than characters. In fact, transliterating Russian or Hebrew into Latin alphabet reduces the number of tokens by the factor of 2 in my small example. I used Huggingface tokenizer playground with GPT-4 setting. Claude produces different results, but the overall picture is similar.

The cheapest language to use is English: it uses Latin alphabet, its words rarely change, and the majority of the training material is in English. Second go Latin-based languages like Spanish, Icelandic and Polish with tokenization efficiency of about 1/2 of English. Tokenization efficiency for non-Latin based languages like Hebrew and Russian is ~1/3 of English. Chinese and Japanese perform approximately like Polish and Icelandic at 1/2 of English: on one hand they use lots of bytes per character, from the other hand they use fewer characters per sentence, since they are not phonetic. Hebrew shows the worst character-to-token ratio, probably because not a lot of training material is in it. Of course, this is based on a very small selection, and the numbers of large texts may be different, but I assume the general trends will not change.

Language	Sentence	Characters	UTF8 bytes	Tokens
English	I met a huge dog	16	16	5
Spanish	Conocí a un perro enorme	24	25	8
Icelandic	Ég hitti risastóran hund	24	26	10
Polish	Spotkałem ogromnego psa	23	24	8
Russian	Я встретил огромную собаку	26	49	14
Russian Transliterated	Ya vstretil ogromnuyu sobaku	28	28	11
Hebrew	פגשתי כלב ענק	13	24	16
Hebrew Transliterated	pgSti klv 3nq	13	13	9
Japanese	大きな犬に出会った	9	27	11
Chinese	我遇见了一只大狗	8	24	11

2 Comments

LM
4 months ago Permalink

I wonder if you know anything about DeepSeek. Does it have the same ratio? Has it been trained on data in English too? Or may be Chinese has some preferences, if the LLM has been trained on data in Chinese? I mean, it must be the same utf-8 based calculation (is it?), but the LLM doesn’t always have to translate for reasoning (does it?). These are the question I’m asking to myself. May be you know something.

ikriv
4 months ago Permalink

Yes, I know about DeepSeek. According to what I heard, it was not trained in Chinese. It uses “distilled” versions of the Western models, trained in various languages, but of course mostly in English, but not exclusively.

LLM does not need to “translate”, it does not really have a concept of language per se. In its heart, it’s a “next word predictor” or, rather “next character sequence predictor”, and that works in any language, as long as you can write it down. ChatGPT can converse in languages other than English reasonably well. The extra price I am talking about is not related to the language per se, but to the way it is encoded. Of course, one can invent an encoding other than UTF-8 that favors Chinese, but I doubt DeepSeek actually does that.

Ivan Krivyakov

Premature optimization is the root of all evil

LLM tokens and foreign languages

2 Comments

Leave a Reply to ikriv Cancel reply