LLM tokens and foreign languages

TL;DR If you use language other than English, you pay more per word sent/received from an LLM. If you don’t use Latin alphabet, you pay way more.

Here’s why. LLMs don’t charge by word, they charge by token received and returned. Tokens are generated statistically, based on the frequency of byte combinations in the training material (explanation link).

Tokenizers use UTF-8, and it produces shorter sequences for Latin alphabet. Non-Latin characters are encoded as 2 or even 3 bytes, so for languages like Hebrew and Japanese you may get more tokens than characters. In fact, transliterating Russian or Hebrew into Latin alphabet reduces the number of tokens by the factor of 2 in my small example. I used Huggingface tokenizer playground with GPT-4 setting. Claude produces different results, but the overall picture is similar.

The cheapest language to use is English: it uses Latin alphabet, its words rarely change, and the majority of the training material is in English. Second go Latin-based languages like Spanish, Icelandic and Polish with tokenization efficiency of about 1/2 of English. Tokenization efficiency for non-Latin based languages like Hebrew and Russian is ~1/3 of English. Chinese and Japanese perform approximately like Polish and Icelandic at 1/2 of English: on one hand they use lots of bytes per character, from the other hand they use fewer characters per sentence, since they are not phonetic. Hebrew shows the worst character-to-token ratio, probably because not a lot of training material is in it. Of course, this is based on a very small selection, and the numbers of large texts may be different, but I assume the general trends will not change.

Language Sentence Characters UTF8 bytes Tokens
English I met a huge dog 16 16 5
Spanish Conocí a un perro enorme 24 25 8
Icelandic Ég hitti risastóran hund 24 26 10
Polish Spotkałem ogromnego psa 23 24 8
Russian Я встретил огромную собаку 26 49 14
Russian Transliterated Ya vstretil ogromnuyu sobaku 28 28 11
Hebrew פגשתי כלב ענק 13 24 16
Hebrew Transliterated pgSti klv 3nq 13 13 9
Japanese 大きな犬に出会った 9 27 11
Chinese 我遇见了一只大狗 8 24 11

 

2 Comments


  1. I wonder if you know anything about DeepSeek. Does it have the same ratio? Has it been trained on data in English too? Or may be Chinese has some preferences, if the LLM has been trained on data in Chinese? I mean, it must be the same utf-8 based calculation (is it?), but the LLM doesn’t always have to translate for reasoning (does it?). These are the question I’m asking to myself. May be you know something.

    Reply

  2. Yes, I know about DeepSeek. According to what I heard, it was not trained in Chinese. It uses “distilled” versions of the Western models, trained in various languages, but of course mostly in English, but not exclusively.

    LLM does not need to “translate”, it does not really have a concept of language per se. In its heart, it’s a “next word predictor” or, rather “next character sequence predictor”, and that works in any language, as long as you can write it down. ChatGPT can converse in languages other than English reasonably well. The extra price I am talking about is not related to the language per se, but to the way it is encoded. Of course, one can invent an encoding other than UTF-8 that favors Chinese, but I doubt DeepSeek actually does that.

    Reply

Leave a Reply to ikriv Cancel reply

Your email address will not be published. Required fields are marked *