TL;DR If you use language other than English, you pay more per word sent/received from an LLM. If you don’t use Latin alphabet, you pay way more.
Here’s why. LLMs don’t charge by word, they charge by token received and returned. Tokens are generated statistically, based on the frequency of byte combinations in the training material (explanation link).
Tokenizers use UTF-8, and it produces shorter sequences for Latin alphabet. Non-Latin characters are encoded as 2 or even 3 bytes, so for languages like Hebrew and Japanese you may get more tokens than characters. In fact, transliterating Russian or Hebrew into Latin alphabet reduces the number of tokens by the factor of 2 in my small example. I used Huggingface tokenizer playground with GPT-4 setting. Claude produces different results, but the overall picture is similar.
The cheapest language to use is English: it uses Latin alphabet, its words rarely change, and the majority of the training material is in English. Second go Latin-based languages like Spanish, Icelandic and Polish with tokenization efficiency of about 1/2 of English. Tokenization efficiency for non-Latin based languages like Hebrew and Russian is ~1/3 of English. Chinese and Japanese perform approximately like Polish and Icelandic at 1/2 of English: on one hand they use lots of bytes per character, from the other hand they use fewer characters per sentence, since they are not phonetic. Hebrew shows the worst character-to-token ratio, probably because not a lot of training material is in it. Of course, this is based on a very small selection, and the numbers of large texts may be different, but I assume the general trends will not change.
| Language | Sentence | Characters | UTF8 bytes | Tokens |
|---|---|---|---|---|
| English | I met a huge dog | 16 | 16 | 5 |
| Spanish | Conocí a un perro enorme | 24 | 25 | 8 |
| Icelandic | Ég hitti risastóran hund | 24 | 26 | 10 |
| Polish | Spotkałem ogromnego psa | 23 | 24 | 8 |
| Russian | Я встретил огромную собаку | 26 | 49 | 14 |
| Russian Transliterated | Ya vstretil ogromnuyu sobaku | 28 | 28 | 11 |
| Hebrew | פגשתי כלב ענק | 13 | 24 | 16 |
| Hebrew Transliterated | pgSti klv 3nq | 13 | 13 | 9 |
| Japanese | 大きな犬に出会った | 9 | 27 | 11 |
| Chinese | 我遇见了一只大狗 | 8 | 24 | 11 |

Permalink
I wonder if you know anything about DeepSeek. Does it have the same ratio? Has it been trained on data in English too? Or may be Chinese has some preferences, if the LLM has been trained on data in Chinese? I mean, it must be the same utf-8 based calculation (is it?), but the LLM doesn’t always have to translate for reasoning (does it?). These are the question I’m asking to myself. May be you know something.
Permalink
Yes, I know about DeepSeek. According to what I heard, it was not trained in Chinese. It uses “distilled” versions of the Western models, trained in various languages, but of course mostly in English, but not exclusively.
LLM does not need to “translate”, it does not really have a concept of language per se. In its heart, it’s a “next word predictor” or, rather “next character sequence predictor”, and that works in any language, as long as you can write it down. ChatGPT can converse in languages other than English reasonably well. The extra price I am talking about is not related to the language per se, but to the way it is encoded. Of course, one can invent an encoding other than UTF-8 that favors Chinese, but I doubt DeepSeek actually does that.