TL;DR If you use language other than English, you pay more per word sent/received from an LLM. If you don’t use Latin alphabet, you pay way more.
Here’s why. LLMs don’t charge by word, they charge by token received and returned. Tokens are generated statistically, based on the frequency of byte combinations in the training material (explanation link).
Tokenizers use UTF-8, and it produces shorter sequences for Latin alphabet. Non-Latin characters are encoded as 2 or even 3 bytes, so for languages like Hebrew and Japanese you may get more tokens than characters. In fact, transliterating Russian or Hebrew into Latin alphabet reduces the number of tokens by the factor of 2 in my small example. I used Huggingface tokenizer playground with GPT-4 setting. Claude produces different results, but the overall picture is similar.
The cheapest language to use is English: it uses Latin alphabet, its words rarely change, and the majority of the training material is in English. Second go Latin-based languages like Spanish, Icelandic and Polish with tokenization efficiency of about 1/2 of English. Tokenization efficiency for non-Latin based languages like Hebrew and Russian is ~1/3 of English. Chinese and Japanese perform approximately like Polish and Icelandic at 1/2 of English: on one hand they use lots of bytes per character, from the other hand they use fewer characters per sentence, since they are not phonetic. Hebrew shows the worst character-to-token ratio, probably because not a lot of training material is in it. Of course, this is based on a very small selection, and the numbers of large texts may be different, but I assume the general trends will not change.
Language | Sentence | Characters | UTF8 bytes | Tokens |
---|---|---|---|---|
English | I met a huge dog | 16 | 16 | 5 |
Spanish | Conocí a un perro enorme | 24 | 25 | 8 |
Icelandic | Ég hitti risastóran hund | 24 | 26 | 10 |
Polish | Spotkałem ogromnego psa | 23 | 24 | 8 |
Russian | Я встретил огромную собаку | 26 | 49 | 14 |
Russian Transliterated | Ya vstretil ogromnuyu sobaku | 28 | 28 | 11 |
Hebrew | פגשתי כלב ענק | 13 | 24 | 16 |
Hebrew Transliterated | pgSti klv 3nq | 13 | 13 | 9 |
Japanese | 大きな犬に出会った | 9 | 27 | 11 |
Chinese | 我遇见了一只大狗 | 8 | 24 | 11 |