LLM tokens and foreign languages

TL;DR If you use language other than English, you pay more per word sent/received from an LLM. If you don’t use Latin alphabet, you pay way more.

Here’s why. LLMs don’t charge by word, they charge by token received and returned. Tokens are generated statistically, based on the frequency of byte combinations in the training material (explanation link).

Tokenizers use UTF-8, and it produces shorter sequences for Latin alphabet. Non-Latin characters are encoded as 2 or even 3 bytes, so for languages like Hebrew and Japanese you may get more tokens than characters. In fact, transliterating Russian or Hebrew into Latin alphabet reduces the number of tokens by the factor of 2 in my small example. I used Huggingface tokenizer playground with GPT-4 setting. Claude produces different results, but the overall picture is similar.

The cheapest language to use is English: it uses Latin alphabet, its words rarely change, and the majority of the training material is in English. Second go Latin-based languages like Spanish, Icelandic and Polish with tokenization efficiency of about 1/2 of English. Tokenization efficiency for non-Latin based languages like Hebrew and Russian is ~1/3 of English. Chinese and Japanese perform approximately like Polish and Icelandic at 1/2 of English: on one hand they use lots of bytes per character, from the other hand they use fewer characters per sentence, since they are not phonetic. Hebrew shows the worst character-to-token ratio, probably because not a lot of training material is in it. Of course, this is based on a very small selection, and the numbers of large texts may be different, but I assume the general trends will not change.

Language Sentence Characters UTF8 bytes Tokens
English I met a huge dog 16 16 5
Spanish Conocí a un perro enorme 24 25 8
Icelandic Ég hitti risastóran hund 24 26 10
Polish Spotkałem ogromnego psa 23 24 8
Russian Я встретил огромную собаку 26 49 14
Russian Transliterated Ya vstretil ogromnuyu sobaku 28 28 11
Hebrew פגשתי כלב ענק 13 24 16
Hebrew Transliterated pgSti klv 3nq 13 13 9
Japanese 大きな犬に出会った 9 27 11
Chinese 我遇见了一只大狗 8 24 11

 

Leave a Reply

Your email address will not be published. Required fields are marked *