
{"id":5322,"date":"2025-08-05T15:31:34","date_gmt":"2025-08-05T19:31:34","guid":{"rendered":"https:\/\/ikriv.com\/blog\/?p=5322"},"modified":"2025-08-05T15:44:33","modified_gmt":"2025-08-05T19:44:33","slug":"llm-tokens-and-foreign-languages","status":"publish","type":"post","link":"https:\/\/ikriv.com\/blog\/?p=5322","title":{"rendered":"LLM tokens and foreign languages"},"content":{"rendered":"<p><b>TL;DR<\/b> If you use language other than English, you pay more per word sent\/received from an LLM. If you don&#8217;t use Latin alphabet, you pay way more.<\/p>\n<p>Here&#8217;s why. LLMs don&#8217;t charge by word, they charge by <i>token<\/i> received and returned. Tokens are generated statistically, based on the frequency of byte combinations in the training material (<a href=\"https:\/\/medium.com\/thedeephub\/all-you-need-to-know-about-tokenization-in-llms-7a801302cf54\">explanation link<\/a>).<\/p>\n<p>Tokenizers use UTF-8, and it produces shorter sequences for Latin alphabet. Non-Latin characters are encoded as 2 or even 3 bytes, so for languages like Hebrew and Japanese you may get <b>more tokens than characters<\/b>. In fact, transliterating Russian or Hebrew into Latin alphabet reduces the number of tokens by the factor of 2 in my small example. I used <a href=\"https:\/\/huggingface.co\/spaces\/Xenova\/the-tokenizer-playground\">Huggingface tokenizer playground<\/a> with GPT-4 setting. Claude produces different results, but the overall picture is similar.<\/p>\n<p>The cheapest language to use is English: it uses Latin alphabet, its words rarely change, and the majority of the training material is in English. Second go Latin-based languages like Spanish, Icelandic and Polish with tokenization efficiency of about 1\/2 of English. Tokenization efficiency for non-Latin based languages like Hebrew and Russian is ~1\/3 of English. Chinese and Japanese perform approximately like Polish and Icelandic at 1\/2 of English: on one hand they use lots of bytes per character, from the other hand they use fewer characters per sentence, since they are not phonetic. Hebrew shows the worst character-to-token ratio, probably because not a lot of training material is in it. Of course, this is based on a very small selection, and the numbers of large texts may be different, but I assume the general trends will not change.<\/p>\n<table>\n<thead>\n<tr>\n<th>Language<\/th>\n<th>Sentence<\/th>\n<th>Characters<\/th>\n<th>UTF8 bytes<\/th>\n<th>Tokens<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>English<\/td>\n<td>I met a huge dog<\/td>\n<td>16<\/td>\n<td>16<\/td>\n<td>5<\/td>\n<\/tr>\n<tr>\n<td>Spanish<\/td>\n<td>Conoc\u00ed a un perro enorme<\/td>\n<td>24<\/td>\n<td>25<\/td>\n<td>8<\/td>\n<\/tr>\n<tr>\n<td>Icelandic<\/td>\n<td>\u00c9g hitti risast\u00f3ran hund<\/td>\n<td>24<\/td>\n<td>26<\/td>\n<td>10<\/td>\n<\/tr>\n<tr>\n<td>Polish<\/td>\n<td>Spotka\u0142em ogromnego psa<\/td>\n<td>23<\/td>\n<td>24<\/td>\n<td>8<\/td>\n<\/tr>\n<tr>\n<tr>\n<td>Russian<\/td>\n<td>\u042f \u0432\u0441\u0442\u0440\u0435\u0442\u0438\u043b \u043e\u0433\u0440\u043e\u043c\u043d\u0443\u044e \u0441\u043e\u0431\u0430\u043a\u0443<\/td>\n<td>26<\/td>\n<td>49<\/td>\n<td>14<\/td>\n<\/tr>\n<tr>\n<td>Russian Transliterated<\/td>\n<td>Ya vstretil ogromnuyu sobaku<\/td>\n<td>28<\/td>\n<td>28<\/td>\n<td>11<\/td>\n<\/tr>\n<tr>\n<td>Hebrew<\/td>\n<td>\u05e4\u05d2\u05e9\u05ea\u05d9 \u05db\u05dc\u05d1 \u05e2\u05e0\u05e7<\/td>\n<td>13<\/td>\n<td>24<\/td>\n<td>16<\/td>\n<\/tr>\n<tr>\n<td>Hebrew Transliterated<\/td>\n<td>pgSti klv 3nq<\/td>\n<td>13<\/td>\n<td>13<\/td>\n<td>9<\/td>\n<\/tr>\n<tr>\n<td>Japanese<\/td>\n<td>\u5927\u304d\u306a\u72ac\u306b\u51fa\u4f1a\u3063\u305f<\/td>\n<td>9<\/td>\n<td>27<\/td>\n<td>11<\/td>\n<\/tr>\n<tr>\n<td>Chinese<\/td>\n<td>\u6211\u9047\u89c1\u4e86\u4e00\u53ea\u5927\u72d7<\/td>\n<td>8<\/td>\n<td>24<\/td>\n<td>11<\/td>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR If you use language other than English, you pay more per word sent\/received from an LLM. If you don&#8217;t use Latin alphabet, you pay way more. Here&#8217;s why. LLMs <a href=\"https:\/\/ikriv.com\/blog\/?p=5322\" class=\"more-link\">[&hellip;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"Layout":"","footnotes":""},"categories":[1],"tags":[],"class_list":["entry","author-ikriv","post-5322","post","type-post","status-publish","format-standard","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5322"}],"version-history":[{"count":14,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5322\/revisions"}],"predecessor-version":[{"id":5336,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5322\/revisions\/5336"}],"wp:attachment":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}