
{"id":2533,"date":"2018-10-08T22:15:58","date_gmt":"2018-10-09T02:15:58","guid":{"rendered":"https:\/\/ikriv.com\/blog\/?p=2533"},"modified":"2019-12-01T22:39:57","modified_gmt":"2019-12-02T03:39:57","slug":"net-string-comparison-is-not-lexicographical","status":"publish","type":"post","link":"https:\/\/ikriv.com\/blog\/?p=2533","title":{"rendered":".NET string comparison is not lexicographical"},"content":{"rendered":"<p>This Friday I wrote a unit test and to my astonishment I have found that &#8220;a&#8221; &lt; &#8220;A&#8221; &lt; &#8220;ab&#8221;, with .NET\u00a0<strong>InvariantCulture<\/strong> and <strong>InvariantCultureIgnoreCase<\/strong> string comparers.<\/p>\n<p>That means that .NET string sorting is not <a href=\"https:\/\/en.wikipedia.org\/wiki\/Alphabetical_order\">lexicographical<\/a>, that came as a shock to me. If it were lexicographical, the\u00a0order would have been &#8220;a&#8221; &lt; &#8220;ab&#8221; &lt; &#8220;A&#8221;.<\/p>\n<p>If the strings differ by more than just case, both case sensitive and case insensitive comparers (except Ordinal, see below) will return the same result. Any of the strings in [&#8220;ab&#8221;, &#8220;aB&#8221;, &#8220;Ab&#8221;, &#8220;AB&#8221;] will be less than any of [&#8220;ac&#8221;, &#8220;aC&#8221;, &#8220;Ac&#8221;, &#8220;AC&#8221;].<\/p>\n<p>For strings that differ <em>only\u00a0<\/em>by case, insensitive comparers return &#8220;equal&#8221; and sensitive comparers maintain order, so you get &#8220;ab&#8221; &lt; &#8220;aB&#8221; &lt; &#8220;Ab&#8221; &lt; &#8220;AB&#8221;.<\/p>\n<p>This produces a &#8220;natural&#8221; sorting where &#8220;google&#8221; and &#8220;Google&#8221; are close to each other, but it is not lexicographical. Consider unsorted\u00a0 input &#8220;google, Google, human, zebra, Antwerp&#8221;. Lexicographically it would sort as &#8220;google, human, zebra, Antwerp, Google&#8221;, while most .NET comparers would sort it as &#8220;Antwerp, google, Google, human, zebra&#8221;.<\/p>\n<p>StringComparer.Ordinal and StringCoparer.OrdinalIgnoreCase stand out: these two are truly lexicographical. Also, they put capitals before small letters, because this is the order in which they appear in UNICODE. I created a little application that sorts strings using various comparers:<\/p>\n<p><a href=\"https:\/\/github.com\/ikriv-samples\/DotNetStringSorting\">https:\/\/github.com\/ikriv-samples\/DotNetStringSorting<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2535\" src=\"https:\/\/ikriv.com\/blog\/wp-content\/uploads\/2018\/10\/sorter.png\" alt=\"\" width=\"510\" height=\"473\" \/><\/p>\n<p><strong>Input<\/strong>:<br \/>\na, ab, aB, ac, A, AB, Ab<\/p>\n<p>Sorted with <strong>StringComparer.InvariantCulture<\/strong>:<br \/>\n<span style=\"font-size: 20px;\">a,\u00a0<\/span>A, ab, aB, Ab, AB, ac<\/p>\n<p>Sorted with <strong>StringComparer.InvariantCultureIgnoreCase<\/strong>:<br \/>\na, A, ab, aB, AB, Ab, ac<\/p>\n<p>Sorted with <strong>StringComparer.Ordinal<\/strong>:<br \/>\nA, AB, Ab, a, aB, ab, ac<\/p>\n<p><strong>StringComparer.OrdinalIgnoreCase<\/strong>:<br \/>\na, A, ab, aB, AB,\u00a0Ab, ac<\/p>\n<p>The lesson learnt: never assume anything unless verified. I&#8217;ve been working with .NET for over 10 years, and I never doubted that string comparison is lexicographical (what else could it possibly be?). I was up for a big surprise.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This Friday I wrote a unit test and to my astonishment I have found that &#8220;a&#8221; &lt; &#8220;A&#8221; &lt; &#8220;ab&#8221;, with .NET\u00a0InvariantCulture and InvariantCultureIgnoreCase string comparers. That means that .NET <a href=\"https:\/\/ikriv.com\/blog\/?p=2533\" class=\"more-link\">[&hellip;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"Layout":"","footnotes":""},"categories":[3,8,4],"tags":[],"class_list":["entry","author-ikriv","post-2533","post","type-post","status-publish","format-standard","category-dotnet","category-cs","category-hack"],"_links":{"self":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2533"}],"version-history":[{"count":5,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2533\/revisions"}],"predecessor-version":[{"id":4600,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2533\/revisions\/4600"}],"wp:attachment":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2533"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2533"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}