.NET string comparison is not lexicographical

This Friday I wrote a unit test and to my astonishment I have found that “a” < “A” < “ab”, with .NET InvariantCulture and InvariantCultureIgnoreCase string comparers.

That means that .NET string sorting is not lexicographical, that came as a shock to me. If it were lexicographical, the order would have been “a” < “ab” < “A”.

If the strings differ by more than just case, both case sensitive and case insensitive comparers (except Ordinal, see below) will return the same result. Any of the strings in [“ab”, “aB”, “Ab”, “AB”] will be less than any of [“ac”, “aC”, “Ac”, “AC”].

For strings that differ only by case, insensitive comparers return “equal” and sensitive comparers maintain order, so you get “ab” < “aB” < “Ab” < “AB”.

This produces a “natural” sorting where “google” and “Google” are close to each other, but it is not lexicographical. Consider unsorted input “google, Google, human, zebra, Antwerp”. Lexicographically it would sort as “google, human, zebra, Antwerp, Google”, while most .NET comparers would sort it as “Antwerp, google, Google, human, zebra”.

StringComparer.Ordinal and StringCoparer.OrdinalIgnoreCase stand out: these two are truly lexicographical. Also, they put capitals before small letters, because this is the order in which they appear in UNICODE. I created a little application that sorts strings using various comparers:

https://github.com/ikriv-samples/DotNetStringSorting

Input:
a, ab, aB, ac, A, AB, Ab

Sorted with StringComparer.InvariantCulture:
a, A, ab, aB, Ab, AB, ac

Sorted with StringComparer.InvariantCultureIgnoreCase:
a, A, ab, aB, AB, Ab, ac

Sorted with StringComparer.Ordinal:
A, AB, Ab, a, aB, ab, ac

StringComparer.OrdinalIgnoreCase:
a, A, ab, aB, AB, Ab, ac

The lesson learnt: never assume anything unless verified. I’ve been working with .NET for over 10 years, and I never doubted that string comparison is lexicographical (what else could it possibly be?). I was up for a big surprise.

Ivan Krivyakov

Premature optimization is the root of all evil

.NET string comparison is not lexicographical

Leave a Reply Cancel reply