P/Invoke and Extended ASCII Black Hole

Background

Not so long ago I had to fight an interesting bug. I had a native C++ function that sent a char* string to some external device. Greatly simplified version of this function is given in Listing 1. The function can operate in one of three modes - normal, upper case, and lower case. Regular ASCII characters (0x01...0x7F) are pritned according to current output mode. Characters 0x80, 0x81, and 0x82 switch between modes. Characters 0x83 and higher are printed in the form of their hex codes, e.g. <88>.

// C++
void PrintWithControlChars( char const* str )
{
    OutputMode mode = MODE_NORMAL;

    while (unsigned char c = *str++)
    {
        if (c >= 0x80)
        {
            // control character
            switch (c)
            {
            case MODE_NORMAL:
            case MODE_LOWER_CASE:
            case MODE_UPPER_CASE:
                mode = (OutputMode)c;
                break;
            default:
                printf("&l1;%02X%gt;", c);
            }
        }
        else
        {
            switch (mode)
            {
            case MODE_UPPER_CASE: putchar(toupper(c)); break;
            case MODE_LOWER_CASE: putchar(tolower(c)); break;
            default: putchar(c); break;
            }
        }
    }
}

All was good and well, until I tried to use this function from C# via P/Invoke:

// C#
class Program
{
    [DllImport("NativeDLL.dll")]
    extern static void PrintWithControlChars( [In][MarshalAs(UnmanagedType.LPStr)] string str );

    [STAThread]
    static void Main(string[] args)
    {
        PrintWithControlChars("Normal case,\x82 Upper case,\x81 Lower case,\x80 Normal case, \x88\n");
    }
}

Expected otuput was

Normal case, UPPER CASE, lower case, Normal case, <88>

while in reality I got

Normal case,? Upper case, lower case,? normal case, ?

What could be the problem? Why C++ function did not see control characters except for the "lower case"? A bug perhaps?

Would You Like to Come in for a Little Byte?

String literals in C# and in C++ may look similar, but this similarity is a little deceptive. In C# characters are UNICODE, while in C++ characters are typically single byte. When we call PrintWithControlChars() from C#, we pass it a UNICODE string, while C++ expects a single-byte string. .NET framework performs a silent conversion for us. It knows what to do thanks to [MarshalAs(UnmanagedType.LPStr)] attribute attached to the declaration of the parameter str.

Documentation for MarshalAs does not seem to mention how exactly UNICODE strings are converted to LPSTR. Experiment shows that the conversion is performed using the system code page. If particular UNICODE character is not found in the system code page, it is replaced with a question mark (ASCII code 0x3F). This means that actual byte values received by the C++ function may be different depending on the system code page. For example, UNICODE string literal "für" (0x0066, 0x00FC, 0x0072) will become single-byte "für" (0x66, 0xFC, 0x72) on English system, and single-byte "f?r" (0x66, 0x3F, 0x72) on Russian system.

Extended ASCII Black Hole

But we run our tests on an English system, and we still see the question marks. What could be the problem? After all, first 256 UNICODE characters seem to closely follow code page 1252 used on English systems. When converting between the two, one needs just to remove or add zeroes – 0x0066 becomes 0x66, 0x00FC becomes 0xFC, et cetera.

This is right for most characters in the range 0x00...0xFF, but not for all of them. By looking carefully at the chart for Windows code page 1252, we discover that character range 0x80...0x9F is an exception from the common rule. For example, 0x80 (the Euro sign) is UNICODE 0x20AC, and not 0x0080 as one would expect. We also discover a number of suspiciously looking "holes" - characters not defined in code page 1252. Specificaly, 0x81, 0x8D, 0x8F, 0x90, and 0x9D are not defined.

Going the other way, we find that UNICODE range 0x0080...0x009F contains some control characters that are not present in code page 1252 (see UNICODE chart). However, ranges 0x0000...0x007F and 0x00A0...0x00FF exactly follow code page 1252.

Exception to Exception

So, UNICODE characters 0x0080...0x009F are not part of code page 1252, and thus are converted to question marks (0x3F) instead of single-byte 0x80...0x9F. Indeed, it would be incorrect to convert UNICODE 0x0085 ("Next line" control) to 0x85 ("horizontal ellispis"), which then would convert back to UNICODE 0x2026.

But this is still not the whole story. "Missing" characters 0x81, 0x8D, 0x8F, 0x90, and 0x9D are exception to exception: they are converted using the regular rule: 0x81 <-> 0x0081, 0x8D <-> 0x008D, etc. This explains why our "lower case" control "\x81" survived the conversion while others did not.

To summarize:

C# to C++ string conversion depends on the system code page
If system code page is 1252 (Latin-1), UNICODE characters 0x0000...0x00FF will be converted to their single-byte equivalents, with the exception of the following: 0x0080, 0x0082...0x008C, 0x008E, 0x0091...0x009C, 0x009E, 0x009F.

Why We've Got in Trouble

The root of the problem is that C# and C++ have radically different idea of what is a character. C++ inherits from C the idea that characters are integers (typically, bytes). Characters are indistinguishable from their integer codes. It is OK to add, subtract and even multiply characters. It is also OK to mix characters with arbitrary binary codes such as control codes in our example. char* may represent a string, or an arbitrary chunk of memory. Interpretation of what it exactly means is up to the programmer.

C#, on the other hand, treats characters first and foremost as elements of text. Converting characters to integers is still implicit, but converting integers to characters is not. Probably as a tribute to C++, C# characters can be subtracted and multiplied, but resulting expressions are of type int, not char. There is a separate type byte for representing single-byte integer values, and byte[] for representing arbitrary octet sequences. Mixing binary codes and characters does not work very well in C#.

When we write '\x85' in C#, it represents UNICODE character U+0085, "next line control", and the system will treat it accordingly. If this value is handed to an external source, the system will choose a suitable representation for "next line control" character if available, or resort to a default (such as question mark) when no suitable external representation exists. There is no promise to preserve binary value of 0x0085. Interpretation "next line control" is primary, binary value 0x0085 is secondary.

When we write '\x85' in C++, it represents a binary value of 0x85 and nothing else, as far as compiler is concerned. Binary value is primary, its interpretation is up to the programmer. '\x85' may mean "horizontal ellipsis" if code page 1252 is assumed, or Cyrillic capital letter "E" in code page 866, or integer value -123, or a million other things, depending on programmer's imagination.

Source Code

Download test project that demonstrates extended ASCII black hole: ExtendedAsciiBlackHole.zip (72K).

This is a Visual Studio .NET 2003 solution that consists of a native DLL written in C++ and a managed executable written in C#. The DLL implements the PrintWithControlChars() method, and the executable calls it using [MarshalAs(UnmanagedType.LPStr)] attribute on the string parameter.

Ivan Krivyakov

Premature optimization is the root of all evil