Console code pages and child processes
Executive Summary
- GUI and console applications represent international characters differently.
- This may cause issues when exchanging files or redirecting I/O containing international texts.
- Console applications run as DETACHED_PROCESS are special case. C++ and .NET programs may behave differently.
- Child processes of detached console apps use OEM encoding. This can create outright nightmares.
- If you use only English letters, you can safely ignore all of the above.
Character Encodings
Computers store information in bytes, but texts are composed of characters. Mapping between the two is a messy subject with long history. English letters and punctuation are almost universally encoded using ASCII, and interoperability is good. However, international characters, such as accented Latin letters, Chinese, Greek, Hebrew, Japanese, Russian, and other "exotic" letters are still encoded in a variety of ways, despite the advent of Unicode, the universal international encoding.
Windows uses a number of mutually incompatible encodings called code pages. There is a different encoding for each language/alphabet: here's the list. For ease of reference, each code page has a number. Unicode viarant UTF-8 is technically also a code page, number 65001.
OEM and ANSI Code Pages
Different localized versions of Windows use different code pages. Furthermore, a single version of Windows typically uses different codepages for non-Unicode GUI applications and for console applications. Note that console applications are never truly Unicode. GUI code pages are called "ANSI" and console code pages are called "OEM". The reason for this duality is historic. What particular code pages are used as ANSI and OEM depends on the "Language for non-Unicode programs" setting, which in found in Windows 10 Control Panel under "Region" icon in the "Administrative" tab. Each localized version of Windows defaults to the ANSI and OEM code pages for its main language, but this can be changed.
Each console window, visible or not, has an input code page and an output code page associated with it. In practice all console applications in the system use OEM code page except those that run as DETACHED_PROCESS. Detached process application have their code page set to zero, which is usually interpreted as "ANSI" code page. This creates another source of incompatibilities.
Practical example of poor interoperability
Suppose you copy the word café from this page to Notepad on English Windows, and save it to a file named cafe.txt
. Then, depending on what you do with this file, you'd get the following results:
- cafθ, with Greek letter "theta" if you do
type cafe.txt
in a command line window on English Windows. - cafй, with "Cyrillic small letter short i" if you open it in Notepad on Russian Windows.
- cafщ, with "Cyrillic small letter shcha" if you do
type cafe.txt
in a command line window on Russian Windows.
File cafe.txt
contains four bytes: 67, 97, 102, 233, and they don't change. The numbers 69, 97, 102 fall into the ASCII range of 0-127 and are invariably rendered as C, a, f
. However, the meaning of the number 233 varies depending on the code page in use:
Environment | Code page | Mapping | Letter name |
---|---|---|---|
English GUI application | 1252 | 233 → "é" | Latin small letter e with acute accent |
English console application | 437 | 233 → "θ" | Greek small letter theta |
Russian GUI application | 1251 | 233 → "й" | Cyrillic small letter short i |
Russian console application | 866 | 233 → "щ" | Cyrillic small letter shcha |
In Western European languages words usually contain only a few accented characters, so they may still recognizable after such "transitions". Words from languages that use non-Latin alphabets become completely garbled. E.g. "кафе" typed in Russian Windows Notepad (CP 1251) becomes "ърЇх" in Russian Windows command line (CP 866).
C/C++ vs .NET applications
Applications can query current code page via GetConsoleCP()
and GetConsoleOutputCP()
APIs. It is up to the application to alter its behavior (or not) in response to the value of current code page. Most applications written in C and C++ ignore the issue of code pages. They treat all input and output as a stream of bytes and do not attempt any encoding conversions. Note, that it is possible to add encoding conversions to C or C++ program, but it is not trivial, and most programs don't bother with it.
.NET applications internally store text data as Unicode, and convert it on the fly to/from the current code page when doing input and output. This makes them more compliant, but may create its own unique set of problems when encoding conversion is not desired.
More poor interoperability scenarios
International characters may be garbled in the following scenarios:
- Console application displays a file created by a GUI program on the same machine. The file would typically be written in UTF-8 or in ANSI encoding, and displayed in OEM encoding.
- Console application displays text received from a GUI program through any other means: input redirect, command line parameters, etc. The input would typically be ANSI encoded, command line parameters may work with .NET programs, but it is also tricky, especially for C/C++ programs.
- .NET console application copies a file created by a non-Unicode GUI program on the same machine. The file encoding is ANSI, but the contents is converted to UNICODE and back assuming OEM encoding. This may garble some characters. C/C++ programs don't have this problem, as they will not attempt any converisons.
- GUI application invokes a console application using any mode except DETACHED_PROCESS and reads its redirected output. The output will be encoded in OEM encoding for both C++ and .NET applications. If GUI applications attempts to display it without conversion, the output will be garbled. If GUI application attempts to parse it and extract file names, computer names, etc. from the output, and the file names contain international characters, the files will not be found.
- GUI application invokes a console application as DETACHED_PROCESS and reads its redirected output. .NET console applications will produce output in ANSI encoding. Detached processes have code page of zero, which is interpreted as CP_ACP, or "current ANSI code page". C/C++ programs will most likely still produce output in OEM encoding. There is no reliable way to determine which is which, nor there is a way to force OEM encoding on the detached process. Don't use DETACHED_PROCESS flag, use CREATE_NO_WINDOW instead.
- .NET console application running in DETACHED_PROCSES mode invokes a child process and reads its redirected output. Unless the child process is specifically created with DETACHED_PROCESS flag, it will have its own console with OEM code page, and redirected output will be in OEM encoding, while the application will interpret it in its current code page, ANSI. This does not happen to C/C++ programs, since they ignore current code page and always expect OEM.
- .NET application running in a console with non-default code page invokes a child process and reads its redirected output. This is a variation of the previous scenario, as newly created child process will have OEM code page.
- Console application attempts to show messages in a language not supported by current code page. E.g. if we run a Russian program on English Windows. English Windows is perfectly capable of displaying Russian text, but not inside a console. .NET application will replace international characters with question marks, C/C++ applications will show garbage.
- Output of .NET console application with messages in another language is redirected to a file. Even though the text is not displayed, .NET still respects current code page and converts all output to it, replacing not supported characters with question makrs. C/C++ program ignore current code page and the resulting file will contain the message in its original encoding, which still can be recovered through encoding-aware editors.
File names and other named objects
Non-Unicode Windows APIs that deal with file names, e.g. FindFirstFileA()
always use ANSI code page. Same applies to user names, computer names, etc. This is not a problem for .NET applications, since they always use Unicode internally, and convert the text to OEM code page just before output. For C++ applications, however, trying to use non-Unicode aware APIs and then display the results verbatim will cause garbled names.
One is virtually guaranteed to have localized file names on a non-English Windows system, so care must be taken to avoid this problem. One must use Unicode APIs to handle file names, and convert them to current OEM code page manually, even though it is quite messy in C++. Another possible solution is to use automatic Unicode-to-OEM translation similar to .NET via _setmode(..., _O_U16TEXT)
.
Other notes
- When new console is created, its code page is always OEM. There is no way to specify the code page of the console for a newly created process.
- Detached process has code page zero.
- Code page is an attribute of a console, not of a process. If multiple processes share a console, every one of them may affect the console's code page.
- Windows documentation calls all code pages 'multi byte character sets', even though many of them are single byte.
- "ANSI" code pages are not really defined by ANSI, nor "OEM" code pages are defined by OEMs. Both are defined by Microsoft.
In conclusion
Dealing with international characters in console apps is a mess, and there is no easy way to achieve universal compatibility of everything with everything. However, being aware of the problem is 50% of the solution. Stick to English whenever possible, and always test localized text when applicable, especially when dealing with file names, user names, and other objects whose names are supplied by the operating system.
Windows 10 promises true Unicode console buffers, but it remains to be seen whether it cathces on and how much time it will take.
References
Code pages, official Microsoft documentation.List of the code page identifiers, official Microsoft documentation.
Creation of a console, official Microsoft documentation.
Unicode and UTF-8 Output Text Buffer by Rich Turner, Microsoft.
Notes on Unicode on the command line in Windows by A. Sinan Unur: using Turkish characters with command line and Perl.
Anyone who says the console can't do Unicode... by Michael S. Kaplan.
Feedback
If you have questions or comments, feel free to
leave feedback.