I have been quite annoyed by a Windows bug that causes a huge number of open-source command-line tools to choke on multi-byte characters at the Windows Command Prompt. The MSVCRT.DLL shipped with Windows Vista or later has been having big troubles with such characters. While Microsoft tools and compilers after Visual Studio 6.0 do not use this DLL anymore, the GNU tools on Windows, usually built by MinGW or Mingw-w64, are dependent on this DLL and suffer from this problem. One cannot even use ls
to display a Chinese file name, when the system locale is set to Chinese.
The following simple code snippet demonstrates the problem:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
#include <locale.h> #include <stdio.h> char msg[] = "xd7xd6xb7xfb Char" ; wchar_t wmsg[] = L "字符 char" ; void Test1() { char * ptr = msg; printf ( "Test 1: " ); while (*ptr) { putchar (*ptr++); } putchar ( ‘
‘ ); } void Test2() { printf ( "Test 2: " ); puts (msg); } void Test3() { wchar_t * ptr = wmsg; printf ( "Test 3: " ); while (*ptr) { putwchar(*ptr++); } putwchar(L ‘
‘ ); } int main() { char buffer[32]; puts ( "Default C locale" ); Test1(); Test2(); Test3(); putchar ( ‘
‘ ); puts ( "Chinese locale" ); setlocale (LC_CTYPE, "Chinese_China.936" ); Test1(); Test2(); Test3(); putchar ( ‘
‘ ); puts ( "English locale" ); setlocale (LC_CTYPE, "English_United States.1252" ); Test1(); Test2(); Test3(); } |
When built with a modern version of Visual Studio, it gives the expected output (console code page is 936):
Default C locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: char
Chinese locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: 字符 char
English locale
Test 1: ×?·? Char
Test 2: ×?·? Char
Test 3: char
I.e. when the locale is the default ‘C’, the ‘ANSI’ version of character output routines can successfully output single-byte and multi-byte characters, while putwchar
, the ‘Unicode’ version of putchar
, fails at the multi-byte characters (reasonably, as the C locale does not understand how to translate Chinese characters). When the locale is set correctly to code page 936 (Simplified Chinese), everything is correct. When the locale is set to code page 1252 (Latin), the corresponding characters at the same code points of the original Chinese characters (‘×Ö·û’ instead of ‘字符’) are shown with the ‘ANSI’ routines, though ‘Ö’ (xd6
) and ‘û’ (xfb
) are shown as ‘?’ because they do not exist in code page 936. The Chinese characters, of course, cannot be shown with putwchar
in this locale, just like the C locale.
When built with GCC, the result is woeful:
Default C locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: char
Chinese locale
Test 1: Char
Test 2: 字符 Char
Test 3: char
English locale
Test 1: ×?·? Char
Test 2: ×?·? Char
Test 3: char
Two things are worth noticing:
putchar
stops working for Chinese when the locale is correctly set.putwchar
never works for Chinese.
Horrible and thoroughly broken! (Keep in mind that Microsoft is to blame here. You can compile the program with MSVC 6.0 using the /MD
option, and the result will be the same—an executable that works in Windows XP but not in Windows Vista or later.)
I attacked this problem a few years ago, and tried some workarounds. The solution I came up with looked so fragile that I did not push it up to the MinGW library. It was a personal failure, as well as an indication that working around a buggy implementation without affecting the application code can be very difficult or just impossible.
The problem occurs only with the console, where the Microsoft runtime does some translation (broken in MSVCRT.DLL, but OK in newer MSVC runtimes). It vanishes when users redirect the output from the console. So one solution is not to use the Command Prompt at all. The Cygwin Terminal may be a good choice, especially for people familiar with Linux/Unix. I have Cygwin installed, but sometimes I still want to do things in the more Windows-y way. I figured I could make a small tool (like cat
) to get the input from stdin
, and forward everything to stdout
. As long as this tool is compiled by a Microsoft compiler, things should be OK. Then I thought a script could be faster. Finally, I came up with putting the following line into an mbf.bat:
@perl -p -e ""
(Perl is still wonderful for text processing, even in this ‘empty’ program!)
Now the executables built by GCC and MSVC give the same result, if we append ‘|mbf
’ on the command line:
Default C locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: char
Chinese locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: 字符 char
English locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: char
If you know how to make Microsoft fix the DLL problem, do it. Otherwise you know at least a workaround now.
The following code is my original partial solution to the problem, and it may be helpful to your GCC-based project. I don’t claim any copyright of it, nor will I take any responsibilities for its use.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
/* mingw_mbcs_safe_io.c */ #include <mbctype.h> #include <stdio.h> /* Output functions that work with the Windows 7+ MSVCRT.DLL * for multi-byte characters on the console. Please notice * that buffering must not be enabled for the console (e.g. * by calling setvbuf); otherwise weird things may occur. */ int __cdecl _mgw_flsbuf( int ch, FILE * fp) { static char lead = ‘ ‘ ; int ret = 1; if (lead != ‘ ‘ ) { ret = fprintf (fp, "%c%c" , lead, ch); lead = ‘ ‘ ; if (ret < 0) return EOF; } else if (_ismbblead(ch)) lead = ch; else return _flsbuf(ch, fp); return ch; } int __cdecl putc ( int ch, FILE * fp) { static __thread char lead = ‘ ‘ ; int ret = 1; if (lead != ‘ ‘ ) { ret = fprintf (fp, "%c%c" , lead, ch); lead = ‘ ‘ ; } else if (_ismbblead(ch)) lead = ch; else ret = fprintf (fp, "%c" , ch); if (ret < 0) return EOF; else return ch; } int __cdecl putchar ( int ch) { putc (ch, stdout); } int __cdecl _mgwrt_putchar( int ch) { putc (ch, stdout); }
|