| Summary: | Bug in src/stdlib/SDL_iconv.c | ||
|---|---|---|---|
| Product: | SDL | Reporter: | John Popplewell <john> |
| Component: | *don't know* | Assignee: | Ryan C. Gordon <icculus> |
| Status: | RESOLVED FIXED | QA Contact: | Sam Lantinga <slouken> |
| Severity: | normal | ||
| Priority: | P2 | ||
| Version: | HG 1.2 | ||
| Hardware: | x86 | ||
| OS: | Windows (All) | ||
| Attachments: |
Test program showing the problem
Proposed patch |
||
Sure, a test program and patch would be awesome, if you have time. Thanks for your time! :) Created attachment 455 [details]
Test program showing the problem
Created attachment 456 [details]
Proposed patch
My additional comments got dropped on the floor. The test program takes a UTF8 string, converts it to UTF-16LE (on Windows), shows the resulting 16-bit character codes, then converts it back to UTF8. I used PuTTY and OpenSSH to get a UTF8 shell on Windows :-) Here is the output using SDL-1.2.svn (no iconv support): C:\tpdev\test-junk>utf8test.exe Before: 'Здpaвcтвyй миp' UFFFD UFFFD U0070 U0061 UFFFD U0063 UFFFD UFFFD U0079 UFFFD U0020 UFFFD UFFFD U0070 After : '��pa�c��y� ��p' C:\tpdev\test-junk> Here is the output using SDL-1.2.svn (with iconv support): C:\tpdev\test-junk>utf8test.exe Before: 'Здpaвcтвyй миp' U0417 U0434 U0070 U0061 U0432 U0063 U0442 U0432 U0079 U0439 U0020 U043C U0438 U0070 After : 'Здpaвcтвyй миp' C:\tpdev\test-junk> Here is the output using patched SDL-1.2.svn (no iconv support): C:\tpdev\test-junk>utf8test.exe Before: 'Здpaвcтвyй миp' U0417 U0434 U0070 U0061 U0432 U0063 U0442 U0432 U0079 U0439 U0020 U043C U0438 U0070 After : 'Здpaвcтвyй миp' C:\tpdev\test-junk> The output of 'testiconv.exe' (no iconv support) is unchanged with the patch. Hope that helps. Great problem description and patch! I've applied your change to subversion, thanks! Here's a link to some binaries, source-code and a build script: http://johnnypops.demon.co.uk/files/utf8test-20091211.zip Rename one of the SDL-svn*.dll files to SDL.dll to test the various permutations. |
Originally reported by AKFoerster on the mailing list. Error decoding UTF8 Russian text to UTF-16LE on Windows, but specifically on platforms without iconv support (the default on Windows). Valid UTF8 characters are flagged as being overlong and then substituted by the UNKNOWN_UNICODE character. After studying the testiconv.c example program, reading the RFCs and putting some printf statements in SDL_iconv.c the problem is in a test for 'Maximum overlong sequences', specifically 4.2.1, which is carried out by the following code: } else if ( p[0] >= 0xC0 ) { if ( (p[0] & 0xE0) != 0xC0 ) { /* Skip illegal sequences return SDL_ICONV_EILSEQ; */ ch = UNKNOWN_UNICODE; } else { if ( (p[0] & 0xCE) == 0xC0 ) { <<<<<<<< here overlong = SDL_TRUE; } ch = (Uint32)(p[0] & 0x1F); left = 1; } } else { Here is the 2-byte encoding of a character in range 00000080 - 000007FF 110xxxxx 10xxxxxx The line in question is supposed to be checking for an overlong sequence which would be less than 11000001 10111111 which should be represented as a single byte. BUT, the mask value (0xCE) is wrong, it isn't checking the top-most bit: 11000001 value 11001110 mask (incorrect) ^ and should be (0xDE): 11000001 value 11011110 mask (correct) making the above code: } else if ( p[0] >= 0xC0 ) { if ( (p[0] & 0xE0) != 0xC0 ) { /* Skip illegal sequences return SDL_ICONV_EILSEQ; */ ch = UNKNOWN_UNICODE; } else { if ( (p[0] & 0xDE) == 0xC0 ) { <<<<<<<< here overlong = SDL_TRUE; } ch = (Uint32)(p[0] & 0x1F); left = 1; } } else { I can supply a test program and/or a patch if required, best regards, John Popplewell