We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 896

Summary: Bug in src/stdlib/SDL_iconv.c
Product: SDL Reporter: John Popplewell <john>
Component: *don't know*Assignee: Ryan C. Gordon <icculus>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2    
Version: HG 1.2   
Hardware: x86   
OS: Windows (All)   
Attachments: Test program showing the problem
Proposed patch

Description John Popplewell 2009-12-08 23:05:50 UTC
Originally reported by AKFoerster on the mailing list.

Error decoding UTF8 Russian text to UTF-16LE on Windows, but specifically on platforms without iconv support (the default on Windows).

Valid UTF8 characters are flagged as being overlong and then substituted by the UNKNOWN_UNICODE character.

After studying the testiconv.c example program, reading the RFCs and putting some printf statements in SDL_iconv.c the problem is in a test for 'Maximum overlong sequences', specifically 4.2.1, which is carried out by the following code:

      } else if ( p[0] >= 0xC0 ) {
        if ( (p[0] & 0xE0) != 0xC0 ) {
          /* Skip illegal sequences
            return SDL_ICONV_EILSEQ;
          */
          ch = UNKNOWN_UNICODE;
        } else {
          if ( (p[0] & 0xCE) == 0xC0 ) {    <<<<<<<< here
            overlong = SDL_TRUE;
          }
          ch = (Uint32)(p[0] & 0x1F);
          left = 1;
        }
      } else {

Here is the 2-byte encoding of a character in range 00000080 - 000007FF
    110xxxxx 10xxxxxx

The line in question is supposed to be checking for an overlong sequence which would be less than
    11000001 10111111

which should be represented as a single byte.

BUT, the mask value (0xCE) is wrong, it isn't checking the top-most bit:
    11000001     value
    11001110     mask (incorrect)
       ^
and should be (0xDE):
    11000001     value
    11011110     mask (correct)

making the above code:

      } else if ( p[0] >= 0xC0 ) {
        if ( (p[0] & 0xE0) != 0xC0 ) {
          /* Skip illegal sequences
            return SDL_ICONV_EILSEQ;
          */
          ch = UNKNOWN_UNICODE;
        } else {
          if ( (p[0] & 0xDE) == 0xC0 ) {    <<<<<<<< here
            overlong = SDL_TRUE;
          }
          ch = (Uint32)(p[0] & 0x1F);
          left = 1;
        }
      } else {

I can supply a test program and/or a patch if required,

best regards,
John Popplewell
Comment 1 Sam Lantinga 2009-12-09 00:46:10 UTC
Sure, a test program and patch would be awesome, if you have time.

Thanks for your time! :)
Comment 2 John Popplewell 2009-12-09 02:48:47 UTC
Created attachment 455 [details]
Test program showing the problem
Comment 3 John Popplewell 2009-12-09 02:50:05 UTC
Created attachment 456 [details]
Proposed patch
Comment 4 John Popplewell 2009-12-10 23:44:31 UTC
My additional comments got dropped on the floor. The test program takes a UTF8 string, converts it to UTF-16LE (on Windows), shows the resulting 16-bit character codes, then converts it back to UTF8. I used PuTTY and OpenSSH to get a UTF8 shell on Windows :-)

Here is the output using SDL-1.2.svn (no iconv support):

C:\tpdev\test-junk>utf8test.exe
Before: 'Здpaвcтвyй миp'
UFFFD UFFFD U0070 U0061 UFFFD U0063 UFFFD UFFFD U0079 UFFFD U0020 UFFFD UFFFD U0070
After : '��pa�c��y� ��p'

C:\tpdev\test-junk>

Here is the output using SDL-1.2.svn (with iconv support):

C:\tpdev\test-junk>utf8test.exe
Before: 'Здpaвcтвyй миp'
U0417 U0434 U0070 U0061 U0432 U0063 U0442 U0432 U0079 U0439 U0020 U043C U0438 U0070
After : 'Здpaвcтвyй миp'

C:\tpdev\test-junk>

Here is the output using patched SDL-1.2.svn (no iconv support):
C:\tpdev\test-junk>utf8test.exe
Before: 'Здpaвcтвyй миp'
U0417 U0434 U0070 U0061 U0432 U0063 U0442 U0432 U0079 U0439 U0020 U043C U0438 U0070
After : 'Здpaвcтвyй миp'

C:\tpdev\test-junk>

The output of 'testiconv.exe' (no iconv support) is unchanged with the patch.

Hope that helps.
Comment 5 Sam Lantinga 2009-12-11 00:04:46 UTC
Great problem description and patch!  I've applied your change to subversion, thanks!
Comment 6 John Popplewell 2009-12-11 00:09:42 UTC
Here's a link to some binaries, source-code and a build script:

http://johnnypops.demon.co.uk/files/utf8test-20091211.zip

Rename one of the SDL-svn*.dll files to SDL.dll to test the various permutations.