We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 2097

Summary: utf8 documentation for SDL_TextInputEvent and unicode support in general is wrong
Product: SDL Reporter: loebel.marvin
Component: eventsAssignee: Sam Lantinga <slouken>
Status: NEW --- QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2    
Version: 2.0.0   
Hardware: All   
OS: All   

Description loebel.marvin 2013-09-13 15:05:56 UTC
Hi, I just had a longer discussion about this with someone who is writing an Rust binding for SDL2. (http://www.rust-lang.org/)

On http://wiki.libsdl.org/SDL_TextInputEvent it reads as follow:

text: The null-terminated input text in UTF-8 encoding

However, this is a contradiction because UTF-8 is defined as allowing interior nulls - therefore claiming the text field contains text in UTF-8 encoding is wrong.

To fix this, you could modify the documentation, depending on how the code actually treats it:

"The null-terminated input text in a subset of UTF-8 encoding that will never contain an interior null. Any text input that contains interior nulls will be discarded and not trigger an event." 

Or "The null-terminated input text in modified UTF-8 encoding, interior nulls will be replaced by the two byte sequence 0xC0 0x80" (see http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8)

Or you could allow interior nulls, which would mean changing the API to pass a length field (which one could argue is more idiomatic and modern).


I known this might seem silly, as in the wild you'll likely not see any interior nulls, precisely because low level C-like APIs often use use null terminated strings, but it's nevertheless import to at least correctly document it.

Confusion caused my misinterpretation of the documentation can lead to subtly bugs, like memory leaks (null terminated API got passed an utf8 string with interior nulls), or security issues (Let's say you have a valid filename "harmless.jpg/0.exe", but the display routine cut's off part of the string).


On an related note, it should also be documented/decided upon whether the text can contain utf16 surrogate pairs - according to the Unicode standard they are illegal in an utf8 encoded string: http://en.wikipedia.org/wiki/UTF-8#Invalid_code_points. 

However, if you disallow them, you can no longer encode all windows file paths as utf8, as they can contain broken utf16 sequences, which would get encoded as lone surrogate pairs.


So yeah, Unicode is hard. :)