434
189
I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"
Why do I ask this question?
How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element.
I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).
For example, try to edit one of these characters:
- 𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF
- 𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
- 𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
- 𠂊 (U+2008A) Han Character
You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.
For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:
- Opera has problem with editing them (delete required 2 presses on backspace)
- Notepad can't deal with them correctly (delete required 2 presses on backspace)
- File names editing in Window dialogs in broken (delete required 2 presses on backspace)
- All QT3 applications can't deal with them - show two empty squares instead of one symbol.
- Python encodes such characters incorrectly when used directly
u'X'!=unicode('X','utf-16')on some platforms when X in character outside of BMP. - Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
- StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).
- WinForms TextBox may generate invalid string when limited with MaxLength.
It seems that such bugs are extremely easy to find in many applications that use UTF-16.
So... Do you think that UTF-16 should be considered harmful?

I tried copying the characters to a filename and tried to delete them and had no problems. Some Unicode characters read right to left and keyboard input handling sometimes changes to accommodate that (depending on the program used). Can you post the numeric codes for the specific characters you are having trouble with? – None – 2009-06-26T17:30:47.687
1Have you tried to work with them in Notepad and see how this work? For example edit file name with this character and put coursor at the right of this character and press backspace. You'll see that in both. Notepad of file name editing dialog it requires two times to press "backspace" to remove this character. – None – 2009-06-27T07:50:22.063
17
The double backspace behavior is mostly intentional http://blogs.msdn.com/michkap/archive/2005/12/21/506248.aspx
– None – 2009-06-27T10:56:18.47764Not really correct. I explain, if you write "שָׁ" the compound character that consists of "ש", "ָ" and "ׁ", vovels, then removal of each one of them is logical, you remove one code-point when you press "backspace" and remove all character including vovels when press "del". But, you never produce illegal state of text -- illegal code points. Thus, the situation when you press backspace and get illegat text is incorrect. – None – 2009-06-27T12:43:13.190
Are you referring to how sin and shin are composed of two code points, and by deleting the code-point for the dot you get an "illegal" character? – None – 2009-06-29T16:39:06.647
3No, you get "vowelless" writing. It is totally legal. More then that, in most of cases vowels like these (shin/sin) are almost ever written unless they are required for clearification of something that is not obvious from context like שׁם and שׂם these are two different words, but according to context you know which one of is vowelless שם means. – None – 2009-06-29T17:24:23.137
41CiscoIPPhone: If a bug is "reported several different times, by many different people", and then a couple years later a developer writes on a dev blog that "Believe it or not, the behavior is mostly intentional!", then (to put it mildly) I tend to think it's probably not the best design decision ever made. :-) Just because it's intentional doesn't mean it's not a bug. – None – 2010-03-18T01:18:48.133
3For the record, I don't have problems with any of these characters in Apple's TextEdit.app (which uses Cocoa and thus UTF-16), but trying to insert them in Emacs (which uses a variant of UTF-8 internally) produces garbage. I do think that such bugs are not the fault of the character encoding, but of the lack of competence of the programmers involved. – None – 2010-08-15T08:20:43.317
1BTW, I've just checked editing these letters, they don't give me a problems neither in Opera, nor in Windows 7. Opera seems to edit them properly, so does Notepad. File with these letters in the name has been created successfully. – Malcolm – 2010-12-31T14:00:09.620
1@Malcolm, 1st there is no problem creating such files - the question editing them. Now I've tested on XP maybe in 7 MS fixed this issue. Take a look how backspace works, do you need to hit it once or twice. – None – 2010-12-31T14:52:42.830
1Once. I specially checked for this issue, and in Windows 7 the problem with the characters beyond BMP seems to be gone. Maybe this problem had been solved even in Vista. – Malcolm – 2011-01-01T00:45:58.283
2@Malcolm - even thou it does not make UTF-16 less harmful :-) – None – 2011-01-01T08:21:56.987
1Well, I don't think that mere existence of crappy implementations indicates harmfulness of the standard at all. :p This is just an update on the current situation: how problematic characters beyond BMP in Windows (and Opera) are now. – Malcolm – 2011-01-01T14:35:05.173
145Great post. UTF-16 is indeed the "worst of both worlds": UTF8 is variable-length, covers all of Unicode, requires a transformation algorithm to and from raw codepoints, restricts to ASCII, and it has no endianness issues. UTF32 is fixed-length, requires no transformation, but takes up more space and has endianness issues. So far so good, you can use UTF32 internally and UTF8 for serialization. But UTF16 has no benefits: It's endian-dependent, it's variable length, it takes lots of space, it's not ASCII-compatible. The effort needed to deal with UTF16 properly could be spent better on UTF8. – Kerrek SB – 2011-06-09T11:38:43.320
1UTF-8 has the same caveats as UTF-16. Buggy UTF-16 handling code exists; although probably less than buggy UTF-8 handling code (most code handling UTF-8 thinks it's handling ASCII, Windows-1252, or 8859-1) – Ian Boyd – 2011-08-12T00:10:12.293
26@Ian: UTF-8 DOES NOT have the same caveats as UTF-8. You cannot have surrogates in UTF-8. UTF-8 does not masquerade as something it’s not, but most programmers using UTF-16 are using it wrong. I know. I've watched them again and again and again and again. – tchrist – 2011-08-15T19:44:52.403
@tchrist UTF-16 can sometimes require more than 16-bits to represent a single code-point, UTF-8 can sometimes require more than 8-bits to represent a single code-point. UTF-16 can sometimes use multiple code points to represent a single character, UTF-8 can sometimes use multiple code points to represent a single character.
U+0061 U+0301 U+0317forms one character:á̗. When converted to UTF-8 the byte sequence (without the BOM) is61 CC 81 CC 97. When converted to UTF-16 the byte sequence (without the BOM) is61 00 01 03 17 03. Same caveats. – Ian Boyd – 2011-08-15T21:21:15.73017@Ian You are welcome to spout off the theory all you want: it’s wasted on me. I teach this stuff myself. I can promise you that the UTF-16 problems are everywhere. These people can’t even get code points right. No one using UTF-8 ever screws that up. It’s these damned two-bit use-to-be-UCS2 P.O.S. UTF-16 interfaces that screw people up. That is the real world. That is the calibre of the average UTF-16 programmer out there. What are you some Microsoft apologist or something? It’s a screwed-up choice that has caused endless misery in this world: you can’t make a silk purse from a sow’s ear. – tchrist – 2011-08-15T22:30:02.973
1i don't see how someone fresh to a subject can be stymied simply because it is named "UTF-16". Yet if you change the name to "UTF-8" it becomes obvious and intuitive. – Ian Boyd – 2011-08-16T00:00:42.103
16@Ian: You have listed common caveats, that doesn't mean that the two have the same caveats. UTF-16 has more: It has endianness issues, and it does not contain ASCII as a subset. Those two alone make a huge difference. – Kerrek SB – 2011-08-16T12:25:06.913
1@Kerrek S: In terms of writing code to handle caveats, endian order is not an issue for programmers. Take me, for example, as a programmer who is dealing with UTF-8 and UTF-16: multiple character diacritis, BMPs and surrogate pairs are (still) difficult to handle. Endian order is trivial. UTF-16 not containing an ASCII subset? What is UTF-16 missing? ASCII has
ACK(0x06), UTF-8 hasACK(0x06), UTF-16 has ACK (0x0006). – Ian Boyd – 2011-08-16T13:19:49.2476
@Ian Boyd: Just so you know, you're arguing with the author of http://98.245.80.27/tcpc/OSCON2011/gbu.pdf See also http://98.245.80.27/tcpc/OSCON2011/index.html
– Christoffer Hammarström – 2011-08-18T20:15:04.37718Also, UTF-8 doesn't have the problem because everyone treats it as a variable width encoding. The reason UTF-16 has the problem is because everyone treats it like a fixed width encoding. – Christoffer Hammarström – 2011-08-18T20:22:41.797
@Christoffer Hammarström You can't blame one non-fixed width encoding for being non-fixed width, while embracing another non-fixed width encoding because it's non-fixed width. – Ian Boyd – 2011-08-18T20:57:40.893
UTF-32 good enough for you? – None – 2011-08-18T22:09:38.543
3Tell me about it, I've been shouting this at my stupid Windows programming colleages for years. The only safe encodings are UTF32 and UTF8 (As long as people don't treat is as a fixed length encoding). – None – 2011-08-19T02:35:36.657
2Can you elaborate on the assertion that "Python encodes such characters incorrectly"? How would you even write this into a file? AFAIK, Python cannot read files whose encoding is not a superset of ASCII (at byte level). – Ringding – 2011-08-19T09:31:34.917
3Another example: JavaScript's
charCodeAtselects UTF-16 words, not Unicode characters. This arguably isn't a bug, but applications that assume charCodeAt works on Unicode characters will be broken. – Joey Adams – 2011-08-21T00:13:04.370i think this link provides some useful context to your question, though not related to an answer
– Ryathal – 2011-12-07T13:57:38.447I think @Ringding is right, the Python example seems flawed. In Python 2,
unicode('', 'utf-16')interprets the bytes of''as a UTF-8 string and then decodes that to UTF-16; that obviously goes wrong. – Fred Foo – 2012-04-20T22:00:11.7879
Please enjoy the grand summary of the popular POV at: http://www.utf8everywhere.org/
– Pavel Radzivilovsky – 2012-04-20T20:26:38.697I'd rather see UTF-8 as well; but I have to say, I've seen just as many people who have said "my char strings are now UTF-8" and not dealt with the problems therein at all. – Billy ONeal – 2012-04-27T15:59:09.163
13http://www.utf8everywhere.org/ – alex – 2012-04-30T03:59:21.407
@larsmans: That's because using regular quotes like that tells the interpreter that there's bytes inside. If you use
u'', it should work correctly. Python2 has the "helpful" feature of automatically encoding/decoding between utf8/ascii in some situations. In python3 your example works because the quotes denote a type that contains unicode codepoints, not bytes. – Daenyth – 2012-04-30T22:05:02.7501@tchrist: "UTF-8 DOES NOT have the same caveats as UTF-8." But surely UTF-8 has EXACTLY the same caveats that UTF-8 does? (Sorry, I couldn't help myself...) – Teemu Leisti – 2013-07-16T11:29:30.213
2http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ – Pavel Radzivilovsky – 2013-10-30T12:54:33.133
2I am not sure what caveats on UTF-8, but at least, those caveats (if exists) should be a lot more visible than UTF-16, because the non-ASCII result will look broken immediately. – Eonil – 2013-11-15T21:39:01.127