Details
Description
XStream generates XML that is not well-formed according to the definition of the W3C recommendation [1].
The problem arises because not all characters are legal in a XML name [2]
XStream (in particular the ReflectionConverter) generates XML element names using Java identifier names. But not all legal Java identifiers [3] are legal XML element names.
XStream has already an mechanism to generate "XML friendly names" (XmlFriendlyReplacer), but this replacer only replaces the $ sign.
Since the JLS is a bit fuzzy about the legal identifiers, I've implemented a small application that test every Unicode character if it is a legal character to start or continue a Java identifier and if so, checks whether it is a legal character to start or continue a legal XML name.
The result is that there are 85 characters legal in Java identifiers, but not legal in XML names. For most of these characters, it is unlikely that they will be used in Java identifiers on a typical American or European system. But the following characters are legal in Java identifiers and are printable in these systems:
'$', '¢', '£', '¤', '¥', 'ª', 'µ', 'º'
But nevertheless for all the other characters it is totally legal to be part of an Java identifier as well.
Even the UNICODE codepoint LEFT_TO_RIGHT_OVERRIDE (0x202e) is legal! (I tried it with Eclipse Helios on Windows XP. It looks (and feels) weired, but is legal and does not produce compile errors!)
The solution to this issue would be to enhance the XmlFriendlyReplacer to replace all the 85 characters.
[1] http://www.w3.org/TR/REC-xml/#dt-wellformed
[2] http://www.w3.org/TR/REC-xml/#NT-Name
[3] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8
[4] http://www.unicode.org/charts/PDF/U2000.pdf page 195, format characters
The mechanism to encode java identifiers has been changed for upcoming 1.4.x and the XmlFriendyReplacer is now deprecated. However, thanks for your analysis. I was not aware that there are some Unicode chars that can be used in Java identifiers as well, I always wrongly assumed that isJavaIdentifier is restricted to chars within 8-bit ASCII. I'll use your code within a unit test and update the current fucntionality.