[XSTR-623] XStream generates XML that is not well-formed (according to the XML specification) by writing illegal characters in names

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 1.4
Component/s: IO
Labels:
None

JDK version and platform:
Sun 1.6.0_18 Windows XP 32bit

Description

XStream generates XML that is not well-formed according to the definition of the W3C recommendation [1].

The problem arises because not all characters are legal in a XML name [2]

XStream (in particular the ReflectionConverter) generates XML element names using Java identifier names. But not all legal Java identifiers [3] are legal XML element names.

XStream has already an mechanism to generate "XML friendly names" (XmlFriendlyReplacer), but this replacer only replaces the $ sign.

Since the JLS is a bit fuzzy about the legal identifiers, I've implemented a small application that test every Unicode character if it is a legal character to start or continue a Java identifier and if so, checks whether it is a legal character to start or continue a legal XML name.

The result is that there are 85 characters legal in Java identifiers, but not legal in XML names. For most of these characters, it is unlikely that they will be used in Java identifiers on a typical American or European system. But the following characters are legal in Java identifiers and are printable in these systems:
'$', '¢', '£', '¤', '¥', 'ª', 'µ', 'º'

But nevertheless for all the other characters it is totally legal to be part of an Java identifier as well.
Even the UNICODE codepoint LEFT_TO_RIGHT_OVERRIDE (0x202e) is legal! (I tried it with Eclipse Helios on Windows XP. It looks (and feels) weired, but is legal and does not produce compile errors!)

The solution to this issue would be to enhance the XmlFriendlyReplacer to replace all the 85 characters.

[1] http://www.w3.org/TR/REC-xml/#dt-wellformed
[2] http://www.w3.org/TR/REC-xml/#NT-Name
[3] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8
[4] http://www.unicode.org/charts/PDF/U2000.pdf page 195, format characters

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

Hide
CharacterTest.jar

08/Jul/10 2:53 AM

1 kB

Michael Schnell
charactertest/CharacterTest.java 4 kB

Download Zip
Show

CharacterTest.jar

08/Jul/10 2:53 AM

1 kB

Michael Schnell

Activity

Hide

Permalink

Jörg Schaible added a comment - 08/Jul/10 3:11 AM

The mechanism to encode java identifiers has been changed for upcoming 1.4.x and the XmlFriendyReplacer is now deprecated. However, thanks for your analysis. I was not aware that there are some Unicode chars that can be used in Java identifiers as well, I always wrongly assumed that isJavaIdentifier is restricted to chars within 8-bit ASCII. I'll use your code within a unit test and update the current fucntionality.

Show

Jörg Schaible added a comment - 08/Jul/10 3:11 AM The mechanism to encode java identifiers has been changed for upcoming 1.4.x and the XmlFriendyReplacer is now deprecated. However, thanks for your analysis. I was not aware that there are some Unicode chars that can be used in Java identifiers as well, I always wrongly assumed that isJavaIdentifier is restricted to chars within 8-bit ASCII. I'll use your code within a unit test and update the current fucntionality.

Hide

Permalink

Jörg Schaible added a comment - 27/Jul/11 7:36 PM

Thanks for providing the test code. I've used some of it to implement the final solution in the new XmlFriendlyNameCoder. Fixed in HEAD.

Show

Jörg Schaible added a comment - 27/Jul/11 7:36 PM Thanks for providing the test code. I've used some of it to implement the final solution in the new XmlFriendlyNameCoder. Fixed in HEAD.

People

Assignee:

Jörg Schaible

Reporter:

Michael Schnell

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

08/Jul/10 2:53 AM

Updated:

05/Aug/11 5:04 PM

Resolved:

27/Jul/11 7:36 PM