XStream
  1. XStream
  2. XSTR-623

XStream generates XML that is not well-formed (according to the XML specification) by writing illegal characters in names

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.4
    • Component/s: IO
    • Labels:
      None
    • JDK version and platform:
      Sun 1.6.0_18 Windows XP 32bit

      Description

      XStream generates XML that is not well-formed according to the definition of the W3C recommendation [1].

      The problem arises because not all characters are legal in a XML name [2]

      XStream (in particular the ReflectionConverter) generates XML element names using Java identifier names. But not all legal Java identifiers [3] are legal XML element names.

      XStream has already an mechanism to generate "XML friendly names" (XmlFriendlyReplacer), but this replacer only replaces the $ sign.

      Since the JLS is a bit fuzzy about the legal identifiers, I've implemented a small application that test every Unicode character if it is a legal character to start or continue a Java identifier and if so, checks whether it is a legal character to start or continue a legal XML name.

      The result is that there are 85 characters legal in Java identifiers, but not legal in XML names. For most of these characters, it is unlikely that they will be used in Java identifiers on a typical American or European system. But the following characters are legal in Java identifiers and are printable in these systems:
      '$', '¢', '£', '¤', '¥', 'ª', 'µ', 'º'

      But nevertheless for all the other characters it is totally legal to be part of an Java identifier as well.
      Even the UNICODE codepoint LEFT_TO_RIGHT_OVERRIDE (0x202e) is legal! (I tried it with Eclipse Helios on Windows XP. It looks (and feels) weired, but is legal and does not produce compile errors!)

      The solution to this issue would be to enhance the XmlFriendlyReplacer to replace all the 85 characters.

      [1] http://www.w3.org/TR/REC-xml/#dt-wellformed
      [2] http://www.w3.org/TR/REC-xml/#NT-Name
      [3] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8
      [4] http://www.unicode.org/charts/PDF/U2000.pdf page 195, format characters

        People

        • Assignee:
          Jörg Schaible
          Reporter:
          Michael Schnell
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved: