Uploaded image for project: 'JBoss Web Services'
  1. JBoss Web Services
  2. JBWS-1716

Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • jbossws-1.2.1
    • None
    • Workaround Exists
    • Hide

      set system property file.encoding=utf-8

      this workaround is equally bad since it breaks apps that rely on platform specific reading ...

      Show
      set system property file.encoding=utf-8 this workaround is equally bad since it breaks apps that rely on platform specific reading ...

      When sending a client request which includes a non-ASCII UTF-8 character such as the "ç" in "Français" on a machine which has the default character encoding set to something different than UTF-8, the encoding is erroneous. For example, the "ç" in the example above is marshalled on the network stream as 0xC3 0x83 0xC2 0xA7 instead of the legal UTF-8 sequence being 0xC3 0xA7, when the machine's default character set is set to MS1252 in this case (Windows).

      A fix for this is setting the system property file.encoding=utf-8, but this causes as many problems elsewhere as it fixes (especially in the case of legacy platform-specific file reading) ... .

      A forum post is highly likely to expose the same phenomenon: http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4030510#4030510

      After some good hours of stepping through the JBossWS code, I discovered what I guess must be the culprit in the method XMLFragment.writeSourceInternal(Writer writer):
      ....
      if (reader == null)
      reader = new InputStreamReader(streamSource.getInputStream());

      Here streamSource.getInputStream() is an already UTF-8 encoded stream. However, when a new instance of InputStreamReader is created around it, it will be set to the machine's default character encoding, thus effectively interpreting bytes from the UTF-8 stream in a different encoding scheme, resulting in corrupted data.

      Each time data passes through the marschalling corruption is added, effectively worsening wrong character count when data is passed back and forth.

      I would suggest attaching a reader to the StreamSource source instance var so that it keeps track of its encoding, but that might break things elsewhere ...

        1. JAXBSerializer.java--afterchange
          6 kB
          song andy
        2. JAXBSerializer.java--beforechange
          5 kB
          song andy

              Unassigned Unassigned
              floefliep Wim De Muynck (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: