Uploaded image for project: 'Undertow'
  1. Undertow
  2. UNDERTOW-1802

Improve FormEncodedDataDefinition to handle chars in configured encoding

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Done
    • Icon: Major Major
    • 2.0.31.Final, 2.1.4.Final, 2.2.2.Final
    • Core
    • None

      Garbled characters happen when EAP 7/Undertow's FormEncodedDataDefinition parses raw multibyte characters in POST request data. However, the garbled character issue does not happen on EAP 6 (JBossWeb) and Tomcat.

      example application:

      ...(snip)...
      @WebServlet(name = "TestServlet", urlPatterns = {"/test"})
      public class TestServlet extends HttpServlet {
          ...(snip)...
      
          protected void doPost(HttpServletRequest request, HttpServletResponse response)
                  throws ServletException, IOException {
              request.setCharacterEncoding("UTF-8");
              response.setCharacterEncoding("UTF-8");
              response.setContentType("text/html;charset=UTF-8");
              try (PrintWriter out = response.getWriter()) {
                 out.println("test = " + request.getParameter("test"));
              }
          }
      

      result:

      • EAP 7 (Undertow)
        $ curl http://localhost:8080/example/test -d "test=テスト"
        test =  ̄テニ ̄ツᄍ ̄テネ 
        
      • EAP 6 (JBossWeb) and JWS 5 Tomcat 9
        $ curl http://localhost:8080/example/test -d "test=テスト"
        test = テスト
        

      IMO, as per HTTP specification, the client should use percent-encode (url-encode) when sending POST data as "Content-Type: application/x-www-form-urlencoded". Basically, web browsers (HTTP compliant clients) correctly uses percent-encode (url-encode) when sending POST data. So, I think the root cause of this issue is the client's wrong behavior. In fact, this garbled character issue does not happen when using the following curl command:

      $ curl localhost:8080/test/ -d "test=%E3%83%86%E3%82%B9%E3%83%88"
      

      However, I think we can improve Undertow's FormEncodedDataDefinition slightly to become more compatible with EAP 6/JBossWeb and not to break raw multibyte characters in some cases.

      Here's an analysis of implementation differences between EAP 6 and EAP 7:

      • EAP 6/JBossWeb (org.apache.tomcat.util.http.Parameters#processParameters) parses parameters as byte, then it finally converts byte array of key/value to String through ByteChunk.
      • EAP 7/Undertow (io.undertow.server.handlers.form.FormEncodedDataDefinition#doParse) also parses parameters as byte. However, it directly converts each one byte to one char by using "StringBuilder#append((char) n)" where n is one byte. As multibyte characters (like Japanese characters) can not be correctly converted to one character from one byte, this processing results in a garbled character.

      So, I think we can improve FormEncodedDataDefinition by using ByteArrayOutputStream instead of StringBuilder: https://github.com/undertow-io/undertow/compare/master...msfm:master_UNDERTOW-1802

      Of course, if the parameter value contains both % and raw multibyte characters, the parameter still can not be parsed correctly because the value can not be decoded as a percent-encoded value correctly. However, I think this is not an issue because it's the same as EAP 6/JBossWeb.

            rhn-support-rmartinc Ricardo Martin Camarero
            rhn-support-mmiura Masafumi Miura
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: