Crazy Coding
Posted by Steve at 6/11/2007 10:59pm

I came across a problem in our code the other day. It took a while to figure out what was going on, and I thought I'd write up what I found. First a bit of background:

Our app uses Struts, a java web application framework that handles automatic conversion between html forms and server-side objects, among other things. Along the way we have moved towards internationalising our software to make it more useful to non-English-speaking countries. One issue that arose from this is submitting a form with data that includes multi-byte characters. Obviously when a form is submitted to the server, the server receives a stream of bytes, which it has to interpret as meaningful data. If the form being submitted contains English characters, everything usually works because there are standard, accepted mappings between bytes and characters (the "character encoding"). But other encodings are required to interpret character sets such as asian or some european alphabets. So if a user enters Japanese characters into a form, the browser needs to pick an encoding to convert it into bytes and send it to the server, then the server needs to decode it back to the original text using the same encoding.

The issue then is what encoding should we use, and how do we make sure the browser sends data using that encoding? The first question is fairly easy. The Unicode character set can represent text from all languages. The de-facto encoding used with Unicode on the web is UTF-8, so we prefer to use that.

The second question isn't so simple - browsers don't seem to consistently convey which encoding they're using back to the server. The form can specify a list of accepted encoding using the accept-charset attribute, but the ancient version of Struts that we're using doesn't support it. There's another way for the server and the browser to agree on an encoding, which is even more of a hack. In the absence of any other suggested encoding, the browser will use whatever encoding the server used to send the form (if it has been specified). As far as I can tell, there's no formal specification that covers this behaviour, but it seems to work, and without any better option, this is what we rely on. We send our pages to the browser using UTF-8, and when the form is submitted to the server, we interpret it as UTF-8 data (unless the browser has explicitly said that it's using something else).

Here it gets a bit interesting. We use a custom request processor that sets the request encoding to UTF-8, and everything works great, whether the text in the form is in English, Japanese or Polish. Or so I thought. It seems that non-English text doesn't get decoded properly at the server if the form contains a file upload input. This seemed really strange, and I spent a fair bit of time debugging, making sure we were setting the encoding on the request, trying to figure out why it was only a problem when the form contained a file input.

Normally a form is submitted to the server using the application/x-www-form-urlencoded method (this "encoding" shouldn't be confused with character encoding - they are two separate things), which means if you've got a form like this:

  <form method="post">
    <input type="text" name="name" />
    <input type="text" name="description" />
  </form>

It will be submitted something like this:

  name=steve&description=This%20is%20the%20description

Each input's name and value is put into a big long string and sent to the server. If there's a file input in the form, this type of encoding is inefficient, and multipart/form-data is used instead. This basically creates a new section in the request for each input. For standard text inputs, the contents of the section is simply the value entered into the form; for file inputs, the contents of the file are sent directly as a binary stream.

I figured that the sections in the request for the text inputs were not being decoded properly, but I didn't know why. This is what the relevant part of our request processor looked like:

   protected void processPopulate(HttpServletRequest request, HttpServletResponse response,
      ActionForm form, ActionMapping mapping) throws ServletException {

      if (request.getCharacterEncoding() == null) {
         request.setCharacterEncoding("UTF-8");
      }
      super.processPopulate(request, response, form, mapping);

   }

I eventually discovered that the request object being passed to this method was a different type if the form included file inputs. I had a look at the Struts source (thank goodness for open source), and found that Struts was identifying the request as a multipart request, and wrapping the object in a MultipartRequestWrapper object. I had a look at the source for that class and found this:

    public String getCharacterEncoding() {
        return request.getCharacterEncoding();
    }

    /**
     * This method does nothing.  To use any Servlet 2.3 methods,
     * call on getRequest() and use that request object.  Once Servlet 2.3
     * is required to build Struts, this will no longer be an issue.
     */
    public void setCharacterEncoding(String encoding) {
        ;
    }

Wow. That's helpful. I still haven't figured out why the setCharacterEncoding method doesn't set the encoding on the underlying request object, but at least I'd figured out what I needed to do to fix it. The simple fix to our request processor was this:

   protected void processPopulate(HttpServletRequest request, HttpServletResponse response,
      ActionForm form, ActionMapping mapping) throws ServletException {

      HttpServletRequest realRequest = request;
      if (request instanceof MultipartRequestWrapper) {
         realRequest = ((MultipartRequestWrapper)request).getRequest();
      }

      if (realRequest.getCharacterEncoding() == null) {
         realRequest.setCharacterEncoding("UTF-8");
      }
      super.processPopulate(request, response, form, mapping);

   }

After this change, everything worked as expected.


Note: I mentioned that we're using a fairly old version of Struts (version 1.1). From looking at the source of later versions it looks like this wouldn't be an issue, as well as the accept-charset attribute being supported on forms. Hopefully for anyone else who's stuck using an old version this will save you some time.

Comments

You need to log in to post comments.