PHP 6 and Request Decoding

» 21 February 2007 » In PHP »

It looks like we have finally settled on an approach for HTTP input (request) decoding in PHP 6. There have been no fewer than 4 different proposals floated before, but this one combines flexibility, performance, intuitiveness, and minimal architectural changes, and has only a couple of small drawbacks. Let’s take a closer look.

As you probably know, correctly determining the encoding of HTTP requests is somewhat of an unsolved problem. I know of no mainstream clients that send the charset specification along with the request. This means that it is up to the server or the application to figure out the encoding, which can be done in a number of ways, including encoding detection, looking at Accept-Charset header, parsing request to see if _charset_ field is passed, and more. Unfortunately, none of them are completely reliable and the best you can do is guess the encoding with some degree of confidence.

The approach that we decided on is basically a lazy evaluation scheme. When PHP receives the request, it will simply store it internally as-is and not do any request decoding at all. However, if your script happens to access $_GET, $_POST, or $_REQUEST arrays, the runtime JIT handler will kick in and convert the values in the array from binary (raw) to Unicode based on the current HTTP input encoding setting. This will be done for the whole array at once, not per element. The encoding setting can be changed at runtime via tentatively named http_input_encoding() function. If the encoding is changed, the JIT handler is re-armed and the next access to the arrays will re-convert the stored raw data to Unicode based on the new setting.

The advantages of this approach are numerous. For one, PHP is not forced to guess the encoding of the request during request parsing stage, which happens before the script is executed. This allows the application to explicitly set the expected encoding or query other sources for the possible encoding value. For example, there could be a function that performs encoding detection on the request and returns the guess along with the degree of confidence; or PHP could parse the request and provide the raw value of the _charset_ field. In either case, it is up to the application to set the encoding before accessing the request arrays. Secondly, PHP does not have to do request decoding until it is necessary to do so, removing the upfront cost for scripts that do not need request arrays. Thirdly, in case there are conversion errors, they are processed using the same mechanism that PHP employs for other encoding conversions, allowing application to set a custom conversion error handler.

One possible problem with this approach was pointed out by Rasmus. Someone could try to inject bogus data into the request, so that when the app accesses a request array for the first time, the bogus data trigger the errors in the conversion process. I think we can deal with this issue in a sensible way, and that the pros of our approach outweigh the cons. Note that the decoding of the request has nothing to do with filtering. The job of the filter extension is to validate or sanitize the data, and it has to operate on the results of the request conversion, i.e. Unicode strings.

Hope this has been a useful preview of this very important part of PHP 6. Once this functionality is complete, we can finally make the Unicode preview release. Stay tuned.

Trackback URL

  1. andrei
    Basil Gohar
    22/02/2007 at 6:41 am Permalink

    I personally cannot find any serious flaw with this design – how do other environments handle dealing with Unicode? Do they even parse incoming request incoming request information, or do they just allow runtime functions to handle the decoding?

    Also, I am not entirely sure I understood Rasmus’ concerns. Isn’t the issue of bogus data present regardless of when/how the conversion process happens? Is it a security concern, such that if the errors happened prior to the JIT routine, less harm could be done?

  2. andrei
    Andrew Magruder
    22/02/2007 at 7:18 am Permalink

    We’re experiencing this problem with PHP5.x now. What I don’t understand from your proposal is how PHP handles errors and how to address those errors at the application level.

    I think delaying it until usage time is a good thing. I think allowing the application to specify an encoding is a good thing.


    What happens when the conversion from raw to $_ variables fails because of an invalid character encoding?

    How does the application find out about it?

    Will the application (please!) get the converted data with offending characters dropped (rather than substituted for a question mark)?

    Are you planning on providing some way to get at the raw data directly? (So we can build what we need if it won’t be part of base-PHP.)

    Is there any provision for the PHP application developer to write a mapping function to replace well known problematic characters @ JIT-translation time? For example, we are routinely POSTed latin1 encoded Euro symbols in otherwise UTF-8 encoded POSTs.

    If suitable diagnostics are not provided by base-level PHP, the only application recourse I see, if you want to *sure* that you’re doing the right thing, is rip $_ to every encoding your application might expect to deal with (about 4-5 for us) and then do string comparisons to look for missing data. Did I miss something? (I looked @ php-dev list for a bit and didn’t find the answers to these questions.)

  3. andrei
    22/02/2007 at 9:16 am Permalink

    @Basil: If the conversion happened prior to script execution, we would have to stop at the first error and set some global error flag, since we cannot meaningfully continue. With the new approach, the conversions trigger the normal PHP error handler, which does not necessarily stop. Since the error handler may be a custom user one, there are some potential problems, depending on what the user error handler does. But like I said, this is a manageable issue.

    @Andrew: If there are conversion problems from raw data to $_* variables, the conversion error handler will be invoked. The application may set a custom error handler, if so desired. If you want to receive data with offending characters dropped, you can set the global error mode in your .ini file to U_CONV_ERROR_SKIP. You will be able to get at the raw data directly. We have not considered providing santizing functions like you described, but if we do, they will probably be part of ext/filter.

  4. andrei
    27/01/2008 at 4:02 am Permalink

    I use IIS and I have decoding problems too..
    when I click on URL UTF-8 encoded, resulting querystring replace any non english chars with question marks. Why? I usa UTF-8 encoding in my pages !!! Thank you

  5. andrei
    14/10/2014 at 9:29 am Permalink

    Admiring the hard work you put into your blog and detailed information you present. It’s nice to come across a blog every once in a while that isn’t the same unwanted rehashed material. Fantastic read! I’ve saved your site and I’m including your RSS feeds to my Google account.

Additional comments powered by BackType