PHP 6 and Request Decoding
It looks like we have finally settled on an approach for HTTP input (request) decoding in PHP 6. There have been no fewer than 4 different proposals floated before, but this one combines flexibility, performance, intuitiveness, and minimal architectural changes, and has only a couple of small drawbacks. Let’s take a closer look.
As you probably know, correctly determining the encoding of HTTP requests is somewhat of an unsolved problem. I know of no mainstream clients that send the charset specification along with the request. This means that it is up to the server or the application to figure out the encoding, which can be done in a number of ways, including encoding detection, looking at Accept-Charset header, parsing request to see if _charset_ field is passed, and more. Unfortunately, none of them are completely reliable and the best you can do is guess the encoding with some degree of confidence.
The approach that we decided on is basically a lazy evaluation scheme. When PHP receives the request, it will simply store it internally as-is and not do any request decoding at all. However, if your script happens to access $_GET, $_POST, or $_REQUEST arrays, the runtime JIT handler will kick in and convert the values in the array from binary (raw) to Unicode based on the current HTTP input encoding setting. This will be done for the whole array at once, not per element. The encoding setting can be changed at runtime via tentatively named http_input_encoding() function. If the encoding is changed, the JIT handler is re-armed and the next access to the arrays will re-convert the stored raw data to Unicode based on the new setting.
The advantages of this approach are numerous. For one, PHP is not forced to guess the encoding of the request during request parsing stage, which happens before the script is executed. This allows the application to explicitly set the expected encoding or query other sources for the possible encoding value. For example, there could be a function that performs encoding detection on the request and returns the guess along with the degree of confidence; or PHP could parse the request and provide the raw value of the _charset_ field. In either case, it is up to the application to set the encoding before accessing the request arrays. Secondly, PHP does not have to do request decoding until it is necessary to do so, removing the upfront cost for scripts that do not need request arrays. Thirdly, in case there are conversion errors, they are processed using the same mechanism that PHP employs for other encoding conversions, allowing application to set a custom conversion error handler.
One possible problem with this approach was pointed out by Rasmus. Someone could try to inject bogus data into the request, so that when the app accesses a request array for the first time, the bogus data trigger the errors in the conversion process. I think we can deal with this issue in a sensible way, and that the pros of our approach outweigh the cons. Note that the decoding of the request has nothing to do with filtering. The job of the filter extension is to validate or sanitize the data, and it has to operate on the results of the request conversion, i.e. Unicode strings.
Hope this has been a useful preview of this very important part of PHP 6. Once this functionality is complete, we can finally make the Unicode preview release. Stay tuned.