All the Little Pieces, or TextIterator in PHP 6

» 13 July 2006 » In PHP »

I have been working on the Unicode support in PHP for quite a while now and I figure that it is time to start talking about Unicode and I18N in general and specifically about some of the new features that PHP 6 will be bringing to the table.
First up is the new Swiss-army knife-like TextIterator class. The purpose of this class is to provide access to various text units in a generic fashion. Actually, I lied. TextIterator implements ICU‘s full boundary analysis API, so what it really gives you are the boundaries between the text units. A slight distinction, but well worth remembering. And what are these units, might you ask?

  • codepoints
  • combining sequences
  • characters (slightly different than combining sequences)
  • words
  • line breaks
  • sentences

As its name tells you, TextIterator also implements PHP’s Iterator interface and thus can be used in such constructs as foreach(). As a quick example, let’s go through a string and extract all words contained in it (skipping empty pieces). Using foreach() it is as simple as:


$str = "The quick brown fox jumped over the lazy dog.";
foreach (new
TextIterator($str, TextIterator::WORD)
as
$num => $word) {
if (
$word[0] != " ") {
printf("%d. %s\n", $num, $word);
}
}


The result is:

0. The
2. quick
4. brown
6. fox
8. jumped
10. over
12. the
14. lazy
16. dog
17. .


Doing the same thing without foreach() is a bit more involved, but also more flexible. We’ll print out the words along with their boundaries’ offsets.


$it = new TextIterator($str, TextIterator::WORD);
$start = $it->first();
for (
$end = $it->next(); $end != TextIterator::DONE; $start = $end, $end = $it->next()) {
if (
$str[$start] != " ") {
printf("[%2d..%2d]  %s\n", $start, $end, substr($str, $start, $end-$start));
}
}


And the result here:

[ 0.. 3]  The
[ 4.. 9]  quick
[10..15]  brown
[16..19]  fox
[20..26]  jumped
[27..31]  over
[32..35]  the
[36..40]  lazy
[41..44]  dog
[44..45]  .

One thing worth mentioning is that, at least for now, accessing random offsets in the Unicode strings is somewhat slower than in the binary strings. So the foreach() approach ends up being faster and is the recommended way of accessing text units in a linear fashion.
What else can we do with boundary analysis? At any point we can retrieve the text element at the current boundary with the current() method. Continuing the example:


$it->first();
$word = $it->current();

will give you “The”. We can move backward with the previous() method:


$it->last(); // positions iterator beyond the last character
$it->previous(); // advances to the boundary before the current one
$word = $it->current();

gives you “.” which is the last word in the text. If you want to move through multiple boundaries in the same call, just pass that number to next() and previous():


$it->first();
$it->next(4); // skip the first 4 boundaries and stop
$word = $it->current();

gives you “brown”. You can check whether a certain offset is a boundary or not with isBoundary():


var_dump($it->isBoundary(10)); // true since 'brown' is at offset 10 and it's a boundary

Two more methods, following() and preceding(), allow you to locate a boundary immediately following or preceding the specified offset. This might be useful for doing ellipsis on a piece of text:


$limit = 25; // cut off at 25 chars or before
$offset = $it->preceding($limit);
echo
substr($str, 0, $offset), "...\n";

gives “The quick brown fox …”. One more thing to note is that isBoundary(), following() and preceding() actually reposition the iterator to the located boundary.
TextIterator has a counterpart that does everything (well, almost) in reverse. It’s called, wait for it.. ReverseTextIterator. It has the exact same API and can be used transparently where needed:


foreach (new ReverseTextIterator($str, TextIterator::WORD) as $num => $word) {
if (
$word[0] != " ") {
printf("%d. %s\n", $num, $word);
}
}


The result here is:

0. .
1. dog
3. lazy
5. the
7. over
9. jumped
11. fox
13. brown
15. quick
17. The

Last but not least, if you are really lazy and just want to get all the text pieces defined by the boundaries, TextIterator provides a convenient getAll() method:


$it = new TextIterator($str, TextIterator::WORD);
print_r($it->getAll());

With the expected result of:

Array
(
[0] => The
[1] =>
[2] => quick
[3] =>
[4] => brown
[5] =>
[6] => fox
[7] =>
[8] => jumped
[9] =>
[10] => over
[11] =>
[12] => the
[13] =>
[14] => lazy
[15] =>
[16] => dog
[17] => .
)

Performance has been an important consideration when designing TextIterator. It does a few optimization tricks internally that allow it to be much faster than using offset operator, substr() or even word boundaries in regular expressions.
Hopefully, this has been a useful preview of an important new piece of functionality in PHP 6. Stay tuned for more to come.

Trackback URL