April 2006 » Archive » Andrei Zmievski

Archive > April 2006

PHP-GTK Book is Out

30 April 2006 » In Books, PHP » No Comments

Well, I certainly never imagined back when I was starting work on PHP-GTK that one day there would be a 400 page book about it. Pro PHP-GTK by Scott Mattocks is the first English language book on this topic and it is on the bookshelves, real and virtual, right now. I just wanted to say, kudos to Scott and congrats on his newborn as well. He’s very productive. 🙂

Notes from php|tek 2006

29 April 2006 » In Talks » 5 Comments

I was in Orlando this past week at the php|tek conference, put on by php|architect. I gave two presentations there: PHP 6 and Unicode and Regex Clinic. The slides for both of them are available on my Talks page.
The social highlight of the conference was, perhaps, the dinner outing on Thursday night which culminated in an impromptu speed beer drinking contest. Guess who started it. (hint: name starts with Ras and ends with beer glass slamming on the table)

A Day in the Life of Schmichael

21 April 2006 » In Funny, Work » 1 Comment

Let it be known that Yahoo! engineers are not without a sense of whimsy. Michael Radwin, who is the main engineering manager for my group, has been on paternal leave for the past 3 months. Evidently, his direct reports got somewhat lonely without the attention of the fearless leader and his friendly smile..

Emergency Supplies

21 April 2006 » In Funny » No Comments

Now that’s what I call an emergency supply kit.

Good Old Times

18 April 2006 » In PHP » 9 Comments

I was browsing the MARC archives the other day and decide to see what my very first posting to any of the PHP lists was. It turned out to be this one on php-general list. And then shortly thereafter I posted on php-dev offering to help with … wait for it … PHP on Windows development. And here’s my first submitted bug, too. <tear up />.
Ah, good old times indeed.. Especially considering Zeev’s reply to me. 🙂

Un-probable Sentences

12 April 2006 » In Hacks » 8 Comments

I decided to start a section on Language and Linguistics, since it’s one of my passions and I am, after all, pursuing a graduate degree in it. So, I will be posting some interesting tidbits and such from the classes, the Web, and my own experiments.
This semester’s class is Computers and Written Language. It basically deals with introductory computational linguistics. Last week we covered n-gram language models, which are statistical models of word sequences. They are called n-gram because n-1 previous words are used to predict the probability of the next one. Such models are useful for a variety of tasks, including speech recognition (“The sign says key pout” vs. “The sign says keep out”), handwriting recognition, spellchecking, document identification, etc.
The programming assignment we had from the class required us to build a trigram (n=3) model of a given corpus of text. This involves counting occurrences of each trigram and calculating the probability of the final word following two preceding ones. For example, probability of see following want to can be calculated as:

Ρ(see|want to) = C(want to see) / C(want to)

That is, probability of see given want to is the number of times we’ve seen want to see trigram divided by the number of times we’ve seen want to bigram, and it turns out to be low, since want to can be followed by many different verbs. P(tonic|gin and), on the other hand, is much higher. You also want to take sentence boundaries into account, since I is very likely to begin a sentence in a fiction corpus, but not so much in a financial one.
So the idea is: read corpus, tokenize, count, calculate probabilities. Probably 30-40 lines of code in a language like PHP or Ruby (which is what I used, just for fun). Once I was done though, I thought, well, I have this nice trigram model, what else can I do with it? Ah, apply it in reverse, to generate sentences!
This is also a fairly simple task. Take a pair of words, then look in the list of the words that can possibly follow them, as learned from corpus, pick a probabilty at random and use it to make a selection from the list. Shift the sequence, so that the last word becomes next to last and the current one becomes last, rinse, repeat. The whole process is basically a Markov chain. I added some heuristics for comma insertion, a couple of controls, and called the resulting generator furby because it reminded me of that weird little toy from a few years ago that would sit there, absorb the sounds of the outside world, and regurgitate them back in a mangled, but eeriely recognizable manner.
So what kind of sentences did I obtain? Let me quote good old Chomsky first:

The notion “probability of a sentence” is an entirely useless one… — Noam Chomsky, 1969

I am not going to argue against his statement here, but I will apply it for my own purposes. You see, the sentences that furby generates are not improbable. They are… un-probable. Sometimes they are poetry, sometimes they are normal sentences you’d find in a book, but mostly they feel like someone who knows English as a second language had a hit of LSD and was asked to write down his thoughts. It’s English, with a big dollop of whoa-a-ah.
I got a few texts from Project Gutenberg site and fed them to furby. Everything from Sherlock Holmes stories to Alice in Wonderland to Robinson Crusoe. Here are some samples of what it spit out:

“I never have had a considerable household, he murmured.” (sane)
“I remember most vividly, three smashed bicycles in a fury of misery.” (poetry)
“He put his lips tight, and I wrote to the suspicion that the things had been shattered by his eager face.” (LSD)

The cool thing is that the results are in the style of the original text. Here are a couple generated from Twain’s Huckleberry Finn:

“There was them kind of a whoop nowheres.”
“You know bout dat chile stannin mos right in the night-time, sometimes another.”

Note that these are original sentence that do not occur in the texts. It was a lot of fun just running furby over and over again and seeing what it would come up with. But why not mix two authors? I tried a couple, but the best combination seemed to be DH Lawrence’s Sons and Lovers and the aforementioned Huckleberry Finn. Once it sucked in this unlikely duet, furby decided to become a comedian with a streak of soft-core pornography. Here are some gems:

“She wanted him and a half a sovereign.”
“Goodness man don’t be a fine woman.”
“Her mouth to begin working, till pretty late to-night.”
“She heard him buy threepennyworth of hot-cross buns, he talked to barmaids,
to almost any woman whom he felt.”
“He shoved his muzzle in the wet.”
“Joking, laughing with their shafts lying idle on the downward track.”
“As the lads enjoyed it when i realised that she was warm, on the pavement
then Dawes then Clara.”
“She had never been shaved.”
“He lay pressed hard against her and the electric light vanished, and I saw
the wrist and the coconut, and shook her head.”
“She could think of the body as it were, prowling abroad.”
“The three brothers sat with his finger-tips.”
“Eh, dear, if i’m a trying to get as drunk as a bubble of foam.”

Priceless.
My next goal is to feed it php-general archives and see if furby can be more intelligent that most of the postings on that list. Stay tuned.

Notes from PHP Québec 2006

03 April 2006 » In Food, PHP, Talks » 1 Comment

I just got back from Montreal where I gave two talks at the PHP Québec conference: one on PHP 6 and Unicode and another on PHP-GTK 2. Both of my sessions were full and I got very positive comments from the attendees. I think I am getting close to figuring out the right proportions of theory, examples, and demos that should be present in a talk.
For PHP-GTK 2, I showed off a couple of apps that I quickly wrote a few days before and that make use of Yahoo! developer APIs. The first one lets you pick two airports and calculates the distance between them as well as showing the local maps and weather info. The second one uses Flickr API to display a continuous grid of latest images from flickr.com. These were about 100-200 lines of code total each, so, of course, Rasmus had to brand them as Pidgets.
The conference itself was well organized and attended and had a number of interesting talks. Kudos to Sylvan, Yann, and others for their efforts.
The post conference program for each day was full as well. On Thursday night we had a dinner for speakers and organizers at the always excellent À la Decouverte, a small and cozy restaurant with a big taste. I had marinated snails with mushrooms in garlic butter and phylo dough for appetizer, and ostrich medallions in blueberry sauce for main course, and both were delicious.
On Friday night we made a visit to the ever popular Les Deux Pierrots which is hard to describe to someone who’s never been there before. Two bands alternate on the stage and play anything from popular rock tunes (think Take Me Out) to French camping songs to something resembling a hoedown. The level of energy is amazing, and you can’t help being pulled into the manic foot-stomping/hand-clapping atmosphere. Great place to let off steam, basically.
And Saturday morning found us at the Sucrerie de la Montagne, a “sugar shack” outside Montreal that lets you take a peek into the process of obtaining and making maple syrup and also manages to feed hundreds of visitors an hour at the rustic wooden tables in its giant restaurant. The rule of thumb is, you have a big (> 1 liter) bottle of syrup on the table and it has to be gone by the end of the meal. So you put maple syrup on and into everything: bread, pea soup, omelette, sausages, meat pie, mashed potatoes, pancakes, and coffee. We almost managed to finish ours.
Pics should be coming up soon.

Andrei Zmievski

Archive > April 2006

PHP-GTK Book is Out

Notes from php|tek 2006

A Day in the Life of Schmichael

Emergency Supplies

Good Old Times

Un-probable Sentences

Notes from PHP Québec 2006

Search

Search

Recent Posts

Recent Comments

Archives

Categories

Meta

Archive > April 2006

Subscribe

Search

Search

Recent Posts

Recent Comments

Archives

Categories

Meta