Un-probable Sentences

12 April 2006 » In Hacks » 8 Comments

I decided to start a section on Language and Linguistics, since it’s one of my passions and I am, after all, pursuing a graduate degree in it. So, I will be posting some interesting tidbits and such from the classes, the Web, and my own experiments.

This semester’s class is Computers and Written Language. It basically deals with introductory computational linguistics. Last week we covered n-gram language models, which are statistical models of word sequences. They are called n-gram because n-1 previous words are used to predict the probability of the next one. Such models are useful for a variety of tasks, including speech recognition (“The sign says key pout” vs. “The sign says keep out”), handwriting recognition, spellchecking, document identification, etc.

The programming assignment we had from the class required us to build a trigram (n=3) model of a given corpus of text. This involves counting occurrences of each trigram and calculating the probability of the final word following two preceding ones. For example, probability of see following want to can be calculated as:

Ρ(see|want to) = C(want to see) / C(want to)

That is, probability of see given want to is the number of times we’ve seen want to see trigram divided by the number of times we’ve seen want to bigram, and it turns out to be low, since want to can be followed by many different verbs. P(tonic|gin and), on the other hand, is much higher. You also want to take sentence boundaries into account, since I is very likely to begin a sentence in a fiction corpus, but not so much in a financial one.

So the idea is: read corpus, tokenize, count, calculate probabilities. Probably 30-40 lines of code in a language like PHP or Ruby (which is what I used, just for fun). Once I was done though, I thought, well, I have this nice trigram model, what else can I do with it? Ah, apply it in reverse, to generate sentences!

This is also a fairly simple task. Take a pair of words, then look in the list of the words that can possibly follow them, as learned from corpus, pick a probabilty at random and use it to make a selection from the list. Shift the sequence, so that the last word becomes next to last and the current one becomes last, rinse, repeat. The whole process is basically a Markov chain. I added some heuristics for comma insertion, a couple of controls, and called the resulting generator furby because it reminded me of that weird little toy from a few years ago that would sit there, absorb the sounds of the outside world, and regurgitate them back in a mangled, but eeriely recognizable manner.

So what kind of sentences did I obtain? Let me quote good old Chomsky first:

The notion “probability of a sentence” is an entirely useless one… — Noam Chomsky, 1969

I am not going to argue against his statement here, but I will apply it for my own purposes. You see, the sentences that furby generates are not improbable. They are… un-probable. Sometimes they are poetry, sometimes they are normal sentences you’d find in a book, but mostly they feel like someone who knows English as a second language had a hit of LSD and was asked to write down his thoughts. It’s English, with a big dollop of whoa-a-ah.

I got a few texts from Project Gutenberg site and fed them to furby. Everything from Sherlock Holmes stories to Alice in Wonderland to Robinson Crusoe. Here are some samples of what it spit out:

“I never have had a considerable household, he murmured.” (sane)
“I remember most vividly, three smashed bicycles in a fury of misery.” (poetry)
“He put his lips tight, and I wrote to the suspicion that the things had been shattered by his eager face.” (LSD)

The cool thing is that the results are in the style of the original text. Here are a couple generated from Twain’s Huckleberry Finn:

“There was them kind of a whoop nowheres.”
“You know bout dat chile stannin mos right in the night-time, sometimes another.”

Note that these are original sentence that do not occur in the texts. It was a lot of fun just running furby over and over again and seeing what it would come up with. But why not mix two authors? I tried a couple, but the best combination seemed to be DH Lawrence’s Sons and Lovers and the aforementioned Huckleberry Finn. Once it sucked in this unlikely duet, furby decided to become a comedian with a streak of soft-core pornography. Here are some gems:

“She wanted him and a half a sovereign.”
“Goodness man don’t be a fine woman.”
“Her mouth to begin working, till pretty late to-night.”
“She heard him buy threepennyworth of hot-cross buns, he talked to barmaids,
to almost any woman whom he felt.”
“He shoved his muzzle in the wet.”
“Joking, laughing with their shafts lying idle on the downward track.”
“As the lads enjoyed it when i realised that she was warm, on the pavement
then Dawes then Clara.”
“She had never been shaved.”
“He lay pressed hard against her and the electric light vanished, and I saw
the wrist and the coconut, and shook her head.”
“She could think of the body as it were, prowling abroad.”
“The three brothers sat with his finger-tips.”
“Eh, dear, if i’m a trying to get as drunk as a bubble of foam.”


My next goal is to feed it php-general archives and see if furby can be more intelligent that most of the postings on that list. Stay tuned.

Notes from PHP Québec 2006

03 April 2006 » In Food, PHP, Talks » 1 Comment

I just got back from Montreal where I gave two talks at the PHP Québec conference: one on PHP 6 and Unicode and another on PHP-GTK 2. Both of my sessions were full and I got very positive comments from the attendees. I think I am getting close to figuring out the right proportions of theory, examples, and demos that should be present in a talk.

For PHP-GTK 2, I showed off a couple of apps that I quickly wrote a few days before and that make use of Yahoo! developer APIs. The first one lets you pick two airports and calculates the distance between them as well as showing the local maps and weather info. The second one uses Flickr API to display a continuous grid of latest images from flickr.com. These were about 100-200 lines of code total each, so, of course, Rasmus had to brand them as Pidgets.

The conference itself was well organized and attended and had a number of interesting talks. Kudos to Sylvan, Yann, and others for their efforts.

The post conference program for each day was full as well. On Thursday night we had a dinner for speakers and organizers at the always excellent À la Decouverte, a small and cozy restaurant with a big taste. I had marinated snails with mushrooms in garlic butter and phylo dough for appetizer, and ostrich medallions in blueberry sauce for main course, and both were delicious.

On Friday night we made a visit to the ever popular Les Deux Pierrots which is hard to describe to someone who’s never been there before. Two bands alternate on the stage and play anything from popular rock tunes (think Take Me Out) to French camping songs to something resembling a hoedown. The level of energy is amazing, and you can’t help being pulled into the manic foot-stomping/hand-clapping atmosphere. Great place to let off steam, basically.

And Saturday morning found us at the Sucrerie de la Montagne, a “sugar shack” outside Montreal that lets you take a peek into the process of obtaining and making maple syrup and also manages to feed hundreds of visitors an hour at the rustic wooden tables in its giant restaurant. The rule of thumb is, you have a big (> 1 liter) bottle of syrup on the table and it has to be gone by the end of the meal. So you put maple syrup on and into everything: bread, pea soup, omelette, sausages, meat pie, mashed potatoes, pancakes, and coffee. We almost managed to finish ours.

Pics should be coming up soon.

Discover New Music

16 February 2006 » In Tech » 1 Comment

Inspired by the latest entry at Yahoo! Music Blog, I registered at Last.fm and started feeding my playlist information to them via the AudioScrobbler iTunes plugin. My profile there is slowly building and I am looking forward to checking out what their customized radio station will start playing for me once they know enough of my tastes. Their recommendations so far seem to be decent, but I need to listen to about 300 tracks before the music “neighbors” become available. As a side benefit, I also put the RSS feed of the last 10 tracks played on frontpage of this site in the right hand column.

I also explored Pandora, the other service mentioned in the blog. From what little time I spent with Pandora, it seems fascinating. Created by the Music Genome Project, it allows you to specify an artist or a song and create a “radio station” that plays songs that are musically similar to the specified one. What does “similar” mean? Well, the folks at Music Genome Project analyzed over 10,000 songs of different artists and broke them down into traits, such as harmony, rhythm, instrumentation, orchestration, arrangement, lyrics, singing and vocal harmony. Given a song, for example “Volcano” by Damien Rice, Pandora will create a station that features songs with mellow rock instrumentation, folk influences, mild rhythmic syncopation, acoustic sonority, repetitive melodic phrasing, and other similarities to the original, picking things like “Smile” by Mia and Jonah and “Igloo Glass” by Holopaw. You can add more songs or artists to the station and Pandora will try to pick tracks that cover the whole gamut of what you are looking for. You can also thumbs up or down the individual songs to fine tune your preferences and build up a favorites list. I would say its recommendations range from good to very good and I definitely intend to use it as a tool for discovering new music.

SuperBowl XL

06 February 2006 » In Funny » 3 Comments

Since Patriots missed their chance at the SuperBowl this year, I only had slight interest in the game itself and was mostly looking forward to the commercials. However, those have proven to be quite sub-par, except for a couple that were fairly amusing. Here are the links:

There were a couple ofo AmeriQuest Mortgage ones too, but they don’t seem to be online.

NetflixQueueShuffler Update

12 January 2006 » In Hacks, Movies » Comments Off on NetflixQueueShuffler Update

I upgraded to Firefox 1.5 and found out that my NetflixQueueShuffler GreaseMonkey script no longer worked. So after some digging, I fixed it up and it’s available for your downloading pleasure.

Book Update

12 January 2006 » In Books, Reviews » Comments Off on Book Update

Thought it would be good to mention some memorable books that I have read in the last couple of months. I had Dark Star Safari on hiatus for a long time, but finally finished it a couple of weeks ago. The delay had nothing to do with the quality of the book itself, which gives a detailed and profound account of the Africa of modern times from the point of view of a westerner who is also intimately familiar with it. Paul Theroux spent many years of his youth teaching in Africa and his knowledge of the local people, languages, and customs allows for a much closer conversation with everyone he meets on his epic journey from Cairo to Capetown, be it on a ferry, canoe, or an armed convoy truck. Some might find him a bit crotchety, but I found the book to be a good eye opener on the problems facing Africa — especially sub-Saharan countries — today.

Redemption Ark is a sequel to Alastair Reynolds’ Revelation Space hard-SF space opera. Reynolds is at the top of his game once again, revealing a complex, gripping, and surprisingly insightful story full of awesome imagery and technothriller-like excitement. Looking forward to the conclusion of the series.

I’ve been meaning to read something by Tom Robbins, so I picked up Jitterbug Perfume. I honestly can say this is one of the best books I have ever come across: amazing and amazingly unique characters, a plot that is firmly rooted in the magic realism space, great dialogue, and to top it off, there are genuinely funny moments sprinkled throughout. Robbins is a master of the language; on almost every other page I found sentences and passages that I wanted to highlight and maybe I will do just that on the second reading. Give this one a try: you’ll never think of the beets in quite the same way again.

Started on: Guns, Germs, and Steel by Jared Diamond (another book that’s been on my list for a while), and Oldman’s Guide to Outsmarting Wine by Mark Oldman.

He saves… but does he commit?

06 January 2006 » In Funny » 2 Comments

Looking through the email inbox this morning I saw these headers, which provided a low-yield amusement factor.

Good to know that even the deities have to go through formalities.

Happy New 2006 to y’all by the way!

“They are all ball-shaped!”

06 December 2005 » In Funny » Comments Off on “They are all ball-shaped!”

Just a sample of the kind of emails I sometimes get. I have no idea why they decided that I am a 3D artist or that this is a good way to recruite someone to work on a shareware game without “a huge budget”, but it’s amusing. Good luck with the April 1 release date!

This Is Not “American Idol”

29 November 2005 » In PHP, Rants » 9 Comments

The latest round of discussions on the php-internals mailing list highlights something that has been a pet peeve of mine for a long time. As PHP became more and more popular, the number of people subscribed to the mailing list has grown as well, and lately this has resulted in a slew of interminable threads of will-crushing length. It seems that every time I open my mail reader, the counter next to “php-internals” blinks and jumps to over a 100. And roughly half, if not more, of the messages are, a) from people I have never heard about, and b) contain opinions, rants, and “votes” on fairly important issues, as in “I’m +2 on this namespace separator”.

A whole lot of these folks are under the impression that one can simply subscribe to the list, read discussions while lurking or semi-lurking, and start to vote on things that affect intimate parts of the language. That is… kind of gall-ish, if you ask me. I have lived in the United States for over 13 years, I pay all my taxes, I respect the law (except for occasional speeding), yet I still cannot vote in either federal or state elections. Whether it’s fair or not is debatable, but at least there is a vetting process in place that requires immigrants to fully adopt this country as their new home before being able to vote.

I appreciate the enthusiasm with which these people partake in the discussions, and I understand that they may have strong opinions on things that PHP does or does not do. However, in order to be taken seriously one has to have a certain amount of respect, experience, “karma” – call it what you will – and that has to be earned.

And how do you earn it? Through concrete participation, be it code contributions, documentation write-ups, bug triage, or just some good ideas that you design and promote in a respectful and polite manner. But to show up, issue forth proclamations on topics that you do not even necessarily understand, and assume that you can influence the course of development through sheer arrogance or grandiose rants is a misguided, if not brazen, attempt at “democracy”. And if your first email to the list ignores the customs and practices of the group, your subsequent ones are likely to be taken less than seriously. First impressions count, you know.

Why not just ignore posts like these, some would say? Because on average, the signal to noise ratio on php-internals is still pretty good, and there are occasional insightful posts from new people that I would like to read. But since they may be buried under an avalanche of superfluous messages, I have to take a deep breath and wade through until I find the worthy ones. And that takes time. Precious, precious time.

To sum up: make a difference, contribute something, think before you post, be polite, and try to consider that yours is not the only opinion out there, especially if you are new to the list.

Oh, if you are using a mail reader that screws up message threading, I will hunt you down and stuff you full of Perl internals until you look like a camel. I will go fucking ninja on you, and you will not see me coming.

PHP Developer’s Meeting 2005

22 November 2005 » In PHP » 1 Comment

It’s been a while since there was a small, focused meeting for the purposes of working out the evolution of the next version of PHP. The last, and only, time was probably in January 2000 when Rasmus, Zeev, Andi, Stig, Sascha, Thies, Frank, myself, and a few others gathered in Tel Aviv to hash out PHP 4. You can see how young we looked back then.

Last week in Paris saw the second iteration of the PDM and this time the focus was on PHP 6. We had a very productive discussion over two days and Derick did a great job taking notes and writing up the report. It has been posted on the internals mailing list, and once 5.1 is out, I think we can concentrate on the implementation of PHP 6, which should be great.

Page 12 of 25« First...1011121314...20...Last »