Category > PHP

Duplicates Detection with ElasticSearch

28 March 2011 » In PHP, Tech » 14 Comments

As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.
In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.
De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.
Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.
The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).
The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.
I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.
Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis (FLT) query type, which is a combination of the MoreLikeThis query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.
To do the initial classification of places into duplicate groups, the basic algorithm is as follows:

  1. Loop through all the places.
  2. Use FLT + geo query to find places that are likely to be duplicates.
  3. Create a duplicate group from the current place and the matches.
  4. Mark these places as done, so that they don’t get added to another group.

Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.
FuzzyLikeThis has several useful tuning parameters:

max_query_terms
Specifies the maximum number of terms that will be included in the generated query
min_similarity
Excludes the terms that are too dissimilar
prefix_length
Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)

I experimented with different values for max_query_terms, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms set to 2. The prefix_length was set to 1, and min_similarity was left at the default of 0.5.
The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.
Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).
The full FLT query looked like this:

{
    "query": {
        "filtered": {
            "query": {
                "fuzzy_like_this": {
                    "like_text": "Milk Bar",
                    "fields": [
                        "title"
                    ],
                    "max_query_terms": 2,
                    "prefix_length": 1
                }
            },
            "filter": {
                "geo_distance": {
                    "distance": 0.05,
                    "geo": {
                        "lat": 40.677795,
                        "lon": -73.968923
                    }
                }
            }
        }
    },
    "size": 60
}

With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.
Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.

31: Gen
    31, Gen, 1.49, [40.677602, -73.96358]
   756, Gen, 1.49, [40.677602, -73.96358]
   141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349]
33: Grimaldi's Pizzeria
    33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1987, Grimaldi's, 0.52, [40.702702, -73.99345]
  2032, Grimaldi's, 0.52, [40.702677, -73.99350]
249: The Cock and Hoop
   249, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1102, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1128, The Cock & Hoop, 1.41, [52.950970, -1.14429]
897: Arnolfini bookshop
   897, Arnolfini bookshop, 1.41, [51.449433, -2.59744]
   921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748]
1541: Powell's
  1541, Powell's, 1.50, [45.523022, -122.68136]
  1612, Powell's City of Books, 0.75, [45.522991, -122.68124]
  1895, Powell's City of Books, 0.75, [45.522987, -122.68123]
1542: Rogue Distillery and Pub House
  1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494]
  1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
  1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]

My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.
This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.
Update: @clintongormley correctly points out that max_query_terms actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms with the length of the title or just leave it at the default of 25.

Ideas of March

15 March 2011 » In Beer, Me, Opinion, PHP, Tech » 3 Comments

I miss blogging and the conversations that used to go with it. My friend Chris Shiflett probably feels the same way, because he is trying to start a blog revival with his Ideas of March post.
I used to blog much more actively, but with Twitter gaining momentum in early 2008 I found it easier to fire off a quick thought, observation, funny link, or another short piece there rather than put it on my blog. It helped building up an audience on Twitter was much faster due to the one-click nature of creating relationships and keeping track of content. I feel that the problem of doing the same for blog conversations has not yet been solved, despite services like Disqus.
Still, I think regular blogs are still relevant and valuable, because they do what the microblogging services cannot. Here are a few reasons why I like blogs:

  • Blog posts usually present a more complete and rounded point of view rather than a quip.
  • Blog posts require time to research, ruminate on, and rewrite before publishing, which leads to higher quality content.
  • They are indexed by a real search engines (Google), rather than the half-baked solution Twitter uses that cannot search more than 2 weeks in the past.
  • Posts can be saved to Instapaper or a similar service.

I want to get back to blogging, but a couple of things are slowing me down. Most of the things that I like to blog about fit into 3 broad areas: tech stuff — coding, architecture, interesting problems and solutions to them, etc; opinion pieces on various topics; and food & drink. So far I have been mixing them all in one blog, because I am not sure if it’s worth breaking out each one into a separate blog on the main site or even into a different site. I would be really interested to know if people mind reading mixed content like that or if they prefer more compartmentalized approach. I also want a better design for my blog, because I don’t feel that the current one allows me to present richer content well, like embedding photos, videos, code samples, and so on. Hopefully, I can find a designer who can help me with that.
In the end, really, it’s all about overcoming inertia and actually writing posts rather than just thinking about them. So I promise to write at least 2 blog posts before the end of March and resolve the abovementioned issues soon to help me blog more.
The blogs are dead. Long live the blogs.

PHP UK 2011

26 February 2011 » In PHP, Talks » 1 Comment

Becoming an American citizen a couple of years ago greatly eased travel-related headaches for me – no more visas to worry about (in general), way faster processing at the passport control, and so forth. Because of this, I was finally able to make it to PHP UK conference this year, and I have to say that I enjoyed it very much. The organizers did a great job with the venue, the social events, food, drinks, making sure things ran on time and a host of other things that I’m sure I didn’t even notice. A really nice touch was the badges that had the name and the sessions schedule printed on both sides. It was also great to meet a bunch of new people, including some that I only knew from the mailing lists or Twitter. This was probably one of the best community-organized conferences that I’ve been to.
I gave a talk on ElasticSearch, which is part of the technology stack behind Mapalong. This was the first incarnation of the talk, but I thought it went fairly well, though I did run overtime slightly and couldn’t do all the demos that I wanted to. The slides are available on the Talks page.

The Prime That Wasn't

03 August 2010 » In PHP, Tech » 36 Comments

I’ve seen a lot of problems solved with regular expressions, but last Friday, thanks to Chris and Sean, I found out that one can use them to determine whether a given integer is a prime number. The original articles showed the following regular expression:

/^1?$|^(11+?)\1+$/

You don’t apply this to the integer itself. Instead, you create a string of repeating 1s, where the numbers of 1s is defined by the number you’re testing, and apply the regex to that. If it fails to match, the number is a prime. The regex relies on backtracking and thus would not work with a DFA-based engine, such as PHP’s (now deprecated) ereg* functions. But, it works just fine with the preg_* ones, or so I thought (more on that in a bit).
So, how does it work? It may seem like a brain buster, but it’s actually pretty simple. Ignore the portion before the alternation operator (|), because it’s just a short-circuit case for checking whether the input string is empty or has a single 1. (If the string is empty, it means the number being tested is 0. If the string is a single 1, it means the number being tested is 1. Neither of these are prime.) The portion after the pipe is where the real magic happens.
The (11+?) sub-pattern matches strings 11, 111, etc. The \1+ matches whatever the sub-pattern matched, one or more times. So on the first try, the engine will match 11 and then attempt to match the same thing one or more times in the rest of the string. If it succeeds, the number is not a prime. Why? Because it just proved that the length of the string is divisible by 2 (the length of 11), therefore the original number is divisible by 2. If the overall match fails, the engine backtracks to the beginning and tries to match 111 two or more times and so on successively. If the first sub-pattern gets long enough (n/2 basically), and the overall match still fails, the number is a prime. Beautiful, isn’t it?
Coincidentally, Sean recently created a code evaluation plugin for the Phergie-based IRC bot that runs in the channel we hang out in. The plugin is a simple proxy to ideone.com, but is helpful for doing quick code tests. We had fun experimenting with this regex pattern implemented in a PHP function that returns the next largest prime number after a given one. The trouble started when Sean fed it 99999999, and the code spit out 100000001. This didn’t seem right, and Wolfram Alpha confirmed; the answer we got was not a prime. (It factors into 17 * 5882353.)
A few similar tests also gave us numbers that were not prime. But, where was the problem? The PHP code was too simple to have a bug, many answers were correct, and the regex algorithm itself was sound. It was time for brute force. I quickly wrote some code to feed increasing odd integers into into regex pattern and check its answers against the normal sieve algorithm to see where it starts failing. The first number to fail was 22201; the regex said it was prime, but it’s actually a perfect square (1492). Past that, the failure rate increased.
It then dawned on me that the culprit might be the backtracking itself, specifically how it was implemented in PCRE, the library at the heart of PHP’s regular expressions. As I mention in my Regex Clinic talk, unbounded backtracking can dramatically slow down the matching and should be well-understood before attempting to write complex patterns. To manage the danger of runaway patterns, PCRE implemented the pcre.backtrack_limit setting a few years ago. In our case, backtracking is used to break up the string of 1s into successively larger chunks, and, with very long strings, the engine may run into this backtracking limit, which is set to 100000 by default. My guess was that with the 22201-character-long string, the default was not enough. Once the limit is reached, the match fails, and the number is declared prime.
I bumped the limit up to 200000, and voilà, 22201 was no longer declared prime. To fix the 100000001 match, the limit had to be raised dramatically, to around 250000000! And, the program took 14 seconds to deliver the verdict on my new i5 MacBook Pro. Needless to say, don’t use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance. I certainly did, and was glad that my bit of investigative work showed that abstract, pure, good ideas can still fail in the messy world of software and hardware.

Time Machine Forensics

01 July 2010 » In Other, PHP » 2 Comments

About a year ago, I finally wisened up to the obvious: 1) losing data sucks, and 2) doing daily (or frequent) backups manually are a pain in the ass, especially for portables. 2) is why most people forgo backups in the first place, resulting in 1). I needed something that I could set up once and pretty much forget about. Thankfully, there was just the thing: Time Capsule  —  a single device that encapsulates both Airport Extreme base station and a huge hard drive, and can be used for automatic, transparent, wireless backups from any of your Mac devices via the Time Machine.
I purchased the 1 TB version, configured it, and have been using it ever since. It has already saved my butt a couple of times, when I accidentally deleted (rm -rf) a directory and had that familiar sinking feeling, but then remembered that the Time Machine was there ready to lend a hand.
However, I noticed recently that the hourly backups were taking longer and growing in size, sometimes up to 1GB or more. It didn’t seem like there was that much data changing on an hourly basis, so I set out to investigate. A quick look around revealed that the Time Machine itself will not reveal the backup details, but someone wrote a script called timedog that displays the files that the Time Machine backed up during its most recent run (or any of your choosing). The output of the script is something like this (abbreviated):

# cd /Volumes/Backup\ of\ foo/Backups.backupdb/foo
# timedog -d 5 -l
==> Comparing TM backup 2010-06-29-160846 to 2010-06-29-101853
     399B->     399B    [1] /Mac HD/Library/Application Support/CrashReporter/
[..skip..]
   52.6KB->   53.0KB        /Mac HD/Users/andrei/.dbshell
     863B->     866B        /Mac HD/Users/andrei/.lesshst
   11.0KB->   15.1KB        /Mac HD/Users/andrei/.viminfo
   10.4KB->   30.7KB        /Mac HD/Users/andrei/.zsh_hist
    6.7MB->    6.7MB        /Mac HD/Users/andrei/.dropbox/dropbox.db
    3.9MB->    3.9MB        /Mac HD/Users/andrei/.dropbox/migrate.db
   25.2MB->   50.3MB   [10] /Mac HD/Users/andrei/.dropbox/cache/
   21.0KB->   21.0KB    [1] /Mac HD/Users/andrei/Desktop/
  120.0MB->  120.4MB    [1] /Mac HD/Users/andrei/Documents/Linkinus 2 Logs/
  142.8MB->  146.2MB  [156] /Mac HD/Users/andrei/Library/Application Support/
[..skip..]
  608.0MB->  608.0MB    [5] /Mac HD/private/var/data/mongodb/
[..skip..]
==> Total Backup: 967 changed files/directories, 1.88GB

Looking at this, a couple of offenders are immediately obvious. Linkinus (an IRC client) keeps all the conversation logs in one single file, and since I’m frequently on IRC, that file grows and gets backed up every hour. The data files for MongoDB that I use for development are also fairly large and change often. Finally, there is something in Library/Application Support, but it wasn’t shown because of the depth limit I set. After increasing the depth to 7, I discovered that it was the Google Chrome cache and history.
Conversation logs and Chrome stuff are not important enough for me to back up hourly, and MongoDB data I can copy periodically to an off-site server. By excluding these 3 items from the backup process via Time Machine Preferences I was able to reduce the size of the hourlies to 50 MB or less.
This brings up an important point though: while Time Machine is great for doing automated backups over your local network, you should have a separate copy of the data off-site, for redundancy. How and when to do the off-site backups varies according to everyone’s needs, but I would suggest something like CrashPlan, which does unlimited online backups for about $5/month. Once again, it’s automatic and hands-off, which is how you want it.
Tools I found useful while investigating this Time Machine issue:

Vim for Programmers on Slideshare

02 June 2010 » In PHP, Talks » 6 Comments

A few years ago, I was considering what proposal to submit to the Vancouver PHP Conference. The usual slate of “how to do this and that in PHP” was becoming a bit tired, so I decided to submit a talk about an essential skill that PHP (and other language) developers might need: using the Vim editor.
By that time I knew that I was firmly in the Vim camp (as opposed to Emacs or IDEs). Of course, writing a 45 minute talk about Vim is like trying to explain Mulholland Drive during an elevator ride, but I rose to the challenge and put together the first version of the slide deck. When I later received the feedback about the talk, I realized that it was the most highly rated one of the conference, above even Rasmus‘s perennial PHP keynote. Clearly, I was onto something.
Since then I’ve expanded and adjusted the talk to fit the 45-60 minute slot, but I still usually run out of time due to the wealth of material. So I published the slides on Slideshare and created a Github repo for my time-tested Vim settings and plugins, so feel free to fork it and submit pull requests. And in general, go forth and Vim.

Regex Clinic on Slideshare

03 May 2010 » In PHP, Talks » 5 Comments

Back in 2004, I submitted a proposal to the only PHP conference ever to have taken place aboard a cruise ship. Yes, the infamous PHP Cruise (“Do we all know each other?”). The talk was about regular expressions and their usage in PHP. I wrote the original version in PowerPoint, but when I boarded the ship and saw Keynote 1.0 on someone’s Powerbook, I was hooked and had to port all the slides to Keynote right then and there.
Well, the talk – and the cruise – went well, and since then Andrei’s Regex Clinic has grown and shrunk in size depending on which conference I was giving it at and how much time was allotted. I considered retiring the talk several times, because surely people should know how to use regular expressions by now, but whenever I give it there is a roomful of people wanting to know what (?>=foo) does. And as much as I try, I always run out of time trying to cover everything from the basics to more advanced usage.
I feel that the slides are pretty polished by now and it’s better if people can read them at a comfortable pace, so the whole tutorial-sized presentation is now available on Slideshare. I omitted the section that covers PHP’s regular expression API, because it already has great online documentation. Otherwise, go forth and read about look-ahead assertions and recursive matching.

Please Start From The Beginning

11 January 2010 » In Development, Me, PHP » 3 Comments

Please Start From The Beginning is a video series that explores the career paths and experiences of web industry professionals. I was honored to be interviewed for the series along with such interesting people as Eric Meyer, Joe Stump, and Elliot Jay Stocks. If you to hear about how the heck I started in Web development, this is a video for you.

PHP Advent 2009: GeoIP Wrangling

21 December 2009 » In Analog, Development, PHP » 7 Comments

There have been a few kind words and questions about the GeoIP-based blurb that greets the visitors to our Analog holding page, so I wanted to blog about how it was made and what obstacles I ran into. However, while writing I realized that it would be the perfect candidate for this year’s edition of PHP Advent instead of the article I originally planned to write about Git and Github. So head on over and read about GeoIP Wrangling and feel free to leave your comments here.

Say Hello to Analog

18 December 2009 » In Analog, Me, PHP, Work » 8 Comments

Wow.
That pretty much sums up my feelings about the response to the announcement of Analog, a web design and development co-operative that I started with a few of my friends. I am stunned and humbled by the many kind words of congratulations, praise, and encouragement that we received about our launch via Twitter, Facebook, and personal communication. Thank you. Many have been wondering what I’ve been up to since leaving Digg in early September, and organizing and setting up Analog has been a big part of it.
The first time I discussed the idea for such a company was when Chris Shiflett and I went to Iceland in June. During that time of renewal, reset, and inspiration, we talked about our desire to work on interesting projects with a great team of peers. People like Jon and Jon, who Chris had worked with on a few occasions. From the start, we wanted to be a bona fide co-operative: an organization owned and operated by a group of individuals for their mutual benefit and adhering to the principles of equality and equitability.
Co-operatives, especially tech ones, aren’t very common — though they are much more prevalent across the pond in the UK — so it took us a while to work out the legal and tax issues of having an equitable company comprised of people from multiple countries. This time was also spent refining our brand identity and brand promise, both of which we believe strongly in. We also needed a short statement to explain who we are and what we do, and in the end it became this (the last part of which harkens back to our original motivation):

Analog is a company of friends who make web sites. It’s a co-operative where imagination, design, and engineering thrive; good people doing good work.

The part about “making web sites” may sound simplistic, but we believe in taking back simple, honest phrases like web site and web developer. They are precise and descriptive despite having been shunned or dismissed by people in favor of things like web application, front-end/back-end engineer, and other seemingly sexier nomenclature meant to sound more important. It’s time to call things what they are. Say it with me, “I am proud to be a web developer.”
Analog origins
Of course, the most difficult thing to decide on was the name. We had a lot more latitude within the .coop top-level domain (TLD), but even then, we must have gone through a hundred or more names looking for one that would somehow reflect our philosophy while being memorable. The flash of inspiration struck while imbibing the potent Bee Sting cider at the inimitable Duke of York in Bristol. The name had an instant appeal, and I imagine we all thought, “Yes, Analog is it.” Jon Tan even left a mark on the table, since he couldn’t wait to see what Analog would look like in type.
Analog appealed to us because of its association with handmade things, craftsmanship, and a “warmer” feeling in general. Somehow it felt good to think that we were going to do digital things the analog way, where a personal touch of each of us would be evident in our work and communication.
The team at Analog is one of the best that I’ve had the honor to be a part of. Alan Colville is an accomplished UX designer and customer experience and usability researcher. He has helped a number of clients in the past, including Vodafone, Virgin Media, BlackBerry, and Visa. Chris Shiflett has extensive background in web development, specializing in web security, and has worked on projects for Ning, National Geographic, Digg, and many other clients during his time as principal of Brain Bulb and OmniTI. Jon Gibbins is an ace developer and web accessibility expert who most recently lent his skills to OmniTI as well. And Jon Tan is simply the best designer and typography maven that I have a pleasure to know, with an extensive body of published work. Between us, we have many years of experience and a bountiful font of creative knowledge.
The type of work that we want to do is twofold. Firstly, we want to take on client projects that are built on an inventive concept, where we have as much creative freedom as possible. By this I mean that, as a group, we want to be part of the initial discussions and brainstorming, so that we can inject our own ideas into the process. We want the projects to utilize both our design and development expertise, involving aspects of the programmable web in a way that supports and enriches the original concept. Secondly, we want to incubate some of the ideas we’ve been knocking around into products that can be spun-off later, if necessary. We’re especially interested in possibilities presented by geo-location, geo-tagging, and other geo-things. We also want to share what we learn and produce as a team. The first thing we’re releasing is the JS grid overlay used on the Analog site; look for it shortly.
We’re on Twitter as @analogcoop. Get in touch if you have a cool project in mind and want to work with us to make it a reality, or use our nifty contact form at analog.coop.

Reblog this post [with Zemanta]