Category > Tech

Duplicates Detection with ElasticSearch

28 March 2011 » In PHP, Tech » 108 Comments

As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.

In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.

De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.

Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.

The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).

The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.

I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.

Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis (FLT) query type, which is a combination of the MoreLikeThis query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.

To do the initial classification of places into duplicate groups, the basic algorithm is as follows:

  1. Loop through all the places.
  2. Use FLT + geo query to find places that are likely to be duplicates.
  3. Create a duplicate group from the current place and the matches.
  4. Mark these places as done, so that they don’t get added to another group.

Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.

FuzzyLikeThis has several useful tuning parameters:

Specifies the maximum number of terms that will be included in the generated query
Excludes the terms that are too dissimilar
Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)

I experimented with different values for max_query_terms, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms set to 2. The prefix_length was set to 1, and min_similarity was left at the default of 0.5.

The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.

Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).

The full FLT query looked like this:

    "query": {
        "filtered": {
            "query": {
                "fuzzy_like_this": {
                    "like_text": "Milk Bar",
                    "fields": [
                    "max_query_terms": 2,
                    "prefix_length": 1
            "filter": {
                "geo_distance": {
                    "distance": 0.05,
                    "geo": {
                        "lat": 40.677795,
                        "lon": -73.968923
    "size": 60

With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.

Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.

31: Gen
    31, Gen, 1.49, [40.677602, -73.96358]
   756, Gen, 1.49, [40.677602, -73.96358]
   141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349]

33: Grimaldi's Pizzeria
    33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1987, Grimaldi's, 0.52, [40.702702, -73.99345]
  2032, Grimaldi's, 0.52, [40.702677, -73.99350]

249: The Cock and Hoop
   249, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1102, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1128, The Cock & Hoop, 1.41, [52.950970, -1.14429]

897: Arnolfini bookshop
   897, Arnolfini bookshop, 1.41, [51.449433, -2.59744]
   921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748]

1541: Powell's
  1541, Powell's, 1.50, [45.523022, -122.68136]
  1612, Powell's City of Books, 0.75, [45.522991, -122.68124]
  1895, Powell's City of Books, 0.75, [45.522987, -122.68123]

1542: Rogue Distillery and Pub House
  1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494]
  1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
  1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]

My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.

This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.

Update: @clintongormley correctly points out that max_query_terms actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms with the length of the title or just leave it at the default of 25.

Ideas of March

15 March 2011 » In Beer, Me, Opinion, PHP, Tech » 20 Comments

I miss blogging and the conversations that used to go with it. My friend Chris Shiflett probably feels the same way, because he is trying to start a blog revival with his Ideas of March post.

I used to blog much more actively, but with Twitter gaining momentum in early 2008 I found it easier to fire off a quick thought, observation, funny link, or another short piece there rather than put it on my blog. It helped building up an audience on Twitter was much faster due to the one-click nature of creating relationships and keeping track of content. I feel that the problem of doing the same for blog conversations has not yet been solved, despite services like Disqus.

Still, I think regular blogs are still relevant and valuable, because they do what the microblogging services cannot. Here are a few reasons why I like blogs:

  • Blog posts usually present a more complete and rounded point of view rather than a quip.
  • Blog posts require time to research, ruminate on, and rewrite before publishing, which leads to higher quality content.
  • They are indexed by a real search engines (Google), rather than the half-baked solution Twitter uses that cannot search more than 2 weeks in the past.
  • Posts can be saved to Instapaper or a similar service.

I want to get back to blogging, but a couple of things are slowing me down. Most of the things that I like to blog about fit into 3 broad areas: tech stuff — coding, architecture, interesting problems and solutions to them, etc; opinion pieces on various topics; and food & drink. So far I have been mixing them all in one blog, because I am not sure if it’s worth breaking out each one into a separate blog on the main site or even into a different site. I would be really interested to know if people mind reading mixed content like that or if they prefer more compartmentalized approach. I also want a better design for my blog, because I don’t feel that the current one allows me to present richer content well, like embedding photos, videos, code samples, and so on. Hopefully, I can find a designer who can help me with that.

In the end, really, it’s all about overcoming inertia and actually writing posts rather than just thinking about them. So I promise to write at least 2 blog posts before the end of March and resolve the abovementioned issues soon to help me blog more.

The blogs are dead. Long live the blogs.

The Prime That Wasn’t

03 August 2010 » In PHP, Tech » 49 Comments

I’ve seen a lot of problems solved with regular expressions, but last Friday, thanks to Chris and Sean, I found out that one can use them to determine whether a given integer is a prime number. The original articles showed the following regular expression:


You don’t apply this to the integer itself. Instead, you create a string of repeating 1s, where the numbers of 1s is defined by the number you’re testing, and apply the regex to that. If it fails to match, the number is a prime. The regex relies on backtracking and thus would not work with a DFA-based engine, such as PHP’s (now deprecated) ereg* functions. But, it works just fine with the preg_* ones, or so I thought (more on that in a bit).

So, how does it work? It may seem like a brain buster, but it’s actually pretty simple. Ignore the portion before the alternation operator (|), because it’s just a short-circuit case for checking whether the input string is empty or has a single 1. (If the string is empty, it means the number being tested is 0. If the string is a single 1, it means the number being tested is 1. Neither of these are prime.) The portion after the pipe is where the real magic happens.

The (11+?) sub-pattern matches strings 11, 111, etc. The \1+ matches whatever the sub-pattern matched, one or more times. So on the first try, the engine will match 11 and then attempt to match the same thing one or more times in the rest of the string. If it succeeds, the number is not a prime. Why? Because it just proved that the length of the string is divisible by 2 (the length of 11), therefore the original number is divisible by 2. If the overall match fails, the engine backtracks to the beginning and tries to match 111 two or more times and so on successively. If the first sub-pattern gets long enough (n/2 basically), and the overall match still fails, the number is a prime. Beautiful, isn’t it?

Coincidentally, Sean recently created a code evaluation plugin for the Phergie-based IRC bot that runs in the channel we hang out in. The plugin is a simple proxy to, but is helpful for doing quick code tests. We had fun experimenting with this regex pattern implemented in a PHP function that returns the next largest prime number after a given one. The trouble started when Sean fed it 99999999, and the code spit out 100000001. This didn’t seem right, and Wolfram Alpha confirmed; the answer we got was not a prime. (It factors into 17 * 5882353.)

A few similar tests also gave us numbers that were not prime. But, where was the problem? The PHP code was too simple to have a bug, many answers were correct, and the regex algorithm itself was sound. It was time for brute force. I quickly wrote some code to feed increasing odd integers into into regex pattern and check its answers against the normal sieve algorithm to see where it starts failing. The first number to fail was 22201; the regex said it was prime, but it’s actually a perfect square (1492). Past that, the failure rate increased.

It then dawned on me that the culprit might be the backtracking itself, specifically how it was implemented in PCRE, the library at the heart of PHP’s regular expressions. As I mention in my Regex Clinic talk, unbounded backtracking can dramatically slow down the matching and should be well-understood before attempting to write complex patterns. To manage the danger of runaway patterns, PCRE implemented the pcre.backtrack_limit setting a few years ago. In our case, backtracking is used to break up the string of 1s into successively larger chunks, and, with very long strings, the engine may run into this backtracking limit, which is set to 100000 by default. My guess was that with the 22201-character-long string, the default was not enough. Once the limit is reached, the match fails, and the number is declared prime.

I bumped the limit up to 200000, and voilà, 22201 was no longer declared prime. To fix the 100000001 match, the limit had to be raised dramatically, to around 250000000! And, the program took 14 seconds to deliver the verdict on my new i5 MacBook Pro. Needless to say, don’t use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance. I certainly did, and was glad that my bit of investigative work showed that abstract, pure, good ideas can still fail in the messy world of software and hardware.

Chrome Not Quite Shiny Yet

27 June 2010 » In Opinion, Tech » 7 Comments

A month ago I decided to stop using Firefox as my main browser, because I was becoming frustrated with its sluggishness and general stability issues. I wanted to give a new, modern browser a chance and Google Chrome seemed like the most fitting choice. I’ve been using it diligently and exclusively since then and wanted to share some good and not so good things that I found out after a month of daily usage. The caveat is that I’ve only used it on the Mac platform, so my findings may differ for Windows/Linux users.

  1. It’s fast and stable. Pages load quite a bit faster, if only perceptually. And while the memory usage is not that much lower than FF’s – after all, those DOM trees and images have to go somewhere – I have never had to force-quit (or even just quit) Chrome the way I had to with FF after a couple of days. This is a big win in my book.
  2. The built-in developer tools are nice, but are not quite as good as Firebug, it seems. I will have to try them out on a project before I can say this with certainty.
  3. There is a lack of integration with 3rd-party tools and websites I used with FF, mainly 1Password and Delicious. Yes, there are Chrome alpha/beta extensions for both, but I miss the full functionality the FF versions offered. Hope the vendors hurry up and bring them up to feature parity.
  4. I miss the quick search engine access that FF has in the upper right corner. That’s how I searched Amazon,, and several other sites. I can’t find an equivalent in Chrome.
  5. There is no way to configure proxy settings through Chrome itself. You have to use a command-line parameter (who does that on a Mac though?), an extension, or do it through OS X’s System Preferences.
  6. It’s from Google. So it’s just what I’ve come to expect from their products – great engine and features combined with some usability problems and a lack of polish.

Overall, I’m pleased with Chrome and will continue using it as my main browser unti something better comes along.

Update: As the comments below indicate, it is possible to configure and use various search engines from the “omnibox”, aka the search bar. Right click and select Edit Search Engines.

Voting opens for my SXSW panel

19 August 2009 » In PHP, Talks, Tech » 1 Comment

I have not yet been to South by Southwest festivals. And every time I mention this to friends and colleagues, they make big eyes and say, “You have to go!”. Well, I finally cracked and submitted a panel proposal for the 2010 SXSW Interactive portion. The panel is titled “Travelog with Maps: When 1,000 Photos Aren’t Enough” and will present and discuss ways to preserve the memories and tell the stories of your trips using latest geolocation tools, social data, and a bit of code. Chris Shiflett will be my co-panelist. We decided to do this panel after trying to share our Iceland trip in real-time and discovering what is and isn’t there to support it.

If you like this idea, please vote for this panel. Thank you!

Plaxo Pulse DOA

05 November 2007 » In Tech » 4 Comments

As if the current proliferation of social networks was not enough, Plaxo has recently launched its own offering called Pulse, in the best tradition of branding-via-metonymy. First of all, “Pulse”? I am generally not in the habit of checking my friends’ vital signs several times a day, so that kind of got lost on me. Maybe they could have done better with Spasm, Borborygm, or ultimately, Omphaloskepsis, since that’s basically what social networks are.

Anyway, what I really wanted to say is that Plaxo Pulse fails. Out of the gate. Dead on arrival. Why? Well, ever since it launched I’ve been receiving notices that such-and-such has added me as friend or wants to add me as a business contact. These notices provide a link to go to Pulse site and confirm the connection. Not recognizing one of the names, I decided to clicke on the link to check it out, but all I saw was a page that said, “Not a member yet? Sign up!” Are you freaking kidding me? You expect me to sign up just to see who wanted to add me as a contact? No thanks, Pulse. You lost me at “click here”.

Web 2.0 Law #1

11 May 2007 » In Tech » No Comments

Now that Yahoo! Photos is closing down, Flickr is going to experience another surge in membership. It is already the most popular photo sharing service that matters (forget about Photobucket and such), and it has an unparalleled API. So with that in mind here’s the first of Andrei’s Web 2.0 Laws.

Law #1: As the number of mashups increases, the probability of a new mashup not using Flickr approaches 0.

Proof, what proof? This is not math or physics here, this is Web 2.0. But do go and take a look for yourself; Flickr infiltration is everywhere. And if you find a mashup that doesn’t use Flickr API, then there are only two explanations: a) the authors have thought or are thinking about using Flickr in one way or another; b) they are Amish.

I, for one, welcome our new megapixel overlords.

“PHP Eats Rails for Breakfast”

21 October 2006 » In PHP, Tech » 3 Comments

That is indeed the title of the article on Ohloh, a site that collects information on open source projects and gathers statistics by analyzing the source code of those projects. So far they’ve indexed over 3,000 projects and their conclusion seems to be that among Web scripting languages, PHP is the undisputed champion (as measured by the LOC count).

Measured purely by the number of new lines of code, PHP leaves all other web-scripting languages in the dust, and continues to grow. Quite simply, one-fifth of all open source code being written today is written in PHP.

This, combined with their observation that the relative number of developers working in PHP is not increasing and that the number of new projects being started has actually decreased, leads to an interesting premise: the trend for for PHP-based projects points towards mature code bases. Meaning that the developers prefer to work with existing applications or frameworks and increase their output through updating these applications rather than starting new ones. In general, that seems to be a good trend, if one can rely on the data gathered by Ohloh. I have not done an exhaustive examination, but there are some puzzling things in the data on PHP project itself. For example, how do they calculate “man-years” for each developer? My number is 5.5, which puts me fourth on the list behind Rasmus, Derick, and Frank. I know I’ve been working on PHP longer than that. And then, Andi and Zeev are actually below me with 5.3 and 5.2 man-years respectively. The account with the most man-years seems to be our automatic changelog committer script, even though it committed a total 0 lines of usable code. 🙂

Even without Ohloh’s data, one could venture to say that the code base of projects using PHP is larger than those of Ruby or Python. A lot of that is due to the accessibility of PHP, I think, but does pure lines-of-code count give indication to the quality of said code? Of course not. I think Ohloh is providing a useful service by tracking this kind of information, but I would take its results with a few grains of salt.

Hot and .. Not So

15 July 2006 » In Funny, Tech » 4 Comments

Thought I’d throw a couple of fun links your way. First one is a project that almost won $5,000 prize at the last Mashup Camp. It presents, shall we say, an innovative approach to user validation combining so-called business with so-called pleasure. HotCaptcha gets thumbs-up from me any day of the week.

The other is a supremely strong candidate for the title of the Worst Music Video Ever. It is an inspired effort that immediately induces cringing expression on your face and fails to release you from its grip until (in my case) 3 hours later. Have fun.

Half-kingdom for the Desktop

21 June 2006 » In Tech » 1 Comment

Last week was the first time I had to do a hard shutdown on my Powerbook. It was a strange feeling. There I was, doing something that used to be a regular occurrence in the Windows world, and now it really bothered me that I had to hold down the Power button for 5 seconds and wait for my Mac to commit temporary suicide. And I was apprehensive for a good reason.

After powering back up, I discovered that my desktop (lowercase) was gone. No icons, and I could not drag anything to the background at all. I still had ~/Desktop folder but it seemed to be disconnected from the actual display. No one I asked about this has encountered the problem before, but I had a hunch it had to do with a corrupted preference file or something similar.

After much muckying around (creating a new user, making sure its desktop was operational, running Preferential Treatment over all .plist files, doing a binary sieve on the .plists to find the offending one) I narrowed it down to A bit more yahoo’ing revealed that the culprit was a “secret” preference called CreateDesktop. If set to false, it tells Finder not to display the desktop. Brilliant. Somehow this got turned off during the hard shutdown and my desktop got lost at sea. Anyways, if you ever get sick of your Desktop (or want to restore it, like I did), you can change the setting via:

defaults write CreateDesktop -bool <false or true>

Until later. I am going to go click on all the icons.

UPDATE: Fixed the HTML escaping of the command (the bool arguments were hidden).

Page 1 of 3123