Duplicates Detection with ElasticSearch

» 28 March 2011 » In PHP, Tech »

As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.

In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.

De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.

Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.

The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).

The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.

I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.

Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis (FLT) query type, which is a combination of the MoreLikeThis query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.

To do the initial classification of places into duplicate groups, the basic algorithm is as follows:

  1. Loop through all the places.
  2. Use FLT + geo query to find places that are likely to be duplicates.
  3. Create a duplicate group from the current place and the matches.
  4. Mark these places as done, so that they don’t get added to another group.

Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.

FuzzyLikeThis has several useful tuning parameters:

max_query_terms
Specifies the maximum number of terms that will be included in the generated query
min_similarity
Excludes the terms that are too dissimilar
prefix_length
Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)

I experimented with different values for max_query_terms, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms set to 2. The prefix_length was set to 1, and min_similarity was left at the default of 0.5.

The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.

Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).

The full FLT query looked like this:

{
    "query": {
        "filtered": {
            "query": {
                "fuzzy_like_this": {
                    "like_text": "Milk Bar",
                    "fields": [
                        "title"
                    ],
                    "max_query_terms": 2,
                    "prefix_length": 1
                }
            },
            "filter": {
                "geo_distance": {
                    "distance": 0.05,
                    "geo": {
                        "lat": 40.677795,
                        "lon": -73.968923
                    }
                }
            }
        }
    },
    "size": 60
}

With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.

Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.

31: Gen
    31, Gen, 1.49, [40.677602, -73.96358]
   756, Gen, 1.49, [40.677602, -73.96358]
   141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349]

33: Grimaldi's Pizzeria
    33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1987, Grimaldi's, 0.52, [40.702702, -73.99345]
  2032, Grimaldi's, 0.52, [40.702677, -73.99350]

249: The Cock and Hoop
   249, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1102, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1128, The Cock & Hoop, 1.41, [52.950970, -1.14429]

897: Arnolfini bookshop
   897, Arnolfini bookshop, 1.41, [51.449433, -2.59744]
   921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748]

1541: Powell's
  1541, Powell's, 1.50, [45.523022, -122.68136]
  1612, Powell's City of Books, 0.75, [45.522991, -122.68124]
  1895, Powell's City of Books, 0.75, [45.522987, -122.68123]

1542: Rogue Distillery and Pub House
  1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494]
  1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
  1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]

My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.

This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.

Update: @clintongormley correctly points out that max_query_terms actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms with the length of the title or just leave it at the default of 25.

Trackback URL

30 Comments on "Duplicates Detection with ElasticSearch"

  1. andrei
    Peter
    28/03/2011 at 1:50 pm Permalink

    Thanks for sharing this.

    A new place can also create a new dup group, right? What if a place belongs to two groups? How are you grouping?

    For tweet duplicate detection I’ve implemented a similar approach without the fuzzy part. But to further enhance duplicate precision I’ve used the jaccard index.

    It is also implemented with ElasticSearch now:

    https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/ElasticTweetSearch.java#L972

  2. andrei
    Andrei
    28/03/2011 at 3:00 pm Permalink

    Excellent question, Peter. The post briefly mentions what I do for a new place, but I guess I should expand on it. When a new place is created, I do the same FLT + geo query and then look at the results to see which duplicate groups they belong to (if any) and give preference to the group with the most hits. For example, if place A is being added, the results may come back with places B, E, F, G, and J. If B, and E belong to dup group 1, and F, G, and J belong to dup group 2, A would be placed in group 2. If there is a tie, I pick a group with the closest centroid. If most of the results don’t have a duplicate group, then I create one and add them and the new place to it.

    Thank you for the Jaccard index reference – I’ll check it out to see whether it’s something I can use as well.

  3. andrei
    Peter
    29/03/2011 at 6:07 am Permalink

    Hi Andrei, thanks for your answer. Now I understand this a bit better :)

    Regarding the jaccard index I’m not sure … you would need to fuzzy each term before you can use it, otherwise it is too strict. (But maybe you could apply Lucene’s fuzzy algorithm directly to the terms in your app)

    In my case fuzzying would generate too many duplicates I think, so I am only stemming and calculating the j-index on the stemmed terms. But I’ll try it out in the near future. Thanks again for sharing this!

  4. andrei
    Dairon Medina
    29/03/2011 at 9:40 am Permalink

    Hi Andrei, nice post . I haved used lucene before with PHP but now im working on a python based social networking for organizing parties and event and that library sounds good. I will give it a try. Nice post and best reggards from Cuba,

    Dairon

  5. andrei
    Iftikhan Nazeem
    11/04/2011 at 12:59 am Permalink

    Andrei,

    Thank you for sharing. I wanted to point out that the link to ElasticSearch is specified as http://elasticsearch.com instead of http://elasticsearch.org.

  6. andrei
    Andrei
    11/04/2011 at 8:42 am Permalink

    Thank you. The .com is a redirect to .org, but I fixed the link anyway.

  7. andrei
    Nick Berry
    02/04/2012 at 12:01 pm Permalink

    I worked on a similar problem a decade ago when I had to merge/clean-up the restaurant/hotel/points-of-interest databases that were merged into Microsoft’s mapping products. Your posting brought lots of memories back.

    If I remember correctly, the entity sub-string that was the most popular was “BBQ”. It doesn’t take much thought to realize all the possible combinations of ways that “Bar-Be-Q”, and “Bar BQ” and “Barbeque” and “B-B-Q” with all possible combinations of punctuation.

    Also, many of the source materials give hint to their origin as hotel listings in printed directories. (Hotels entered in multiple ways with different combinations of word order “Hilton Heathrow”, “Heathrow Hilton”, “London Heathrow Hilton” … etc, as well as additional words to try and get them listed in other sections “Staines Hilton”, “Terminal 4 London Hilton”.

  8. andrei
    Dennis
    29/09/2012 at 11:36 pm Permalink

    Hi, once you have the items in groups, how do you go about querying for items and collapsing on the groups you have created?

  9. andrei
    Andrei
    30/09/2012 at 9:53 am Permalink

    The groups are stored in MongoDB in a separate collection. Each entry just lists the IDs of the places in the group. Querying is an easy $in operator, to find which group a given place belongs to.

  10. andrei
    Lars Marius Garshol
    02/03/2013 at 1:19 am Permalink

    It’s interesting to see what you chose to approach this problem by simply querying the search engine directly. I’m surprised your results seem so good, because generally this is a tricky problem, where information from many fields (name, address, phone number, geoposition) etc all need to be considered and weighed against one another.

    I had the same problem and chose to build a full record linkage engine on top of Lucene. That basically uses Lucene to find candidate matches (much like you do), but then does configurable detailed comparison with weighted Levenshtein, q-grams etc etc and combines results for different properties using Bayes’s Theorem. It also cleans and normalizes data before comparison.

    Even that requires a lot of tuning and work to produce good results, so, like I said, I’m surprised your results look so good. But maybe I’m missing something.

  11. andrei
    Andrei
    05/03/2013 at 7:52 pm Permalink

    Your project looks quite interesting, and I’m sure produces better quality results than my approach. I wish I had run across it at the time. But, I wanted to keep the number of pieces of technology as small as possible and using ElasticSearch was a quick & dirty approach that seemed to yield decent results.

  12. andrei
    pawelrychlik
    10/07/2013 at 4:24 am Permalink

    Andrei,
    I currently work on a quite similar problem with basically the same toolset. I’ve came across a problem of scoring relativity in elasticsearch – i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed & deduplicated documents (of course the relation here is much more complex).

    Situation A:
    You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.

    Situation B:
    You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again – you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level – say 2.0.

    In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).
    How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn’t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?

  13. andrei
    Paweł Rychlik
    10/07/2013 at 4:25 am Permalink

    Andrei,
    I currently work on a quite similar problem with basically the same toolset. I’ve came across a problem of scoring relativity in elasticsearch – i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed & deduplicated documents (of course the relation here is much more complex).

    Situation A:
    You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.

    Situation B:
    You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again – you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level – say 2.0.

    In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).
    How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn’t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?

  14. andrei
    Andrei
    11/07/2013 at 11:30 am Permalink

    @Paweł, in our product we never remove the duplicates from the corpus, because they are places entered by people and it wouldn’t be nice if we simply removed their data. We used the process described in the article to generate “clusters” of places that could potentially be duplicates of one another and to represented them as such on the map.

  15. andrei
    farmakeio
    17/08/2015 at 2:26 pm Permalink

    This article іs гeally a pleasant օne it helos new internet
    visitors, who аrе wishing in favor οf blogging.

  16. andrei
    link zobacz
    19/08/2015 at 4:42 pm Permalink

    Wow, amazing weblog layout! How lengthy have you ever been blogging for?
    you made running a blog glance easy. The total glance of your
    website is fantastic, as neatly as the content!

    My page link zobacz

  17. andrei
    Udemy dicount Codes
    20/08/2015 at 12:06 am Permalink

    Aw, this was an exceptionally good post.
    Taking a few minutes and actual effort to create a good article… but what can I say…
    I hesitate a lot and never seem to get nearly anything done.

  18. I’m impressed, I have to admit. Seldom do I encounter a blog
    that’s both equally educative and amusing, and without a doubt, you have hit the nail on the head.
    The problem is something too few men and women are speaking intelligently
    about. I am very happy that I came across this during my search for
    something concerning this.

  19. andrei
    Ken Ash
    21/08/2015 at 8:54 pm Permalink

    Ҭhanks for sharing your thouɡhts on navy. Regɑrds

  20. andrei
    PC Services Essex
    22/08/2015 at 1:03 pm Permalink

    Thank you a bunch for sharing this with all of us you really recognize what you
    are speaking approximately! Bookmarked. Kindly also
    discuss with my web site =). We may have a
    hyperlink alternate agreement between us

  21. andrei
    free xvideos
    22/08/2015 at 1:06 pm Permalink

    I’m curious to find out what blog platform you are utilizing?
    I’m having some small security issues with my latest
    site and I’d like to find something more safeguarded.
    Do you have any recommendations?

  22. I go to see daily some sites and blogs to read articles
    or reviews, however this weblog offers quality based writing.

  23. Нi! Do you use Twitter? I’d like to follow you if that would be okɑy.
    I’m absolutely enjߋying your bloɡ and look forward
    to new uρdates.

  24. andrei
    COMINT Consulting
    26/08/2015 at 3:17 am Permalink

    I’ve read several good stuff here. Certainly price bookmarking for revisiting.
    I surprise how so much effort you set to make this type of wonderful informative
    website.

  25. I appreciate, result in I found exactly what I was taking a look for.

    You have ended my four day long hunt! God Bless you man. Have
    a nice day. Bye

  26. andrei
    Sidra Azeem Sheikh
    29/08/2015 at 6:27 am Permalink

    Thɑnks for your marvelous posting! I actually enjoyed reading it, you happen to be a great aսthor.I will ensurе that I bookmark your blog and will often come back someday.
    I want to encourage continue your great ϳob, have a nice weekend!

  27. andrei
    Zoja Nadeem Raza
    29/08/2015 at 8:33 am Permalink

    I qսite like looking throսgh an аrticle that
    will make people think. Also, many thankѕ for permitting me to сomment!

  28. andrei
    Shahzmeen Khan
    29/08/2015 at 9:48 am Permalink

    Ні there, all the time i used tօ check weblog posts here early in the dawn, for the reason that i love to find out more and more.

  29. andrei
    Bushra Umair
    29/08/2015 at 10:30 am Permalink

    Thanks in support of shɑring such a pleaѕаnt opinion, paragraph is nice,
    thats why і Һave read it comρletely

Trackbacks

  1. Jual Follower 18/10/2014 at 5:26 am

    Jual Follower Duplicates Detection with ElasticSearch » Andrei Zmievski

Hi Stranger, leave a comment:

ALLOWED XHTML TAGS:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe to Comments

Additional comments powered by BackType