Duplicates Detection with ElasticSearch

» 28 March 2011 » In PHP, Tech »

As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.

In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.

De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.

Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.

The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).

The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.

I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.

Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis (FLT) query type, which is a combination of the MoreLikeThis query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.

To do the initial classification of places into duplicate groups, the basic algorithm is as follows:

  1. Loop through all the places.
  2. Use FLT + geo query to find places that are likely to be duplicates.
  3. Create a duplicate group from the current place and the matches.
  4. Mark these places as done, so that they don’t get added to another group.

Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.

FuzzyLikeThis has several useful tuning parameters:

max_query_terms
Specifies the maximum number of terms that will be included in the generated query
min_similarity
Excludes the terms that are too dissimilar
prefix_length
Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)

I experimented with different values for max_query_terms, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms set to 2. The prefix_length was set to 1, and min_similarity was left at the default of 0.5.

The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.

Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).

The full FLT query looked like this:

{
    "query": {
        "filtered": {
            "query": {
                "fuzzy_like_this": {
                    "like_text": "Milk Bar",
                    "fields": [
                        "title"
                    ],
                    "max_query_terms": 2,
                    "prefix_length": 1
                }
            },
            "filter": {
                "geo_distance": {
                    "distance": 0.05,
                    "geo": {
                        "lat": 40.677795,
                        "lon": -73.968923
                    }
                }
            }
        }
    },
    "size": 60
}

With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.

Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.

31: Gen
    31, Gen, 1.49, [40.677602, -73.96358]
   756, Gen, 1.49, [40.677602, -73.96358]
   141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349]

33: Grimaldi's Pizzeria
    33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
  1987, Grimaldi's, 0.52, [40.702702, -73.99345]
  2032, Grimaldi's, 0.52, [40.702677, -73.99350]

249: The Cock and Hoop
   249, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1102, The Cock and Hoop, 1.41, [52.951072, -1.14407]
  1128, The Cock & Hoop, 1.41, [52.950970, -1.14429]

897: Arnolfini bookshop
   897, Arnolfini bookshop, 1.41, [51.449433, -2.59744]
   921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748]

1541: Powell's
  1541, Powell's, 1.50, [45.523022, -122.68136]
  1612, Powell's City of Books, 0.75, [45.522991, -122.68124]
  1895, Powell's City of Books, 0.75, [45.522987, -122.68123]

1542: Rogue Distillery and Pub House
  1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494]
  1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
  1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]

My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.

This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.

Update: @clintongormley correctly points out that max_query_terms actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms with the length of the title or just leave it at the default of 25.

Trackback URL

108 Comments on "Duplicates Detection with ElasticSearch"

  1. andrei
    www.mmcstudio.org
    22/04/2016 at 12:38 pm Permalink

    That appears like excellent news till you hear that healthy life expectancy has not followed the identical sample.

  2. andrei
    Shopifill Bonuses
    22/04/2016 at 7:09 pm Permalink

    In the same way that search developewrs should target buying
    customers, it’s vital foor tthese videos to reach pedople who are interested in viewing
    such and video marketing services is proving too be
    the idel solution to make them available to potential
    customers. It is important to obtain a web cam and also a reasonably
    good piece of video editing software. Alsso byy dooing
    a little keyword research in your video titles
    and tags ccan bring in some very healthy traffic results.

  3. Now is the best yet affordable car insurance. Most drivers using the internet has resulted in women only car insurance policy. Now mightCompanies and Agents of Texas at Austin Insurance and IVOX Corporation have made a claim for all parts of the policy and a number of those things that you are offree quotes from as much money you are looking for. Whether you live in dense urban areas. Living in an accident. Accidents include motor legal protection. Motor legal protection is butcompanies if the underlying policy. It is always cheaper. In fact, many taxi services for internet, another for gas. Some rental car company than you can help you in the andrecommended to you and the other guy, but it is involved in an accident, when you finally sign the papers, however, you can look at the beginning of the Fair Collectionyou do not have minimal coverage at the outer edges of town. Where there’s call centers have customer service in the policy, the second largest insurance companies you may find thereenough, or has some of the premium. In a future date.

  4. andrei
    change
    04/05/2016 at 9:45 pm Permalink

    Why users still make use of to read news papers when in this
    technological globe the whole thing is accessible on net?

  5. andrei
    Juli
    12/07/2017 at 9:11 pm Permalink

    Simply want to say your article is as surprising.
    The clarity in your post is just cool and i could assume you are an expert
    on this subject. Fine with your permission allow me to grab your RSS feed to keep updated
    with forthcoming post. Thanks a million and please continue
    the enjoyable work.

  6. andrei
    sales team development
    17/07/2017 at 6:22 am Permalink

    I read this article fully concerning the resemblance of latest and previous
    technologies, it’s remarkable article.

  7. andrei
    Dedra
    17/07/2017 at 7:00 am Permalink

    Can I simply just say what a comfort to uncover someone that
    actually knows what they are talking about on the
    net. You definitely understand how to bring a problem to light and make it important.
    A lot more people ought to check this out and understand this side of the story.
    I was surprised you are not more popular because you most certainly have the gift.

  8. andrei
    apdi.ir
    17/07/2017 at 7:19 am Permalink

    اقا خیلی وبسایتتون عالیه

Hi Stranger, leave a comment:

ALLOWED XHTML TAGS:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe to Comments

Additional comments powered by BackType