Duplicates Detection with ElasticSearch
As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.
In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.
De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.
Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.
The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).
The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.
I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.
Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis (FLT) query type, which is a combination of the MoreLikeThis query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.
To do the initial classification of places into duplicate groups, the basic algorithm is as follows:
- Loop through all the places.
- Use FLT + geo query to find places that are likely to be duplicates.
- Create a duplicate group from the current place and the matches.
- Mark these places as done, so that they don’t get added to another group.
Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.
FuzzyLikeThis has several useful tuning parameters:
max_query_terms- Specifies the maximum number of terms that will be included in the generated query
min_similarity- Excludes the terms that are too dissimilar
prefix_length- Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)
I experimented with different values for max_query_terms, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms set to 2. The prefix_length was set to 1, and min_similarity was left at the default of 0.5.
The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.
Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).
The full FLT query looked like this:
{
"query": {
"filtered": {
"query": {
"fuzzy_like_this": {
"like_text": "Milk Bar",
"fields": [
"title"
],
"max_query_terms": 2,
"prefix_length": 1
}
},
"filter": {
"geo_distance": {
"distance": 0.05,
"geo": {
"lat": 40.677795,
"lon": -73.968923
}
}
}
}
},
"size": 60
}
With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.
Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.
31: Gen
31, Gen, 1.49, [40.677602, -73.96358]
756, Gen, 1.49, [40.677602, -73.96358]
141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349]
33: Grimaldi's Pizzeria
33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350]
1987, Grimaldi's, 0.52, [40.702702, -73.99345]
2032, Grimaldi's, 0.52, [40.702677, -73.99350]
249: The Cock and Hoop
249, The Cock and Hoop, 1.41, [52.951072, -1.14407]
1102, The Cock and Hoop, 1.41, [52.951072, -1.14407]
1128, The Cock & Hoop, 1.41, [52.950970, -1.14429]
897: Arnolfini bookshop
897, Arnolfini bookshop, 1.41, [51.449433, -2.59744]
921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748]
1541: Powell's
1541, Powell's, 1.50, [45.523022, -122.68136]
1612, Powell's City of Books, 0.75, [45.522991, -122.68124]
1895, Powell's City of Books, 0.75, [45.522987, -122.68123]
1542: Rogue Distillery and Pub House
1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494]
1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.
This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.
Update: @clintongormley correctly points out that max_query_terms actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms with the length of the title or just leave it at the default of 25.

Thanks for sharing this.
A new place can also create a new dup group, right? What if a place belongs to two groups? How are you grouping?
For tweet duplicate detection I’ve implemented a similar approach without the fuzzy part. But to further enhance duplicate precision I’ve used the jaccard index.
It is also implemented with ElasticSearch now:
https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/ElasticTweetSearch.java#L972
Excellent question, Peter. The post briefly mentions what I do for a new place, but I guess I should expand on it. When a new place is created, I do the same FLT + geo query and then look at the results to see which duplicate groups they belong to (if any) and give preference to the group with the most hits. For example, if place A is being added, the results may come back with places B, E, F, G, and J. If B, and E belong to dup group 1, and F, G, and J belong to dup group 2, A would be placed in group 2. If there is a tie, I pick a group with the closest centroid. If most of the results don’t have a duplicate group, then I create one and add them and the new place to it.
Thank you for the Jaccard index reference – I’ll check it out to see whether it’s something I can use as well.
Hi Andrei, thanks for your answer. Now I understand this a bit better
Regarding the jaccard index I’m not sure … you would need to fuzzy each term before you can use it, otherwise it is too strict. (But maybe you could apply Lucene’s fuzzy algorithm directly to the terms in your app)
In my case fuzzying would generate too many duplicates I think, so I am only stemming and calculating the j-index on the stemmed terms. But I’ll try it out in the near future. Thanks again for sharing this!
Hi Andrei, nice post . I haved used lucene before with PHP but now im working on a python based social networking for organizing parties and event and that library sounds good. I will give it a try. Nice post and best reggards from Cuba,
Dairon
Andrei,
Thank you for sharing. I wanted to point out that the link to ElasticSearch is specified as http://elasticsearch.com instead of http://elasticsearch.org.
Thank you. The .com is a redirect to .org, but I fixed the link anyway.
I worked on a similar problem a decade ago when I had to merge/clean-up the restaurant/hotel/points-of-interest databases that were merged into Microsoft’s mapping products. Your posting brought lots of memories back.
If I remember correctly, the entity sub-string that was the most popular was “BBQ”. It doesn’t take much thought to realize all the possible combinations of ways that “Bar-Be-Q”, and “Bar BQ” and “Barbeque” and “B-B-Q” with all possible combinations of punctuation.
Also, many of the source materials give hint to their origin as hotel listings in printed directories. (Hotels entered in multiple ways with different combinations of word order “Hilton Heathrow”, “Heathrow Hilton”, “London Heathrow Hilton” … etc, as well as additional words to try and get them listed in other sections “Staines Hilton”, “Terminal 4 London Hilton”.
Hi, once you have the items in groups, how do you go about querying for items and collapsing on the groups you have created?
The groups are stored in MongoDB in a separate collection. Each entry just lists the IDs of the places in the group. Querying is an easy $in operator, to find which group a given place belongs to.
It’s interesting to see what you chose to approach this problem by simply querying the search engine directly. I’m surprised your results seem so good, because generally this is a tricky problem, where information from many fields (name, address, phone number, geoposition) etc all need to be considered and weighed against one another.
I had the same problem and chose to build a full record linkage engine on top of Lucene. That basically uses Lucene to find candidate matches (much like you do), but then does configurable detailed comparison with weighted Levenshtein, q-grams etc etc and combines results for different properties using Bayes’s Theorem. It also cleans and normalizes data before comparison.
Even that requires a lot of tuning and work to produce good results, so, like I said, I’m surprised your results look so good. But maybe I’m missing something.
Your project looks quite interesting, and I’m sure produces better quality results than my approach. I wish I had run across it at the time. But, I wanted to keep the number of pieces of technology as small as possible and using ElasticSearch was a quick & dirty approach that seemed to yield decent results.
This is really interesting, You are a very skilled
blogger. I’ve joined your feed and look forward to seeking more of your fantastic post. Also, I’ve shared your site in my
social networks!
Hi there to all, as I am actually keen of reading
this weblog’s post to be updated on a regular basis. It carries pleasant material.
I believe that is one of the so much vital information for me.
And i am happy studying your article. But should statement on some basic issues, The site taste is great, the articles is really excellent :
D. Good process, cheers
Your own write-up provides established necessary to
me. It’s very helpful and you’re simply certainly really experienced of this type. You have popped our eye to different opinion of this particular subject together with interesting and strong articles.
Thanks for sharing your thoughts on partypoker bonus code july.
Regards
An outstanding share! I’ve just forwarded this onto a coworker who was doing a little homework on this. And he actually ordered me lunch simply because I discovered it for him… lol. So allow me to reword this…. Thank YOU for the meal!! But yeah, thanx for spending the time to discuss this issue here on your web page.
The write-up offers proven necessary to me. It’s quite informative and you really are certainly quite experienced in this region.
You possess opened up my own sight in order to various views on this
kind of topic along with intriguing, notable and strong articles.