Duplicates Detection with ElasticSearch
As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.
In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.
De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.
Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.
The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).
The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.
I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.
Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis
(FLT) query type, which is a combination of the MoreLikeThis
query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.
To do the initial classification of places into duplicate groups, the basic algorithm is as follows:
- Loop through all the places.
- Use FLT + geo query to find places that are likely to be duplicates.
- Create a duplicate group from the current place and the matches.
- Mark these places as done, so that they don’t get added to another group.
Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.
FuzzyLikeThis
has several useful tuning parameters:
max_query_terms
- Specifies the maximum number of terms that will be included in the generated query
min_similarity
- Excludes the terms that are too dissimilar
prefix_length
- Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)
I experimented with different values for max_query_terms
, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms
set to 2
. The prefix_length
was set to 1
, and min_similarity
was left at the default of 0.5
.
The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.
Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).
The full FLT query looked like this:
{ "query": { "filtered": { "query": { "fuzzy_like_this": { "like_text": "Milk Bar", "fields": [ "title" ], "max_query_terms": 2, "prefix_length": 1 } }, "filter": { "geo_distance": { "distance": 0.05, "geo": { "lat": 40.677795, "lon": -73.968923 } } } } }, "size": 60 }
With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.
Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.
31: Gen 31, Gen, 1.49, [40.677602, -73.96358] 756, Gen, 1.49, [40.677602, -73.96358] 141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349] 33: Grimaldi's Pizzeria 33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350] 1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350] 2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350] 1987, Grimaldi's, 0.52, [40.702702, -73.99345] 2032, Grimaldi's, 0.52, [40.702677, -73.99350] 249: The Cock and Hoop 249, The Cock and Hoop, 1.41, [52.951072, -1.14407] 1102, The Cock and Hoop, 1.41, [52.951072, -1.14407] 1128, The Cock & Hoop, 1.41, [52.950970, -1.14429] 897: Arnolfini bookshop 897, Arnolfini bookshop, 1.41, [51.449433, -2.59744] 921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748] 1541: Powell's 1541, Powell's, 1.50, [45.523022, -122.68136] 1612, Powell's City of Books, 0.75, [45.522991, -122.68124] 1895, Powell's City of Books, 0.75, [45.522987, -122.68123] 1542: Rogue Distillery and Pub House 1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494] 1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504] 1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.
This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.
Update: @clintongormley correctly points out that max_query_terms
actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms
with the length of the title or just leave it at the default of 25
.