Birthday for Water
This year I’m turning 35 years old, and I’m giving up my birthday for water.
What does this mean? There is a water crisis in the world. I was vaguely aware of it, but as with many causes and charities, I didn’t feel compelled to act. Then I heard Viktoria Harrison, the co-founder of charity: water, speak at Brooklyn Beta about delivering the salvation of clean water to third-world countries and it drove the point home. Clean water not only saves lives; it enables education, jobs, and personal dignity. So, this year I’m foregoing any birthday celebrations and instead asking everyone to help me raise money for charity: water projects. Check out more on my campaign page.
Duplicates Detection with ElasticSearch
As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I’m talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.
In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.
De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.
Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.
The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).
The location variance is mostly due to two reasons: people’s recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.
I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.
Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis
(FLT) query type, which is a combination of the MoreLikeThis
query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.
To do the initial classification of places into duplicate groups, the basic algorithm is as follows:
- Loop through all the places.
- Use FLT + geo query to find places that are likely to be duplicates.
- Create a duplicate group from the current place and the matches.
- Mark these places as done, so that they don’t get added to another group.
Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.
FuzzyLikeThis
has several useful tuning parameters:
max_query_terms
- Specifies the maximum number of terms that will be included in the generated query
min_similarity
- Excludes the terms that are too dissimilar
prefix_length
- Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)
I experimented with different values for max_query_terms
, but since the majority of the placemarks have titles of three words or less, the best results were obtained with max_query_terms
set to 2
. The prefix_length
was set to 1
, and min_similarity
was left at the default of 0.5
.
The FLT query can be run against multiple fields in the document. Initially, I tried matching the given text against title, notes, and tags, but there were too many false positives. In the end, I limited it to matching only on the title field.
Logically, the duplicates are likely to be fairly close to each other, so I applied the geo-distance filter to the FLT query matches and set the radius to be 50 meters (which might change, subject to testing).
The full FLT query looked like this:
{ "query": { "filtered": { "query": { "fuzzy_like_this": { "like_text": "Milk Bar", "fields": [ "title" ], "max_query_terms": 2, "prefix_length": 1 } }, "filter": { "geo_distance": { "distance": 0.05, "geo": { "lat": 40.677795, "lon": -73.968923 } } } } }, "size": 60 }
With these settings, the initial results were encouraging. The grouping seemed to work well, including fuzzy titles, and there only a few false negatives, such as Grimaldi’s not being grouped with the places named Grimaldi’s Pizzeria. This turned out to be because of a too aggressive score cutoff; my first take counted only the matches with the score ≥1.0, but there were a few cases where valid results had scores below 1.0, so I relaxed the cutoff to be ≥0.5. With this change, the duplicate groups were 99% accurate, which was good enough for our purposes.
Below are some examples of the groups. The first line in a paragraph is the placemark that was used as the source for matching and each line in the group has the place ID, title, the match score, and the geo coordinates.
31: Gen 31, Gen, 1.49, [40.677602, -73.96358] 756, Gen, 1.49, [40.677602, -73.96358] 141, Restaurant Gen Inc, 0.75, [40.677599, -73.96349] 33: Grimaldi's Pizzeria 33, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350] 1001, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350] 2330, Grimaldi's Pizzeria, 1.41, [40.702677, -73.99350] 1987, Grimaldi's, 0.52, [40.702702, -73.99345] 2032, Grimaldi's, 0.52, [40.702677, -73.99350] 249: The Cock and Hoop 249, The Cock and Hoop, 1.41, [52.951072, -1.14407] 1102, The Cock and Hoop, 1.41, [52.951072, -1.14407] 1128, The Cock & Hoop, 1.41, [52.950970, -1.14429] 897: Arnolfini bookshop 897, Arnolfini bookshop, 1.41, [51.449433, -2.59744] 921, Arnolfini Bookshop, 1.41, [51.449304, -2.59748] 1541: Powell's 1541, Powell's, 1.50, [45.523022, -122.68136] 1612, Powell's City of Books, 0.75, [45.522991, -122.68124] 1895, Powell's City of Books, 0.75, [45.522987, -122.68123] 1542: Rogue Distillery and Pub House 1542, Rogue Distillery and Pub House, 1.06, [45.525834, -122.68494] 1597, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504] 1613, Rogue Distillery and Public House, 1.06, [45.525848, -122.68504]
My script was written in Python with pyes, but it would be very easy to write an equivalent one in PHP using the Elastica library, for example.
This is far from an exhaustive examination of the de-duplication issue, but I hope you can take away some ideas for how to deal with this problem in your product.
Update: @clintongormley correctly points out that max_query_terms
actually specifies how many words from the beginning of the input text will be passed to the engine. I guess in our case most of the variation happens at the end of the titles, so passing the first 2 words worked fine. However, it might be better to change max_query_terms
with the length of the title or just leave it at the default of 25
.
Ideas of March
I miss blogging and the conversations that used to go with it. My friend Chris Shiflett probably feels the same way, because he is trying to start a blog revival with his Ideas of March post.
I used to blog much more actively, but with Twitter gaining momentum in early 2008 I found it easier to fire off a quick thought, observation, funny link, or another short piece there rather than put it on my blog. It helped building up an audience on Twitter was much faster due to the one-click nature of creating relationships and keeping track of content. I feel that the problem of doing the same for blog conversations has not yet been solved, despite services like Disqus.
Still, I think regular blogs are still relevant and valuable, because they do what the microblogging services cannot. Here are a few reasons why I like blogs:
- Blog posts usually present a more complete and rounded point of view rather than a quip.
- Blog posts require time to research, ruminate on, and rewrite before publishing, which leads to higher quality content.
- They are indexed by a real search engines (Google), rather than the half-baked solution Twitter uses that cannot search more than 2 weeks in the past.
- Posts can be saved to Instapaper or a similar service.
I want to get back to blogging, but a couple of things are slowing me down. Most of the things that I like to blog about fit into 3 broad areas: tech stuff — coding, architecture, interesting problems and solutions to them, etc; opinion pieces on various topics; and food & drink. So far I have been mixing them all in one blog, because I am not sure if it’s worth breaking out each one into a separate blog on the main site or even into a different site. I would be really interested to know if people mind reading mixed content like that or if they prefer more compartmentalized approach. I also want a better design for my blog, because I don’t feel that the current one allows me to present richer content well, like embedding photos, videos, code samples, and so on. Hopefully, I can find a designer who can help me with that.
In the end, really, it’s all about overcoming inertia and actually writing posts rather than just thinking about them. So I promise to write at least 2 blog posts before the end of March and resolve the abovementioned issues soon to help me blog more.
The blogs are dead. Long live the blogs.
PHP UK 2011
Becoming an American citizen a couple of years ago greatly eased travel-related headaches for me – no more visas to worry about (in general), way faster processing at the passport control, and so forth. Because of this, I was finally able to make it to PHP UK conference this year, and I have to say that I enjoyed it very much. The organizers did a great job with the venue, the social events, food, drinks, making sure things ran on time and a host of other things that I’m sure I didn’t even notice. A really nice touch was the badges that had the name and the sessions schedule printed on both sides. It was also great to meet a bunch of new people, including some that I only knew from the mailing lists or Twitter. This was probably one of the best community-organized conferences that I’ve been to.
I gave a talk on ElasticSearch, which is part of the technology stack behind Mapalong. This was the first incarnation of the talk, but I thought it went fairly well, though I did run overtime slightly and couldn’t do all the demos that I wanted to. The slides are available on the Talks page.
42 Hills
I love San Francisco and I love to walk. So I started a new Tumblr blog called 42 Hills that tracks my progress through doing all the 27 walks in the Stairway Walks of San Francisco book. Check it out and join me, if you like.
Photo Advent is here
My friends Kara, Marie, and Terry started a new project called Photo Advent this year based on a very simple idea: 25 days, 25 photos. In the spirit of Christmas, they’ve asked 25 photographers to each select a photo or two and share it with you. Well, they were kind enough to ask me as well, so I contributed one on stepping off the beaten path when taking your photos. Hope you like it.
Angel's Share Experiment
I haven’t yet compiled a list of my top 10 beers, but if I did, Lost Abbey Angel’s Share would definitely be on it. I first tasted it at the 2010 Barleywine Festival at Toronado, and it was love from the first sip. The intense, rich maltiness, full body, and surprisingly dry finish — enhanced by the bourbon, oak, and vanilla character developed by aging it in Heaven Hill Wheat Whiskey barrels — make it a beer to savored and appreciated, especially at the end of the day in the company of good friends.
Here in California, we are sometimes lucky enough to find Angel’s Share on draft, and it might even be better than the bottled version. I even love the name, which refers to the portion (share) of a wine or spirit’s volume that is lost to evaporation during aging in the barrels.
Needless to say, I really want to brew a beer approaching this greatness.
The reason is not even economics. A small (375 ml) bottle of Angel’s Share goes
for around $16-18 locally, but I consider that a fair price, given the
ingredients, effort, and time that goes into its making. I would simply like to
extend my skills, learn, and, in the process, hopefully brew a great beer.
Now, there are a few obstacles. First, Lost Abbey does not release the base beer that goes into the bourbon barrels, and there is virtually no information on its composition, aside from a note on the website that it is “brewed with copious amounts of dark caramel malt.” Sources say that it is an English-style barleywine, an Imperial brown ale, or an American strong ale, so developing a clone recipe is a challenge. Second, I don’t have a bourbon barrel to age the beer in, and even if I did, the logistics of brewing 55 gallons of beer, fermenting, aging it a year, and then bottling over 300 bombers (22-oz bottles) are, shall we say, formidable. The closest thing a homebrewer can do to simulate this process is use oak cubes and then add some bourbon at bottling time.
Unable to find a clone recipe of the base beer, I decided to use the recipe from The Mad Fermentationist, who said that it was “inspired by, but not a clone of, Angel’s Share.” I am still learning how to craft beers from scratch, but I couldn’t stay away from tweaking his recipe anyway. I removed the Crystal 55L malt, cut the wheat malt by two thirds, added a good amount of Munich malt, and a bit of Carafa II Special. Whether that was a mistake or not will be shown by the end result. All of the hops I used were leftovers that I’ve been keeping in the fridge, so the alpha acid percentage had to be adjusted down, somewhat. A beer like this doesn’t really need much hop character, so a variety of hops can be used for bittering.
I also decided to brew a smaller batch, since the recipe is very experimental, and I don’t really need 5 gallons of 11% barleywine. An added bonus was that I could brew it on my stovetop, and ferment it in a 3-gallon Better Bottle.
Since this would be the biggest (highest ABV) beer I’ve ever brewed, I wanted to do it properly to ensure that the fermentation doesn’t get stuck halfway. To help with this, I made a 1.5 L starter from a very fresh vial of WLP001 yeast. I also followed the advice in How to brew a really BIG beer article, which suggests doing a long 146/149°F to 154/156°F step mash to make a very fermentable wort. For some reason, I thought that raising the mash temperature by decoction was a good idea. It wasn’t. The decoction boiled, but after adding it back to the main mash, the temperature hardly increased, so I had to improvise. More in the notes.
Given my recipe hacking and the troubles I encountered, I doubt this will be close to Angel’s Share, but I still hope it will be an enjoyable barleywine.
Overview
――――――――
Type: All-grain
Batch Size: 2.5 gal
Total Grain: 9.72 lbs
Expected OG: 1.102
Expected SRM: 22.6
Expected IBU (Rager): 82.1
Brewhouse Efficiency: 75%
Wort boil time: 120 min
Fermentation Temperature: 67°F
Fermentables
————————————
6.75 lbs. Maris Otter 66.1%
1.75 lbs. Munich 10L Malt (US) 17.1%
0.32 lbs. White Wheat Malt 3.1%
0.32 lbs. CaraVienna Malt (Belgium) 3.1%
0.32 lbs. CaraMunich Malt (Belgium) 3.1%
0.13 lbs. Special B Malt (Belgium) 1.2%
0.09 lbs. Chocolate Malt (US) 0.9%
0.06 lbs. Carafa II Special (Germany) 0.6%
0.25 lbs. Sugar – Muscovado 4.9%
Hops
————
0.6 oz. Galena [13.8% AA, pellet] @ 60 min.
0.4 oz. Challenger [6.5% AA, pellet] @ 30 min.
0.4 oz. Cascade [6.5% AA, pellet] @ 30 min.
Extras
——————
1 Whirlfloc tablet @ 15 min.
1 Servomyces yeast nutrient tablet @ 10 min.
0.8 oz American/Hungarian oak cubes secondary
Yeast
—————
White Labs WLP001 – California Ale
Water Profile
—————————————
San Francisco tap
Mash Schedule
—————————————
Type: step mash
Saccharification rest: 45 min. @ 147°F
Saccharification rest: 60 min. @ 155°F
Notes
6/19/10
Brew day, by myself.
Got some dough balls when mashing in, but the temp was 147°F exactly. After 15 minutes, pulled 3 quarts of medium-thickness mash into the kettle, and raised temp to 154°F and held 15 minutes for conversion.
Realized the chocolate malt I added from a leftover bag was uncrushed, so substituted 2.25 oz of crushed pale chocolate malt instead. Raised decoction temp to boiling, boiled for 15 minutes. Added back to the mash tun, but the temperature didn’t seem to go up at all. Infused 3 quarts of boiling water (by infusion calculator), temperature rose to only 151°F. Decided to leave at that, because adding more water would give me more than the desired boil volume. Mashed for another 40 minutes. Collected 3.75 gallons of 1.063 SG wort. Did a sparge with 1 quart to get more sugars out. Added after 30 minutes of the main boil. Obviously, the efficiency suffered a bit with such a small sparge.
Extended pre-hops boil by 30 minutes to evaporate more. Decided to add second batch of hops at -20 minutes to bring down IBUs a tad. Added muscovado sugar at flameout.
Chilled to 70°F, transferred to fermenter. Pre-pitch volume was 2.3 gallons. Aerated well by shaking, then pitched the active starter (made the night before). Placed in cooler with ice pack. Signs of fermentation (krausen and airlock) after only 2.5 hours.
6/20/10
Good strong fermentation going by morning, nice thick krausen. Temp is staying around 64°F.
6/21/10
Realized that Beer Alchemy was set to Tinseth formula instead of Rager while I was tweaking the IBUs, so most likely I over-bittered the barleywine. Womp womp. Dropped 6 oak cubes (3 American, 3 Hungarian) into the carboy.
6/22/10
Swirled the carboy to expel sulfur. Blowoff followed shortly, so attached a blowoff tube.
6/27/10
Temperature is up to 69°F. Gravity is down to 1.019.
7/7/10
Racked to the secondary on top of 0.8 oz of Hungarian-American oak cube mixture that’s been soaking in bourbon.
8/6/10
After a month on oak, the wood character is definitely present. The bitterness is there, but not overwhelming. Probably best to bottle soon.
8/15/10
Bottled with 1.5 oz of dextrose.
The Prime That Wasn't
I’ve seen a lot of problems solved with regular expressions, but last Friday, thanks to Chris and Sean, I found out that one can use them to determine whether a given integer is a prime number. The original articles showed the following regular expression:
/^1?$|^(11+?)\1+$/
You don’t apply this to the integer itself. Instead, you create a string of repeating 1
s, where the numbers of 1
s is defined by the number you’re testing, and apply the regex to that. If it fails to match, the number is a prime. The regex relies on backtracking and thus would not work with a DFA-based engine, such as PHP’s (now deprecated) ereg* functions. But, it works just fine with the preg_* ones, or so I thought (more on that in a bit).
So, how does it work? It may seem like a brain buster, but it’s actually pretty simple. Ignore the portion before the alternation operator (|
), because it’s just a short-circuit case for checking whether the input string is empty or has a single 1
. (If the string is empty, it means the number being tested is 0. If the string is a single 1
, it means the number being tested is 1. Neither of these are prime.) The portion after the pipe is where the real magic happens.
The (11+?)
sub-pattern matches strings 11
, 111
, etc. The \1+
matches whatever the sub-pattern matched, one or more times. So on the first try, the engine will match 11
and then attempt to match the same thing one or more times in the rest of the string. If it succeeds, the number is not a prime. Why? Because it just proved that the length of the string is divisible by 2 (the length of 11
), therefore the original number is divisible by 2. If the overall match fails, the engine backtracks to the beginning and tries to match 111
two or more times and so on successively. If the first sub-pattern gets long enough (n/2
basically), and the overall match still fails, the number is a prime. Beautiful, isn’t it?
Coincidentally, Sean recently created a code evaluation plugin for the Phergie-based IRC bot that runs in the channel we hang out in. The plugin is a simple proxy to ideone.com, but is helpful for doing quick code tests. We had fun experimenting with this regex pattern implemented in a PHP function that returns the next largest prime number after a given one. The trouble started when Sean fed it 99999999, and the code spit out 100000001. This didn’t seem right, and Wolfram Alpha confirmed; the answer we got was not a prime. (It factors into 17 * 5882353.)
A few similar tests also gave us numbers that were not prime. But, where was the problem? The PHP code was too simple to have a bug, many answers were correct, and the regex algorithm itself was sound. It was time for brute force. I quickly wrote some code to feed increasing odd integers into into regex pattern and check its answers against the normal sieve algorithm to see where it starts failing. The first number to fail was 22201; the regex said it was prime, but it’s actually a perfect square (1492). Past that, the failure rate increased.
It then dawned on me that the culprit might be the backtracking itself, specifically how it was implemented in PCRE, the library at the heart of PHP’s regular expressions. As I mention in my Regex Clinic talk, unbounded backtracking can dramatically slow down the matching and should be well-understood before attempting to write complex patterns. To manage the danger of runaway patterns, PCRE implemented the pcre.backtrack_limit
setting a few years ago. In our case, backtracking is used to break up the string of 1
s into successively larger chunks, and, with very long strings, the engine may run into this backtracking limit, which is set to 100000 by default. My guess was that with the 22201-character-long string, the default was not enough. Once the limit is reached, the match fails, and the number is declared prime.
I bumped the limit up to 200000, and voilà, 22201 was no longer declared prime. To fix the 100000001 match, the limit had to be raised dramatically, to around 250000000! And, the program took 14 seconds to deliver the verdict on my new i5 MacBook Pro. Needless to say, don’t use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance. I certainly did, and was glad that my bit of investigative work showed that abstract, pure, good ideas can still fail in the messy world of software and hardware.
Time Machine Forensics
About a year ago, I finally wisened up to the obvious: 1) losing data sucks, and 2) doing daily (or frequent) backups manually are a pain in the ass, especially for portables. 2) is why most people forgo backups in the first place, resulting in 1). I needed something that I could set up once and pretty much forget about. Thankfully, there was just the thing: Time Capsule — a single device that encapsulates both Airport Extreme base station and a huge hard drive, and can be used for automatic, transparent, wireless backups from any of your Mac devices via the Time Machine.
I purchased the 1 TB version, configured it, and have been using it ever since. It has already saved my butt a couple of times, when I accidentally deleted (rm -rf) a directory and had that familiar sinking feeling, but then remembered that the Time Machine was there ready to lend a hand.
However, I noticed recently that the hourly backups were taking longer and growing in size, sometimes up to 1GB or more. It didn’t seem like there was that much data changing on an hourly basis, so I set out to investigate. A quick look around revealed that the Time Machine itself will not reveal the backup details, but someone wrote a script called timedog that displays the files that the Time Machine backed up during its most recent run (or any of your choosing). The output of the script is something like this (abbreviated):
# cd /Volumes/Backup\ of\ foo/Backups.backupdb/foo # timedog -d 5 -l ==> Comparing TM backup 2010-06-29-160846 to 2010-06-29-101853 399B-> 399B [1] /Mac HD/Library/Application Support/CrashReporter/ [..skip..] 52.6KB-> 53.0KB /Mac HD/Users/andrei/.dbshell 863B-> 866B /Mac HD/Users/andrei/.lesshst 11.0KB-> 15.1KB /Mac HD/Users/andrei/.viminfo 10.4KB-> 30.7KB /Mac HD/Users/andrei/.zsh_hist 6.7MB-> 6.7MB /Mac HD/Users/andrei/.dropbox/dropbox.db 3.9MB-> 3.9MB /Mac HD/Users/andrei/.dropbox/migrate.db 25.2MB-> 50.3MB [10] /Mac HD/Users/andrei/.dropbox/cache/ 21.0KB-> 21.0KB [1] /Mac HD/Users/andrei/Desktop/ 120.0MB-> 120.4MB [1] /Mac HD/Users/andrei/Documents/Linkinus 2 Logs/ 142.8MB-> 146.2MB [156] /Mac HD/Users/andrei/Library/Application Support/ [..skip..] 608.0MB-> 608.0MB [5] /Mac HD/private/var/data/mongodb/ [..skip..] ==> Total Backup: 967 changed files/directories, 1.88GB
Looking at this, a couple of offenders are immediately obvious. Linkinus (an IRC client) keeps all the conversation logs in one single file, and since I’m frequently on IRC, that file grows and gets backed up every hour. The data files for MongoDB that I use for development are also fairly large and change often. Finally, there is something in Library/Application Support, but it wasn’t shown because of the depth limit I set. After increasing the depth to 7, I discovered that it was the Google Chrome cache and history.
Conversation logs and Chrome stuff are not important enough for me to back up hourly, and MongoDB data I can copy periodically to an off-site server. By excluding these 3 items from the backup process via Time Machine Preferences I was able to reduce the size of the hourlies to 50 MB or less.
This brings up an important point though: while Time Machine is great for doing automated backups over your local network, you should have a separate copy of the data off-site, for redundancy. How and when to do the off-site backups varies according to everyone’s needs, but I would suggest something like CrashPlan, which does unlimited online backups for about $5/month. Once again, it’s automatic and hands-off, which is how you want it.
Tools I found useful while investigating this Time Machine issue:
- timedog – the script used above
- Time Machine Buddy – a Dashboard widget that organises and displays the backup logs
- Time Machine Scheduler – in case you want to change how often the backups are done