Introducing OpenLSH

A few months ago I attended a Boston Data Mining meetup where a cool, fellow techie J Singh did a presentation on an algorithm called Locality Sensitive Hashing. During his presentation J expressed an interest in developing an open source library that implements LSH and few weeks later OpenLSH was born.

If you aren't familiar with Locality Sensitive Hashing it is a probabilistic, search algorithm that uses hashing to detect similar documents via the use of collisions. It works best on high dimensional data, where the pool of candidate documents, images etc are infinitely large (see resource links below for more info). Because of the nature of the algorithm we needed to think of data sources that not only offered high dimensional data but also an infinitely large data set and the Twitter public, sample stream immediately came to mind as a perfect data set to start with. We also decided to use Google App Engine as our development platform primarily because it offered free resources for testing the algorithm at scale.

Trouble in paradise

After going through the trial and discovery phase of working with the Twitter streaming API and evaluating libraries to access the API we needed to get a quick and dirty Google App Engine project up and running using the Twitter streaming API. The plan was to adapt our Python script where we had access to the streaming API working to this project. Here was the first sight of trouble. We couldn't get this working and after a lot swearing (on my part) and research it seemed that know one had gotten the Twitter streaming API working on GAE and if they did they were keeping it to themselves.

The crux of the issue is that the Twitter streaming API requires a persistent connection. This presents a problem for running applications that use this API (or any API that has a similar requirement) on GAE because by default it does not support sockets. GAE's URL Fetch service facilitates HTTP requests. This means two things, first the request will be farmed out to and executed on another machine (not the box where your project lives). Second, the URL Fetch service has it's own version of Python's httplib where socket support is “turned off.”

Problem solved but came at price

We decided to ignore the naysayers and try to troubleshoot our way to a working solution. Once we had a Google App Engine project that had working streaming API code, we deployed it and watched it blow up. There were a couple exceptions that led to the discovery of a solution to The Great Socket Mystery of 2014.

The first exception is the generic but somewhat helpful DeadlineExceededError. This is the exception you get when you try to open the connection for the streaming API. Eventually the request timesout because we never receive a response back for the URL Fetch service. After a night of consulting the great Google I found one lonely but extremely helpful forum post regarding how to activate the use of sockets for a GAE project.
You can configure your project to use the “normal” version of httplib. In order to do so, in your project app.yaml file you must set the GAEUSESOCKETSHTTPLIB GAE environment variable:

env_variables:  
    GAE_USE_SOCKETS_HTTPLIB : True

The WTF moment of the night came after I set this variable, deployed the updated project and was greeted by a helpful but almost comical exception TweepError: The Socket API will be enabled for this application once billing has been enabled in the admin console.

Wow, really? This is like watching an adult movie online, it is getting to the good part and they cut the cord until you put give them your credit card number. I had clearly come too far to turn back now. So I activated billing, waited for them to verify my credit card and viola we have tweets from the Twitter streaming API working in a Google App Engine project. Yay!

Conclusion

Hopefully this tale of twists and turns will be helpful to someone else. I was very happy to get this working so we could move on to more important bits of the project.

The OpenLSH project is still in its infancy so if you are interested in keeping tabs on our progress or would like to be a contributor see our github repo (link below).

Resources
See section: “Making httplib use sockets”
https://developers.google.com/appengine/docs/python/sockets/

Google App Engine's URL Fetch Service:
https://developers.google.com/appengine/docs/python/urlfetch/

OpenLSH Github Repo:
https://github.com/singhj/locality-sensitive-hashing

Locality Sensitive Hashing:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing