Posts Tagged ‘CouchDB’
For a while now, I’ve been wanting to do a programming project using CouchDB and third-party web service APIs. I started out with an application to sync Launchpad bug data into CouchDB so that I could analyze it locally, a bit like Bug Hugger. It quickly got too complex for my spare time, and stalled. I’d still like to pick it up someday when I can devote more time to it.
More recently, I was noticing that Jono seemed to be having a rocking good time lately, sending a lot of awesome tweets about jams. This was only conjecture, though, and I needed hard data. I need to quantify just how strong these influences were.
Now, this was a project I could get done in an evening of hacking and learning.
First, I threw together this quick proof of concept to learn the Twitter API and get some tantalizing preliminary data. Behold version 1.0 of the jonometer:
#!/usr/bin/python # python-twitter import sys import twitter import re username = 'jonobacon' updates_wanted = 100 patterns = ['rock', 'awesome', 'jam'] class Counter: """A simple accumulator which counts matches of a regex""" def __init__(self, pattern): self.pattern = pattern self.regex = re.compile(pattern, re.I) self.count = 0 def update(self, s): """Increment count if the string s matches the pattern""" if self.regex.search(s): self.count += 1 def main(): client = twitter.Api() counters = map(Counter, patterns) updates_found = 0 for update in client.GetUserTimeline(username, updates_wanted): updates_found += 1 for counter in counters: counter.update(update.GetText()) for counter in counters: print counter.pattern, counter.count if __name__ == '__main__': sys.exit(main())
The output looked like this:
rock 5 awesome 6 jam 10
In other words, about 5% of Jono’s recent tweets were rocking, another 6% were awesome, and a whopping 10% were jamming! I was definitely onto something, but I had to find out more.
One of the shortcomings of this quick prototype is that it would download the data from Twitter every time I ran it. This meant that it was fairly slow (about 2 seconds for 100 tweets), which is inconvenient for experimenting with different patterns, and that I wouldn’t want to try it with larger data sets (say, thousands of tweets, or multiple people).
Enter CouchDB, the darling of the NoSQL crowd: fast, scalable and simple, it was just what I wanted for the next version of the jonometer. I replaced the Counter objects with a single Database, which stores all of the tweets in CouchDB. This was incredibly simple to do, because python-twitter provides an .AsDict() method which returns a tweet as a dictionary object, and CouchDB can store this type of data structure directly into the database. Easy!
Each time the jonometer is run, it downloads all of the new tweets since the previous run. In order to do this, it needs to keep track of the most recent tweet ID it has seen, so that it can pick up where it left off. I had originally planned to store a record in the database with the sync state, but after Stuart reminded me that Gwibber does much the same thing, I followed its example and instead calculated it using a view. Each row in the “maxid” view records the highest tweet ID seen for a particular user:
…so although the jonometer is currently Jono-specific, it could be extended easily.
For the core functionality, I created a view called “matches” to count how many tweets match each pattern. For each key (username and pattern), there is a row in this view which records how many tweets from that user matched that pattern:
The null pattern is used to keep a count of the total number of tweets for that user.
Once the data is loaded, the runtime for the CouchDB version is only about 0.3 seconds, including the Python interpreter startup as well as checking Twitter to see if there are new tweets. I doubled the size of the database up to 200 (which was about all Twitter would give me in one batch), and this didn’t change measurably. If I’ve done all of this right, it should scale easily up to thousands of tweets. Awesome! Adding or changing a pattern currently requires manually deleting the view so that it can be re-created. There is probably an established pattern for dealing with this, but I don’t know what it is yet.
Here’s the code for version 2: