We'll see | Matt Zimmerman

a potpourri of mirth and madness

Posts Tagged ‘CouchDB

Introducing the jonometer

a learning experiment using Python, Twitter, CouchDB and desktopcouch

For a while now, I’ve been wanting to do a programming project using CouchDB and third-party web service APIs. I started out with an application to sync Launchpad bug data into CouchDB so that I could analyze it locally, a bit like Bug Hugger. It quickly got too complex for my spare time, and stalled. I’d still like to pick it up someday when I can devote more time to it.

More recently, I was noticing that Jono seemed to be having a rocking good time lately, sending a lot of awesome tweets about jams. This was only conjecture, though, and I needed hard data. I need to quantify just how strong these influences were.

Now, this was a project I could get done in an evening of hacking and learning.

Version One

First, I threw together this quick proof of concept to learn the Twitter API and get some tantalizing preliminary data. Behold version 1.0 of the jonometer:

#!/usr/bin/python

# python-twitter

import sys
import twitter
import re

username = 'jonobacon'
updates_wanted = 100
patterns = ['rock', 'awesome', 'jam']

class Counter:
    """A simple accumulator which counts matches of a regex"""

    def __init__(self, pattern):
        self.pattern = pattern
        self.regex = re.compile(pattern, re.I)
        self.count = 0

    def update(self, s):
        """Increment count if the string s matches the pattern"""
        if self.regex.search(s):
            self.count += 1

def main():
    client = twitter.Api()
    counters = map(Counter, patterns)
    updates_found = 0
    for update in client.GetUserTimeline(username, updates_wanted):
        updates_found += 1
        for counter in counters:
            counter.update(update.GetText())

    for counter in counters:
        print counter.pattern, counter.count

if __name__ == '__main__':
    sys.exit(main())

The output looked like this:

rock 5
awesome 6
jam 10

In other words, about 5% of Jono’s recent tweets were rocking, another 6% were awesome, and a whopping 10% were jamming! I was definitely onto something, but I had to find out more.

One of the shortcomings of this quick prototype is that it would download the data from Twitter every time I ran it. This meant that it was fairly slow (about 2 seconds for 100 tweets), which is inconvenient for experimenting with different patterns, and that I wouldn’t want to try it with larger data sets (say, thousands of tweets, or multiple people).

Version Two

Enter CouchDB, the darling of the NoSQL crowd: fast, scalable and simple, it was just what I wanted for the next version of the jonometer. I replaced the Counter objects with a single Database, which stores all of the tweets in CouchDB. This was incredibly simple to do, because python-twitter provides an .AsDict() method which returns a tweet as a dictionary object, and CouchDB can store this type of data structure directly into the database. Easy!

Each time the jonometer is run, it downloads all of the new tweets since the previous run. In order to do this, it needs to keep track of the most recent tweet ID it has seen, so that it can pick up where it left off. I had originally planned to store a record in the database with the sync state, but after Stuart reminded me that Gwibber does much the same thing, I followed its example and instead calculated it using a view. Each row in the “maxid” view records the highest tweet ID seen for a particular user:

The maxid view
Key Value
jonobacon 10743678774

…so although the jonometer is currently Jono-specific, it could be extended easily.

For the core functionality, I created a view called “matches” to count how many tweets match each pattern. For each key (username and pattern), there is a row in this view which records how many tweets from that user matched that pattern:

The matches view
Key Value
["jonobacon", null] 100
["jonobacon", "Awesome"] 6
["jonobacon", "Jam"] 10
["jonobacon", "Rock"] 5

The null pattern is used to keep a count of the total number of tweets for that user.

Once the data is loaded, the runtime for the CouchDB version is only about 0.3 seconds, including the Python interpreter startup as well as checking Twitter to see if there are new tweets. I doubled the size of the database up to 200 (which was about all Twitter would give me in one batch), and this didn’t change measurably. If I’ve done all of this right, it should scale easily up to thousands of tweets. Awesome! Adding or changing a pattern currently requires manually deleting the view so that it can be re-created. There is probably an established pattern for dealing with this, but I don’t know what it is yet.

Here’s the code for version 2:

#!/usr/bin/python

# python-twitter
# python-desktopcouch

import sys
import twitter
import re
from desktopcouch.records.server import CouchDatabase
from desktopcouch.records.record import Record

username = 'jonobacon'
# title string : JavaScript regex
patterns = { 'Rock' : 'rock',
        'Awesome' : 'awesome',
        'Jam' : 'jam' }

class Database(CouchDatabase):
    design_doc = "jonometer"
    database_name = "jonometer"

    def __init__(self, patterns):
        """patterns is a dictionary of (title string, JavaScript regex)"""

        CouchDatabase.__init__(self, self.database_name, create=True)
        self.patterns = patterns.copy()

        # set up maxid view
        if not self.view_exists("maxid", self.design_doc):
            mapfn = '''function(doc) { emit(doc.user.screen_name, doc.id); }'''
            viewfn = '''function(key, values, rereduce) {
    return Math.max.apply(Math, values);
}'''
            self.add_view("maxid", mapfn, viewfn, self.design_doc)

        # set up a view to count occurrences of each pattern
        if not self.view_exists("matches", self.design_doc):

            mapfn = '''
function(doc) {
    emit([doc.user.screen_name, null], 1);

    var pattern = null;
    var pattern_name = null;
'''

            mapfn += ''.join(['''   
    pattern = "%s";
    pattern_name = "%s";
    if (new RegExp(pattern, "i").exec(doc.text)) {
        emit([doc.user.screen_name, pattern_name], 1);
    }
    ''' % (pattern, pattern_name)
       for pattern_name, pattern in self.patterns.items()])

            mapfn += '}'

            viewfn = '''function(key, values, rereduce) { return sum(values); }'''
            self.add_view("matches", mapfn, viewfn, self.design_doc)

    def maxid(self, username):
        """Return the highest known tweet ID for the specified user"""

        view = self.execute_view("maxid", self.design_doc)
        result = view[username].rows
        if len(result) > 0:
            return result[0].value
        return None

    def count_matches(self, username, pattern_name=None):
        """Return the number of tweets from username which match 
        the specified pattern.

        If no pattern is specified, count all tweets."""

        assert pattern_name is None or pattern_name in self.patterns
        view = self.execute_view("matches", self.design_doc)
        result = view[[username, pattern_name]].rows
        if len(result) > 0:
            return result[0].value

def main():
    client = twitter.Api()
    db = Database(patterns)

    maxid = db.maxid(username)
    if maxid:
        timeline = client.GetUserTimeline(username, since_id=maxid)
    else:
        timeline = client.GetUserTimeline(username, count=100)

    for tweet in timeline:
        print "new:", tweet.GetText()
        record = Record(tweet.AsDict(),
                record_type='https://mdzlog.alcor.net/tag/jonometer')
        record_id = db.put_record(record)

    for pattern in patterns:
        print pattern, db.count_matches(username, pattern)
    print "total", db.count_matches(username)

if __name__ == '__main__':
    sys.exit(main())

Written by Matt Zimmerman

March 19, 2010 at 23:41