The code examples are just a sketch of the interface, to see the actual implementation use the links at the bottom of the post.
The basic County class was fairly straight forward:
class County(object):
""" Generic County """
name = None # name of the county
url = None # url of the vote page
reg2000 = {'Rep':0, 'Dem':0, 'Oth':0} # Registrations 2000
reg2004 = {'Rep':0, 'Dem':0, 'Oth':0} # Registrations 2004
vote2000 = {'Rep':0, 'Dem':0, 'Oth':0} # Votes 2000
vote2004 = {'Rep':0, 'Dem':0, 'Oth':0} # Find this out!
class Lehigh(County):
""" Concrete class for Lehigh County """
name = 'Lehigh'
url = 'http://lehigh.pa.gov/'
The 2004 vote data would be screen-scraped by the script, the other data was available in an easily machine readable format from the PA state website. Adding a __str__ method to the County class gives us a handy hook to pretty print statistics.We need a way to parse the wildly varying county websites, I chose to parse the "lynx --dump" output which turns it into a text parsing problem instead of an HTML DOM parsing problem. Even then each county was fairly unique (samples here). The per-county "Catcher" class handled pulling the chunk of text for the Presidential election out of the page with all the county races on it, and parsing the Presidential numbers.
class CountyCatcher(object):
""" generic Catcher to parse voting results page """
start = None # regular expression where presidential races start
end = None # presidential data is finished
def parse(self, lines): pass # function to parse results
class LehighCatcher(object):
""" Catcher specific to the Lehigh County Results page """
start = re.compile('PRESIDENT') # start of Presidential data
end = re.compile('DOG CATCHER') # race after President
def parse(self, lines):
# work done here, about 10-15 lines of python per county
Several counties used a sensible columnar format of "Candidate/Party #Votes" so the base Catcher did this by default and only the 'start' and 'end' regexps had to be specified.
In the end it was a bit disappointing, six hours to get all the code ready before the polls closed and another few hours watching and adding new counties as they came online. The results lagged an hour or more behind CNN for the counties with websites (Philadelphia, the largest country didn't even put results on its website). The same code did come in handy a few days later when USA Today put all counties for all states on their website - it wasn't in a machine readable format (csv, excel, etc) but I was able to bang out a 30 line class that ran over the 50 state abbreviations and gave me the nationwide data so I could do fun stuff like make graphs like this:
The graphing library is also something I wrote in python, but not for this project.
lynx_dumps.txt : example text output of various counties
files.py : the main program with all County definitions. This was written on election day in a rush, and it shows.
catcher.py : the guts of the multi-line text matcher. This should be readable, it was borrowed from another project.
Looking ahead to 2008, I'll add a GUI frotend (Tk? Wx? Who knows what I'll be using in four years) and some live graphs, possibly using matplotlib.
One step away (non-python material warning): here are three posts on live blogging the county numbers on election night and some more graphs.
Two steps away (politics & economics warning): I set up this "Python Stuff" section to seperate it from the rest of the blog but feel free to wander around.