The Geek Event Finder

What It Is

The Geek Event Finder automatically scans the content of websites that announce upcoming gatherings for geeks. Those sites may be single-group sites, like a user group's homepage announcing its monthly meetings, or they may be sites which are themselves aggregations, like DDJ's Conference Calendar. Once the data is gathered, it is uploaded to a Google Calendar account, so that anyone can search for upcoming events in the area near them.

Code

The code is hosted in a Mercurial repository at Assembla.

News

I'll announce any relevant developments regarding the Geek Event Aggregator on my blog under the tag geekeventaggregator.

How It Works

With Python, of course!

Sites to scan

Manually, I create a CSV list of URLs for the Geek Event Aggregator to scan. Each line contains a URL, the group's name, and its location. For sites that list events with multiple names/locations, the "name" and "location" columns contain integers representing the location of those data relative to the event date.

Scanning a site

At first, I imagined specifying exactly where on each page the description of an event could be found; either using a regular expression customized to each page, or using Beautiful Soup. Those were terrible ideas. There is absolutely no consistent pattern used to announce events, so it would have involved a significant chunk of work for every single website scanned. Who has time for that?

Interpreting HTML is hard, let's go shopping

The solution turned out to be text-based brute force; or, "doing the simplest thing that could possibly work". I start by discarding the HTML and scanning only the displayed text. Now, how do I pick out event announcements in all that text?

The date is the key

The one thing that every event announcement has in common is a date. Better yet, the date is much easier to programmatically recognize than a location or an event name. I realized that, if I could pick out all the dates from a webpage's text, I'd be almost done. So all I had to do was feed the webpage's text through something that could recognize dates. No problem!

Recognizing a date
So what's a date? Uh-oh. Fortunately, the dateutil module has a parser that does the hard work of recognizing a huge variety of date formats. Whew!
Cleanup

After chopping the text up into little bits, and feeding the bits through dateutil.parser to see what it would recognize as a date, I had a lot of screening of false hits to do.

Dateutil.parser is not designed to decide whether something is a date. It assumes that it's being fed a legitimate date, and tries really, REALLY hard to find a date for you, no matter how it has to stretch. So, for instance, if you parser.parse('3'), it thinks, "Um... well... maybe you mean the third day of this month?"

This is why the text bits had to be fed in descending order of size. If the text contained "14 Feb", and you fed it "14" alone, the parser would see "the 14th of this month".

One trick here is including a default date in the call to parser.parse(). If the default date is included, elements of the date that the parser "can't find" get taken from the default date - so parsing "14" alone would give "the 14th of the default month".  Then, dates whose month or day matches the default date's month or day fall under suspicion, and we can check to see that there's really a month in the string we fed it.  It would be nice if the default date could be a truly illegal (and thus completely recognizable) one, like the 32nd day of the 13th month, but it has to be a datetime object, so that's impossible.

Repeating events

Compared to all that, finding regularly scheduled events is really, really easy. A simple regular expression finds phrases like "3rd Tuesday"; then, dateutil.rrule translates that into specific dates.

Name and Location

Once the date is found, the event name is relatively easy. For single-group websites, this are simply read as text strings from the list of URLs. For pages that list a variety of events, an integer is used to describe the relative positions of the name.

The location is found similarly, then fed to the Yahoo Maps web service to translate the string to latitude and longitude.

For instance, events in the DDJ conference calendar are listed as such:

					AJAXWorld Conference & Expo
March 19 - 21, 2007
New York City , New York

so the list of sources describes the site as

http://www.devtownstation.com/ddj.asp,-1,1

meaning, "Event Name will occur one tag boundary before the date (-1); Location will occur one tag boundary after the date (+1)." It only takes a moment's glimpse at the HTML for any given page to determine this much data; far better than writing a Beautiful Soup description for each page!

The location information is fed through a script that looks up the words included and attaches appropriate region tags. For instance, if it finds the word "Berlin" in the location, it attaches "Germany Europe".

Troubleshooting

The whole process demanded a lot of troubleshooting, especially at first. I found myself frequently staring at the Aggregator's results and yelling, "WHERE in the WORLD did you get THAT from?" Eventually, I rewrote the script as a TurboGears application - though one without a customer-facing webpage. Rather, the webpage is served only to me, for debugging purposes - data on each website is stored as part of the TG model. Then, TG can serve up an annotated version of the webpage, with extra tags added to show me what parts of a webpage the alleged event data was drawn from.

Uploading

The final step is to get the data somewhere you can see it. I'm still experimenting with various possibilities here. At the moment, I'm doing something fairly simplistic - I dump the events into a .csv file, then truncate the existing events in the Google Calendar account and import the .csv file. This requires a couple manual steps, but for now, I don't mind doing that once a week or so.

Flaws

False hits

No human being views the Aggregator's results (why, are you volunteering?), so there will always be false hits. Sure, I go through sometimes to see if there's any systematic source of error, but there will always be things I can't reasonably train my script to avoid. For instance, if your user group's website says that Python has a 8-1 productivity advantage over writing assembly code while blindfolded, the Aggregator thinks you have a meeting on August 1. Oh, well. Double-check everything.

No time of day

The Aggregator ignores time of day for events it finds and treats them all as full-day events. By-day seemed sufficiently granular detail, and reading times would be one more huge pain.

Non-English languages

It's not that I want to be an English-language chauvenist. (For Heaven's sake, I'm an Esperantist.) But I haven't yet taught the Aggregator to recognize meetings that occur jeder zweite Dienstag im Monat.

Non-American date conventions

This one I expect to fix soon. 03/08 is March 8 in the U.S., but August 3 in Europe. Dateutil.parser can account for this, I just have to attach a column to my website list specifying which date format I expect each site to use based on its nationality. (Or maybe I should take it from the final element of the URL? But I don't know if I can count on non-American sites to always end with the country code.)

Calendar drill-down sites

There are some sites with very useful aggregated event information - TechVenue, for instance - whose format has got me stymied. First, I'd have to recognize the event listing, even though it's not adjacent to a date I can recognize - it may be adjacent to the day number, or it may not be. Then, I'd have to follow the link to drill down for location information. It's worth pursuing, but I don't know when I'll get around to it.

- Catherine Devlin
catherine.devlin at gmail
catherinedevlin.blogspot.com