underused.org by Michael Scharkow

Harvard enforces Open Access for all research

February 15th, 2008 | Science | Tags: | No Comments »

Gary King just announced that the Faculty of Arts and Sciences at Harvard unanimously decided to enforce Open Access for all faculty members. This bold move should seriously advance OA world-wide.

I’d like to see similar steps forward from the German DFG and the likes, or my university. I guess most of the faculty at UdK does not even know what OA means, and who needs to if there’s no research anyway ;-)

Typoscript’s RECORDS, how I love thee

January 6th, 2008 | Hacking | Tags: , | No Comments »

After hacking it for so many years, there are still some surprising nuggets hidden in TYPO3. Our TYPO3-newbie webmaster Johannes recently pointed me to the RECORDS type in our template which is extremely useful for including portlet-style content elements in your template. In order to include a plugin somewhere in your page template, simply add it to a sysfolder or hidden page somewhere and refer to the content record like this:

subparts.TAGCLOUD = RECORDS
subparts.TAGCLOUD {
tables = tt_content
source = 444
dontCheckPid = 1
}

That’s it. The plugin content is rendered without USER_INT fiddling or COA tricks, configuration is dummy proof with Flexforms which seem to be more popular than TS configuration anyway. You can also include link lists, search forms, user login or normal content etc. like this, all nicely editable by your average users.

Scraping Youtube with Beautiful Soup

December 11th, 2007 | Hacking | Tags: , , , | 2 Comments »

For an upcoming project I need to track some usage statistics of Youtube videos which are not provided via the GData API. The common solution to this problem is screen scraping the HTML pages and extracting the information.

Here’s a quick howto using Python and the BeautifulSoup HTML/XML parser.

First off, we choose a Youtube video page, like this one and stuff it into the BeautifulSoup parser:

#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re # we need regular expressions later

monty_vid = urlopen('http://youtube.com/watch?v=Xe1a1wHxTyo')
page = BeautifulSoup(monty_vid)

print page.prettify()

The last line pretty-prints the HTML you just retrieved, so you can check if it’s an existing page or a 404. Next, we’d like to extract some meta data like title, description and tags for the video. Luckily, those are provided in the HTML head as meta tags, in order, so we can extract the content attribute from those. The result object is a list with elements that act like dictionaries:

meta = {}
meta['title'] = page.head(’meta’)[0]['content']
meta['description'] = page.head(’meta’)[1]['content']
meta['tags'] = page.head(’meta’)[2]['content'].split(’, ‘)

Notice that all the extracted strings are Unicode, and we made a list of tags by splitting the string. Next up, we want the number of views and the number of ratings. Luckily, the former is available in a span tag with a dedicated class which we can retrieve with the following search on the document body:

views = page.body('span',"viewCount")[0].string

The number of ratings is not readily marked up, but available as a string like “55 ratings”, so we need another technique — pattern matching within a certain div:

numratings = page.body('div', id='defaultRatingMessage',    
                        text=re.compile('ratings'))[0].string.split()[0]

The body() method with the text parameter gives us all tags in the named div that match our simple pattern, from which we extract the first part with split()[0]. Finally, we do not only want the number of ratings, but the rating itself. The rating is not available as a number in the document, but indicated by 0 to 5 star images which we can count:

rating = len(page.body('img','rating icn_star_full_19x20png'))

Of course, there are dozens of different ways to query the HTML for some values, but these work with the current layout. Since you have to update your scraping script with every change, why bother with optimal queries?

Google Chart API released

December 7th, 2007 | Hacking | Tags: , , | No Comments »

The awesome new Chart API is the probably best Google product since, um, Google Search. Gotta love the text-as-data pattern and the fact that you only need to fill an image tag with some parameters. And did I tell you it’s fast?!

(via Tobias Lütke)

Another housemate on the underused server

October 31st, 2007 | underused.org | Tags: , | No Comments »

Welcome Christian Siefkes, text processing/spam filtering wizard and author of a very cool book on peer economy. Go visit and buy the book!