underused.org by Michael Scharkow

Scraping Youtube with Beautiful Soup

December 11th, 2007 | Hacking | Tags: , , , | 2 Comments »

For an upcoming project I need to track some usage statistics of Youtube videos which are not provided via the GData API. The common solution to this problem is screen scraping the HTML pages and extracting the information.

Here’s a quick howto using Python and the BeautifulSoup HTML/XML parser.

First off, we choose a Youtube video page, like this one and stuff it into the BeautifulSoup parser:

#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re # we need regular expressions later

monty_vid = urlopen('http://youtube.com/watch?v=Xe1a1wHxTyo')
page = BeautifulSoup(monty_vid)

print page.prettify()

The last line pretty-prints the HTML you just retrieved, so you can check if it’s an existing page or a 404. Next, we’d like to extract some meta data like title, description and tags for the video. Luckily, those are provided in the HTML head as meta tags, in order, so we can extract the content attribute from those. The result object is a list with elements that act like dictionaries:

meta = {}
meta['title'] = page.head(’meta’)[0]['content']
meta['description'] = page.head(’meta’)[1]['content']
meta['tags'] = page.head(’meta’)[2]['content'].split(’, ‘)

Notice that all the extracted strings are Unicode, and we made a list of tags by splitting the string. Next up, we want the number of views and the number of ratings. Luckily, the former is available in a span tag with a dedicated class which we can retrieve with the following search on the document body:

views = page.body('span',"viewCount")[0].string

The number of ratings is not readily marked up, but available as a string like “55 ratings”, so we need another technique — pattern matching within a certain div:

numratings = page.body('div', id='defaultRatingMessage',    
                        text=re.compile('ratings'))[0].string.split()[0]

The body() method with the text parameter gives us all tags in the named div that match our simple pattern, from which we extract the first part with split()[0]. Finally, we do not only want the number of ratings, but the rating itself. The rating is not available as a number in the document, but indicated by 0 to 5 star images which we can count:

rating = len(page.body('img','rating icn_star_full_19x20png'))

Of course, there are dozens of different ways to query the HTML for some values, but these work with the current layout. Since you have to update your scraping script with every change, why bother with optimal queries?