underused.org by Michael Scharkow

Typoscript’s RECORDS, how I love thee

January 6th, 2008 | Hacking | Tags: , | No Comments »

After hacking it for so many years, there are still some surprising nuggets hidden in TYPO3. Our TYPO3-newbie webmaster Johannes recently pointed me to the RECORDS type in our template which is extremely useful for including portlet-style content elements in your template. In order to include a plugin somewhere in your page template, simply add it to a sysfolder or hidden page somewhere and refer to the content record like this:

subparts.TAGCLOUD = RECORDS
subparts.TAGCLOUD {
tables = tt_content
source = 444
dontCheckPid = 1
}

That’s it. The plugin content is rendered without USER_INT fiddling or COA tricks, configuration is dummy proof with Flexforms which seem to be more popular than TS configuration anyway. You can also include link lists, search forms, user login or normal content etc. like this, all nicely editable by your average users.

Scraping Youtube with Beautiful Soup

December 11th, 2007 | Hacking | Tags: , , , | 2 Comments »

For an upcoming project I need to track some usage statistics of Youtube videos which are not provided via the GData API. The common solution to this problem is screen scraping the HTML pages and extracting the information.

Here’s a quick howto using Python and the BeautifulSoup HTML/XML parser.

First off, we choose a Youtube video page, like this one and stuff it into the BeautifulSoup parser:

#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re # we need regular expressions later

monty_vid = urlopen('http://youtube.com/watch?v=Xe1a1wHxTyo')
page = BeautifulSoup(monty_vid)

print page.prettify()

The last line pretty-prints the HTML you just retrieved, so you can check if it’s an existing page or a 404. Next, we’d like to extract some meta data like title, description and tags for the video. Luckily, those are provided in the HTML head as meta tags, in order, so we can extract the content attribute from those. The result object is a list with elements that act like dictionaries:

meta = {}
meta['title'] = page.head(’meta’)[0]['content']
meta['description'] = page.head(’meta’)[1]['content']
meta['tags'] = page.head(’meta’)[2]['content'].split(’, ‘)

Notice that all the extracted strings are Unicode, and we made a list of tags by splitting the string. Next up, we want the number of views and the number of ratings. Luckily, the former is available in a span tag with a dedicated class which we can retrieve with the following search on the document body:

views = page.body('span',"viewCount")[0].string

The number of ratings is not readily marked up, but available as a string like “55 ratings”, so we need another technique — pattern matching within a certain div:

numratings = page.body('div', id='defaultRatingMessage',    
                        text=re.compile('ratings'))[0].string.split()[0]

The body() method with the text parameter gives us all tags in the named div that match our simple pattern, from which we extract the first part with split()[0]. Finally, we do not only want the number of ratings, but the rating itself. The rating is not available as a number in the document, but indicated by 0 to 5 star images which we can count:

rating = len(page.body('img','rating icn_star_full_19x20png'))

Of course, there are dozens of different ways to query the HTML for some values, but these work with the current layout. Since you have to update your scraping script with every change, why bother with optimal queries?

Google Chart API released

December 7th, 2007 | Hacking | Tags: , , | No Comments »

The awesome new Chart API is the probably best Google product since, um, Google Search. Gotta love the text-as-data pattern and the fact that you only need to fill an image tag with some parameters. And did I tell you it’s fast?!

(via Tobias Lütke)

TYPO3 5.0 - the Enterprise strikes back

September 18th, 2006 | Hacking | Tags: | No Comments »

Today Robert and Sebastian announced that TYPO3 5.0 will work on top of a JSR-170 compliant content repository, which, of course, has to be implemented in PHP first [sic!]. What does this mean? Clearly a landmark in TYPO3 history, as we are now officially embracing the J2EE specificationism that is elsewhere rightfully ignored. Instead of becoming more agile, 5.0 might be even more of a 500lbs gorilla than ever before.

This does hardly come as a surprise: If you’ve witnessed the discussions at the T3DD, there has been a fairly large Java/Enterprise crowd who constantly argued a) for making anything exchangable (components!), b) that content must be stored in RDBMS, XML, flat files and preferably punch cards (just in case), c) that we must support standards (lots of them) and d) TYPO3 must be designed for any purpose. In short: 5.0 should be a kitchen sink that might also be used as a CMS.

So what does the 300-page spec of JSR-170 tell us:

  1. All data is stored in a tree, with nodes and properties (oh, really?!)
  2. You can have any persistence backend (see below)
  3. You can retrieve data via XPATH or (optionally) SQL-like queries

I’m not going to rant about the fun of mixing JAVA and PHP, which is required until we have a PHP implementation, or the man-months that need to be spent on the core data implementation (we’re not even talking about the actual content schema that is needed for TYPO3 as a CMS).

Ignoring the API for a moment, let’s have a look at the JackRabbit implementations for the persistence layer (i.e. our database):

Three out of four mature persistance managers use binary serialization for storage, which is great for transparency and enables us to use 0 (zero) of the industry-tested enterprise tools, like  RDBMS or file systems. The only other semi-mature option is XML-based storage, which is (surprise!) slow and even less reliable.

Please notice that those are all implemented by JAVA people who already have experience with JSR-170. In the TYPO3 community, as represented by the people at T3DD, even a simple ORM solution seemed to be quite a new idea to a lot of them.

I can only hope that the 5.0 project will not get completely off the trail by re-implementing, among others, ZOPE and Websphere in PHP while TYPO3 is still not less and not more than a CMS framework.

[The announcement was made on typo3.projects.typo3-5_0.general which is only available on the TYPO3 usenet (news://news.netfielders.de)]

Back from Dietikon

August 15th, 2006 | Hacking | Tags: , | 1 Comment »

T3DD06 was definitely the best conf I’ve been to, and I had a lot of fun meeting all the smart guys from -core and the community. The new TER frontend has finally been deployed and even the most annoying TIMTAB bug was fixed last weekend.

Other news … my thesis has been somewhat neglected lately but I did insane amounts of computation and data presentation last week. In a comparison of 25 countries any small table with 5 regression coefficients takes up a whole page. Not to mention the fun of presenting a dozen goodness-of-fit measures for every model and every country. Cross-national research surely does suck at times.