Automatic Content Analysis and Machine Learning

In 2011, I defended my PhD thesis in Communication on the subject of Automatic Content Analysis and Machine Learning. The thesis is in German, but below is a short abstract. Some parts of the book have also been published as journal articles, most notably the empirical study on classification quality for thematic analyses. The German version of this page is here .

Abstract

For some years, machine learning techniques have been used to automatically process digital media content - from search engines to automatic language translation. More recently, socials scientist have applied machine learning to the quantitative analysis of texts. Starting from a methodological perspective, I discuss the benefits and disadvantages of automatic content analyses to traditional manual coding. Following these considerations, I introduce the methodological and conceptual foundations of machine learning approaches to text classification and their application in social science research. Empirically, the potential of machine learning for content analysis is investigated using an experimental study with German online news. The outcome variables for the study were (a) the quality of the classification and (b) the efficiency of the training process. Results show that the classification quality varies with the categories chosen, but is only marginally influenced by most preprocessing steps discussed in the literature. Regarding the efficiency of the machine learning, it can be shown that actively choosing informative training material instead of random sampling often leads to a more rapid learning process and can save a lot of human coding effort.

The complete book is available as a PDF download, or you can buy the book version on Amazonand elsewhere.