Monday, March 23, 2009

Portfolio Assignment 6

Movie Clustering

For this assignment, I've decided to try to cluster movies based on IMDB's user-submitted plot summaries. In theory, an accurate plot summary should contain enough relevant information about a movie to be suitable for clustering. However, since we are interested in the movies described by the summaries rather than the summaries themselves, we want to make sure that we group similar movies together even if the summaries we have for those movies are stylistically different, and that we do not group very different movies together because their summaries were submitted by the same author. Therefore, in designing an algorithm to rate the difference between two movies, we must make sure that we are scoring the summaries based on content and not style.

To begin, I downloaded the plot summaries and user-submitted ratings from IMDB. I will only be clustering the most-rated (not highest-rated -- I want well-known movies, not necessarily well-liked movies) 1000 movies. (Actually, IMDB lists both movies and TV series, and the data doesn't distinguish between the two.) In order to select the plot summaries of the 1000 most-rated movies, I wrote Python scripts to parse both files and input the results into a MySQL database. Then we can get the most-rated 1000 movies for which there is at least one plot summary. This will be our dataset. The summaries for those movies can be considered a collection of documents.

Next, I built an inverted index, storing the location of each token in each summary, using the method descriped in the PCI book (including the simplistic tokenizer that splits on alphanumeric characters).

Now, in order to use heirarchical clustering, we need to define a distance function for two movies. One approach would be to build a dictionary of terms, then construct a vector for each movie that relates each term to a frequency, then take the Euclidean distance between those vectors. This is the approach used in PCI (and assignment 4) for clustering blogs. However, it has a number of problems. First, that approach considers all words equally relevant. In this case, we want less common words to be more relevant. The more common a word is, the less likely it is to distinguish between two movies and the more likely it is to be biased by the author's writing style. Also, our summaries differ significantly in length, and this approach would tend to rate the distance between a short summary and a large summary too high, because there must be a large gap in the frequencies of terms, even if the terms are all the same.

The technique I will be using is cosine similarity, as described by the IR book. For each document, the algorithm builds a vector of tf-idf scores for each term, where tf-idf is the normalized term frequency (the number of times the term appears in the document divided by the length of the document) multiplied by the inverted document frequency, which is the log of (N / the number of documents that contain the term), where N is the total number of documents. Thus a term that is more frequent within the document is weighted higher (it adds more information about that document), and a term that appears in more documents overall is weighted lower (it does little to distinguish that document from the rest).

Since this technique of calculating similarities uses only word frequency, not word order, we can add together frequency counts for multiple summaries and treat a movie as a single document, regardless of how many summaries it has.

Writing some python scripts to do these calculations, we have a list of 1000 movies and an algorithm for computing a similarity score for each combination. This is all the data we need to perform clustering.

The method from PCI combines clusters by averaging their word vectors, so the new vector can be used to measure distances to other clusters based on Pearson coefficient or other distance measures. Since I haven't removed any words from the index, this method proved to be prohibitively time-consuming. The word vector for each movie is too large to do these calculations in a reasonable amount of time. Without pruning the dictionary, a vector-based approach is not feasible.

Fortunately, there are techniques for clustering that don't require combining word vectors. The IR book describes some of them, and recommends complete-link (when vector-based approaches aren't possible), in which the distance between two clusters is the distance between their farthest members. Since I had already computed and stored cosine similarities for each pair, this method was more practical. I modified the PCI clustering code so that a cluster contains a list of movie ids (rather than a word vector), and changed the clustering algorithm to implement complete-link clustering.

The dendrogram seems to be too big for Blogger but can be viewed here.

The results are somewhat mixed. There are many groupings of somewhat dissimilar movies (sometimes obviously attributable to a few common terms, sometimes not), but some small clusters in which the movies belong to the same genre or are loosely related by topic. The algorithm was extremely good at grouping sequels, remakes, and franchises, probably because their descriptions share terms like character names or the actual movie names that are very rare in the rest of the collection.

The results might be improved if some of the most and least common words were removed from the dictionary, both to try to reduce anomalies and to make the list of words small enough that more complex algorithms for measuring distances between clusters could be used. A smaller dictionary might also allow us to increase the number of movies, which could improve the usefulness of inverse document frequency as a measure of how important a term is.

I also tried using single-link clustering, in which the similarity between two clusters is the similarity of their closest members. The results show similar tendencies but seem less consistent. The dendrogram is here.

Thursday, March 12, 2009

Portfolio Assignment 5

Crazy Egg:
http://crazyegg.com/snapshots/report/18

The Map of Science:
http://seedmagazine.com/content/article/scientific_method_relationships_among_scientific_paradigms/

Visualization of GPW/GRUMP data:
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/number-and-size-of-settlement-points