Data Mining: Portfolio Assignment 7 (late, short)

PCI Chapter 4 -- Searching and Ranking

Chapter 4 covers several information retrieval techniques, including collecting and storing a corpus of documents and retrieving and sorting search results based on various criteria.

Since I had already collected a corpus of small documents for Assignment 6, I decided to create a search engine for the movie plots (using only the most rated 750 movies this time, because I was having trouble processing 1000). Since I was drawing information from a MySQL database instead of crawling web pages, I had to modify some of the PCI code a bit, including removing information about links from the index. This also meant that I couldn't use PageRank to sort search results. All the other techniques for querying the data, however, work. I assigned word frequency, document location, and word distance each the same weight.

We can use the search engine to explain and confirm the clusters, particularly those that are likely to be based on a small number of specific and extremely uncommon words, which may not be meaningful clusters. For example, one of the clusters includes 'Constantine', 'Dogma', and the 'Charlie's Angels' movies, which most people probably wouldn't consider very similar to each other. We can explain this with the search engine:

>>> e.query('angels')
select w0.sumid,w0.location from wordlocation w0 where w0.wordid=2903
3.000000 Charlies Angels (2000)
2.083333 Aviator, The (2004)
2.076923 Dogma (1999)
2.040000 Charlies Angels (2000)
2.034483 Charlies Angels (2000)
2.029412 Dogma (1999)
2.000000 Dogma (1999)
1.562500 Aviator, The (2004)
1.550000 Dogma (1999)
1.545455 Aviator, The (2004)
1.522727 Constantine (2005)
1.514925 Collateral (2004)
([2903], [288, 125, 448, 289, 286, 449, 451, 123, 450, 124])

(Each result corresponds to a different summary, which is why there may be several results for a given movie.)

So it looks like those movies were clustered together largely based on their use of the word 'angels', which appears in very few documents. However, 'The Aviator' and 'Collateral' are not near this cluster, suggesting that they may have found stronger associations elsewhere. Both are in different clusters in the dendrogram, slightly farther to the right. This suggests that they were clustered first, based on greater similarities to other movies. Once they had been clustered with other movies, their distances to the 'angels' movies was less relevant, because in complete-link clustering, the distance between two clusters is the farthest distance of any pair of movies, one from each. In this case, using the search engine helps us confirm suspicions about why movies were clustered together and helps us find other movies that might have been close to being placed in the same cluster.

Data Mining

Friday, April 24, 2009

Portfolio Assignment 7 (late, short)

No comments:

Post a Comment

Relevant Links

Blog Archive