Data Mining

Wednesday, April 29, 2009

Portfolio Assignment 9

NERO -- Neuro-Evolving Robot Operatives

NERO is a game based on machine learning using neural networks and a genetic algorithm. The objective of the game is to train robots -- each controlled by its own neural network -- to perform various tasks, and pit a combination of these trained soldiers against others in combat. A player may want units that can effectively fire at enemy soldiers, navigate terrain, capture an enemy base, protect the player's base, or exhibit more complex behavior. However, default soldiers will not be effective for any of these tasks, so different teams must be trained to perform them well.

The game is built using the rtNEAT engine for evolving neural networks in real-time. The goal of rtNEAT is to build increasingly complex neural networks to perform increasingly complex tasks, using a genetic algorithm. The technique is similar to genetic programming, because both attempt to evolve structures capable of taking inputs, performing some kind of processing, and producing outputs. While genetic programming evolves program trees, rtNEAT evolves neural networks.

The neural networks are initially simple, with mutations generally adding complexity. This behavior is different from most implementations of genetic programming, which may begin at an arbitrary level of complexity and not strive to become more or less complex. There is no limit to how complex the networks can grow. The engine uses speciation, in which different organisms (neural networks) are classified into species based on their history. Speciation is used to encourage more similar organisms mating (an attempt to address the "competing conventions" problem) and to "preserve innovation" by ensuring that the amount of organisms of a species depends on the fitness of the species. Speciation also helps to reduce bloat (organisms becoming more complex for no benefit) by ensuring that newer, more complex organisms are assigned to a new species and that the old species does not die out unless it is actually less fit.

To train soldiers in NERO, the player sets positive or negative weights that let the game evaluate the fitness of each unit. Units continually spawn, act on their own on the field for a certain amount of time, then disappear, evaluate their fitness, and respawn. Occasionally, the less fit individuals disappear and new organisms are added to the population. Over time, the units become better at satisfying the weights. The player can assign weights to behaviors such as moving toward the enemy, avoiding fire, or staying together. In order to create more complex behaviors, the player must train the troops using a combination of these weights. The user can also set the amount of time each organism remains on the field. Training may proceed more quickly if each unit is on the field for a short time, but more time may be requried to evaluate a unit's completion of a more complex task.

I achieved very limited success with NERO. Training requires a significant amount of time, and the program occasionally crashes, so the player loses any unsaved training progress. Training troops to exhibit complex behaviors typically requires training them for simple behaviors and gradually building up to the complex task, requiring more time. My team members were more successful, training troops to navigate a series of walls and to circle around enemy soldiers. In the former, the troops were trained to navigate around the first and second wall and were then able to navigate around the remaining walls. In the latter, troops were trained to approach enemies, with enemies spaced evenly apart, and then learned to move in a circle. Both examples suggest that NERO soldiers can be successfully trained to handle different and more complex situations than those they encounter in training.

Friday, April 24, 2009

Portfolio Assignment 7 (late, short)

PCI Chapter 4 -- Searching and Ranking

Chapter 4 covers several information retrieval techniques, including collecting and storing a corpus of documents and retrieving and sorting search results based on various criteria.

Since I had already collected a corpus of small documents for Assignment 6, I decided to create a search engine for the movie plots (using only the most rated 750 movies this time, because I was having trouble processing 1000). Since I was drawing information from a MySQL database instead of crawling web pages, I had to modify some of the PCI code a bit, including removing information about links from the index. This also meant that I couldn't use PageRank to sort search results. All the other techniques for querying the data, however, work. I assigned word frequency, document location, and word distance each the same weight.

We can use the search engine to explain and confirm the clusters, particularly those that are likely to be based on a small number of specific and extremely uncommon words, which may not be meaningful clusters. For example, one of the clusters includes 'Constantine', 'Dogma', and the 'Charlie's Angels' movies, which most people probably wouldn't consider very similar to each other. We can explain this with the search engine:

>>> e.query('angels')
select w0.sumid,w0.location from wordlocation w0 where w0.wordid=2903
3.000000 Charlies Angels (2000)
2.083333 Aviator, The (2004)
2.076923 Dogma (1999)
2.040000 Charlies Angels (2000)
2.034483 Charlies Angels (2000)
2.029412 Dogma (1999)
2.000000 Dogma (1999)
1.562500 Aviator, The (2004)
1.550000 Dogma (1999)
1.545455 Aviator, The (2004)
1.522727 Constantine (2005)
1.514925 Collateral (2004)
([2903], [288, 125, 448, 289, 286, 449, 451, 123, 450, 124])

(Each result corresponds to a different summary, which is why there may be several results for a given movie.)

So it looks like those movies were clustered together largely based on their use of the word 'angels', which appears in very few documents. However, 'The Aviator' and 'Collateral' are not near this cluster, suggesting that they may have found stronger associations elsewhere. Both are in different clusters in the dendrogram, slightly farther to the right. This suggests that they were clustered first, based on greater similarities to other movies. Once they had been clustered with other movies, their distances to the 'angels' movies was less relevant, because in complete-link clustering, the distance between two clusters is the farthest distance of any pair of movies, one from each. In this case, using the search engine helps us confirm suspicions about why movies were clustered together and helps us find other movies that might have been close to being placed in the same cluster.

Thursday, April 16, 2009

Portfolio Assignment 8

Chapter 10 -- Finding Independent Features

The goal of independent feature extraction is to discover a set of features relating different instances of the dataset. Like clustering, it is an unsupervised learning technique. Feature extraction can take a large set of attributes, like a list of words for a corpus of documents, and describe the dataset using a much smaller set of features.

Feature extraction using non-negative matrix factorization requires that the dataset can be represented as a matrix, then finds two matrices whose product is (as close as possible to) the dataset. If the dataset has n instances (such as documents) with m attributes (such as words), we can represent each instance as a row and each attribute as a column in an n by m matrix. In order for the product of the two factors to result in an n by m matrix, they must be of size n by x and x by m for some x, which is the number of features. One factor is the weights matrix, which has a row for each instance and a column for each feature, indicating the weight of each feature for each instance (e.g., how strong a particular theme is in a particular document). The other is the features matrix, with a row for each feature and a column for each word, indicating the contribution of each word to each feature.

In order to use non-negative matrix factorization, we have to choose the number of features (x). Two matrices are created, n by x and x by m, and initialized with random values. Then, repeatedly perform the following calculation, which uses four new matrices fn, fd, tn, and td:

1. hn = the transposed weights matrix multiplied by the data matrix
2. hd = the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix
3. wn = the data matrix multiplied by the transposed features matrix
4. wd = the weights matrix multiplied by the features matrix multiplied by the transposed features matrix
5. multiply every value in the features matrix by the corresponding value in hn
6. divide every value in the features matrix by the corresponding value in hd
7. multiply every value in the weights matrix by the corresponding value in wn
8. divide every value in the weights matrix by the corresponding value in wd

Each time, the weights and features matrix approach a correct factorization. Terminate when their product is exactly equal to the data matrix or when a specified maximum number of iterations is performed.

I ran the PCI code to extract 20 features from news stories. These are the 20 features the algorithm found, and the top 3 articles that contain those features, along with their weights.

['reuters', 'results', 'nbsp', 'bank', 'washington', 'times']
(26.956935368561908, u'US to unveil bank test results May 4 - Reuters')
(13.081470034098839, u'US discovers violations in surveillance program - Reuters')
(2.3762249288329689, u'Obama Unveils High-Speed Rail Plan - New York Times')

['marriage', 'paterson', 'york', 'legislation', 'legalize', 'same']
(27.324936892030976, u'Gov. Paterson introduces legislation to legalize gay marriage in ... - New York Daily News')
(7.6426727409171393, u'NY governor to push same-sex marriage bill')
(5.2740397524569902, u'City Room: Paterson Unveils Same-Sex Marriage Bill')

['obama', 'high', 'speed', 'rail', 'unveils', 'plan']
(28.840037916458279, u'Obama Unveils High-Speed Rail Plan - New York Times')
(10.256907276345128, u'Obama Unveils High-Speed Rail Plan')
(9.6323576369348576, u'Obama unveils high-speed rail plan')

['profit', 'nokia', 'nbsp', 'jpmorgan', 'journal', 'falls']
(24.705880143866498, u'Despite Profit Decline, Optimistic Words at Nokia - New York Times')
(23.127917441772848, u'JPMorgan's Profit Falls 10% But Beats Expectations - Wall Street Journal')
(5.5584765190223457, u'JPMorgan Chase earns $2.1B')

['kutcher', 'twitter', 'spears', 'news', 'race', 'nbsp']
(30.636814120256368, u'Twitter Race to 1 Million Followers: Can Kutcher Beat CNN and Spears? - ABC News')
(8.120470265490475, u'Twitter race between Kutcher and CNN narrows')
(1.1119497863206529, u'Why the Spam Carbon Footprint Study is Wrong - PC Magazine')

['news', 'nbsp', 'spam', 'pfizer', 'deal', 'articles']
(15.73995740082424, u'USA Today owner's profits slump - BBC News')
(15.473341233296507, u'Why the Spam Carbon Footprint Study is Wrong - PC Magazine')
(15.370063134872753, u'Pfizer, GlaxoSmithKline to launch HIV company - Bizjournals.com')

['demjanjuk', 'case', 'reopen', 'request', 'denied', 'deportation']
(28.515164940519234, u'Request to reopen Demjanjuk's deportation case denied - CNN')
(10.451745519268284, u"Accused Nazi's Request to Reopen Case Denied")
(1.2391981320144818, u'Prosecutor: Drop case against Bush officials')

['madden', 'coach', 'john', 'news', 'thursday', 'announced']
(20.409454524464191, u'Madden announces retirement - CNN')
(9.027209585795358, u'Football legend John Madden to retire')
(8.5345707222411189, u'NFL Broadcaster John Madden Retires')

['quest', 'million', 'misbranding', 'today', 'test', 'nbsp']
(34.409122200044429, u'Quest Pays $302 Million for Misbranding PTH Test - Renal Business Today')
(5.6338445286364731, u'Deadly attacks mar opening of Indian election')
(3.0262093117167779, u'USA Today owner's profits slump - BBC News')

['drug', 'epilepsy', 'nbsp', 'lower', 'york', 'times']
(27.800010078553523, u'IQ Harmed by Epilepsy Drug in Utero - New York Times')
(6.5091285086544897, u'Pfizer, GlaxoSmithKline to launch HIV company - Bizjournals.com')
(5.7316147346691162, u'Colombia Operation Nabs Most-Wanted Drug Lord')

['suleman', 'octomom', 'nbsp', 'show', 'reports', 'associated']
(27.434627505962908, u'Octomom reality show reports premature - Hollywood Reporter')
(6.4532931828328888, u"Nadya Suleman Wants to Trademark 'Octomom' Nickname")
(2.3115550604267008, u'Billy Ray not amused by Foxx's Miley jokes - msnbc.com')

['housing', 'claims', 'jobless', 'bloomberg', 'starts', 'fall']
(31.683114456995323, u'US Economy: Jobless Claims Fall, Housing Stabilizes - Bloomberg')
(5.1440452349075896, u'New Jobless Claims Drop for Second Straight Week- March Housing Construction Falls 10.8 Percent')
(3.0607894881438384, u'USA Today owner's profits slump - BBC News')

['times', 'robinson', 'york', 'nbsp', 'news', 'years']
(31.302811635807451, u'Sports of The Times Timeline Stretches 62 Years, From Robinson to ... - New York Times')
(9.0137397723407844, u'First-Round NBA Playoff Pairings - New York Times')
(4.1116560554028121, u'Madden announces retirement - CNN')

['growth', 'general', 'files', 'bankruptcy', 'mall', 'owner']
(34.087252834044925, u'Glendale Galleria mall owner General Growth files record bankruptcy - Los Angeles Times')
(3.4004102790728008, u"China's economic growth falls")
(2.6035002478418461, u'FOXBusiness: JPMorgan Posts $2.1 Billion Profit- FOXBusiness: Mall Giant Files for Bankruptcy')

['with', 'that', 'thursday', 'from', 'said', 'have']
(8.0266184808591703, u'Maersk crew reunites with family members')
(7.3256882178409031, u'Obama: Latin America on equal footing with U.S.')
(7.0650147758525792, u'N. Korea orders out nuclear inspectors')

['israel', 'state', 'nbsp', 'quot', 'news', 'minister']
(26.182178132567305, u'Israel demands recognition - Aljazeera.net')
(5.3592694059489618, u"Israel's pragmatic thug")
(4.6806406110833372, u'Just how bad off is the Republican Party (Part 2)?')

['veterans', 'security', 'apologizes', 'homeland', 'groups', 'report']
(28.392368413776488, u'Homeland security chief apologizes to veterans groups - CNN')
(13.277877485904455, u"DHS Sec'y Apologizes to Vets")
(10.06749268131458, u'Homeland security chief apologizes')

['chechnya', 'russia', 'nbsp', 'operation', 'guardian', 'timeline']
(29.64479055909738, u'Chechnya and Russia: timeline - guardian.co.uk')
(10.730427010763846, u"Russia declares end to Chechnya 'operation'")
(1.6589568347684713, u'Russia Ends Operations in Chechnya')

['thailand', 'nbsp', 'after', 'government', 'telegraph', 'political']
(28.958532247175231, u'Thailand's Anti-Government Protests End, but Questions Linger - Voice of America')
(3.8000889983596569, u'Billy Ray not amused by Foxx's Miley jokes - msnbc.com')
(1.841203744693537, u'World Briefing | Asia: Sri Lanka: Conflicting Reports After Cease-Fire Ends')

['monte', 'federer', 'carlo', 'round', 'third', 'nbsp']
(27.729401591662743, u'Federer loses in third round in Monte Carlo - The Associated Press')
(7.6970641509680853, u'First-Round NBA Playoff Pairings - New York Times')
(1.9373164154547438, u'US discovers violations in surveillance program - Reuters')

Each group represents a feature. The list of words is a short list of the words that contribute the most to an article's score in that feature. The three articles listed are the top three articles for that feature, along with the numbers that display how strongly they exhibit the feature. A higher number indicates that the article is a better example of the feature.

Monday, March 23, 2009

Portfolio Assignment 6

Movie Clustering

For this assignment, I've decided to try to cluster movies based on IMDB's user-submitted plot summaries. In theory, an accurate plot summary should contain enough relevant information about a movie to be suitable for clustering. However, since we are interested in the movies described by the summaries rather than the summaries themselves, we want to make sure that we group similar movies together even if the summaries we have for those movies are stylistically different, and that we do not group very different movies together because their summaries were submitted by the same author. Therefore, in designing an algorithm to rate the difference between two movies, we must make sure that we are scoring the summaries based on content and not style.

To begin, I downloaded the plot summaries and user-submitted ratings from IMDB. I will only be clustering the most-rated (not highest-rated -- I want well-known movies, not necessarily well-liked movies) 1000 movies. (Actually, IMDB lists both movies and TV series, and the data doesn't distinguish between the two.) In order to select the plot summaries of the 1000 most-rated movies, I wrote Python scripts to parse both files and input the results into a MySQL database. Then we can get the most-rated 1000 movies for which there is at least one plot summary. This will be our dataset. The summaries for those movies can be considered a collection of documents.

Next, I built an inverted index, storing the location of each token in each summary, using the method descriped in the PCI book (including the simplistic tokenizer that splits on alphanumeric characters).

Now, in order to use heirarchical clustering, we need to define a distance function for two movies. One approach would be to build a dictionary of terms, then construct a vector for each movie that relates each term to a frequency, then take the Euclidean distance between those vectors. This is the approach used in PCI (and assignment 4) for clustering blogs. However, it has a number of problems. First, that approach considers all words equally relevant. In this case, we want less common words to be more relevant. The more common a word is, the less likely it is to distinguish between two movies and the more likely it is to be biased by the author's writing style. Also, our summaries differ significantly in length, and this approach would tend to rate the distance between a short summary and a large summary too high, because there must be a large gap in the frequencies of terms, even if the terms are all the same.

The technique I will be using is cosine similarity, as described by the IR book. For each document, the algorithm builds a vector of tf-idf scores for each term, where tf-idf is the normalized term frequency (the number of times the term appears in the document divided by the length of the document) multiplied by the inverted document frequency, which is the log of (N / the number of documents that contain the term), where N is the total number of documents. Thus a term that is more frequent within the document is weighted higher (it adds more information about that document), and a term that appears in more documents overall is weighted lower (it does little to distinguish that document from the rest).

Since this technique of calculating similarities uses only word frequency, not word order, we can add together frequency counts for multiple summaries and treat a movie as a single document, regardless of how many summaries it has.

Writing some python scripts to do these calculations, we have a list of 1000 movies and an algorithm for computing a similarity score for each combination. This is all the data we need to perform clustering.

The method from PCI combines clusters by averaging their word vectors, so the new vector can be used to measure distances to other clusters based on Pearson coefficient or other distance measures. Since I haven't removed any words from the index, this method proved to be prohibitively time-consuming. The word vector for each movie is too large to do these calculations in a reasonable amount of time. Without pruning the dictionary, a vector-based approach is not feasible.

Fortunately, there are techniques for clustering that don't require combining word vectors. The IR book describes some of them, and recommends complete-link (when vector-based approaches aren't possible), in which the distance between two clusters is the distance between their farthest members. Since I had already computed and stored cosine similarities for each pair, this method was more practical. I modified the PCI clustering code so that a cluster contains a list of movie ids (rather than a word vector), and changed the clustering algorithm to implement complete-link clustering.

The dendrogram seems to be too big for Blogger but can be viewed here.

The results are somewhat mixed. There are many groupings of somewhat dissimilar movies (sometimes obviously attributable to a few common terms, sometimes not), but some small clusters in which the movies belong to the same genre or are loosely related by topic. The algorithm was extremely good at grouping sequels, remakes, and franchises, probably because their descriptions share terms like character names or the actual movie names that are very rare in the rest of the collection.

The results might be improved if some of the most and least common words were removed from the dictionary, both to try to reduce anomalies and to make the list of words small enough that more complex algorithms for measuring distances between clusters could be used. A smaller dictionary might also allow us to increase the number of movies, which could improve the usefulness of inverse document frequency as a measure of how important a term is.

I also tried using single-link clustering, in which the similarity between two clusters is the similarity of their closest members. The results show similar tendencies but seem less consistent. The dendrogram is here.

Thursday, March 12, 2009

Portfolio Assignment 5

Crazy Egg:
http://crazyegg.com/snapshots/report/18

The Map of Science:
http://seedmagazine.com/content/article/scientific_method_relationships_among_scientific_paradigms/

Visualization of GPW/GRUMP data:
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/number-and-size-of-settlement-points

Friday, February 20, 2009

Portfolio Assignment 4 (2/3)

1. PCI Chapter 3

Went through the chapter on spidering and clustering. Using the downloaded code, generated a second dataset using the same list of feeds. Below are two dendrograms, the first generated using the original dataset (that appears in the book) and the second generated using my dataset.

The structures of the dendrograms are noticably different, but there are similar clusters. The first dendrogram contains a cluster of five blogs (The Unofficial Apple Weblog, Download Squad, Autoblog, Joystiq, and Engadget) that all remain close together in the second, though a few other blogs are grouped together with them. This group is indicated by the blue box in both diagrams. We can also see a larger cluster of many web-related blogs, most of which (including Blogger's Blog, Publishing 2.0, and several Google and search-related blogs) remain close in the second dendrogram. This group, indicated in green, seems to vary more between the two diagrams.

2. Clustering: Capitol Words data

I looked through the APIs available at programmableweb.com and decided to build a dataset using Capitol Words, which tracks word usage by members of the U.S. Congress. I obtained a list of legislators from the Sunlight Labs API, and used the Python interface to the Capitol Words API to build a list of 112 current and recent senators' most frequently used words. For each senator, I downloaded their 250 most used words along with the number of times they've used those words. The resulting dataset contains 5969 words, and the number of times each senator has used that word, according to CW, if it is in their top 250 words (0 if otherwise).

We can use this dataset and the PCI code to run several clustering algorithms. For example, k-means clustering with k = 5 returns this result:

['"Sen. Charles E. Grassley(R--IA)"', '"Sen. Daniel Kahikina Akaka(D--HI)"', '"Sen. Susan M. Collins(R--ME)"', '"Sen. Christopher J. Dodd(D--CT)"', '"Sen. Hillary Rodham Clinton(D--NY)"', '"Sen. Tim P. Johnson(D--SD)"']

['"Sen. Michael Bennet(D--CO)"', '"Sen. Arlen Specter(R--PA)"', '"Sen. James H. Webb(D--VA)"', '"Sen. Evan Bayh(D--IN)"', '"Sen. Russell D. Feingold(D--WI)"', '"Sen. Kay R. Hagan(D--NC)"']

['"Sen. Jeanne Shaheen(D--NH)"', '"Sen. Pete V. Domenici(R--NM)"', '"Sen. Harry M. Reid(D--NV)"', '"Sen. Robert C. Byrd(D--WV)"', '"Sen. Thomas Harkin(D--IA)"', '"Sen. Mark E. Udall(D--CO)"', '"Sen. Bernard Sanders(I--VT)"', '"Sen. Daniel K. Inouye(D--HI)"', '"Sen. Barbara A. Mikulski(D--MD)"', '"Sen. Charles T. Hagel(R--NE)"', '"Sen. Edward M. Kennedy(D--MA)"', '"Sen. Sheldon Whitehouse(D--RI)"', '"Sen. John D. Rockefeller(D--WV)"', '"Sen. Thomas Allen Coburn(R--OK)"', '"Sen. Thomas Richard Carper(D--DE)"', '"Sen. Dianne Feinstein(D--CA)"', '"Sen. Sherrod C. Brown(D--OH)"', '"Sen. Mark R. Warner(D--VA)"', '"Sen. Joseph R. Biden(D--DE)"', '"Sen. John E. Sununu(R--NH)"', '"Sen. Carl Levin(D--MI)"', '"Sen. Jeff Bingaman(D--NM)"', '"Sen. John H. Isakson(R--GA)"', '"Sen. Amy Klobuchar(D--MN)"', '"Sen. Max S. Baucus(D--MT)"', '"Sen. Tom S. Udall(D--NM)"', '"Sen. Frank R. Lautenberg(D--NJ)"', '"Sen. Mary L. Landrieu(D--LA)"', '"Sen. Richard G. Lugar(R--IN)"', '"Sen. Benjamin L. Cardin(D--MD)"', '"Sen. Robert Menendez(D--NJ)"', '"Sen. Elizabeth H. Dole(R--NC)"', '"Sen. Jon Tester(D--MT)"', '"Sen. E. Benjamin Nelson(D--NE)"', '"Sen. Ted Stevens(R--AK)"', '"Sen. Barack H. Obama(D--IL)"', '"Sen. Bill Nelson(D--FL)"', '"Sen. Maria Cantwell(D--WA)"', '"Sen. Larry E. Craig(R--ID)"', '"Sen. Patty Murray(D--WA)"', '"Sen. Edward E. Kaufman(D--DE)"', '"Sen. Ken Salazar(D--CO)"', '"Sen. Mark Begich(D--AK)"', '"Sen. John William Warner(R--VA)"', '"Sen. Roland W. Burris(D--IL)"']

['"Sen. Pat Roberts(R--KS)"', '"Sen. Jim W. DeMint(R--SC)"', '"Sen. Samuel D. Brownback(R--KS)"', '"Sen. Michael B. Enzi(R--WY)"', '"Sen. George V. Voinovich(R--OH)"', '"Sen. Richard C. Shelby(R--AL)"', '"Sen. Kent Conrad(D--ND)"', '"Sen. Robert F. Bennett(R--UT)"', '"Sen. Melquiades Rafael Martinez(R--FL)"', '"Sen. Claire McCaskill(D--MO)"', '"Sen. Bob Corker(R--TN)"', '"Sen. Lisa A. Murkowski(R--AK)"', '"Sen. Wayne A. Allard(R--CO)"', '"Sen. Lamar Alexander(R--TN)"', '"Sen. John Francis Reed(D--RI)"', '"Sen. John Eric Ensign(R--NV)"', '"Sen. Olympia Jean Snowe(R--ME)"', '"Sen. David B. Vitter(R--LA)"', '"Sen. C. Saxby Chambliss(R--GA)"', '"Sen. Lindsey O. Graham(R--SC)"', '"Sen. Charles E. Schumer(D--NY)"', '"Sen. John R. Thune(R--SD)"', '"Sen. Mitch McConnell(R--KY)"', '"Sen. Kay Bailey Hutchison(R--TX)"', '"Sen. James E. Risch(R--ID)"', '"Sen. Byron L. Dorgan(D--ND)"', '"Sen. John Sidney McCain(R--AZ)"', '"Sen. Jon Kyl(R--AZ)"', '"Sen. John Barrasso(R--WY)"', '"Sen. Jim Bunning(R--KY)"', '"Sen. John Cornyn(R--TX)"', '"Sen. Roger F. Wicker(R--MS)"', '"Sen. Christopher S. Bond(R--MO)"', '"Sen. Mike O. Johanns(R--NE)"', '"Sen. Joseph I. Lieberman(I--CT)"', '"Sen. Thad Cochran(R--MS)"']

['"Sen. Michael D. Crapo(R--ID)"', '"Sen. Gordon Harold Smith(R--OR)"', '"Sen. Barbara Boxer(D--CA)"', '"Sen. Patrick J. Leahy(D--VT)"', '"Sen. Mark Lunsford Pryor(D--AR)"', '"Sen. Debbie Ann Stabenow(D--MI)"', '"Sen. Herbert H. Kohl(D--WI)"', '"Sen. Norm Coleman(R--MN)"', '"Sen. Judd A. Gregg(R--NH)"', '"Sen. Richard M. Burr(R--NC)"']

We can also use heirarchical clustering, which generates the following dendrogram:

We can also use the book's method for visualizing data in two dimensions:

All of these methods show that the senators, as represented by this dataset, are roughly evenly spaced. However, there are some small clusters in which one party predominates, or a few locations in which senators from the same areas are close together. For example, Arizona Republicans John McCain and Jon Kyl are grouped together by both algorithms. This is particularly evident in the dendrogram, and I've marked a few clusters which show this.

Part of the reason that the senators are so evenly spaced may be that, because of the way the dataset was constructed, there is not enough overlapping data between them. The only data used to construct the dataset was each senator's top 250 words, so each senator has a maximum of 250 nonzero entries. (The reason I constructed it in this way is that the CW API doesn't provide a direct way to get the number of times a senator has used a particular word. The only option is to get the top X words a senator has used, and check to see if the particular word is in the list.)

A better dataset might be obtained by increasing the number of words to search through for each senator. However, including more words in the wordlist would quickly make the dataset larger and more difficult to process. A good strategy might be selecting the wordlist that is the union of each senator's top 250 (or possibly even less) words, and checking that wordlist against each senator's top 500 (or 1000, or more) words to obtain frequencies.

Thursday, February 12, 2009

Portfolio Assignment 3

Team 5's Last.fm recommendation system is written in C#, and presents a Windows Forms-based front-end for a few useful functions of the Last.fm API. The application lets the user execute three different searches: search for an artist, list albums by an artist, and get similar artists.

(All Lastfm objects and methods mentioned are in the Lastfm.Services namespace.)

The user types the name of an artist and selects "List Artist", "List Album", or "Recommend Artist" from a drop down box. Search results are displayed in the box below.

List Artist performs an ArtistSearch with the user's text as a query, gets the first page of results, represented as a collection of Artist objects, and iterates through them to print a list of artists whose names match the user's input.

List Album, which uses an AlbumSearch to get Album objects, produces a list of the most popular albums by a particular artist (like the artist search, it only lists the first page, so it might not return all of an artist's albums), in order.

Recommend Artist takes an artist's name as input, performs an ArtistSearch and gets the first result to get the Artist object representing that artist, then uses Artist.GetSimilar to list the most similar artists to that artist.

Each search displays a list of artist or album names, but allows the user to access more information using an embedded web browser. If the user double-clicks on an artist result, the artist's name is used to generate an Artist object, then an ArtistBio. Then the "Web Results" tab is selected and navigates to the appropriate page using ArtistBio.GetURL. Double-clicking on an album result (which take the form " - ") splits the result string into the artist name and the album name, gets the Artist by that name, then gets the Album by that artist with the appropriate name, then calls Ablum.GetURL and sends the user to the track listing for that album.

Example:

Suppose I'm looking for recommendations, so I enter in an artist I like -- let's use Iron Maiden -- and select "Recommended Artist":

The most similar artist I'm not very familiar with is Saxon. I'll search for their most popular albums:

If I double-click the most popular album listed, I can get a track listing for "Denim And Leather":

So, using this application, I can go from an artist I like to a list of songs I might be interested in.