Data Mining: April 2009

Wednesday, April 29, 2009

Portfolio Assignment 9

NERO -- Neuro-Evolving Robot Operatives

NERO is a game based on machine learning using neural networks and a genetic algorithm. The objective of the game is to train robots -- each controlled by its own neural network -- to perform various tasks, and pit a combination of these trained soldiers against others in combat. A player may want units that can effectively fire at enemy soldiers, navigate terrain, capture an enemy base, protect the player's base, or exhibit more complex behavior. However, default soldiers will not be effective for any of these tasks, so different teams must be trained to perform them well.

The game is built using the rtNEAT engine for evolving neural networks in real-time. The goal of rtNEAT is to build increasingly complex neural networks to perform increasingly complex tasks, using a genetic algorithm. The technique is similar to genetic programming, because both attempt to evolve structures capable of taking inputs, performing some kind of processing, and producing outputs. While genetic programming evolves program trees, rtNEAT evolves neural networks.

The neural networks are initially simple, with mutations generally adding complexity. This behavior is different from most implementations of genetic programming, which may begin at an arbitrary level of complexity and not strive to become more or less complex. There is no limit to how complex the networks can grow. The engine uses speciation, in which different organisms (neural networks) are classified into species based on their history. Speciation is used to encourage more similar organisms mating (an attempt to address the "competing conventions" problem) and to "preserve innovation" by ensuring that the amount of organisms of a species depends on the fitness of the species. Speciation also helps to reduce bloat (organisms becoming more complex for no benefit) by ensuring that newer, more complex organisms are assigned to a new species and that the old species does not die out unless it is actually less fit.

To train soldiers in NERO, the player sets positive or negative weights that let the game evaluate the fitness of each unit. Units continually spawn, act on their own on the field for a certain amount of time, then disappear, evaluate their fitness, and respawn. Occasionally, the less fit individuals disappear and new organisms are added to the population. Over time, the units become better at satisfying the weights. The player can assign weights to behaviors such as moving toward the enemy, avoiding fire, or staying together. In order to create more complex behaviors, the player must train the troops using a combination of these weights. The user can also set the amount of time each organism remains on the field. Training may proceed more quickly if each unit is on the field for a short time, but more time may be requried to evaluate a unit's completion of a more complex task.

I achieved very limited success with NERO. Training requires a significant amount of time, and the program occasionally crashes, so the player loses any unsaved training progress. Training troops to exhibit complex behaviors typically requires training them for simple behaviors and gradually building up to the complex task, requiring more time. My team members were more successful, training troops to navigate a series of walls and to circle around enemy soldiers. In the former, the troops were trained to navigate around the first and second wall and were then able to navigate around the remaining walls. In the latter, troops were trained to approach enemies, with enemies spaced evenly apart, and then learned to move in a circle. Both examples suggest that NERO soldiers can be successfully trained to handle different and more complex situations than those they encounter in training.

Friday, April 24, 2009

Portfolio Assignment 7 (late, short)

PCI Chapter 4 -- Searching and Ranking

Chapter 4 covers several information retrieval techniques, including collecting and storing a corpus of documents and retrieving and sorting search results based on various criteria.

Since I had already collected a corpus of small documents for Assignment 6, I decided to create a search engine for the movie plots (using only the most rated 750 movies this time, because I was having trouble processing 1000). Since I was drawing information from a MySQL database instead of crawling web pages, I had to modify some of the PCI code a bit, including removing information about links from the index. This also meant that I couldn't use PageRank to sort search results. All the other techniques for querying the data, however, work. I assigned word frequency, document location, and word distance each the same weight.

We can use the search engine to explain and confirm the clusters, particularly those that are likely to be based on a small number of specific and extremely uncommon words, which may not be meaningful clusters. For example, one of the clusters includes 'Constantine', 'Dogma', and the 'Charlie's Angels' movies, which most people probably wouldn't consider very similar to each other. We can explain this with the search engine:

>>> e.query('angels')
select w0.sumid,w0.location from wordlocation w0 where w0.wordid=2903
3.000000 Charlies Angels (2000)
2.083333 Aviator, The (2004)
2.076923 Dogma (1999)
2.040000 Charlies Angels (2000)
2.034483 Charlies Angels (2000)
2.029412 Dogma (1999)
2.000000 Dogma (1999)
1.562500 Aviator, The (2004)
1.550000 Dogma (1999)
1.545455 Aviator, The (2004)
1.522727 Constantine (2005)
1.514925 Collateral (2004)
([2903], [288, 125, 448, 289, 286, 449, 451, 123, 450, 124])

(Each result corresponds to a different summary, which is why there may be several results for a given movie.)

So it looks like those movies were clustered together largely based on their use of the word 'angels', which appears in very few documents. However, 'The Aviator' and 'Collateral' are not near this cluster, suggesting that they may have found stronger associations elsewhere. Both are in different clusters in the dendrogram, slightly farther to the right. This suggests that they were clustered first, based on greater similarities to other movies. Once they had been clustered with other movies, their distances to the 'angels' movies was less relevant, because in complete-link clustering, the distance between two clusters is the farthest distance of any pair of movies, one from each. In this case, using the search engine helps us confirm suspicions about why movies were clustered together and helps us find other movies that might have been close to being placed in the same cluster.

Thursday, April 16, 2009

Portfolio Assignment 8

Chapter 10 -- Finding Independent Features

The goal of independent feature extraction is to discover a set of features relating different instances of the dataset. Like clustering, it is an unsupervised learning technique. Feature extraction can take a large set of attributes, like a list of words for a corpus of documents, and describe the dataset using a much smaller set of features.

Feature extraction using non-negative matrix factorization requires that the dataset can be represented as a matrix, then finds two matrices whose product is (as close as possible to) the dataset. If the dataset has n instances (such as documents) with m attributes (such as words), we can represent each instance as a row and each attribute as a column in an n by m matrix. In order for the product of the two factors to result in an n by m matrix, they must be of size n by x and x by m for some x, which is the number of features. One factor is the weights matrix, which has a row for each instance and a column for each feature, indicating the weight of each feature for each instance (e.g., how strong a particular theme is in a particular document). The other is the features matrix, with a row for each feature and a column for each word, indicating the contribution of each word to each feature.

In order to use non-negative matrix factorization, we have to choose the number of features (x). Two matrices are created, n by x and x by m, and initialized with random values. Then, repeatedly perform the following calculation, which uses four new matrices fn, fd, tn, and td:

1. hn = the transposed weights matrix multiplied by the data matrix
2. hd = the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix
3. wn = the data matrix multiplied by the transposed features matrix
4. wd = the weights matrix multiplied by the features matrix multiplied by the transposed features matrix
5. multiply every value in the features matrix by the corresponding value in hn
6. divide every value in the features matrix by the corresponding value in hd
7. multiply every value in the weights matrix by the corresponding value in wn
8. divide every value in the weights matrix by the corresponding value in wd

Each time, the weights and features matrix approach a correct factorization. Terminate when their product is exactly equal to the data matrix or when a specified maximum number of iterations is performed.

I ran the PCI code to extract 20 features from news stories. These are the 20 features the algorithm found, and the top 3 articles that contain those features, along with their weights.

['reuters', 'results', 'nbsp', 'bank', 'washington', 'times']
(26.956935368561908, u'US to unveil bank test results May 4 - Reuters')
(13.081470034098839, u'US discovers violations in surveillance program - Reuters')
(2.3762249288329689, u'Obama Unveils High-Speed Rail Plan - New York Times')

['marriage', 'paterson', 'york', 'legislation', 'legalize', 'same']
(27.324936892030976, u'Gov. Paterson introduces legislation to legalize gay marriage in ... - New York Daily News')
(7.6426727409171393, u'NY governor to push same-sex marriage bill')
(5.2740397524569902, u'City Room: Paterson Unveils Same-Sex Marriage Bill')

['obama', 'high', 'speed', 'rail', 'unveils', 'plan']
(28.840037916458279, u'Obama Unveils High-Speed Rail Plan - New York Times')
(10.256907276345128, u'Obama Unveils High-Speed Rail Plan')
(9.6323576369348576, u'Obama unveils high-speed rail plan')

['profit', 'nokia', 'nbsp', 'jpmorgan', 'journal', 'falls']
(24.705880143866498, u'Despite Profit Decline, Optimistic Words at Nokia - New York Times')
(23.127917441772848, u'JPMorgan's Profit Falls 10% But Beats Expectations - Wall Street Journal')
(5.5584765190223457, u'JPMorgan Chase earns $2.1B')

['kutcher', 'twitter', 'spears', 'news', 'race', 'nbsp']
(30.636814120256368, u'Twitter Race to 1 Million Followers: Can Kutcher Beat CNN and Spears? - ABC News')
(8.120470265490475, u'Twitter race between Kutcher and CNN narrows')
(1.1119497863206529, u'Why the Spam Carbon Footprint Study is Wrong - PC Magazine')

['news', 'nbsp', 'spam', 'pfizer', 'deal', 'articles']
(15.73995740082424, u'USA Today owner's profits slump - BBC News')
(15.473341233296507, u'Why the Spam Carbon Footprint Study is Wrong - PC Magazine')
(15.370063134872753, u'Pfizer, GlaxoSmithKline to launch HIV company - Bizjournals.com')

['demjanjuk', 'case', 'reopen', 'request', 'denied', 'deportation']
(28.515164940519234, u'Request to reopen Demjanjuk's deportation case denied - CNN')
(10.451745519268284, u"Accused Nazi's Request to Reopen Case Denied")
(1.2391981320144818, u'Prosecutor: Drop case against Bush officials')

['madden', 'coach', 'john', 'news', 'thursday', 'announced']
(20.409454524464191, u'Madden announces retirement - CNN')
(9.027209585795358, u'Football legend John Madden to retire')
(8.5345707222411189, u'NFL Broadcaster John Madden Retires')

['quest', 'million', 'misbranding', 'today', 'test', 'nbsp']
(34.409122200044429, u'Quest Pays $302 Million for Misbranding PTH Test - Renal Business Today')
(5.6338445286364731, u'Deadly attacks mar opening of Indian election')
(3.0262093117167779, u'USA Today owner's profits slump - BBC News')

['drug', 'epilepsy', 'nbsp', 'lower', 'york', 'times']
(27.800010078553523, u'IQ Harmed by Epilepsy Drug in Utero - New York Times')
(6.5091285086544897, u'Pfizer, GlaxoSmithKline to launch HIV company - Bizjournals.com')
(5.7316147346691162, u'Colombia Operation Nabs Most-Wanted Drug Lord')

['suleman', 'octomom', 'nbsp', 'show', 'reports', 'associated']
(27.434627505962908, u'Octomom reality show reports premature - Hollywood Reporter')
(6.4532931828328888, u"Nadya Suleman Wants to Trademark 'Octomom' Nickname")
(2.3115550604267008, u'Billy Ray not amused by Foxx's Miley jokes - msnbc.com')

['housing', 'claims', 'jobless', 'bloomberg', 'starts', 'fall']
(31.683114456995323, u'US Economy: Jobless Claims Fall, Housing Stabilizes - Bloomberg')
(5.1440452349075896, u'New Jobless Claims Drop for Second Straight Week- March Housing Construction Falls 10.8 Percent')
(3.0607894881438384, u'USA Today owner's profits slump - BBC News')

['times', 'robinson', 'york', 'nbsp', 'news', 'years']
(31.302811635807451, u'Sports of The Times Timeline Stretches 62 Years, From Robinson to ... - New York Times')
(9.0137397723407844, u'First-Round NBA Playoff Pairings - New York Times')
(4.1116560554028121, u'Madden announces retirement - CNN')

['growth', 'general', 'files', 'bankruptcy', 'mall', 'owner']
(34.087252834044925, u'Glendale Galleria mall owner General Growth files record bankruptcy - Los Angeles Times')
(3.4004102790728008, u"China's economic growth falls")
(2.6035002478418461, u'FOXBusiness: JPMorgan Posts $2.1 Billion Profit- FOXBusiness: Mall Giant Files for Bankruptcy')

['with', 'that', 'thursday', 'from', 'said', 'have']
(8.0266184808591703, u'Maersk crew reunites with family members')
(7.3256882178409031, u'Obama: Latin America on equal footing with U.S.')
(7.0650147758525792, u'N. Korea orders out nuclear inspectors')

['israel', 'state', 'nbsp', 'quot', 'news', 'minister']
(26.182178132567305, u'Israel demands recognition - Aljazeera.net')
(5.3592694059489618, u"Israel's pragmatic thug")
(4.6806406110833372, u'Just how bad off is the Republican Party (Part 2)?')

['veterans', 'security', 'apologizes', 'homeland', 'groups', 'report']
(28.392368413776488, u'Homeland security chief apologizes to veterans groups - CNN')
(13.277877485904455, u"DHS Sec'y Apologizes to Vets")
(10.06749268131458, u'Homeland security chief apologizes')

['chechnya', 'russia', 'nbsp', 'operation', 'guardian', 'timeline']
(29.64479055909738, u'Chechnya and Russia: timeline - guardian.co.uk')
(10.730427010763846, u"Russia declares end to Chechnya 'operation'")
(1.6589568347684713, u'Russia Ends Operations in Chechnya')

['thailand', 'nbsp', 'after', 'government', 'telegraph', 'political']
(28.958532247175231, u'Thailand's Anti-Government Protests End, but Questions Linger - Voice of America')
(3.8000889983596569, u'Billy Ray not amused by Foxx's Miley jokes - msnbc.com')
(1.841203744693537, u'World Briefing | Asia: Sri Lanka: Conflicting Reports After Cease-Fire Ends')

['monte', 'federer', 'carlo', 'round', 'third', 'nbsp']
(27.729401591662743, u'Federer loses in third round in Monte Carlo - The Associated Press')
(7.6970641509680853, u'First-Round NBA Playoff Pairings - New York Times')
(1.9373164154547438, u'US discovers violations in surveillance program - Reuters')

Each group represents a feature. The list of words is a short list of the words that contribute the most to an article's score in that feature. The three articles listed are the top three articles for that feature, along with the numbers that display how strongly they exhibit the feature. A higher number indicates that the article is a better example of the feature.

Data Mining