Thursday, April 16, 2009

Portfolio Assignment 8

Chapter 10 -- Finding Independent Features

The goal of independent feature extraction is to discover a set of features relating different instances of the dataset. Like clustering, it is an unsupervised learning technique. Feature extraction can take a large set of attributes, like a list of words for a corpus of documents, and describe the dataset using a much smaller set of features.

Feature extraction using non-negative matrix factorization requires that the dataset can be represented as a matrix, then finds two matrices whose product is (as close as possible to) the dataset. If the dataset has n instances (such as documents) with m attributes (such as words), we can represent each instance as a row and each attribute as a column in an n by m matrix. In order for the product of the two factors to result in an n by m matrix, they must be of size n by x and x by m for some x, which is the number of features. One factor is the weights matrix, which has a row for each instance and a column for each feature, indicating the weight of each feature for each instance (e.g., how strong a particular theme is in a particular document). The other is the features matrix, with a row for each feature and a column for each word, indicating the contribution of each word to each feature.

In order to use non-negative matrix factorization, we have to choose the number of features (x). Two matrices are created, n by x and x by m, and initialized with random values. Then, repeatedly perform the following calculation, which uses four new matrices fn, fd, tn, and td:

1. hn = the transposed weights matrix multiplied by the data matrix
2. hd = the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix
3. wn = the data matrix multiplied by the transposed features matrix
4. wd = the weights matrix multiplied by the features matrix multiplied by the transposed features matrix
5. multiply every value in the features matrix by the corresponding value in hn
6. divide every value in the features matrix by the corresponding value in hd
7. multiply every value in the weights matrix by the corresponding value in wn
8. divide every value in the weights matrix by the corresponding value in wd

Each time, the weights and features matrix approach a correct factorization. Terminate when their product is exactly equal to the data matrix or when a specified maximum number of iterations is performed.

I ran the PCI code to extract 20 features from news stories. These are the 20 features the algorithm found, and the top 3 articles that contain those features, along with their weights.

['reuters', 'results', 'nbsp', 'bank', 'washington', 'times']
(26.956935368561908, u'US to unveil bank test results May 4 - Reuters')
(13.081470034098839, u'US discovers violations in surveillance program - Reuters')
(2.3762249288329689, u'Obama Unveils High-Speed Rail Plan - New York Times')

['marriage', 'paterson', 'york', 'legislation', 'legalize', 'same']
(27.324936892030976, u'Gov. Paterson introduces legislation to legalize gay marriage in ... - New York Daily News')
(7.6426727409171393, u'NY governor to push same-sex marriage bill')
(5.2740397524569902, u'City Room: Paterson Unveils Same-Sex Marriage Bill')

['obama', 'high', 'speed', 'rail', 'unveils', 'plan']
(28.840037916458279, u'Obama Unveils High-Speed Rail Plan - New York Times')
(10.256907276345128, u'Obama Unveils High-Speed Rail Plan')
(9.6323576369348576, u'Obama unveils high-speed rail plan')

['profit', 'nokia', 'nbsp', 'jpmorgan', 'journal', 'falls']
(24.705880143866498, u'Despite Profit Decline, Optimistic Words at Nokia - New York Times')
(23.127917441772848, u'JPMorgan's Profit Falls 10% But Beats Expectations - Wall Street Journal')
(5.5584765190223457, u'JPMorgan Chase earns $2.1B')

['kutcher', 'twitter', 'spears', 'news', 'race', 'nbsp']
(30.636814120256368, u'Twitter Race to 1 Million Followers: Can Kutcher Beat CNN and Spears? - ABC News')
(8.120470265490475, u'Twitter race between Kutcher and CNN narrows')
(1.1119497863206529, u'Why the Spam Carbon Footprint Study is Wrong - PC Magazine')

['news', 'nbsp', 'spam', 'pfizer', 'deal', 'articles']
(15.73995740082424, u'USA Today owner's profits slump - BBC News')
(15.473341233296507, u'Why the Spam Carbon Footprint Study is Wrong - PC Magazine')
(15.370063134872753, u'Pfizer, GlaxoSmithKline to launch HIV company - Bizjournals.com')

['demjanjuk', 'case', 'reopen', 'request', 'denied', 'deportation']
(28.515164940519234, u'Request to reopen Demjanjuk's deportation case denied - CNN')
(10.451745519268284, u"Accused Nazi's Request to Reopen Case Denied")
(1.2391981320144818, u'Prosecutor: Drop case against Bush officials')

['madden', 'coach', 'john', 'news', 'thursday', 'announced']
(20.409454524464191, u'Madden announces retirement - CNN')
(9.027209585795358, u'Football legend John Madden to retire')
(8.5345707222411189, u'NFL Broadcaster John Madden Retires')

['quest', 'million', 'misbranding', 'today', 'test', 'nbsp']
(34.409122200044429, u'Quest Pays $302 Million for Misbranding PTH Test - Renal Business Today')
(5.6338445286364731, u'Deadly attacks mar opening of Indian election')
(3.0262093117167779, u'USA Today owner's profits slump - BBC News')

['drug', 'epilepsy', 'nbsp', 'lower', 'york', 'times']
(27.800010078553523, u'IQ Harmed by Epilepsy Drug in Utero - New York Times')
(6.5091285086544897, u'Pfizer, GlaxoSmithKline to launch HIV company - Bizjournals.com')
(5.7316147346691162, u'Colombia Operation Nabs Most-Wanted Drug Lord')

['suleman', 'octomom', 'nbsp', 'show', 'reports', 'associated']
(27.434627505962908, u'Octomom reality show reports premature - Hollywood Reporter')
(6.4532931828328888, u"Nadya Suleman Wants to Trademark 'Octomom' Nickname")
(2.3115550604267008, u'Billy Ray not amused by Foxx's Miley jokes - msnbc.com')

['housing', 'claims', 'jobless', 'bloomberg', 'starts', 'fall']
(31.683114456995323, u'US Economy: Jobless Claims Fall, Housing Stabilizes - Bloomberg')
(5.1440452349075896, u'New Jobless Claims Drop for Second Straight Week- March Housing Construction Falls 10.8 Percent')
(3.0607894881438384, u'USA Today owner's profits slump - BBC News')

['times', 'robinson', 'york', 'nbsp', 'news', 'years']
(31.302811635807451, u'Sports of The Times Timeline Stretches 62 Years, From Robinson to ... - New York Times')
(9.0137397723407844, u'First-Round NBA Playoff Pairings - New York Times')
(4.1116560554028121, u'Madden announces retirement - CNN')

['growth', 'general', 'files', 'bankruptcy', 'mall', 'owner']
(34.087252834044925, u'Glendale Galleria mall owner General Growth files record bankruptcy - Los Angeles Times')
(3.4004102790728008, u"China's economic growth falls")
(2.6035002478418461, u'FOXBusiness: JPMorgan Posts $2.1 Billion Profit- FOXBusiness: Mall Giant Files for Bankruptcy')

['with', 'that', 'thursday', 'from', 'said', 'have']
(8.0266184808591703, u'Maersk crew reunites with family members')
(7.3256882178409031, u'Obama: Latin America on equal footing with U.S.')
(7.0650147758525792, u'N. Korea orders out nuclear inspectors')

['israel', 'state', 'nbsp', 'quot', 'news', 'minister']
(26.182178132567305, u'Israel demands recognition - Aljazeera.net')
(5.3592694059489618, u"Israel's pragmatic thug")
(4.6806406110833372, u'Just how bad off is the Republican Party (Part 2)?')

['veterans', 'security', 'apologizes', 'homeland', 'groups', 'report']
(28.392368413776488, u'Homeland security chief apologizes to veterans groups - CNN')
(13.277877485904455, u"DHS Sec'y Apologizes to Vets")
(10.06749268131458, u'Homeland security chief apologizes')

['chechnya', 'russia', 'nbsp', 'operation', 'guardian', 'timeline']
(29.64479055909738, u'Chechnya and Russia: timeline - guardian.co.uk')
(10.730427010763846, u"Russia declares end to Chechnya 'operation'")
(1.6589568347684713, u'Russia Ends Operations in Chechnya')

['thailand', 'nbsp', 'after', 'government', 'telegraph', 'political']
(28.958532247175231, u'Thailand's Anti-Government Protests End, but Questions Linger - Voice of America')
(3.8000889983596569, u'Billy Ray not amused by Foxx's Miley jokes - msnbc.com')
(1.841203744693537, u'World Briefing | Asia: Sri Lanka: Conflicting Reports After Cease-Fire Ends')

['monte', 'federer', 'carlo', 'round', 'third', 'nbsp']
(27.729401591662743, u'Federer loses in third round in Monte Carlo - The Associated Press')
(7.6970641509680853, u'First-Round NBA Playoff Pairings - New York Times')
(1.9373164154547438, u'US discovers violations in surveillance program - Reuters')

Each group represents a feature. The list of words is a short list of the words that contribute the most to an article's score in that feature. The three articles listed are the top three articles for that feature, along with the numbers that display how strongly they exhibit the feature. A higher number indicates that the article is a better example of the feature.

No comments:

Post a Comment