Data Mining: February 2009

Friday, February 20, 2009

Portfolio Assignment 4 (2/3)

1. PCI Chapter 3

Went through the chapter on spidering and clustering. Using the downloaded code, generated a second dataset using the same list of feeds. Below are two dendrograms, the first generated using the original dataset (that appears in the book) and the second generated using my dataset.

The structures of the dendrograms are noticably different, but there are similar clusters. The first dendrogram contains a cluster of five blogs (The Unofficial Apple Weblog, Download Squad, Autoblog, Joystiq, and Engadget) that all remain close together in the second, though a few other blogs are grouped together with them. This group is indicated by the blue box in both diagrams. We can also see a larger cluster of many web-related blogs, most of which (including Blogger's Blog, Publishing 2.0, and several Google and search-related blogs) remain close in the second dendrogram. This group, indicated in green, seems to vary more between the two diagrams.

2. Clustering: Capitol Words data

I looked through the APIs available at programmableweb.com and decided to build a dataset using Capitol Words, which tracks word usage by members of the U.S. Congress. I obtained a list of legislators from the Sunlight Labs API, and used the Python interface to the Capitol Words API to build a list of 112 current and recent senators' most frequently used words. For each senator, I downloaded their 250 most used words along with the number of times they've used those words. The resulting dataset contains 5969 words, and the number of times each senator has used that word, according to CW, if it is in their top 250 words (0 if otherwise).

We can use this dataset and the PCI code to run several clustering algorithms. For example, k-means clustering with k = 5 returns this result:

['"Sen. Charles E. Grassley(R--IA)"', '"Sen. Daniel Kahikina Akaka(D--HI)"', '"Sen. Susan M. Collins(R--ME)"', '"Sen. Christopher J. Dodd(D--CT)"', '"Sen. Hillary Rodham Clinton(D--NY)"', '"Sen. Tim P. Johnson(D--SD)"']

['"Sen. Michael Bennet(D--CO)"', '"Sen. Arlen Specter(R--PA)"', '"Sen. James H. Webb(D--VA)"', '"Sen. Evan Bayh(D--IN)"', '"Sen. Russell D. Feingold(D--WI)"', '"Sen. Kay R. Hagan(D--NC)"']

['"Sen. Jeanne Shaheen(D--NH)"', '"Sen. Pete V. Domenici(R--NM)"', '"Sen. Harry M. Reid(D--NV)"', '"Sen. Robert C. Byrd(D--WV)"', '"Sen. Thomas Harkin(D--IA)"', '"Sen. Mark E. Udall(D--CO)"', '"Sen. Bernard Sanders(I--VT)"', '"Sen. Daniel K. Inouye(D--HI)"', '"Sen. Barbara A. Mikulski(D--MD)"', '"Sen. Charles T. Hagel(R--NE)"', '"Sen. Edward M. Kennedy(D--MA)"', '"Sen. Sheldon Whitehouse(D--RI)"', '"Sen. John D. Rockefeller(D--WV)"', '"Sen. Thomas Allen Coburn(R--OK)"', '"Sen. Thomas Richard Carper(D--DE)"', '"Sen. Dianne Feinstein(D--CA)"', '"Sen. Sherrod C. Brown(D--OH)"', '"Sen. Mark R. Warner(D--VA)"', '"Sen. Joseph R. Biden(D--DE)"', '"Sen. John E. Sununu(R--NH)"', '"Sen. Carl Levin(D--MI)"', '"Sen. Jeff Bingaman(D--NM)"', '"Sen. John H. Isakson(R--GA)"', '"Sen. Amy Klobuchar(D--MN)"', '"Sen. Max S. Baucus(D--MT)"', '"Sen. Tom S. Udall(D--NM)"', '"Sen. Frank R. Lautenberg(D--NJ)"', '"Sen. Mary L. Landrieu(D--LA)"', '"Sen. Richard G. Lugar(R--IN)"', '"Sen. Benjamin L. Cardin(D--MD)"', '"Sen. Robert Menendez(D--NJ)"', '"Sen. Elizabeth H. Dole(R--NC)"', '"Sen. Jon Tester(D--MT)"', '"Sen. E. Benjamin Nelson(D--NE)"', '"Sen. Ted Stevens(R--AK)"', '"Sen. Barack H. Obama(D--IL)"', '"Sen. Bill Nelson(D--FL)"', '"Sen. Maria Cantwell(D--WA)"', '"Sen. Larry E. Craig(R--ID)"', '"Sen. Patty Murray(D--WA)"', '"Sen. Edward E. Kaufman(D--DE)"', '"Sen. Ken Salazar(D--CO)"', '"Sen. Mark Begich(D--AK)"', '"Sen. John William Warner(R--VA)"', '"Sen. Roland W. Burris(D--IL)"']

['"Sen. Pat Roberts(R--KS)"', '"Sen. Jim W. DeMint(R--SC)"', '"Sen. Samuel D. Brownback(R--KS)"', '"Sen. Michael B. Enzi(R--WY)"', '"Sen. George V. Voinovich(R--OH)"', '"Sen. Richard C. Shelby(R--AL)"', '"Sen. Kent Conrad(D--ND)"', '"Sen. Robert F. Bennett(R--UT)"', '"Sen. Melquiades Rafael Martinez(R--FL)"', '"Sen. Claire McCaskill(D--MO)"', '"Sen. Bob Corker(R--TN)"', '"Sen. Lisa A. Murkowski(R--AK)"', '"Sen. Wayne A. Allard(R--CO)"', '"Sen. Lamar Alexander(R--TN)"', '"Sen. John Francis Reed(D--RI)"', '"Sen. John Eric Ensign(R--NV)"', '"Sen. Olympia Jean Snowe(R--ME)"', '"Sen. David B. Vitter(R--LA)"', '"Sen. C. Saxby Chambliss(R--GA)"', '"Sen. Lindsey O. Graham(R--SC)"', '"Sen. Charles E. Schumer(D--NY)"', '"Sen. John R. Thune(R--SD)"', '"Sen. Mitch McConnell(R--KY)"', '"Sen. Kay Bailey Hutchison(R--TX)"', '"Sen. James E. Risch(R--ID)"', '"Sen. Byron L. Dorgan(D--ND)"', '"Sen. John Sidney McCain(R--AZ)"', '"Sen. Jon Kyl(R--AZ)"', '"Sen. John Barrasso(R--WY)"', '"Sen. Jim Bunning(R--KY)"', '"Sen. John Cornyn(R--TX)"', '"Sen. Roger F. Wicker(R--MS)"', '"Sen. Christopher S. Bond(R--MO)"', '"Sen. Mike O. Johanns(R--NE)"', '"Sen. Joseph I. Lieberman(I--CT)"', '"Sen. Thad Cochran(R--MS)"']

['"Sen. Michael D. Crapo(R--ID)"', '"Sen. Gordon Harold Smith(R--OR)"', '"Sen. Barbara Boxer(D--CA)"', '"Sen. Patrick J. Leahy(D--VT)"', '"Sen. Mark Lunsford Pryor(D--AR)"', '"Sen. Debbie Ann Stabenow(D--MI)"', '"Sen. Herbert H. Kohl(D--WI)"', '"Sen. Norm Coleman(R--MN)"', '"Sen. Judd A. Gregg(R--NH)"', '"Sen. Richard M. Burr(R--NC)"']

We can also use heirarchical clustering, which generates the following dendrogram:

We can also use the book's method for visualizing data in two dimensions:

All of these methods show that the senators, as represented by this dataset, are roughly evenly spaced. However, there are some small clusters in which one party predominates, or a few locations in which senators from the same areas are close together. For example, Arizona Republicans John McCain and Jon Kyl are grouped together by both algorithms. This is particularly evident in the dendrogram, and I've marked a few clusters which show this.

Part of the reason that the senators are so evenly spaced may be that, because of the way the dataset was constructed, there is not enough overlapping data between them. The only data used to construct the dataset was each senator's top 250 words, so each senator has a maximum of 250 nonzero entries. (The reason I constructed it in this way is that the CW API doesn't provide a direct way to get the number of times a senator has used a particular word. The only option is to get the top X words a senator has used, and check to see if the particular word is in the list.)

A better dataset might be obtained by increasing the number of words to search through for each senator. However, including more words in the wordlist would quickly make the dataset larger and more difficult to process. A good strategy might be selecting the wordlist that is the union of each senator's top 250 (or possibly even less) words, and checking that wordlist against each senator's top 500 (or 1000, or more) words to obtain frequencies.

Thursday, February 12, 2009

Portfolio Assignment 3

Team 5's Last.fm recommendation system is written in C#, and presents a Windows Forms-based front-end for a few useful functions of the Last.fm API. The application lets the user execute three different searches: search for an artist, list albums by an artist, and get similar artists.

(All Lastfm objects and methods mentioned are in the Lastfm.Services namespace.)

The user types the name of an artist and selects "List Artist", "List Album", or "Recommend Artist" from a drop down box. Search results are displayed in the box below.

List Artist performs an ArtistSearch with the user's text as a query, gets the first page of results, represented as a collection of Artist objects, and iterates through them to print a list of artists whose names match the user's input.

List Album, which uses an AlbumSearch to get Album objects, produces a list of the most popular albums by a particular artist (like the artist search, it only lists the first page, so it might not return all of an artist's albums), in order.

Recommend Artist takes an artist's name as input, performs an ArtistSearch and gets the first result to get the Artist object representing that artist, then uses Artist.GetSimilar to list the most similar artists to that artist.

Each search displays a list of artist or album names, but allows the user to access more information using an embedded web browser. If the user double-clicks on an artist result, the artist's name is used to generate an Artist object, then an ArtistBio. Then the "Web Results" tab is selected and navigates to the appropriate page using ArtistBio.GetURL. Double-clicking on an album result (which take the form " - ") splits the result string into the artist name and the album name, gets the Artist by that name, then gets the Album by that artist with the appropriate name, then calls Ablum.GetURL and sends the user to the track listing for that album.

Example:

Suppose I'm looking for recommendations, so I enter in an artist I like -- let's use Iron Maiden -- and select "Recommended Artist":

The most similar artist I'm not very familiar with is Saxon. I'll search for their most popular albums:

If I double-click the most popular album listed, I can get a track listing for "Denim And Leather":

So, using this application, I can go from an artist I like to a list of songs I might be interested in.

Monday, February 2, 2009

Portfolio Assignment 2

1. Python

Went through the code from Chapter 2, including del.icio.us and MovieLens examples.

Exercise 1: Tanimoto Coefficient

The Tanimoto Coefficient is an extended form of the Jaccard Coefficient, which is a value representing the similarity of two sets:

J(A,B) = size(INTERSECTION(A,B)) / size(UNION(A,B))

The Tanimoto Coefficient is an extended Jaccard Coefficient defined for vectors (as opposed to sets), meaning it can be applied to numeric values (though it is often used for binary values, in which case it is equivalent to the Jaccard Coefficient):

T(A,B) = DOTPRODUCT(A,B) / ( DOTPRODUCT(A,A) + DOTPRODUCT(B,B) - DOTPRODUCT(A,B) )

...where DOTPRODUCT(A,B) = A[1]B[1] + A[2]B[2] + ... + A[n]B[n], for vectors A and B with size n.

In Python:

# Compute the dot product of two dictionaries
def dotProduct(d1, d2):
result = 0.0
for item in d1:
   if item in d2:
       result += d1[item] * d2[item]
return result

#Compute the Tanimoto coefficient for two datasets
def sim_tanimoto(prefs, person1, person2):
a = prefs[person1]
b = prefs[person2]
num = dotProduct(a, b)
den = dotProduct(a, a) + dotProduct(b, b) - dotProduct(a, b)
return num / den

This can be used as the similarity function:

>>> getRecommendations(critics, 'Toby', sim_tanimoto)
[(3.3856740061121022, 'The Night Listener'), (2.8132359267203459, 'Lady in the Water'), (2.3460656698130458, 'Just My Luck')]

The Tanimoto Coefficient for two vectors will always be a value between 0 and 1, 0 if there are no elements that are nonzero in both vectors and 1 if the two vectors are identical. Unlike the Pearson Coefficient, it is not a strict measure of linear correlation and does not correct for grade inflation. An interesting feature of the Tanimoto Coefficient is that adding values that are not in either set (i.e. which are zero in both vectors) does not affect the result. In general, any value that is identical in both lists will increase the Pearson Coefficient between those lists. However, in our implementation for the recommendation system, this general difference is not relevant because we only consider values that are shared by both lists anyway (otherwise, each movie that neither of two critics had rated would increase the similarity between those critics).

2. Weka

Ran C4.5 (J4.8) on the Cleveland Heart Disease Dataset:


=== Run information ===

Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     cleveland-14-heart-disease
Instances:    303
Attributes:   14
             age
             sex
             cp
             trestbps
             chol
             fbs
             restecg
             thalach
             exang
             oldpeak
             slope
             ca
             thal
             num
Test mode:    evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree
------------------

thal = fixed_defect
|   ca <= 0 |   |   exang = no: <50 exang =" yes:">50_1 (3.06/1.0)
|   ca > 0: >50_1 (10.0)
thal = normal
|   ca <= 0: <50> 0
|   |   cp = typ_angina
|   |   |   trestbps <= 138: >50_1 (4.0/1.0)
|   |   |   trestbps > 138: <50 cp =" asympt:">50_1 (20.0/3.0)
|   |   cp = non_anginal: <50 cp =" atyp_angina" restecg =" left_vent_hyper" exang =" no:">50_1 (3.0)
|   |   |   |   exang = yes: <50 restecg =" normal:" restecg =" st_t_wave_abnormality:" thal =" reversable_defect" cp =" typ_angina"> 229
|   |   |   age <= 48: >50_1 (2.0)
|   |   |   age > 48: <50 cp =" asympt" restecg =" left_vent_hyper:">50_1 (8.0/1.0)
|   |   |   restecg = normal
|   |   |   |   trestbps <= 136 |   |   |   |   |   ca <= 0: <50> 0
|   |   |   |   |   |   thalach <= 151: <50> 151: >50_1 (3.0)
|   |   |   |   trestbps > 136: >50_1 (4.0)
|   |   |   restecg = st_t_wave_abnormality: >50_1 (0.0)
|   |   oldpeak > 0.6: >50_1 (57.39)
|   cp = non_anginal
|   |   slope = up: <50 slope =" flat"> 122: >50_1 (3.0)
|   |   |   ca > 0: >50_1 (8.0/1.0)
|   |   slope = down: <50 cp =" atyp_angina"> 0.1: >50_1 (2.75/0.75)
|   |   ca > 0: >50_1 (2.25/0.25)

Number of Leaves  :     30

Size of the tree :     51


Time taken to build model: 0.02 seconds

=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances         279               92.0792 %
Incorrectly Classified Instances        24                7.9208 %
Kappa statistic                          0.8396
Mean absolute error                      0.0532
Root mean squared error                  0.1624
Relative absolute error                 26.5595 %
Root relative squared error             51.542  %
Total Number of Instances              303   

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
 0.952     0.116      0.908     0.952     0.929      0.952    <50>50_1
 0         0          0         0         0          ?        >50_2
 0         0          0         0         0          ?        >50_3
 0         0          0         0         0          ?        >50_4

=== Confusion Matrix ===

  a   b   c   d   e   <-- classified as  157   8   0   0   0 |   a = <50 b =" ">50_1
  0   0   0   0   0 |   c = >50_2
  0   0   0   0   0 |   d = >50_3
  0   0   0   0   0 |   e = >50_4

...


=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         235               77.5578 %
Incorrectly Classified Instances        68               22.4422 %
Kappa statistic                          0.5443
Mean absolute error                      0.1044
Root mean squared error                  0.2725
Relative absolute error                 52.0476 %
Root relative squared error             86.5075 %
Total Number of Instances              303   

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
 0.83      0.29       0.774     0.83      0.801      0.809    <50>50_1
 0         0          0         0         0          ?        >50_2
 0         0          0         0         0          ?        >50_3
 0         0          0         0         0          ?        >50_4

=== Confusion Matrix ===

  a   b   c   d   e   <-- classified as  137  28   0   0   0 |   a = <50 b =" ">50_1
  0   0   0   0   0 |   c = >50_2
  0   0   0   0   0 |   d = >50_3
  0   0   0   0   0 |   e = >50_4

The decision tree generated by the J4.8 algorithm correctly classifies 92% of the instances in the original dataset. This suggests that 92% is an upper bound on the accuracy of the model for a given dataset. Evaluating the model using 10-fold cross-validation yields a 77.6% success rate. This is likely to be a more appropriate estimate for the model's accuracy on a general dataset.The false positive and true positive rates show that the model will incorrectly classify 29% of patients at higher risk and 17% of patients at lower risk. Note that the precision scores for num = 0 and num = 1 are 77.4% and 77.8% respectively, so classifications of patients not in the original dataset are approximately equally likely to be accurate regardless of the result.

Data Mining