Data Mining: Portfolio Assignment 2

1. Python

Went through the code from Chapter 2, including del.icio.us and MovieLens examples.

Exercise 1: Tanimoto Coefficient

The Tanimoto Coefficient is an extended form of the Jaccard Coefficient, which is a value representing the similarity of two sets:

J(A,B) = size(INTERSECTION(A,B)) / size(UNION(A,B))

The Tanimoto Coefficient is an extended Jaccard Coefficient defined for vectors (as opposed to sets), meaning it can be applied to numeric values (though it is often used for binary values, in which case it is equivalent to the Jaccard Coefficient):

T(A,B) = DOTPRODUCT(A,B) / ( DOTPRODUCT(A,A) + DOTPRODUCT(B,B) - DOTPRODUCT(A,B) )

...where DOTPRODUCT(A,B) = A[1]B[1] + A[2]B[2] + ... + A[n]B[n], for vectors A and B with size n.

In Python:

# Compute the dot product of two dictionaries
def dotProduct(d1, d2):
result = 0.0
for item in d1:
   if item in d2:
       result += d1[item] * d2[item]
return result

#Compute the Tanimoto coefficient for two datasets
def sim_tanimoto(prefs, person1, person2):
a = prefs[person1]
b = prefs[person2]
num = dotProduct(a, b)
den = dotProduct(a, a) + dotProduct(b, b) - dotProduct(a, b)
return num / den

This can be used as the similarity function:

>>> getRecommendations(critics, 'Toby', sim_tanimoto)
[(3.3856740061121022, 'The Night Listener'), (2.8132359267203459, 'Lady in the Water'), (2.3460656698130458, 'Just My Luck')]

The Tanimoto Coefficient for two vectors will always be a value between 0 and 1, 0 if there are no elements that are nonzero in both vectors and 1 if the two vectors are identical. Unlike the Pearson Coefficient, it is not a strict measure of linear correlation and does not correct for grade inflation. An interesting feature of the Tanimoto Coefficient is that adding values that are not in either set (i.e. which are zero in both vectors) does not affect the result. In general, any value that is identical in both lists will increase the Pearson Coefficient between those lists. However, in our implementation for the recommendation system, this general difference is not relevant because we only consider values that are shared by both lists anyway (otherwise, each movie that neither of two critics had rated would increase the similarity between those critics).

2. Weka

Ran C4.5 (J4.8) on the Cleveland Heart Disease Dataset:


=== Run information ===

Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     cleveland-14-heart-disease
Instances:    303
Attributes:   14
             age
             sex
             cp
             trestbps
             chol
             fbs
             restecg
             thalach
             exang
             oldpeak
             slope
             ca
             thal
             num
Test mode:    evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree
------------------

thal = fixed_defect
|   ca <= 0 |   |   exang = no: <50 exang =" yes:">50_1 (3.06/1.0)
|   ca > 0: >50_1 (10.0)
thal = normal
|   ca <= 0: <50> 0
|   |   cp = typ_angina
|   |   |   trestbps <= 138: >50_1 (4.0/1.0)
|   |   |   trestbps > 138: <50 cp =" asympt:">50_1 (20.0/3.0)
|   |   cp = non_anginal: <50 cp =" atyp_angina" restecg =" left_vent_hyper" exang =" no:">50_1 (3.0)
|   |   |   |   exang = yes: <50 restecg =" normal:" restecg =" st_t_wave_abnormality:" thal =" reversable_defect" cp =" typ_angina"> 229
|   |   |   age <= 48: >50_1 (2.0)
|   |   |   age > 48: <50 cp =" asympt" restecg =" left_vent_hyper:">50_1 (8.0/1.0)
|   |   |   restecg = normal
|   |   |   |   trestbps <= 136 |   |   |   |   |   ca <= 0: <50> 0
|   |   |   |   |   |   thalach <= 151: <50> 151: >50_1 (3.0)
|   |   |   |   trestbps > 136: >50_1 (4.0)
|   |   |   restecg = st_t_wave_abnormality: >50_1 (0.0)
|   |   oldpeak > 0.6: >50_1 (57.39)
|   cp = non_anginal
|   |   slope = up: <50 slope =" flat"> 122: >50_1 (3.0)
|   |   |   ca > 0: >50_1 (8.0/1.0)
|   |   slope = down: <50 cp =" atyp_angina"> 0.1: >50_1 (2.75/0.75)
|   |   ca > 0: >50_1 (2.25/0.25)

Number of Leaves  :     30

Size of the tree :     51


Time taken to build model: 0.02 seconds

=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances         279               92.0792 %
Incorrectly Classified Instances        24                7.9208 %
Kappa statistic                          0.8396
Mean absolute error                      0.0532
Root mean squared error                  0.1624
Relative absolute error                 26.5595 %
Root relative squared error             51.542  %
Total Number of Instances              303   

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
 0.952     0.116      0.908     0.952     0.929      0.952    <50>50_1
 0         0          0         0         0          ?        >50_2
 0         0          0         0         0          ?        >50_3
 0         0          0         0         0          ?        >50_4

=== Confusion Matrix ===

  a   b   c   d   e   <-- classified as  157   8   0   0   0 |   a = <50 b =" ">50_1
  0   0   0   0   0 |   c = >50_2
  0   0   0   0   0 |   d = >50_3
  0   0   0   0   0 |   e = >50_4

...


=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         235               77.5578 %
Incorrectly Classified Instances        68               22.4422 %
Kappa statistic                          0.5443
Mean absolute error                      0.1044
Root mean squared error                  0.2725
Relative absolute error                 52.0476 %
Root relative squared error             86.5075 %
Total Number of Instances              303   

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
 0.83      0.29       0.774     0.83      0.801      0.809    <50>50_1
 0         0          0         0         0          ?        >50_2
 0         0          0         0         0          ?        >50_3
 0         0          0         0         0          ?        >50_4

=== Confusion Matrix ===

  a   b   c   d   e   <-- classified as  137  28   0   0   0 |   a = <50 b =" ">50_1
  0   0   0   0   0 |   c = >50_2
  0   0   0   0   0 |   d = >50_3
  0   0   0   0   0 |   e = >50_4

The decision tree generated by the J4.8 algorithm correctly classifies 92% of the instances in the original dataset. This suggests that 92% is an upper bound on the accuracy of the model for a given dataset. Evaluating the model using 10-fold cross-validation yields a 77.6% success rate. This is likely to be a more appropriate estimate for the model's accuracy on a general dataset.The false positive and true positive rates show that the model will incorrectly classify 29% of patients at higher risk and 17% of patients at lower risk. Note that the precision scores for num = 0 and num = 1 are 77.4% and 77.8% respectively, so classifications of patients not in the original dataset are approximately equally likely to be accurate regardless of the result.

Data Mining

Monday, February 2, 2009

Portfolio Assignment 2

No comments:

Post a Comment

Relevant Links

Blog Archive