Went through the code from Chapter 2, including del.icio.us and MovieLens examples.
Exercise 1: Tanimoto Coefficient
The Tanimoto Coefficient is an extended form of the Jaccard Coefficient, which is a value representing the similarity of two sets:
J(A,B) = size(INTERSECTION(A,B)) / size(UNION(A,B))
The Tanimoto Coefficient is an extended Jaccard Coefficient defined for vectors (as opposed to sets), meaning it can be applied to numeric values (though it is often used for binary values, in which case it is equivalent to the Jaccard Coefficient):
T(A,B) = DOTPRODUCT(A,B) / ( DOTPRODUCT(A,A) + DOTPRODUCT(B,B) - DOTPRODUCT(A,B) )
...where DOTPRODUCT(A,B) = A[1]B[1] + A[2]B[2] + ... + A[n]B[n], for vectors A and B with size n.
In Python:
# Compute the dot product of two dictionaries
def dotProduct(d1, d2):
result = 0.0
for item in d1:
if item in d2:
result += d1[item] * d2[item]
return result
#Compute the Tanimoto coefficient for two datasets
def sim_tanimoto(prefs, person1, person2):
a = prefs[person1]
b = prefs[person2]
num = dotProduct(a, b)
den = dotProduct(a, a) + dotProduct(b, b) - dotProduct(a, b)
return num / den
This can be used as the similarity function:
>>> getRecommendations(critics, 'Toby', sim_tanimoto)
[(3.3856740061121022, 'The Night Listener'), (2.8132359267203459, 'Lady in the Water'), (2.3460656698130458, 'Just My Luck')]
The Tanimoto Coefficient for two vectors will always be a value between 0 and 1, 0 if there are no elements that are nonzero in both vectors and 1 if the two vectors are identical. Unlike the Pearson Coefficient, it is not a strict measure of linear correlation and does not correct for grade inflation. An interesting feature of the Tanimoto Coefficient is that adding values that are not in either set (i.e. which are zero in both vectors) does not affect the result. In general, any value that is identical in both lists will increase the Pearson Coefficient between those lists. However, in our implementation for the recommendation system, this general difference is not relevant because we only consider values that are shared by both lists anyway (otherwise, each movie that neither of two critics had rated would increase the similarity between those critics).
2. Weka
Ran C4.5 (J4.8) on the Cleveland Heart Disease Dataset:
=== Run information ===
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: cleveland-14-heart-disease
Instances: 303
Attributes: 14
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
num
Test mode: evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree
------------------
thal = fixed_defect
| ca <= 0 | | exang = no: <50 exang =" yes:">50_1 (3.06/1.0)
| ca > 0: >50_1 (10.0)
thal = normal
| ca <= 0: <50> 0
| | cp = typ_angina
| | | trestbps <= 138: >50_1 (4.0/1.0)
| | | trestbps > 138: <50 cp =" asympt:">50_1 (20.0/3.0)
| | cp = non_anginal: <50 cp =" atyp_angina" restecg =" left_vent_hyper" exang =" no:">50_1 (3.0)
| | | | exang = yes: <50 restecg =" normal:" restecg =" st_t_wave_abnormality:" thal =" reversable_defect" cp =" typ_angina"> 229
| | | age <= 48: >50_1 (2.0)
| | | age > 48: <50 cp =" asympt" restecg =" left_vent_hyper:">50_1 (8.0/1.0)
| | | restecg = normal
| | | | trestbps <= 136 | | | | | ca <= 0: <50> 0
| | | | | | thalach <= 151: <50> 151: >50_1 (3.0)
| | | | trestbps > 136: >50_1 (4.0)
| | | restecg = st_t_wave_abnormality: >50_1 (0.0)
| | oldpeak > 0.6: >50_1 (57.39)
| cp = non_anginal
| | slope = up: <50 slope =" flat"> 122: >50_1 (3.0)
| | | ca > 0: >50_1 (8.0/1.0)
| | slope = down: <50 cp =" atyp_angina"> 0.1: >50_1 (2.75/0.75)
| | ca > 0: >50_1 (2.25/0.25)
Number of Leaves : 30
Size of the tree : 51
Time taken to build model: 0.02 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 279 92.0792 %
Incorrectly Classified Instances 24 7.9208 %
Kappa statistic 0.8396
Mean absolute error 0.0532
Root mean squared error 0.1624
Relative absolute error 26.5595 %
Root relative squared error 51.542 %
Total Number of Instances 303
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.952 0.116 0.908 0.952 0.929 0.952 <50>50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
=== Confusion Matrix ===
a b c d e <-- classified as 157 8 0 0 0 | a = <50 b =" ">50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4
...
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 235 77.5578 %
Incorrectly Classified Instances 68 22.4422 %
Kappa statistic 0.5443
Mean absolute error 0.1044
Root mean squared error 0.2725
Relative absolute error 52.0476 %
Root relative squared error 86.5075 %
Total Number of Instances 303
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.83 0.29 0.774 0.83 0.801 0.809 <50>50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
=== Confusion Matrix ===
a b c d e <-- classified as 137 28 0 0 0 | a = <50 b =" ">50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4
The decision tree generated by the J4.8 algorithm correctly classifies 92% of the instances in the original dataset. This suggests that 92% is an upper bound on the accuracy of the model for a given dataset. Evaluating the model using 10-fold cross-validation yields a 77.6% success rate. This is likely to be a more appropriate estimate for the model's accuracy on a general dataset.The false positive and true positive rates show that the model will incorrectly classify 29% of patients at higher risk and 17% of patients at lower risk. Note that the precision scores for num = 0 and num = 1 are 77.4% and 77.8% respectively, so classifications of patients not in the original dataset are approximately equally likely to be accurate regardless of the result.
No comments:
Post a Comment