1. Data
There was an error in the "recommendations.py" posted to the website that led to results inconsistent with the examples in the book. It contained the score critics['Michael Phillips']['Just My Luck'] = 3.0, but the data in the book did not have a rating for that critic and movie.
2. Euclidian Distance function
The book gives the correct code for this function, which is similar to this:
#Calculate the Euclidian distance-based similarity score for two
#critics.
def sim_distance(prefs, person1, person2):
# Get the list of shared items
si = {}
for item in prefs[person1]:
if item in prefs[person2]:
si[item] = 1
# If they have no ratings in common, return 0
if len(si) == 0:
return 0
#Add up the squares of all the differences
squares = [ pow(prefs[person1][item] - prefs[person2][item], 2)
for item in si ]
return 1/(1+sqrt(sum(squares)))
However, in several of the examples in the book, it seems that the function used does not take the square root of the sum of squares, as if the last line is:
return 1/(1+sum(squares))
For the examples for which my results differed from the results in the book, making this change made the results consistent. However, the first version is correct -- to calculate Euclidian distance, you must take the root of the sum of squares.
For example:
>>> sim_distance(critics, 'Lisa Rose', 'Gene Seymour')
0.29429805508554946
The book gives the result 0.148148148148 for this distance. (The sum of squared differences is 5.75; 1/(1+5.75) = 0.148, 1/(1+sqrt(5.75)) = 0.294)
3. Pearson Coefficient
Successfully tested with the music grade/ITBS score data; got the same results as in class using a spreadsheet:
scores = {
'band' : {
1 : 3, 2 : 4, 3 : 2.5, 4 : 2, 5 : 4, 6 : 3, 7 : 3.5, 8 : 4,
9 : 3, 10 : 4, 11 : 4.5, 12 : 2, 13 : 3, 14 : 4, 15 : 2.5,
16 : 2, 17 : 4, 18 : 3, 19 : 3.5, 20 : 4, 21 : 3, 22 : 4,
23 : 4.5, 24 : 2
},
'math' : {
1 : 220, 2 : 240, 3 : 210, 4 : 215, 5 : 260, 6 : 230,
7 : 240, 8 : 259, 9 : 245, 10 : 280, 11 : 300, 12 : 230,
13 : 250, 14 : 275, 15 : 200, 16 : 200, 17 : 290, 18 : 250,
19 : 270, 20 : 260, 21 : 260, 22 : 270, 23 : 310, 24 : 170
},
'language' : {
1 : 215, 2 : 210, 3 : 250, 4 : 230, 5 : 240, 6 : 270,
7 : 240, 8 : 220, 9 : 230, 10 : 270, 11 : 260, 12 : 250,
13 : 245, 14 : 235, 15 : 260, 16 : 255, 17 : 250, 18 : 230,
19 : 245, 20 : 270, 21 : 270, 22 : 250, 23 : 260, 24 : 250
},
'science' : {
1 : 220, 2 : 225, 3 : 235, 4 : 210, 5 : 220, 6 : 250,
7 : 220, 8 : 240, 9 : 250, 10 : 230, 11 : 220, 12 : 225,
13 : 235, 14 : 210, 15 : 220, 16 : 250, 17 : 220,
18 : 240, 19 : 250, 20 : 230, 21 : 260, 22 : 230,
23 : 260, 24 : 220
}
}
>>> sim_pearson(scores, 'band', 'math')
0.86645069516014406
>>> sim_pearson(scores, 'band', 'language')
0.015891184923353452
>>> sim_pearson(scores, 'band', 'science')
0.0103917994005868664. Recommendations
Consistent with the results in the book except when using sim_distance, because of the previously mentioned error. With the correct function:
>>> getRecommendations(critics, 'Toby', similarity=sim_distance)
[(3.457128694491423, 'The Night Listener'),
(2.7785840038149234, 'Lady in the Water'),
(2.4224820423619162, 'Just My Luck')]
5.Manhattan Distance
Implementation:
#Compute the Manhattan distance-based similarity between two
#critics
def sim_manhattan(prefs, person1, person2):
si = {}
for item in prefs[person1]:
if item in prefs[person2]:
si[item] = 1
#If they have no reviews in common, they can't be scored
if len(si) == 0:
return 0
# Otherwise, compute the Manhattan distance by calculating
# the distances between the two on each axis (i.e. for each
# common item).
distances = [ abs(prefs[person1][item] - prefs[person2][item])
for item in si ]
# Add 1 to the sum of the distances, then invert to find the
# similarity.
return 1/(1+sum(distances))
Using this function to provide recommendations yields similar results to the other similarity functions:
>>> getRecommendations(critics,'Toby',similarity=sim_manhattan)
[(3.4958444851387513, 'The Night Listener'),
(2.7668281709638345, 'Lady in the Water'),
(2.4433656957928802, 'Just My Luck')]