Data Mining: January 2009

Implemented the movie critic example from the textbook.

1. Data
There was an error in the "recommendations.py" posted to the website that led to results inconsistent with the examples in the book. It contained the score critics['Michael Phillips']['Just My Luck'] = 3.0, but the data in the book did not have a rating for that critic and movie.

2. Euclidian Distance function
The book gives the correct code for this function, which is similar to this:

#Calculate the Euclidian distance-based similarity score for two
#critics.
def sim_distance(prefs, person1, person2):
 # Get the list of shared items
 si = {}
 for item in prefs[person1]:
     if item in prefs[person2]:
         si[item] = 1

 # If they have no ratings in common, return 0
 if len(si) == 0:
     return 0

 #Add up the squares of all the differences
 squares = [ pow(prefs[person1][item] - prefs[person2][item], 2)
         for item in si ]
 return 1/(1+sqrt(sum(squares)))

However, in several of the examples in the book, it seems that the function used does not take the square root of the sum of squares, as if the last line is:

   return 1/(1+sum(squares))

For the examples for which my results differed from the results in the book, making this change made the results consistent. However, the first version is correct -- to calculate Euclidian distance, you must take the root of the sum of squares.

For example:

>>> sim_distance(critics, 'Lisa Rose', 'Gene Seymour')
0.29429805508554946

The book gives the result 0.148148148148 for this distance. (The sum of squared differences is 5.75; 1/(1+5.75) = 0.148, 1/(1+sqrt(5.75)) = 0.294)


3. Pearson Coefficient

Successfully tested with the music grade/ITBS score data; got the same results as in class using a spreadsheet:

scores = {
 'band' : {
     1 : 3, 2 : 4, 3 : 2.5, 4 : 2, 5 : 4, 6 : 3, 7 : 3.5, 8 : 4,
     9 : 3, 10 : 4, 11 : 4.5, 12 : 2, 13 : 3, 14 : 4, 15 : 2.5,
     16 : 2, 17 : 4, 18 : 3, 19 : 3.5, 20 : 4, 21 : 3, 22 : 4,
     23 : 4.5, 24 : 2
 },
 'math' : {
     1 : 220, 2 : 240, 3 : 210, 4 : 215, 5 : 260, 6 : 230,
     7 : 240, 8 : 259, 9 : 245, 10 : 280, 11 : 300, 12 : 230,
     13 : 250, 14 : 275, 15 : 200, 16 : 200, 17 : 290, 18 : 250,
     19 : 270, 20 : 260, 21 : 260, 22 : 270, 23 : 310, 24 : 170
 },
 'language' : {
     1 : 215, 2 : 210, 3 : 250, 4 : 230, 5 : 240, 6 : 270,
     7 : 240, 8 : 220, 9 : 230, 10 : 270, 11 : 260, 12 : 250,
     13 : 245, 14 : 235, 15 : 260, 16 : 255, 17 : 250, 18 : 230,
     19 : 245, 20 : 270, 21 : 270, 22 : 250, 23 : 260, 24 : 250
 },
 'science' : {
     1 : 220, 2 : 225, 3 : 235, 4 : 210, 5 : 220, 6 : 250,
     7 : 220, 8 : 240, 9 : 250, 10 : 230, 11 : 220, 12 : 225,
     13 : 235, 14 : 210, 15 : 220, 16 : 250, 17 : 220,
    18 : 240, 19 : 250, 20 : 230, 21 : 260, 22 : 230,
    23 : 260, 24 : 220
  } 
}

>>> sim_pearson(scores, 'band', 'math')
0.86645069516014406
>>> sim_pearson(scores, 'band', 'language')
0.015891184923353452
>>> sim_pearson(scores, 'band', 'science')
0.010391799400586866

4. Recommendations

Consistent with the results in the book except when using sim_distance, because of the previously mentioned error. With the correct function:

>>> getRecommendations(critics, 'Toby', similarity=sim_distance)
[(3.457128694491423, 'The Night Listener'),
(2.7785840038149234, 'Lady in     the Water'),
(2.4224820423619162, 'Just My Luck')]

5.Manhattan Distance

Implementation:

#Compute the Manhattan distance-based similarity between two
#critics
def sim_manhattan(prefs, person1, person2):
 si = {}
 for item in prefs[person1]:
     if item in prefs[person2]:
         si[item] = 1
 #If they have no reviews in common, they can't be scored
 if len(si) == 0:
     return 0
 # Otherwise, compute the Manhattan distance by calculating
 # the distances between the two on each axis (i.e. for each
 # common item).
 distances = [ abs(prefs[person1][item] - prefs[person2][item])
         for item in si ]
 # Add 1 to the sum of the distances, then invert to find the
 # similarity.
 return 1/(1+sum(distances))

Using this function to provide recommendations yields similar results to the other similarity functions:

>>> getRecommendations(critics,'Toby',similarity=sim_manhattan)
[(3.4958444851387513, 'The Night Listener'),
(2.7668281709638345, 'Lady in the Water'),
(2.4433656957928802, 'Just My Luck')]

Data Mining

Friday, January 23, 2009

Portfolio Assignment 1

Relevant Links

Blog Archive