Kurt Wilms

PWDTAD: Item-Based Collaborative Filtering

You may have noticed the “People who Dugg this also Dugg” feature on the Digg story list pages as an interesting new way to sift through all the available items on Digg and discover new content.

We recently rolled out a few subtle changes to the placement of this feature, along with Related by Keyword. Many users wanted to see the comments directly under the story summary, so we are exploring different locations of these features based on what’s more relevant to you.

So how are the items that appear in the “People who Dugg this also Dugg” section chosen? The short answer is that it is based on an new item-item collaborative filtering algorithm recently developed at Digg.

The Digg Recommendation Engine predicts what items on Digg you might like by finding the users most similar to you, and then using their digging history in order to make predictions for you. There are various tweaks to the algorithm, but that is the basic model. The “People who Dugg this also Dugg” feature explores the relationships between items on Digg rather than the relationship between users. Instead of computing correlations between people, here we compute correlations between pairs of items on Digg. Computer scientists call this type of collaborative filtering ‘item-item’ because the algorithm is focusing on items rather than users. When we want to show “People who Dugg this also Dugg” items for a target item, we look at all the other items on Digg and select those that are most similar to the target item. Again, there are various tweaks to the algorithm, but that is the basic model.

The critical step in calculating the “People who Dugg this also Dugg” items is to compute the similarity between items. The basic idea in similarity computation between two items is to first isolate the users who have dugg the two items you are comparing and then to apply a similarity computation technique to determine their similarity. There are various strategies for determining this value. For those interested I would suggest this paper for a good overview. We’ve chosen to use an implementation of Jaccard coefficients in our calculation.

Launching a recommender on Digg comes with a unique set of challenges, many of which have to do with the tens of thousands of items submitted daily and the incredible volume and rate of diggs occurring on those items. As you can imagine, computing similarity for all theses items on digg fast enough to produce relevant recommendations can be a tricky problem to solve! Our recommender, written in python, computes similarity in real time as digging activity occurs. When an item on Digg receives at least five diggs, we compute similarity scores, select the most similar items, and store the results in a special data structure. As digging activity continues, the similarities are recalculated and the most similar items are reselected. At any given point in time, the recommender is calculating similarities for tens of thousands of items on Digg.

Of course, even with some of the technical challenges solved, our work is far from finished. We’ll keep on studying how our users interact with the related stories and we’ll continue thinking of innovative ways to use the data we have to improve the Digg experience. Since this is the beta version of the feature we want to know what you think. Please contact us with your thoughts and suggestions.

-Kurt