The largest E-Commerce sites operate at a scale that stresses the
direct implementation of collaborative filtering. In
neighborhood-based CF systems, the
neighborhood formation process, especially the user-user similarity
computation step turns out to be the performance bottleneck, which in
turn can make the whole process unsuitable for real-time
recommendation generation. One way of ensuring high scalability is to use a model-based
approach. Model-based systems have the potential to
contribute to recommender systems to operate at a high
scale. The main idea here to isolate the
neighborhood generation and prediction generation steps.
In this paper, we present a model-based approach to precompute
item-item similarity scores. The similarity computation scheme is still
correlation-based but the computation is performed on the item
space. In a typical E-Commerce scenario, we usually have a set of item
that is static compared to the number of users that changes most
often. The static nature of items leads us to the idea of precomputing
the item similarities. One possible way of precomputing the item
similarities is to compute all-to-all similarity and then performing a
quick table look-up to retrieve the required similarity values. This
method, although saves time, requires an
space for
n items.
The fact that we only need a small fraction of similar items to
compute predictions leads us to an alternate model-based scheme. In
this scheme, we retain only a small number of similar items. For each
item j we compute the k most similar items, where
and
record these item numbers and their similarities with j. We term kas the model size. Based on
this model building step, our prediction generation algorithm works as
follows. For generating predictions for a user u on item i, our
algorithm first retrieves the precomputed k most similar items
corresponding to the target item i. Then it looks how many of those
k items were purchased by the user u, based on this intersection
then the prediction is computed using basic item-based collaborative
filtering algorithm.
We observe a quality-performance trade-off here: to ensure good quality we must have a large model size, which leads to the performance problems discussed above. In one extreme, we can have a model size of n, which will ensure exact same quality as original scheme but will have high space complexity. However, our model building step ensures that we retain the most similar items. While generating predictions, these items contribute the most to the prediction scores. Accordingly, we hypothesize that this model-based approach will provide reasonably good prediction quality with even a small model size and hence provide a good performance. We experimentally validate our hypothesis later in this paper. In particular, we experiment with the model size by varying the number of similar items to be stored. Then we perform experiments to compute prediction and response-time to determine the impact of the model size on quality and performance of the whole system.