A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective

2020 
Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a very simple and intuitive measure of similarity between data samples. This research examines the measures that are akin to the Jaccard index and may be used for modelling affinity between users (or items) in collaborative recommendations. Particularly, the measures such as simple matching coefficient (SMC), Sorensen–Dice coefficient (SDC), Salton’s cosine index (SCI), and overlap coefficient (OLC) are compared and analysed in both theoretical and empirical perspectives with respect to the Jaccard index. Since these measures apprehend only the structural similarity information (overlapping information) between the data samples, these are very useful in situations where only the associations between users and items are available such as browsing or buying behaviours of the users on an e-commerce portal (i.e. unary rating data, a special case of ratings). Furthermore, a theoretical relation among these measures has been established. We have also derived an equivalent expression for each of these measures so that it can be directly applied for binary data samples in data mining/machine learning jargon. In order to compare and validate the effectiveness of these structural similarity measures, several experiments have been conducted using standardized benchmark datasets (MovieLens, FilmTrust, Epinions, Yahoo! Movies, and Yahoo! Music). Empirically obtained results demonstrate that the Salton’s cosine index (SCI) provides better accuracy (in terms of MAE, RMSE, and precision) for large datasets, whereas the overlap coefficient (OLC) results in more accurate recommendations for small datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    54
    References
    13
    Citations
    NaN
    KQI
    []