Statistically Robust Evaluation of Stream-Based Recommender Systems

Abstract
Online incremental models for recommendation are nowadays pervasive in both the industry and the academia. However, there is not yet a standard evaluation methodology for the algorithms that maintain such models. Moreover, online evaluation methodologies available in the literature generally fall short on the statistical validation of results, since this validation is not trivially applicable to stream-based algorithms. We propose a k-fold validation framework for the pairwise comparison of recommendation algorithms that learn from user feedback streams, using prequential evaluation. Our proposal enables continuous statistical testing on adaptive-size sliding windows over the outcome of the prequential process, allowing practitioners and researchers to make decisions in real time based on solid statistical evidence. We present a set of experiments to gain insights on the sensitivity and robustness of two statistical tests - McNemar's and Wilcoxon signed rank - in a streaming data environment. Our results show that besides allowing a real-time, fine-grained online assessment, the online versions of the statistical tests are at least as robust as the batch versions, and definitely more robust than a simple prequential single-fold approach.
Funding Information
  • Fundação para a Ciência e a Tecnologia (UID/EEA/50014/2019)

This publication has 29 references indexed in Scilit: