A SURVEY OF RETRIEVAL ALGORITHMS AND THEIR PARALLELIZATION IN LARGE-SCALE SYSTEMS

Abstract

This article presented a survey of two well-known algorithms, TF-IDF and BM-25 methods, for document ranking on a single CPU and parallel processes via HPC. An amazon review dataset with more than two million reviews was measured to measure the rank parameters. We set up the number of workers for the parallel processing during the experiment, which we selected as one and three. Four benchmarks evaluated the preprocess and reading time, vectorization time, TF-IDF transformation time, and overall time. Results metrics have shown a significant difference in speed.

Keywords

SURVEY
PARALLEL
TF IDF
WORKERS
HPC
DOCUMENT
READING