A distributed frequent itemset mining algorithm based on Spark

1 May 2015

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 271-275
https://doi.org/10.1109/cscwd.2015.7230970

Abstract

Frequent itemset mining is an important step of association rules mining. Traditional frequent itemset mining algorithms have certain limitations. For example Apriori algorithm has to scan the input data repeatedly, which leads to high I/O load and low performance, and the FP-Growth algorithm is limited by the capacity of computer's inner stores because it needs to build a FP-tree and mine frequent itemset on the basis of the FP-tree in memory. With the coming of the Big Data era, these limitations are becoming more prominent when confronted with mining large-scale data. In this paper, DPBM, a distributed matrix-based pruning algorithm based on Spark, is proposed to deal with frequent itemset mining. DPBM can greatly reduce the amount of candidate itemset by introducing a novel pruning technique for matrix-based frequent itemset mining algorithm, an improved Apriori algorithm which only needs to scan the input data once. In addition, each computer node reduces greatly the memory usage by implementing DPBM under a latest distributed environment-Spark, which is a lightning-fast distributed computing. The experimental results show that DPBM have better performance than MapReduce-based algorithms for frequent itemset mining in terms of speed and scalability.

Keywords

This publication has 4 references indexed in Scilit:

Apriori-based frequent itemset mining algorithms on MapReduce
Published by Association for Computing Machinery (ACM) ,2012
An improved apriori algorithm
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Pfp
Published by Association for Computing Machinery (ACM) ,2008
Mining frequent itemsets in distributed and dynamic databases
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004

Cited by 12 articles