Efficient parallel data mining for association rules

Abstract
In this paper, we develop an algorithm, called PDM, to con- duct parallel data mining for association rules. Consider a transaction as a collection of items, and a large item- set is a set of items such that the number of transactions containing it exceeds a pre-specilied threshold. PDM is so designed that the global set of large itemsets can be identi- fied efficiently and the amount of inter-node data exchange required is minimized. SpecificaUy, with a given database partition, each processing node will collect (count ) informa- tion on each itemset from its local database efficiently via a hashing method. The information discovered by each node is next shared with other nodes via some communication schemes. Then, PDM employs a technique, called clue-and- poll, to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small frac- tion of the itemsets for the exchange of count information among nodes, thus reducing the communication cost. The global set of large iternsets can hence be determined based on the aggregate count of itemsets. It is experimentally shown that PDM not only attains very good parallelization efficiencies, but also provides robust performance for various input patterns.