Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters

1 December 2009

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 23759232,p. 296-301
https://doi.org/10.1109/icdmw.2009.34

Abstract

The enormous growth of data in a variety of applications has increased the need for high performance data mining based on distributed environments. However, standard data mining toolkits per se do not allow the usage of computing clusters. The success of MapReduce for analyzing large data has raised a general interest in applying this model to other, data intensive applications. Unfortunately current research has not lead to an integration of GUI based data mining toolkits with distributed file system based MapReduce systems. This paper defines novel principles for modeling and design of the user interface, the storage model and the computational model necessary for the integration of such systems. Additionally, it introduces a novel system architecture for interactive GUI based data mining of large data on clusters based on MapReduce that overcomes the limitations of data mining toolkits. As an empirical demonstration we show an implementation based on Weka and Hadoop.

Keywords

This publication has 4 references indexed in Scilit:

GridR: An R-Based Grid-Enabled Tool for Data Analysis in ACGT Clinico-Genomics Trials
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
YALE
Published by Association for Computing Machinery (ACM) ,2006
Adapting the Weka Data Mining Toolkit to a Grid Based Environment
Lecture Notes in Computer Science, 2005
The Google file system
Published by Association for Computing Machinery (ACM) ,2003

Cited by 29 articles