Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters
- 1 December 2009
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 23759232,p. 296-301
- https://doi.org/10.1109/icdmw.2009.34
Abstract
The enormous growth of data in a variety of applications has increased the need for high performance data mining based on distributed environments. However, standard data mining toolkits per se do not allow the usage of computing clusters. The success of MapReduce for analyzing large data has raised a general interest in applying this model to other, data intensive applications. Unfortunately current research has not lead to an integration of GUI based data mining toolkits with distributed file system based MapReduce systems. This paper defines novel principles for modeling and design of the user interface, the storage model and the computational model necessary for the integration of such systems. Additionally, it introduces a novel system architecture for interactive GUI based data mining of large data on clusters based on MapReduce that overcomes the limitations of data mining toolkits. As an empirical demonstration we show an implementation based on Weka and Hadoop.Keywords
This publication has 4 references indexed in Scilit:
- GridR: An R-Based Grid-Enabled Tool for Data Analysis in ACGT Clinico-Genomics TrialsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- YALEPublished by Association for Computing Machinery (ACM) ,2006
- Adapting the Weka Data Mining Toolkit to a Grid Based EnvironmentLecture Notes in Computer Science, 2005
- The Google file systemPublished by Association for Computing Machinery (ACM) ,2003