Spark Rough Hypercuboid Approach for Scalable Feature Selection

14 September 2021

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering

Vol. PP (10414347), 1
https://doi.org/10.1109/tkde.2021.3112520

Abstract

Feature selection refers to choose an optimal non-redundant feature subset with minimal degradation of learning performance and maximal avoidance of data overfitting. The appearance of large data explosion leads to the sequential execution of algorithms are extremely time-consuming, which necessitates the scalable parallelization of algorithms by efficiently exploiting the distributed computational capabilities. In this paper, we present parallel feature selection algorithms underpinned by a rough hypercuboid approach in order to scale for the growing data volumes. Metrics in terms of rough hypercuboid are highly suitable to parallel distributed processing, and fits well with the Apache Spark cluster computing paradigm. Two data parallelism strategies, namely, vertical partitioning and horizontal partitioning, are implemented respectively to decompose the data into concurrent iterative computing streams. Experimental results on representative datasets show that our algorithms significantly faster than its original sequential counterpart while guaranteeing the quality of the results. Furthermore, the proposed algorithms are perfectly capable of exploiting the distributed-memory clusters to accomplish the computation task that fails on a single node due to the memory constraints. Parallel scalability and extensibility analysis have confirmed that our parallelization extends well to process massive amount of data and can scales well with the increase of computational nodes.

Keywords

Funding Information

National Major Science and Technology Project of China (2018AAA0100201)
National Natural Science Foundation of China (62076171, 61573292, 61976182)

This publication has 23 references indexed in Scilit:

A parallel rough set based dependency calculation method for efficient feature selection
Applied Soft Computing, 2018
Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce
Simulation Modelling Practice and Theory, 2016
Scalable Semi-Supervised Learning by Efficient Anchor Graph Regularization
IEEE Transactions on Knowledge and Data Engineering, 2016
On quick attribute reduction in decision-theoretic rough set models
Information Sciences, 2016
Towards scalable fuzzy–rough feature selection
Information Sciences, 2015
The two sides of the theory of rough sets
Knowledge-Based Systems, 2015
Hierarchical attribute reduction algorithms for big data using MapReduce
Knowledge-Based Systems, 2015
A Parallel Matrix-Based Method for Computing Approximations in Incomplete Information Systems
IEEE Transactions on Knowledge and Data Engineering, 2014
A Rough Set-Based Method for Updating Decision Rules on Attribute Values’ Coarsening and Refining
IEEE Transactions on Knowledge and Data Engineering, 2014
Positive approximation: An accelerator for attribute reduction in rough set theory
Artificial Intelligence, 2010

Cited by 14 articles