Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer Data

Abstract
Cancer is a group of diseases caused due to abnormal cell growth. Due to the innovation of microarray technology, a large variety of microarray cancer datasets are produced and hence open up avenues to carry out research work across several disciplines such as Statistics, Computational Biology, Genomic studies and other related fields. The main challenges in analyzing microarray cancer data are the curse of dimensionality, small sample size, noisy data, and imbalance class problem. In this work, we are proposing grid search-based hyperparameter tuning (GSHPT) for random forest parameters to classify Microarray Cancer Data. A grid search is designed by a set of fixed parameter values which are essential in providing optimal accuracy on the basis of n-fold cross-validation. In our work, the 10-fold cross validation is considered. The grid search algorithm provides best parameters such as the number of features to consider at each split, number of trees in the forest, the maximum depth of the tree and the minimum number of samples required to be split at the leaf node. The maximum number of trees considered are 10, 20 and 70 respectively for Ovarian, 3-class Leukemia, and 3-class Leukemia cancer data. In the case of MLL and SRBCT, 50 trees are generated to achieve the maximum classification accuracy. The Gini index is employed as criteria to split the nodes and the maximum depth of the tree is set to 2 for all datasets. Experimental results of the proposed work show an improvement over the state of the art methods. The performance of the proposed method is evaluated using standard metrics such as classification accuracy, precision, recall, f1-score, confusion matrix and misclassification rate and comparative analysis is performed and the results are provided to reveal the performance of the proposed method.