Classification of a real live heart failure clinical dataset- Is TAN Bayes better than other Bayes?

Abstract
Real live clinical data often present itself with a number of usual challenges, such as class imbalance, high dimensionality and missing data. There is the added complexity of the data being distributed non-uniformly and skewed. Thus the performance of classical classification methods with this type of data is lower than with other types of data. Classification based on Bayes is often suggested as a better method, however, the typical assumption made for Bayes, such as variable and data distributions are not satisfied by real clinical data. This paper focuses on improving the performance of Bayesian classifiers but also on how the underlying structures of the data affects the performance. Thus this paper will focus on Bayesian methodologies, namely use of non-parametric Kernel Density Estimation (KDE) and Tree Augmented Naïve Bayes (TAN). The aim is to measure the performance on the heart failure dataset and by focusing on how the data structure improves the classification. The missing data present in the clinical heart failure datasets are replaced using two imputation methods and results compared. We also apply the imputed datasets on three classifiers including J48 (decision tree), naïve Bayesian multinomial and Bayesian network. The experiments show an improvement on the naïve Bayes using KDE, however TAN achieves significant improvement with the different missing value imputation methods. It is seen that TAN not only improves performance of the classifier, but also enhances prediction accuracy while maintaining efficiency and model simplicity.