A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets
Open Access
- 15 June 2021
- journal article
- research article
- Published by MDPI AG in Algorithms
- Vol. 14 (6), 184
- https://doi.org/10.3390/a14060184
Abstract
Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.Keywords
Funding Information
- National Natural Science Foundation of China (No. 61502135)
This publication has 17 references indexed in Scilit:
- Statistical and Machine-Learning Analyses in Nutritional Genomics StudiesNutrients, 2020
- Resource Management in Cloud Radio Access Network: Conventional and New ApproachesSensors, 2020
- Survey of State-of-the-Art Mixed Data Clustering AlgorithmsIEEE Access, 2019
- K-Harmonic means type clustering algorithm for mixed datasetsApplied Soft Computing, 2016
- Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster numberPattern Recognition, 2013
- A dissimilarity measure for the k-Modes clustering algorithmKnowledge-Based Systems, 2012
- SpectralCAT: Categorical spectral clustering of numerical and nominal dataPattern Recognition, 2012
- Generalizing Self-Organizing Map for Categorical DataIEEE Transactions on Neural Networks, 2006
- Rock: A robust clustering algorithm for categorical attributesInformation Systems, 2000
- Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical ValuesData Mining and Knowledge Discovery, 1998