A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Open Access

15 June 2021

journal article
research article
Published by MDPI AG in Algorithms

Vol. 14 (6), 184
https://doi.org/10.3390/a14060184

Abstract

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Keywords

Funding Information

National Natural Science Foundation of China (No. 61502135)

This publication has 17 references indexed in Scilit:

Statistical and Machine-Learning Analyses in Nutritional Genomics Studies
Nutrients, 2020
Resource Management in Cloud Radio Access Network: Conventional and New Approaches
Sensors, 2020
Survey of State-of-the-Art Mixed Data Clustering Algorithms
IEEE Access, 2019
K-Harmonic means type clustering algorithm for mixed datasets
Applied Soft Computing, 2016
Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number
Pattern Recognition, 2013
A dissimilarity measure for the k-Modes clustering algorithm
Knowledge-Based Systems, 2012
SpectralCAT: Categorical spectral clustering of numerical and nominal data
Pattern Recognition, 2012
Generalizing Self-Organizing Map for Categorical Data
IEEE Transactions on Neural Networks, 2006
Rock: A robust clustering algorithm for categorical attributes
Information Systems, 2000
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
Data Mining and Knowledge Discovery, 1998

Cited by 4 articles