A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Abstract
Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.
Funding Information
  • National Natural Science Foundation of China (No. 61502135)