On the Application of Cluster Analysis to Growing Season Precipitation Data in North America East of the Rockies

Abstract
Cluster analysis (CA) has been applied to geophysical research for over two decades although its popularity has increased dramatically over the past few years. To date, systematic methodological reviews have not appeared in geophysical literature. In this paper, after a review of a large number of applications on cluster analysis, an intercomparison of various cluster techniques was carried out on a well-studied dataset (7-day precipitation data from 1949 to 1987 in central and eastern North America). The cluster methods tested were single linkage, complete linkage, average linkage between groups, average linkage within a new group, Ward's method, k means, the nucleated agglomerative method, and the rotated principal component analysis. Three different dissimilarity measures (Euclidean distance, inverse correlation, and theta angle) and three initial partition methods were also tested on the hierarchical and nonhierarchical methods, respectively. Twenty-two of the 23 cluster algorithms yielded natural grouping solutions. Monte Carlo simulations were undertaken to examine the reliability of the cluster solutions. This was done by bootstrap resampling from the full dataset with four different sample size, then testing significance by the t test and the minimum significant difference test. Results showed that nonhierarchical methods outperformed hierarchical methods. The rotated principal component methods were found to be the most accurate methods, the nucleated agglomerative method was found to be superior to all other hard cluster methods, and Ward's method performed best among the hierarchical methods. Single linkage always yielded “chaining” solutions and, therefore, had poor matches to the input data. Of the three distance measures tested, Euclidean distance appeared to generate slightly more accurate solutions compared with the inverse correlation. The theta angle was quite variable in its accuracy. Tests of the initial partition method revealed a sensitivity of k- means CA to the selection of the seed points. The spatial patterns of cluster analysis applied to the full dataset were found to differ for various CA methods, thereby creating some questions on how to interpret the resulting spatial regionalizations. Several methods were shown to incorrectly place geographically separated portions of the domain into a single cluster. The authors termed this type of result “aggregation error.” It was found to be most problematic at small sample sizes and more severe for specific distance measures. The choice of clustering technique and dissimilarity measure/initial partition may indeed significantly affect the results of cluster analysis. Cluster analysis accuracy was also found to be linearly to logarithmically related to the sample size. This relationship was statistically significant. Several methods, such as Ward's, k means, and the nucleated agglomerative were found to reach a higher level of accuracy at a lower sample size compared with other CA methods tested. The level of accuracy reached by the rotated principal component clustering compared with the other methods tested suggests that application of a hard and nonoverlapping clustering methodology to fuzzy and overlapping geophysical data results in a substantial degradation in the regionalizations presented.