Cluster Analysis (2)
Classic Clustering Methods
Luc Anselin^{1}
Latest update 11/19/2018
Introduction
In this second of three chapters that deal with multivariate clustering methods, we will cover two classic clustering methods, i.e., kmeans, and hierarchical clustering.
The problem addressed by a clustering method is to group the n observations into k clusters such that the intracluster similarity is maximized (or, dissimilarity minimized), and the betweencluster similarity minimized (or, dissimilarity maximized). Kmeans is a socalled partitioning clustering method in which the data are partitioned into k groups, with k determined beforehand. In constrast, hierarchical clustering builds up the clusters from the bottom up (or top down) and can be considered for many values of k.
The clustering methods are standard tools of socalled unsupervised learning and constitute a core element in any machine learning toolbox. An extensive technical discussion is beyond the scope of this document, but a thorough treatment can be found in Hastie, Tibshirani, and Friedman (2009) and James et al. (2013), among others.
GeoDa implements the cluster algorithms by leveraging the C Clustering Library (Hoon, Imoto, and Miyano 2017), augmented by the kmeans++ algorithm from Arthur and Vassilvitskii (2007).
To illustrate these techniques, we will continue to use the Guerry data set on moral statistics in 1830 France, which comes preinstalled with GeoDa.
Objectives

Carry out kmeans clustering

Interpret the characteristics of a cluster analysis

Carry out a sensitivity analysis to starting options

Impose a bound on the clustering solutions

Compute aggregate values for the new clusters

Carry out hierarchical clustering
GeoDa functions covered
 Clusters > K Means
 select variables
 select kmeans starting algorithms
 kmeans characteristics
 mapping the clusters
 changing the cluster labels
 saving the cluster classification
 setting a minimum bound
 Clusters > Hierarchical
 select distance criterion
 Table > Aggregate
Getting started
As before, with GeoDa launched and all previous projects closed, we again load the Guerry sample data set from the Connect to Data Source interface. We either load it from the sample data collection and then save the file in a working directory, or we use a previously saved version. The process should yield the familiar themeless base map, showing the 85 French departments, as in Figure 1.
K Means
Principle
The kmeans algorithm starts with an initial (random) partioning of the n data points into k groups, and then incrementally improves upon this until convergence. In a general sense, observations are assigned to the cluster centroid to which they are closest, using an Euclidean (squared difference) dissimilarity criterion.
A key element in this method is the choice of the number of clusters, k. Typically, several values for k are considered, and the resulting clusters are then compared in terms of the objective function. Since the total variance equals the sum of the withingroup variances and the total betweengroup variance, a common criterion is to assess the ratio of the total betweengroup variance to the total variance. A higher value for this ratio suggests a better separation of the clusters. However, since this ratio increases with k, the selection of a best k is not straightforward. In practice, one can plot the ratio against values for k and select a point where the additional improvement in the objective is no longer meaningful. Several ad hoc rules have been suggested, but none is totally satisfactory.
The kmeans algorithm does not guarantee a global optimum, but only a local one. In addition, it is sensitive to the starting point used to initiate the iterative procedure. In GeoDa, two different approaches are implemented. One uses a series of random initial assignments, creating several initial clusters and starting the iterative process with the best among these solutions. In order to ensure replicability, it is important to set a seed value for the random number generator. Also, to assess the sensitivity of the result to the starting point, different seeds should be tried (as well as a different number of initial solutions).
A second approach uses a careful consideration of initial seeds, following the procedure outlined in Arthur and Vassilvitskii (2007), commonly referred to as kmeans++. While generally being faster and resulting in a superior solution in small to medium sized data sets, this method does not scale well (as it requires k passes through the whole data set to find the initial values). Note that a choice of a large number of random initial allocations may yield a better outcome than the application of kmeans++, at the expense of a somewhat longer execution time.
Implementation
We invoke the kmeans functionality from the Clusters toolbar icon, in Figure 2.
We select K Means from the list of options. Alternatively, from the main menu, shown in Figure 3, we can select Clusters > K Means.
This brings up the K Means Clustering Settings dialog, shown in Figure 4, the main interface through which variables are chosen, options selected, and summary results are provided.
Variable Settings Panel
We select the variables and set the parameters for the K Means cluster analysis through the options in the left hand panel of the interface. We choose the same six variables as in the previous chapter: Crm_prs, Crm_prp, Litercy, Donatns, Infants, and Suicids. These variables appear highlighted in the Select Variables panel.
The next option is to select the Number of Clusters. The initial setting is blank. One can either choose a value from the dropdown list, or enter an integer value directly.^{2} In our example, we set the number of clusters to 5.
A default option is to use the variables in standardized form, i.e., in standard deviational units,^{3} expressed with Transformation set to Standardize (Z).
The default algorithm is KMeans++ with initialization reruns set to 150 and maximal iterations to 1000. The seed is the global random number seed set for GeoDa, which can be changed by means of the Use specified seed option. Finally, the default Distance Function is Euclidean. For kmeans, this option cannot be changed, but for hierarchical clustering, there is also a Manhattan distance metric.
The cluster classification will be saved in the variable specified in the Save Cluster in Field box. The default of CL is likely not very useful if several options will be explored. In our first example, we set the variable name to CLa.
Cluster results
After pressing Run, and keeping all the settings as above, a cluster map is created as a new view and the characteristics of the cluster are listed in the Summary panel.
The cluster map in Figure 5 reveals quite evenly balanced clusters, with 22, 19, 18, 16 and 10 members respectively. Keep in mind that the clusters are based on attribute similarity and they do not respect contiguity or compactness (we will examine this aspect in a later chapter).
The cluster characteristics are listed in the Summary panel, shown in Figure 6. This lists, for each cluster, the method (KMeans), the value of k (here, 5), as well as the parameters specified (i.e., the initialization methods, number of initialization reruns, the maximum iterations, transformation, and distance function). Next follows the values of the cluster centers for each of the variables involved in the clustering algorithm (with the Standardize option on, these variables have been transformed to have zero mean and variance one overall, but not in each of the clusters).
In addition, some summary measures are provided to assess the extent to which the clusters achieve withincluster similarity and betweencluster dissimilarity. The total sum of squares is listed, as well as the withincluster sum of squares for each of the clusters. Finally, these statistics are summarized as the total withincluster sum of squares, the total betweencluster sum of squares, and the ratio of betweencluster to total sum of squares. In our initial example, the latter is 0.497467.
The cluster labels (and colors in the map) are arbitrary and can be changed in the cluster map, using the same technique we saw earlier for unique value maps (in fact, the cluster maps are a special case of unique value maps). For example, if we wanted to switch category 4 with 3 and the corresponding colors, we would move the light green rectangle in the legend with label 4 up to the third spot in the legend, as shown in Figure 7.
Once we release the cursor, an updated cluster map is produced, with the categories (and colors) for 3 and 4 switched, as in Figure 8.
As the clusters are computed, a new categorical variable is added to the data table (the variable name is specified in the Save Cluster in Field option). It contains the assignment of each observation to one of the clusters as an integer variable. However, this is not automatically updated when we change the labels, as we did in the example above.
In order to save the updated classification, we can still use the generic Save Categories option available in any map view (right click in the map). After specifying a variable name (e.g., cat_a), we can see both the original categorical variable and the new classification in the data table.
In the table, shown in Figure 9, wherever CLa (the original classification) is 3, the new classification (cat_a) is 4. As always, the new variables do not become permanent additions until the table is saved.
Saving the cluster results
The summary results listed in the Summary panel can be saved to a text file. Right clicking on the panel to brings up a dialog with a Save option, as in Figure 10. Selecting this and specifying a file name for the results will provide a permanent record of the analysis.
Options and sensitivity analysis
The kmeans algorithm depends crucially on its initialization. In GeoDa, there are two methods to approach this. The default kmeans++ algorithm usually picks a very good starting point. However, the number of initial runs may need to be increased to obtain the best solution.
The second method is socalled random initialization. In this approach, k observations are randomly picked and used as seeds for an initial cluster assignment, for which the summary characteristics are then computed. This is repeated many times and the best solution is used as the starting point for the actual kmeans algorithm.
It is important to assess the sensitivity of the results to the starting point, and several combinations of settings should be compared.
In addition to the starting options, it is also possible to constrain the kmeans clustering by imposing a minimum value for a spatially extensive variable, such as a total population. This ensures that the clusters meet a minimum size for that variable. For example, we may want to create neighborhood types (clusters) based on a number of census variables, but we also want to make sure that each type has a minimum overall population size, to avoid creating clusters that are too small in terms of population. We will encounter this approach again when we discuss the maxp method for spatially constrained clustering.
Changing the number of initial runs in kmeans++
The default number of initial reruns for the kmeans++ algorithm is 150. Sometimes, this is not sufficient to guarantee the best possible result (in terms of the ratio of between to total sum of squares). We can change this value in the Initialization Reruns dialog, as illustrated in Figure 11 for a value of 1000 iterations.
The result is slightly different from what we obtained for the default setting. As shown in Figure 12, the first category now has 23 elements, and the second 18. The other groupings remain the same.
The revised initialization results in a slight improvement of the sum of squares ratio, changing from 0.497467 to 0.497772, as shown in Figure 13.
Random initialization
The alternative to the KMeans++ initialization is to select Random as the Initialization Method, as shown in Figure 14. We keep the number of initialization reruns to the default value of 150 and save the result in the variable CLr.
The result is identical to what we obtained for Kmeans++ with 1000 initialization reruns, with the cluster map as in Figure 12 and the cluster summary as in Figure 13. However, if we had run the random initialization with much fewer runs (e.g., 10), the results would be inferior to what we obtained before. This highlights the effect of the starting values on the ultimate result.
Setting a minimum bound
The minimum bound is set in the variable settings dialog by checking the box next to Minimum Bound, as in Figure 15. In our example, we select the variable Pop1831 to set the constraint. Note that we specify a bound of 15% (or 4855) rather than the default 10%. This is because the standard kmeans solution satisfies the default constraint already, so that no actual bounding is carried out.
With the cluster size at 5 and all other options at their default value, we obtain the cluster result shown in the map in Figure 16. These results differ slightly from the unconstrained map in Figure 5.
The cluster characteristics show a slight deterioriation of our summary criterion, to a value of 0.484033 (compared to 0.497772), as shown in Figure 17. This is the price to pay to satisfy the minimum population constraint.
Aggregation by cluster
With the clusters at hand, as defined for each observation by the category in the cluster field, we can now compute aggregate values for the new clusters. We illustrate this as a quick check on the population totals we imposed in the bounded cluster procedure.
The aggregation is invoked from the Table as an option by rightclicking. This brings up the list of options, from which we select Aggregate, as in Figure 18.
The following dialog, shown in Figure 19, provides the specific aggregation method, i.e., count, average, max, min, or Sum, the key on which to aggregate and a selection of variable to aggregate. In our example, we use the cluster field key, e.g., CLb and select only the population variable Pop1831, which we will sum over the departments that make up each cluster. This can be readily extended to multiple variables, as well as to different summary measures, such as the average.^{4}
Pressing the Aggregate key brings up a dialog to select the file in which the new results will be saved. For example, we can select a dbf format and specify the file name. The contents of the new file are given in Figure 20, with the total population for each cluster. Clearly, each cluster meets the minimum requirement of 4855 that was specified.
The same procedure can be used to create new values for any variable, aggregated to the new cluster scale.
Hierarchical Clustering
Principle
In contrast to a partioning method (like kmeans), a hierarchical clustering approach builds the clusters step by step. This can be approached in a topdown fashion or in a bottomup fashion.
In a topdown approach, we start with the full data set as one cluster, and find the best break point to create two clusters. This process continues until each observation is its own cluster. The result of the successive divisions of the data is visualized in a socalled dendrogram, i.e., a representation as a tree.
The bottomup approach starts with each observation being assigned to its own cluster. Next, the two observations are found that are closest (using a given distance criterion), and they are combined into a cluster. This process repeats itself, by using a representative point for each grouping once multiple observations are combined. At the end of this process, there is a single cluster, containing all the observations. Again, the results of the sequential grouping of observations (and clusters) are visually represented by a dendrogram.
The dendrogram is used to set a cut level for a specified k. By placing the cut point at different levels in the tree, clusters with varying dimensions are obtained.
A key aspect of the hierarchical clustering process is how to compute the distance between two existing clusters in order to decide how to group the closest two together. There are several criteria in use, such as single linkage, complete linkage, average linkage, and Ward’s method (or centroid linkage).
With single linkage, the distance between two clusters is defined by the distance between the closest (in attribute space) observations from each cluster. In contrast, for complete linkage, the distance is between the observations that are furthest apart. Average linkage uses the average of all the pairwise distances, whereas Ward’s method utilizes the distance between a central point in each cluster.
A common default is to use Ward’s method, which tend to result in nicely balanced clusters. The complete linkage method yields similar clusters. In contrast, single linkage and average linkage tends to result in many singletons and a few very large clusters.
Implementation
Hierarchical clustering is invoked in GeoDa from the same toolbar icon as kmeans, shown in Figure 2, by selecting the proper item from the drop down list. The desired clustering functionality can also be selected by using Clusters > Hierarchical from the menu, as shown in Figure 21.
Variable Settings Panel
As before, the variables to be clustered are selected in the Variables Settings panel. We continue with the same six variables, shown in Figure 22.
The panel also allows one to set several options, such as the Transformation (default value is a standardized zvalue), the linkage Method (default is Ward’s method), and the Distance Function (default is Euclidean). In the same way as for kmeans, the cluster classification is saved in the data table under the variable name specified in Save Cluster in Field.
Cluster results
The actual computation of the clusters proceeds in two steps. In the first step, after clicking on Run, a dendrogram is presented. The default cut point is set to 5, but this can be changed interactively (see below). Once a cut point is selected, clicking on Save/Show Map creates the cluster map, computes the summary characteristics, and saves the cluster classification in the data table.
Dendrogram
With all options set to the default, the resulting dendrogram is as in Figure 23. The dashed red line corresponds to a cut point that yields five clusters (the default). The dendrogram shows how individual observations are combined into groups of two, and subsequently into larger and larger groups, by combining pairs of clusters. The colors on the right hand side match the colors of the observations in the cluster map (see next).
The dashed line (cut point) can be moved interactively. For example, in Figure 24, we grabbed the line at the top (it can equally be grabbed at the bottom), and moved it to the right to yield eight clusters. The corresponding colors are shown on the right hand bar.
Cluster map
As mentioned before, once the dendrogram cut point is specified, clicking on Save/Show Map will generate the cluster map, shown in Figure 25. Note how the colors for the map categories match the colors in the dendrogram. Also, the number of observations in each class also are the same between the groupings in the dendrogram and the cluster map.
Cluster summary
Similarly, once Save/Show Map has been selected, the cluster descriptive statistics become available from the Summary button in the dialog. The same characteristics are reported as for kmeans. In comparison to our kmeans solution, this set of clusters is slightly inferior in terms of the ratio of between to total sum of squares, achieving 0.482044. However, setting the number of clusters at five is by no means necessarily the best solution. In a real application, one would experiment with different cut points and evaluate the solutions relative to the kmeans solution.
The two kmeans and hierarchical clustering approaches can also be used in conjunction with each other. For example, one could explore the dendrogram to find a good cutpoint, and then use this value for k in a kmeans or other partitioning method.
Options and sensitivity analysis
The main option of interest in hierarchical clustering is the linkage Method. In addition, we can alter the Distance Function and the Transformation. The latter operates in the same fashion as for kmeans.
So far, we have used the default setting for Ward’slinkage. We now consider each of the other linkage options in turn and illustrate the associated dendrogram, cluster map and cluster characteristics.
Single linkage
The linkage options are chosen from the Method item in the dialog. For example, in Figure 27, we select Singlelinkage. The other options are chosen in the same way.
The cluster results for single linkage are typically characterized by one or a few very large clusters and several singletons (one observation per cluster). In our example, this is illustrated by the dendrogram in Figure 28, and the corresponding cluster map in Figure 29. Four clusters consist of a single observation, with the main cluster collecting the 81 other observations. This situation is not remedied by moving the cut point such that more clusters result, since almost all of the additional clusters are singletons as well.
The characteristics of the single linkage hierarchical cluster are similarly dismal. Since four clusters are singeltons, their within cluster sum of squares is 0. Hence, the total withincluster sum of squares equals the sum of squares for cluster 5. The resulting ratio of between to total sum of squares is only 0.214771.
In practice, in most situations, single linkage will not be a good choice, unless the objective is to identify a lot of singletons.
Complete linkage
The complete linkage method yields clusters that are similar in balance to Ward’s method. For example, in Figure 31, the dendrogram is shown for our example, using a cut point with five clusters. The corresponding cluster map is given as Figure 32. The map is similar in structure to that obtained with Ward’s method (Figure 25), but note that the largest category (at 39) is much larger than the largest for Ward (25).
In terms of the cluster characteristics, shown in Figure 33, we note a slight deterioration relative to Ward’s results, with the ratio of between to total sum of squares at 0.423101 (but much better than single linkage).
Average linkage
Finally, the average linkage criterion suffers from some of the same problems as single linkage, although it yields slightly better results. The dendrogram and cluster map are shown in Figures 34 and 35.
As given in Figure 36, the summary characteristics are slighly better than in the single linkage case, with only two singletons. However, the overall ratio of between to total sum of squares is still much worse than for the other two methods, at 0.296838.
Distance metric
The default metric to gauge the distance between the center of each cluster is the Euclidean distance. In some contexts, it may be preferable to use absolute or Manhattan block distance, which penalizes larger distances less. This option can be selected through the Distance Function item in the dialog, as in Figure 37.
We run this for the same variables using Ward’s linkage and a cut point yielding 5 clusters. The corresponding cluster map is shown in Figure 38, with the summary characteristics given in Figure 39.
Relative to the Euclidean distance results, the ratio of between to total sum of squares is somewhat worse, at 0.44549 (compared to 0.482044). However, this is not a totally fair comparison, since the criterion for grouping is not based on a variance measure, but on an absolute difference.
References
Arthur, David, and Sergei Vassilvitskii. 2007. “kmeans++: The Advantages of Careful Seeding.” In SODA 07, Proceedings of the Eighteenth Annual AcmSiam Symposium on Discrete Algorithms, edited by Harold Gabow, 1027–35. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning (2nd Edition). New York, NY: Springer.
Hoon, Michiel de, Seiya Imoto, and Satoru Miyano. 2017. “The C Clustering Library.” Tokyo, Japan: The University of Tokyo, Institute of Medical Science, Human Genome Center.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning, with Applications in R. New York, NY: SpringerVerlag.

University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu↩

The dropdown list goes from 2 to 85, which may be insufficient in big data settings. Hence GeoDa now offers the option to enter a value directly.↩

The Z standardization subtracts the mean and divides by the standard deviation. An alternative standardization is to use the mean absolute deviation, MAD.↩

In the current implementation, the same summary method needs to be applied to all the variables.↩