Cluster Analysis (3)

Introduction

In this chapter, we consider some more advanced partitioning methods. First, we cover two variants of K-means, i.e., K-medians and K-medoids. These operate in the same manner as K-means, but differ in the way the central point of each cluster is defined and the manner in which the nearest points are assigned. In addition, we discuss spectral clustering, a graph partitioning method that can be interpreted as simultaneously implementing dimension reduction with cluster identification.

As implemented in GeoDa, these methods share almost all the same options with the partitioning and hierarchical clustering methods discussed in the previous chapters. These common aspects will not be considered again. We refer to the previous chapters for details on the common options and sensitivity analyses.

We continue to use the Guerry data set to illustrate k-medians and k-medoids, but introduce a new sample data set, spirals.csv, for the spectral clustering examples.

Objectives

Understand the difference between k-median and k-medoid clustering
Carry out and interpret the results of k-median clustering
Gain insight into the logic behind the PAM, CLARA and CLARANS algorithms
Carry out and interpret the results of k-medoid clustering
Understand the graph-theoretic principles underlying spectral clustering
Carry out and interpret the results of spectral clustering

GeoDa functions covered

Clusters > K Medians
- select variables
- MAD standardization
Clusters > K Medoids
Clusters > Spectral

Getting started

The Guerry data set can be loaded in the same way as before.

The spirals data set is specifically designed to illustrate some of the special characteristics of spectral clustering. It is one of the GeoDaCenter sample data sets.

To activate this data set, you load the file spirals.csv and select x and y as the coordinates (the data set only has two variables), as in Figure 1. This will ensure that the resulting layer is represented as a point map.

Figure 1: Spirals convert csv file to point map

The result shows the 300 points, consisting of two distinct but interwoven spirals, as in Figure 2.

Figure 2: Spirals themeless point map

We will not be needing this data set until we cover spectral clustering. For k-medians and k-medoids, we use the Guerry data set.

K Medians

Principle

K-medians is a variant of k-means clustering. As a partitioning method, it starts by randomly picking k starting points and assigning observations to the nearest initial point. After the assignment, the center for each cluster is re-calculated and the assignment process repeats itself. In this way, k-medians proceeds in exactly the same manner as k-means. It is in fact also an EM algorithm.

In contrast to k-means, the central point is not the average (in multiattribute space), but instead the median of the cluster observations. The median center is computed separately for each dimension, so it is not necessarily an actual observation (similar to what is the case for the cluster average in k-means).

The objective function for k-medians is to find the allocation \(C(i)\) of observations \(i\) to clusters \(h = 1, \dots k\), such that the sum of the Manhattan distances between the members of each cluster and the cluster median is minimized: \[\mbox{argmin}_{C(i)} \sum_{h=1}^k \sum_{i \in h} || x_i - x_{h_{med}} ||_{L_1},\] where the distance metric follows the \(L_1\) norm, i.e., the Manhattan block distance.

K-medians is often confused with k-medoids. However, there is an important difference in that in k-medoids, the central point has to be one of the observations (Kaufman and Rousseeuw 2005). We consider k-medoids in the next section.

The Manhattan distance metric is used to assign observations to the nearest center. From a theoretical perspective, this is superior to using Euclidean distance since it is consistent with the notion of a median as the center (Hoon, Imoto, and Miyano 2017, 16).

In all other respects, the implementation and interpretation is the same as for k-means. To illustrate the logic, a simple worked example is provided in the Appendix.

GeoDa employs the k-medians implementation that is part of the C clustering library of Hoon, Imoto, and Miyano (2017).

Implementation

Just as the previous clustering techniques, k-medians is invoked from the Clusters toolbar. From the menu, it is selected as Clusters > K Medians, the second item in the classic clustering subset, as shown in Figure 3.

Figure 3: K Medians Option

This brings up the K Medians Clustering Settings dialog, with the Input options in the left-hand side panel, shown in Figure 4.

Variable Settings Panel

The user interface is identical to that for k-means, to which we refer for details. The main difference is that the Distance Function is Manhattan distance. In the example in Figure 4, we again select the same six variables as before, with the Number of Clusters set to 5 and all other options left to the default settings.

Figure 4: K Medians variable selection

Selecting Run brings up the cluster map and fills out the right-hand panel with some cluster characteristics, listed under Summary. The cluster categories are added to the Table using the variable name specified in the dialog (default is CL, in our example we use CLme1).

Cluster results

The cluster map is shown in Figure 5. The three largest clusters (they are labeled in sequence of their size) are well-balanced, with 24, 20 and 20 observations. The two others are much smaller, at 12 and 9. Interesting is that the clusters are also geographically quite compact, except for cluster 4, which consists of four different spatial subgroups. Cluster 2, in the south of the country, is actually fully contiguous (without imposing any spatial constraints). This is not the case for k-means.

Figure 5: K Medians cluster map (k=5)

While the grouping may seem similar to what we obtained with other methods, this is in fact not the case. In Figure 6, the cluster map for k-means and k-medians are shown next to each other, with the labels for k-medians adjusted so as the get similar colors for each category. This highlights some of the important differences between the two methods. First of all, the size of the different “matching” clusters is not the same, nor is their geographic configuration. Considering the clusters for k-medians (with their new labels), we see that the largest cluster, with 24 observations, corresponds most closely with cluster 3 for k-means, which had 18 observations.

The closest match between the two results is for cluster 2, with only one mismatch out of 9 observations, although that cluster is much larger for k-means, with 19 observations. The worst match is for cluster 5, where only three observations are shared by the two methods for that cluster (out of 12). For the others, there is about a 3/4 match. In other words, the two methods pick out different patterns of similarity in the data. There is no “best” method, since each uses a different objective function. It is up to the analyst to decide which of the objectives makes most sense, in light of the goals of a particular study.

Figure 6: K Means and K Medians compared (k=5)

Further insight into the characteristics of the clusters obtained by the k-medians algorithm are found in the Summary panel on the right side of the settings dialog, shown in Figure 7.

The first set of items summarizes the settings for the analysis, such as the method used, the number of clusters and the various options for initialization, standardization, etc. Next follow the values for each of the variables associated with the median center of each cluster. These results are given in the original scale for the variables, whereas the other summary measures depend on the standardization used. Typically, the median center values are used to interpret the type of grouping that is obtained. This is not always easy, since one has to look for systematic combinations of variables with high or low values for the median so as to characterize the cluster.

The third set of items contains the summary statistics, using the squared difference and mean as the criterion, similar to what is used for k-means. Note that this is only for a general comparison, since this is not the criterion used in the objective function. So, in a sense, it gives a general impression of how the k-medians results compare using the standard used for k-means. In our example, we obtain a ratio of between to total sum of squares of 0.447, compared to 0.497 for k-means (with the default settings). This does not mean that the k-medians result is worse than that for k-means, but it gives a sense of how it performs under a different criterion that what it is optimized for.

The final set of summary characteristics are the proper ones for the objective of minimizing the within-cluster Manhattan distance relative to the cluster median. The total sum of the distances is 372.318. This is the sum of the distances between all observations and the overall median (using the z-standardized values for the variables). For k-medians, the objective is to decrease this value by grouping the observations into clusters with their own medians. The within-cluster total distance is listed for each cluster. In our results, there is quite a range in these values, going from 15.97 in the smallest cluster (with only 9 observations) to 70.49 in cluster 3 (with 20 observations). Clusters 1 and 2, that are larger or equal to the size of cluster 3, have a much better fit. This is also reflected in the average within-cluster distance results, with the smallest value of 1.77 for C5, followed by 2.61 for C1. Interestingly, the latter has about double the total distance compared to C4, but its average is better (2.61 compared to 2.84). The averages correct for the size of the cluster and are thus a good comparative measure of fit.

The total of the within-cluster distances is 250.399, a decrease of 121.9 from the original total. As a fraction of the original total, the final result is 0.673. When comparing results for different values of k, we would look for a bend in the elbow plot as this ratio decreases with increasing values of k.

Figure 7: K Medians cluster characteristics (k=5)

Options and sensitivity analysis

The variables settings panel contains all the same options as for k-means, except that initialization is always by randomization, since there is no k-means++ method for k-medians. One option that is particularly useful in the context of k-medians (and k-medoids) is the use of a different standardization.

MAD standardization

The default z-standardization uses the mean and the variance of the original variables. Both of these are sensitive to the influence of outliers. Since the use of Manhattan distance and the median center for clusters in k-medians already reduces the effect of such outliers, it makes sense to also use a standardization that is less sensitive to those. We considered range standardization in the discussion of k-means. Here, we look at the mean absolute deviation, or MAD. As usual, this is selected as one of the Transformation options, as shown in Figure 8.

Figure 8: MAD variable standardization

The resulting cluster map and summary characteristics are shown in Figures 9 and 10.

The main effect seems to be on the largest cluster, which grows from 24 to 27 observations, mostly at the expense of what was the second largest cluster (which goes from 20 to 18 observations). As a result, none of the clusters are fully contiguous any more.

Figure 9: K Medians cluster map - MAD standardization (k=5)

The distance measures listed in the summary show a different starting point, with a total distance sum of 490.478, compared to 372.318 for z-standardization (recall that these measures are expressed in whatever units were used for the standardization). Therefore, the values for the within-cluster distance and their averages are not directly comparable to those using z-standardization. Only relative comparisons are warranted.

In the end, the total within-clusters are reduced to 0.677 of the original total, a slightly worse result than for z-standardization. However, this does not necessarily mean that z-standardization is superior. The choice of a particular transformation should be made within the context of the substantive research question. When no strong guidelines exist, a sensitivity analysis comparing, for example, z-standardization, range standardization and MAD may be the best strategy.

Figure 10: K Medians cluster characteristics - MAD standardization (k=5)