Dimension Reduction Methods (2)

Distance Preserving Methods

Luc Anselin¹

06/28/2020 (latest update)

Introduction
- Objectives
  - GeoDa functions covered
Multidimensional Scaling (MDS)
t-SNE
Attribute and Locational Similarity
Appendix
References

Introduction

In this chapter, we continue our review of ways to reduce the variable dimension of a problem, but now focus on so-called distance preserving methods, i.e., techniques that attempt to represent, or, more precisely, embed data points from a high-dimensional attribute space in a lower dimensional space (typically 2D or 3D) for easy visualization. This representation is such that the relative distances between the data points are preserved as much as possible. We focus on two methods, the classic multidimensional scaling (MDS) and the more recently developed t-SNE, a variant of a principle referred to as stochastic neighbor embedding. While MDS is a linear technique, t-SNE may be more appropriate in the case when the relationships among the variables may be nonlinear. Also, MDS tends to assure that points that are far apart in high-dimensional space are also far apart in the lower-dimensional space, whereas t-SNE is more attuned at preserving the relative location of close points.

For each method, we first present some basic concepts and the implementation in GeoDa. We then consider more specifically a spatialized approach, where we exploit geovisualization, linking and brushing to explore the connection between neighbors in multi-attribute space and neighbors in geographic space.

To illustrate these techniques, we continue to use the Guerry data set on moral statistics in 1830 France, which comes pre-installed with GeoDa.

Objectives

Understand the mathematics behind classic metric multidimensional scaling (MDS)
Understand the principle behind the SMACOF algorithm
Carry out multidimensional scaling for a set of variables
Gain insight into the various options used in MDS analysis
Interpret the results of MDS
Understand the mathematical principles behind t-SNE
Carry out t-SNE to visualize multi-attribute data
Understand the role of the various t-SNE optimization parameters
Interpret the results of t-SNE
Compare closeness in attribute space to closeness in geographic space
Carry out local neighbor match test using MDS neighbors

GeoDa functions covered

Clusters > MDS
- select variables
- MDS methods
- MDS parameters
- saving MDS results
- spatial weights from MDS results
Clusters > t-SNE
- select variables
- t-SNE optimization parameters
- saving t-SNE results
- spatial weights from t-SNE results
Weights Manager
- weights intersection
Space > Local Neighbor Match Test

Multidimensional Scaling (MDS)

Principle

Multidimensional Scaling or MDS is a classic multivariate approach designed to portray the embedding of a high-dimensional data cloud in a lower dimension, typically 2D or 3D. The technique goes back to pioneering work on so-called metric scaling by Torgerson (Torgerson 1952, 1958), and its extension to non-metric scaling by Shepard and Kruskal (Shepard 1962 a, 1962 b; Kruskal 1964). A very nice overview of the principles and historical evolution of MDS as well as a more complete list of the classic references can be found in Mead (1992).

MDS is based on the notion of distance between observation points in multiattribute space. For \(k\) variables, the Euclidean distance between observations \(x_i\) and \(x_j\) in \(k\)-dimensional space is \[d_{ij} = || x_i - x_j|| = [\sum_{h=1}^k (x_{ih} - x_{jh})^2]^{1/2},\] the square root of the sum of the squared differences between two data points over each variable dimension. In this expression, \(x_i\) and \(x_j\) are \(k\)-dimensional column vectors, with one value for each of the \(k\) variables (dimensions). In practice, it turns out that working with the squared distance is mathematically more convenient, and it does not affect the properties of the method. In order to keep the notation straight, we will use \(d_{ij}^2\) for the squared distance.

The squared distance can also be expressed as the inner product of the difference between two vectors: \[d_{ij}^2 = (x_i - x_j)'(x_i - x_j),\]

with \(x_i\) and \(x_j\) represented as \(k \times 1\) column vectors. It is important to keep the dimensions straight, since the data points are typically contained in an \(n \times k\) matrix \(X\), where each observation corresponds to a row of the matrix. However, in order to obtain the sum of squared differences as a inner product, each observation (on the \(k\) variables) must be represented as a column vector.

An alternative to Euclidean distance is a Manhattan (city block) distance, which is less susceptible to outliers, but does not lend itself to being expressed as an inner product of two vectors: \[d_{ij} = \sum_{h=1}^k |x_{ih} - x_{jh}|,\] the sum of the absolute differences in each dimension.

The objective of MDS is to find the coordinates of data points \(z_1, z_2, \dots, z_n\) in 2D or 3D that mimic the distance in multiattribute space as closely as possible. This can be thought of as minimizing a stress function, \(S(z)\): \[S(z) = \sum_i \sum_j (d_{ij} - ||z_i - z_j||)^2.\] In other words, a set of coordinates in a lower dimension (\(z_i\), for \(i = 1, \dots, n\)) are found such that the distances between the resulting pairs of points ( \(||z_i - z_j||\) ) are as close as possible to their pair-wise distances in high-dimensional multi-attribute space ( \(d_{ij}\) ). Note that, due to the use of a squared difference, this objective functions penalizes large differentials more. Hence a tendency of MDS to represent points that are far apart rather well, but maybe with less attention to points that are closer together.

Mathematical details

Classic metric scaling

Classic metric MDS, one of the two methods implemented in GeoDa, approaches the optimization problem indirectly by focusing on the cross product between the actual vectors of observations \(x_i\) and \(x_j\) rather than between their difference, i.e., the cross-product between each pair of rows in the observation matrix \(X\). The values for all these cross-products are contained in the matrix expression \(XX'\). Note that this matrix, often referred to as a Gram matrix, is of dimension \(n \times n\) (and not \(k \times k\) as is the case for the more familiar \(X'X\) used in PCA).

If the actual observation matrix \(X\) is available, then the eigenvalue decomposition of the Gram matrix provides the solution to MDS. In the same notation as used for PCA (but now pertaining to the matrix \(XX'\)): \[XX' = VGV',\] where V is the matrix with eigenvectors (each of dimension \(n \times 1\)) and \(G\) is a \(n \times n\) diagonal matrix that contains the eigenvalues on the diagonal. It turns out that the new coordinates are found as: \[Z = VG^{1/2},\] i.e., each eigenvector (for \(n\) observations) is multiplied by the square root of the matching eigenvalue, with the eigenvalues ordered in decreasing order. Typically, only the first two or three columns of \(VG^{1/2}\) are used. This also implies that only the first few (largest) eigenvalues need to be computed, rather than the full set of \(n\) (see the Appendix for details).

More generally, MDS only requires the specification of a dissimilarity or squared distance matrix \(D^2\), not necessarily the full set of observations \(X\). As it turns out, the Gram matrix can be obtained from \(D^2\) by double centering: \[XX' = - \frac{1}{2} (I - M)D^2(I - M),\] where \(M\) is the \(n \times n\) matrix \((1/n)\iota \iota'\) and \(\iota\) is a \(n \times 1\) vector of ones (so, \(M\) is an \(n \times n\) matrix containing 1/n in every cell). The matrix \((I - M)\) converts values into deviations from the mean. In the literature, the joint pre- and post-multiplication by \((I - M)\) is referred to as double centering.

Given this equivalence, we don’t actually need the matrix \(X\) (in order to construct \(XX'\)) since the eigenvalue decomposition can be applied directly to the double centered squared distance matrix (see the Appendix). A more formal explanation of the mathematical properties can be found in Borg and Groenen (2005), Chapter 7, and Lee and Verleysen (2007), pp. 74-78.

SMACOF

SMACOF stands for “scaling by majorizing a complex function,” and was initially suggested by de Leeuw (1977) as an alternative to the gradient descent method that was typically employed to minimize the MDS stress function. It was motivated in part because the eigenvalue approach outlined above for classic metric MDS does not work unless the distance function is Euclidean. In many early applications of MDS in psychometrics this was not the case.

The idea behind the iterative majorization method is actually fairly straightforward, but its implementation in this particular case is quite complex.² In essence, a majorizing function is found that is much simpler than the original function, and, for a minimization problem, is always above the actual function. With a function \(f(x)\), a majorizing function \(g(x,z)\) is such that for a fixed point \(z\) the two functions are equal, such that \(f(z) = g(z,z)\) (\(z\) is called the supporting point). The auxiliary function should be easy to minimize (easier than \(f(x)\)) and always dominate the original function, such that \(f(x) \leq g(x,z)\). This leads to the so-called sandwich inequality (coined as such by de Leeuw): \[f(x^*) \leq g(x^*,z) \leq g(z,z) = f(z),\] where \(x^*\) is the value that minimizes the function \(g\).

The iterative part stands for the way in which one proceeds. One starts with a value \(x_0\), such that \(g (x,x_0) = f(x_0)\). For example, consider a parabola that sits above a complex nonlinear function and touches it at \(x_0\), as shown in Figure 1. We can then easily find the minimum for the parabola, say at \(x_1\). Next, we move a new parabola so that it touches our function at \(x_1\), with again \(g(x,x_1) = f(x_1)\). In the following step, we find the minimum for the new parabola at \(x_2\) and keep proceeding in this iterative fashion until the difference between \(x_k\) and \(x_{k-1}\) for two consecutive steps is less than a critical value (the convergence criterion). At that point we consider the minimum of our parabola to be the same as the minimum for the function \(f(x)\), or, close enough, given our convergence criterion.

Figure 1: Iterative majorization (source: Borg and Groenen, 2005, p. 180)

In general, the stress function for a solution \(Z\) to the MDS problem can be written as: \[S(Z) = \sum_{i < j} (\delta_{ij} - d_{ij}(Z))^2 = \sum_{i<j} \delta_{ij}^2 + \sum_{i<j} d_{ij}^2(Z) - 2 \sum_{i<j} \delta_{ij}d_{ij}(Z),\] where \(\delta_{ij}\) is the distance between \(i\) and \(j\) for the original configuration, and \(d_{ij}(Z)\) is the corresponding distance for a proposed solution \(Z\). Note that \(\delta_{ij}\) can be based on any distance metric or dissimilarity measure, but \(d_{ij}(Z)\) is a Euclidean distance. In the stress function, the first term is a constant, since it does not change with the values for the coordinates in \(Z\). The second term is a sum of squared distances between pairs of points in \(Z\), and the third term is a weighted sum of the pairwise distances (each weighted by the initial distances).³ We need to find a set of coordinates \(Z\) that minimizes \(S(Z)\).

The upshot of the iterative majorization is the so-called Guttman transform, which expresses an updated solution \(Z\) as a function of a tentative solution to the majorizing condition, \(Y\), through the following equality: \[Z = (1/n)B(Y)Y,\] where the \(n \times n\) matrix \(B(Y)\) has off-diagonal elements \(B_{ij} = - \delta_{ij} / d_{ij}(Y)\), and diagonal elements \(B_{ii} = - \sum_{j, j\neq i} B_{ij}\). Note that this negative of a sum of negatives turns out to be positive. Basically, the diagonal elements equal the sum of the absolute values of all the column/row elements, so that rows and columns sum to zero, i.e., the matrix \(B\) is double centered (details are given in the section on Iterative majorization of the stress function in the Appendix).

In practice, this means that the majorization algorithm proceeds through a number of simple steps:

start by picking a set of (random) starting values for \(Z\) and set \(Y\) to these values
compute the stress function for the current value of \(Z\)
find a new value for \(Z\) by means of the Guttman transform, using the computed distances included in \(B(Y)\), based on the current value of \(Y\)
update the stress function and check its change; stop when the change is smaller than a pre-set difference
if no convergence yet, set \(Y\) to the new value of \(Z\) and proceed with the update, etc.

Implementation

Default settings

The MDS functionality in GeoDa is invoked from the Clusters toolbar icon, or from the main menu, as Clusters > MDS. It is the second item in the dimension reduction submenu, as shown in Figure 2.

Figure 2: MDS menu option

This brings up the MDS Settings panel through which the variables are selected and various computational parameters are set. We continue with the same six variables as used previously in the PCA example, i.e., Crm_prs, Crm_prp, Litercy, Donatns, Infants, and Suicids. For now, we leave all options to their default settings, shown in Figure 3. The default method is classic metric with the variables standardized. Since this method only works for Euclidean distances, the Distance Function option is not available. Importantly, the default number of dimensions is set to 2 (see below for 3D MDS).

Figure 3: MDS input variable selection

Saving the MDS coordinates

After clicking the Run button, a small dialog is brought up to select the variable names for the new MDS coordinates, in Figure 4.

Figure 4: MDS result variable selection

As mentioned, GeoDa offers both the standard 2D MDS as well as a 3D MDS. The default used here is 2D, with the new variables V1 and V2 checked (the coordinates in the new space). In a typical application, the default variable names of V1 and V2 should be changed to a more appropriate choice.

The coordinates are added to the data table, as illustrated for the first five observations in Figure 5.

Figure 5: MDS coordinates in data table

With some additional computations, we can verify how these results are obtained. First, we need to create the matrix \(XX'\) from the standardized values for the six variables. Next, we compute the eigenvalues and eigenvectors for this \(85 \times 85\) matrix.⁴ The results are summarized in Figure 6.

The top row shows the first two eigenvalues, with their square roots listed immediately below. The eigenvector elements are also given for the first five observations (out of 85). The columns V1 and V2 give the corresponding MDS coordinates, calculated by multiplying the square root of the eigenvalue with the matching elements of the eigenvector (see the mathematical derivation in Classic metric scaling). The results are the same as those reported in Figure 5.

Figure 6: MDS coordinate calculations

MDS scatter plot

After the variable selection, OK generates a two-dimensional scatter plot with the observations, illustrated in Figure 7. In this scatter plot, points that are close together in two dimensions should also be close together in the six-dimensional attribute space. In addition, two new variables with the coordinate values will be added to the data table (as usual, to make this addition permanent, we need to save the table). Below the scatter plot are two lines with statistics. They can be removed by turning the option View > Statistics off (the default is on).

The statistics list the variables included in the analysis and two rough measures of fit. One is the value of the stress function for the solution, in our example, this is 0.343 (a lower value indicates a better fit). The other is the rank correlation between the distances in multi-attribute space and the distances in the lower-dimensional space, here 0.825 (a higher value suggests a stronger correlation between the relative ranks of the distances). The method used (classic metric) is included in the window top banner.

Figure 7: MDS plot

To illustrate the mathematical similarity between PCA and classic metric MDS, consider the scatter plot of the first two principal components (using the same six variables) next to the MDS plot (here shown without the statistics), as in Figure 8. Both plots have the same shape, except for being flipped around the x-axis.

Figure 8: PCA and MDS scatter plots

This is further highlighted by selecting the upper-left quadrant in the PCA plot. The matching points in the MDS plot are in the lower-left quadrant in Figure 9.

Figure 9: Selected points in PCA and MDS scatter plots

Recall that the signs of the elements of the eigenvectors depend on the computational approach, so that the relative positioning is really what counts. Consequently, points (observations) that are close in MDS space, will also be close in PCA space, and vice versa.

3D MDS

When the # of Dimensions option is set to 3, the results of MDS are visualized in a 3-dimensional scatterplot cube, as in Figure 10. Both the stress value (0.196) as well as the rank correlation (0.931) are superior to the two-dimensional solution. This is to be expected, since less dimension reduction is involved.⁵

Points that are close in the 2D MDS scatter plot are not necessarily close in 3D. In Figure 10, this seems to be the case, since the three close points selected in the 2D graph on the right also look close in 3D (the points highlighted in yellow).

Figure 10: Selected points in 3D and 2D MDS scatter plots (a)

However, this is not necessarily the case, and it is important to take advantage of the rotation of the 3D scatterplot cube to get a better sense of the location of the points in three dimensions. For example, the two points selected in the 2D plot in Figure 11 turn out not to be that close in 3D, after some rotation of the cube.

Figure 11: Selected points in 3D and 2D MDS scatter plots (b)

In our typical applications, we will focus on the 2D visualization.

Power approximation

The Parameters panel in the MDS Settings dialog allows for a number of options to be set. The Transformation options are the same as for PCA, with the default set to Standardize (Z), but Standardize (MAD), Raw, Demean, Range Adjust and Range Standardize are available as well. Best practice is to use the variables in standardized form, either the traditional z-transformation, or a more robust alternative (MAD or range standardize).

An option that is particularly useful in larger data sets is to use a Power Iteration to approximate the first few largest eigenvalues needed for the MDS algorithm. Since the first two or three eigenvalue/eigenvector pairs are all that is needed for the classic metric implementation, this provides considerable computational advantages in large data sets. Recall that the matrix for which the eigenvalues/eigenvectors need to be obtained is of dimension \(n \times n\). For large \(n\), the standard eigenvalue calculation will tend to break down.

The Power Iteration option is selected by checking the box in the Parameters panel, as in Figure 12. This also activates the Maximum # of Iterations option. The default is set to 1000, which should be fine in most situations. However, if needed, it can be adjusted by entering a different value in the corresponding box.

Figure 12: MDS power iteration option

At first sight, the result in Figure 13 may seem different from the standard output in Figure 7, but this is due to the indeterminacy of the signs of the eigenvectors. The stress value and rank correlation statistics are identical to the default solution. This is not unexpected, since \(n = 85\) in our example, so that the power approximation is actually not needed.

Figure 13: MDS plot (power approximation)

Closer inspection of the two graphs suggests that the axes are flipped. We can clearly see this by selecting the points in the upper right quadrant in the power graph and locating them in the other, as in Figure 14. The relative position of the points is the same in the two graphs, but they are on opposite sides of the y-axis.

Figure 14: Computation algorithm comparison

SMACOF

The Method option provides two approaches to minimize the stress function. The default is classic metric, but the smacof method is provided as an alternative. When the latter is selected, as in Figure 15, the Maximum # of iterations and the Convergence Criterion can be changed as well (those options are not available to the default classic metric method). The default setting is for 1000 iterations and a convergence criterion of 0.000001. These should be fine for most applications, but they may need to be adjusted if the minimization is not accomplished by the maximum number of iterations.

Figure 15: SMACOF method

As before, after selecting variable names for the two dimensions, a scatter plot is provided (it lists the method in the banner). For our example, the results are as in Figure 16, with the statistics listed at the bottom.

Figure 16: MDS plot using smacof method

Relative to the results for the classic metric method, the stress value is poorer, at 0.487, but the rank correlation is better, at 0.905. It is also shown that convergence (for the default criterion) is reached after 424 iterations. If the result would show 1000/1000 iterations, that would indicate that convergence has not been reached. This means that the default number of iterations would need to be increased. Alternatively, the convergence criterion could be relaxed, but that is not recommended.

We can compare the relative position of the points obtained by means of each method through linking and brushing, in the usual way. For example, in Figure 17 some close points in the SMACOF solution are shown to be somewhat farther apart in the classic metric solution.

Figure 17: Selected points in SMACOF and classic metric scatter plots

With the smacof method selected, it is now possible to choose an alternative distance metric for the original points, such as the Manhattan block distance. This computes absolute differences instead of squared differences. The option is shown in Figure 18. The effect is to lessen the impact of outliers, or large distances in the original high dimensional space.

Figure 18: SMACOF Manhattan distance function

The result that follows from a Manhattan distance metric is quite different from the default Euclidean SMACOF plot. As seen in Figure 19, the scale of the coordinates is different. Note that the range of values for V1sm and V2sm is much smaller than for the metric plot, so even if the points look farther apart, they are actually quite close on the scale of Figure 16. However, the identification of close observations can also differ quite a bit between the two plots. This can be further explored using linking and brushing (not shown).

Figure 19: SMACOF plot for Manhattan distance

Note that in terms of fit, the stress value of 0.317 is better than for SMACOF with Euclidean distance, but the rank correlation of 0.786 is worse. Convergence is obtained after 626 iterations, slightly more than for the Euclidean distance option.

The moral of the story is that the MDS scatter plots should not be interpreted in a mechanical way, but careful linking and brushing should explore the trade-offs between the different options.

Categorical variable

The MDS interface includes an option to visualize the position of data points that correspond to a particular category. One of the main uses of embedding methods is to discover hidden structure in high-dimensional data, such as particular groupings. The Category Variable option provides a way to visualize how the points that correspond to pre-defined groups in the data position themselves in the lower dimensional embedding. For example, in Figure 20, the Category Variable was set to Region, five broad regions of France, as defined in the Guerry data set (see also the left-hand panel of Figure 36).

Figure 20: MDS categorical variable selection

As a result, after the variable names are entered, the resulting MDS scatter plot designates each point with a color corresponding to its category, as in Figure 21. This primarily for visualization during the exploratory process. An alternative way to assess the way the point locations might correspond to different groups is given in the section on Visualizing a categorical variable, discussed in the context of t-SNE, but equally applicable to MDS scatter plots.

Figure 21: MDS scatter plot with categorical variables

Interpretation

In a general sense, the multidimensional scaling method essentially projects points from a high-dimensional multivariate attribute space onto a two or three-dimensional space. To get a more precise idea of how this process operates, we can link points that are close in the MDS scatter plot to the matching lines in a parallel coordinate plot. Data points that are close in multidimensional variable space should be represented by lines that are close in the PCP.

In Figure 22, we selected two points that are close in the MDS scatter plot and assess the matching lines in the six-variable PCP. Since the MDS works on standardized variables, we have expressed the PCP in those dimensions as well by means of the Data > View Standardized Data option. While the lines track each other closely, the match is not perfect, and is better on some dimensions than on others. For example, for Crm_prp and Suicids, the values are near identical, whereas for Infants the gap between the two values is quite large.

Figure 22: MDS and PCP, close points

To provide some context, Figure 23 provides an example of two points that are far from each other in the MDS plot (in the top and bottom of the left hand panel). The gaps between the PCP lines are generally much farther from each other, although for Litercy and Infants, they are quite close.

Figure 23: MDS and PCP, remote points

For a proper interpretation, this needs to be examined more carefully, for example, by brushing the MDS scatter plot, or, alternatively, by brushing the PCP and examining the relative locations of the matching points in the MDS scatter plot.