# Dimension Reduction Methods (1)

### Principal Component Annalysis (PCA)

#### Luc Anselin^{1}

#### 07/29/2020 (latest update)

## Introduction

In this chapter, we move fully into multivariate analysis and begin by addressing some approaches to reduce the dimensionality of the variable space. These are methods that help to address the so-called *curse of dimensionality*, a concept originally formulated by Bellman (1961). In a nutshell, the curse of dimensionality means that techniques that work well in small dimensions (e.g., k-nearest neighbor searches), either break down or become unmanageably complex (i.e., computationally impractical) in higher dimensions.

After a brief example to illustrate the curse of dimensionality in concrete terms, we move on to **principal components analysis** (PCA), a core method of both multivariate statistics and machine learning. Dimension reduction is particularly relevant in situations where many
variables are available that are highly intercorrelated. In essence, the original variables are replaced by a smaller number of proxies that represent them well.

In addition to a standard description of PCA and its implementation in `GeoDa`

, we also introduce a
*spatialized* approach, where we exploit geovisualization, linking and brushing to illustrate how the dimension reduction is represented in geographic space.

To illustrate these techniques, we will again use the Guerry data set on
moral statistics in 1830 France, which comes pre-installed with `GeoDa`

.
Before proceeding with a discussion of the PCA functionality, we briefly illustrate some aspects of the curse of dimensionality.

### Objectives

Illustrate the curse of dimensionality

Illustrate the mathematics behind principal component analysis

Compute principal components for a set of variables

Interpret the characteristics of a principal component analysis

Spatialize the principal components

#### GeoDa functions covered

- Clusters > PCA
- select variables
- PCA parameters
- PCA summary statistics
- saving PCA results

## The Curse of Dimensionality

The curse of dimensionality was originally described by Bellman (1961) in the context of optimization problems. Consider the following simple example where we look for the maximum of a function in one variable dimension, as in Figure 1.

One simple way to try to find the maximum is to evaluate the function at a number of discrete values for \(x\) and find the point(s) where the function value is highest. For example, in Figure 1, we have 10 evaluation points, equally dividing the x-range of \(0-10\). In our example, the maximum is between 5 and 6. In a naive one-dimensional numerical optimization routine, we could then divide the interval between 5 and 6 into ten more subintervals and proceed. However, let’s consider the problem from a different perspective. What if the optimization was with respect to two variables? Now, we are in a two-dimensional attribute domain and using 10 discrete points for each variable actually results in 100 evaluation points, as shown in Figure 2.

Similarly, for three dimensions, there would be \(10^3\) or 1,000 evaluation points, as in Figure 3.

In general, if we keep the principle of 10 evaluation points per dimension, the required number of evaluation points is \(10^k\), with \(k\) as the number of variables. This very quickly becomes unwieldy, e.g., for 10 dimensions, there would need to be \(10^{10}\), or 10 billion evaluation points, and for 20 dimensions \(10^{20}\) or 100,000,000,000,000,000,000, hence the *curse of dimensionality*. Some further illustrations of methods that break down or become impractical in high dimensional spaces are
given in Chapter 2 of Hastie, Tibshirani, and Friedman (2009), among others.

### The Empty Space Phenomenon

A related and strange property of high-dimensional analysis is the so-called
*empty space phenomenon*. In essence, the same number of data points are much further apart in higher dimensional spaces than
in lower dimensional ones. This means that we need (very) large data sets to do effective analysis in high dimensions.

To illustrate this point, consider three variables, say \(x\), \(y\), and \(z\), each generated as 10 uniform random variates in the unit interval (0-1). In one dimension, the observations on these variables are fairly close together. For example, in Figure 4, the one-dimensional spread for each of the variables is shown on the horizontal axes of a parallel coordinate plot (PCP). Clearly, the points on each of the lines that correspond to the dimensions are spaced fairly closely together.

However, when represented in a two-dimensional scatter plot, we observe large empty spaces in the unit quadrant, as in Figure 5 for \(x\) and \(y\) (similar graphs are obtained for the other pairs). The sparsity of the attribute space gets even more pronounced in three dimensions, as in Figure 6.

To quantify this somewhat, we can compute the nearest neighbor distances between the observation points in each of
the dimensions.^{2} In our example (the results will depend on the actual random numbers generated), the smallest nearest neighbor distance in the \(x\) dimension is between observations 5 and 8 at 0.009. In two dimensions, the smallest distance is between observations 1 and 4 at 0.094, and in three dimensions, it is also between 1 and 4, but now at 0.146. All distances are in the same units, since we are working with a unit interval for each variable.

Two observations follow from this small experiment. One is that the nearest neighbors in a lower dimension are not necessarily also nearest neighbors in higher dimensions. The other is that the nearest neighbor distance increases with the dimensionality. In other words, more of the attribute space needs to be searched before neighbors are found (this is in general a problem with k-nearest neighbor searches in high dimensions).

Further examples of some strange aspects of data in higher dimensional spaces can be found in Chapter 1 of Lee and Verleysen (2007).

## Principal Component Analysis (PCA)

### Principles

Principal components analysis has an eminent historical pedigree, going back to pioneering work in the early twentieth century by the statistician Karl Pearson (Pearson 1901) and the economist Harold Hotelling (Hotelling 1933). The technique is also known as the Karhunen-Loève transform in probability theory and as empirical orthogonal functions or EOF in meteorology (see, for example, in applications of space-time statistics in Cressie and Wikle 2011; Wikle, Zammit-Mangion, and Cressie 2019).

The derivation of principal components can be approached from a number of different perspectives, all leading to the same solution. Perhaps the most common treatment considers the components as the solution of a problem of finding new variables that are constructed as a linear combination of the original variables, such that they maximize the explained variance. In a sense, the principal components can be interpreted as the best linear approximation to the multivariate point cloud of the data.

More formally, consider \(n\) observations on \(k\) variables \(x_h\),
with \(h = 1, \dots, k\), organized
as a \(n \times k\) matrix \(X\) (each variable is a column in the matrix). In practice, each of the variables is typically standardized, such that its mean is zero and variance equals one. This avoids problems with (large) scale differences between the variables (i.e., some are very small numbers and others very large). For such standardized variables, the \(k \times k\) cross product matrix \(X'X\) corresponds to the correlation matrix (without standardization, this would be the variance-covariance matrix).^{3}

The goal is to find a small number of new variables, the so-called *principal components*, that explain the bulk of the variance (or, in practice, the correlation) in the original variables. If this can be accomplished with a much smaller number of variables than in the original set, the objective of *dimension reduction* will have been achieved.

We can think of each principal component \(z_u\) as a linear combination of the original variables \(x_h\), with \(h\) going from \(1\) to \(k\) such that: \[z_u = a_1 x_1 + a_2 x_2 + \dots + a_k x_k\] The mathematical problem is to find the coefficients \(a_h\) such that the new variables maximize the explained variance of the original variables. In addition, to avoid an indeterminate solution, the coefficients are scaled such that the sum of their squares equals \(1\).

A full mathematical treatment of the derivation of the optimal solution to this problem is beyond our scope (for details, see, e.g., Lee and Verleysen 2007, Chapter 2). Suffice it to say that the principal components are based on the eigenvalues and eigenvectors of the cross-product matrix \(X'X\).

The coefficients by which the original variables need to be multiplied to obtain each principal component can be shown to correspond to the elements of the eigenvectors of \(X'X\), with the associated eigenvalue giving the explained variance. Note that while the matrix \(X\) is not square, the cross-product matrix \(X'X\) is of dimension \(k \times k\), so it is square. See the Appendix for a quick overview of the concept of eigenvalue and eigenvector.

Operationally, the principal component coefficients are obtained by means of a
matrix decomposition. One option is to compute the *eigenvalue decomposition* of the
\(k \times k\) matrix \(X'X\), i.e., the correlation matrix, as
\[X'X = VGV',\]
where \(V\) is a \(k \times k\) matrix with the eigenvectors as columns (the coefficients needed
to construct the principal components), and \(G\) a \(k \times k\) diagonal matrix of the
associated eigenvalues (the explained variance). For further details, see the Appendix.

A second, and computationally preferred way to approach this is as a *singular value decomposition* (SVD)
of the \(n \times k\) matrix \(X\), i.e., the matrix of (standardized) observations, as
\[X = UDV',\]
where again \(V\) (the transpose of the \(k \times k\) matrix \(V'\)) is the matrix with the eigenvectors of \(X'X\) as columns, and \(D\) is a \(k \times k\)
diagonal matrix such that \(D^2\) is the matrix of eigenvalues of
\(X'X\).^{4} A brief discussion of singular value decomposition is given in the Appendix.

It turns out that this second approach is the solution to viewing the principal components explicity as a dimension reduction problem, originally considered by Karl Pearson. More formally, consider the observed vector on the \(k\) variables \(x\) as a function of a number of unknown *latent variables* \(z\), such that there is a linear relationship between them:
\[ x = Az, \]
where \(x\) is a \(k \times 1\) vector of the observed variables, and \(z\) is a \(p \times 1\) vector of the (unobserved) latent variables, ideally with \(p\) much smaller than \(k\). The matrix \(A\) is of dimension \(k \times p\) and contains the coefficients of the transformation. Again, in order to avoid indeterminate solutions, the coefficients are scaled such that \(A'A = I\), which
ensures that the sum of squares of the coefficients associated with a given component equals one.
Instead of maximizing explained variance, the objective is now to find \(A\) and \(z\) such that the so-called reconstruction error is minimized.^{5}

Note that different computational approaches to obtain the eigenvalues and eigenvectors (there is no analytical solution) may yield opposite signs for the elements of the eigenvectors (but not for the eigenvalues). This will affect the sign of the resulting component (i.e., positives become negatives). This is the case in `GeoDa`

for the **eigen** and **SVD** options, as illustrated below.

In a principal component analysis, we are typically interested in three main results. First, we need the principal component scores as a replacement for the original variables. This is particularly relevant when a small number of components explain a substantial share of the original variance. Second, the relative contribution of each of the original variables to each principal component is of interest. Finally, the variance proportion explained by each component in and of itself is also important.

In addition to these traditional interpretations, we will be particularly
interested in *spatializing* the principal components, i.e., analyzing
and visualizing their spatial patterns.

### Implementation

We invoke the principal components functionality from the **Clusters** toolbar icon,
shown in Figure 7.

From the main menu, we can select **Clusters > PCA**. The full list of options is shown in Figure 8. The dimension reduction methods are included in the top
part (separated from the others by a thin line). **PCA** is the first item on the list of options.

This brings up the **PCA Settings** dialog, the main interface through which variables
are chosen, options selected, and summary results are provided.

#### Variable Settings Panel

We select the variables and set the parameters for the principal components analysis
through the options in the left hand panel of the interface. We choose the six
same variables as in the multivariate analysis presented in Dray and Jombart (2011): **Crm_prs** (crimes against persons),
**Crm_prp** (crimes against property), **Litercy** (literacy), **Donatns** (donations), **Infants** (births out of wedlock), and **Suicids** (suicides)
(see also Anselin 2019, for an extensive discussion of the variables). Note that, as in the original Guerry data set, the scale of each variable is such that larger is better. Specifically, the rates are not the usual events/population (like a crime rate, number of crimes over population), but the reverse, population over events. For example, a large value for **Crm_prs** actually denotes a low crime rate, since it corresponds with more people per crime committed. The chosen variables
appear highlighted in the **Select Variables** panel, Figure 9.

The default settings for the **Parameters** are likely fine in most practical situations.
The **Method** is set to **SVD**, i.e., singular value decomposition. The alternative
**Eigen** carries out an explicit eigenvalue decomposition of the correlation matrix. In our example, the two approaches yield opposite signs for the loadings of three of the components. We return to this below.
For larger data sets, the singular value decomposition approach is faster and is the preferred option.

The **Transformation** option is set to **Standardize (Z)** by default, which converts all variables
such that their mean is zero and variance one, i.e., it creates a z-value as
\[z = \frac{(x - \bar{x})}{\sigma(x)},\]
with \(\bar{x}\) as the mean of the original variable \(x\), and \(\sigma(x)\) as its standard deviation.

An alternative standardization is **Standardize (MAD)**, which uses the *mean absolute deviation* (MAD) as the denominator in the standardization. This is preferred in some of the clustering literature, since it diminishes the effect of outliers on the standard deviation
(see, for example, the illustration in Kaufman and Rousseeuw 2005, 8–9). The mean absolute deviation for a variable \(x\) is computed as:
\[\mbox{mad} = (1/n) \sum_i |x_i - \bar{x}|,\]
i.e., the average of the absolute deviations between an observation and the mean for that variable. The estimate for \(\mbox{mad}\) takes the place of \(\sigma(x)\) in the denominator of the standardization expression.

Two additional transformations that are based on the range of the observations, i.e., the difference between the maximum and minimum, are the
**Range Adjust** and the **Range Standardize** options. **Range Adjust** divides each value by the range of observations:
\[r_a = \frac{x_i}{x_{max} - x_{min}}.\]
While **Range Adjust** simply rescales observations in function of their range, **Range Standardize** turns them into a value
between zero (for the minimum) and one (for the maximum):
\[r_s = \frac{x_i - x_{min}}{x_{max} - x_{min}}.\]
Whereas these transformations are used less commonly in the context of PCA, they are quite useful in cluster analysis (in `GeoDa`

, all
multivariate methods share the same standardization options).

The remaining options are
to use the variables in their
original scale (**Raw**), or as deviations from the mean (**Demean**).

After clicking on the **Run** button, the summary statistics appear in the right hand panel, as shown in Figure 10.
We return for a more detailed interpretation below.

#### Saving the Results

Once the results have been computed, a value appears next to the **Output** **Components** item
in the left panel, shown in Figure 11. This value corresponds to the number of components that explain 95% of the
variance, as indicated in the results panel. This determines the default number of components that
will be added to the data table upon selecting **Save**. The value can be changed in the drop-down
list. For now, we keep the number of components as **5**, even though that is not a very
good result (given that we started with only six variables).

The **Save** button brings up
a dialog to select variable names for the principal components,
shown in Figure 12. The default is generic and
not suitable for a situation where a large number of analyses will be carried out. In practice, one would customize the components names to keep the results from different computations distinct.

Clicking on **OK** adds the principal components to the data table, as in Figure 13, shown for the first
five observations. The principal components now become
available for all types of analysis and visualization. As always, the addition is only made permanent after
saving the table.

### Interpretation

The panel with summary results provides several statistics pertaining to the variance decomposition, the eigenvalues, the variable loadings and the contribution of each of the original variables to the respective components.

#### Explained variance

The first item lists the **Standard deviation** explained by each of the components. This corresponds to the
square root of the **Eigenvalues** (each eigenvalue equals the variance explained
by the corresponding principal component), which are listed as well. In our example, the first eigenvalue is 2.14047, which is thus the variance of the first component. Consequently, the standard deviation is the square root of this value, i.e., 1.463034, given as the first item in the list.

The sum of all the eigenvalues is 6, which equals the number of variables, or, more precisely, the rank of the matrix \(X'X\). Therefore, the proportion of variance explained by the first component is 2.14047/6 = 0.356745, as reported in the list. Similarly, the proportion explained by the second component is 0.200137, so that the cumulative proportion of the first and second component amounts to 0.356745 + 0.200137 = 0.556882. In other words, the first two components explain a little over half of the total variance.

The fraction of the total variance explained is listed both as a separate
**Proportion** and as a **Cumulative proportion**. The latter is typically used to choose a
cut-off for the number of components. A common convention is to take a threshold of 95%,
which would suggest 5 components in our
example. Note that this is not a good result, given that we started with 6 variables, so there is not much of a dimension reduction.

An alternative criterion to select the number of components is the so-called
**Kaiser** criterion (Kaiser 1960), which suggests to take the components for which the eigenvalue
exceeds **1**. In our example, this would yield 3 components (they explain about 74% of the total variance).

The bottom part of the results panel is occupied by two tables that have the original variables as rows and the components as columns.

#### Variable loadings

The first table shows the
**Variable Loadings**. For each component (column), this lists the elements
of the corresponding eigenvector. The eigenvectors are standardized such that the sum
of the squared coefficients equals one. The elements of the eigenvector are the coefficients by which the (standardized) original
variables need to be multiplied to construct each component. We illustrate this below.

It is important to keep in mind that the signs of the loadings may change, depending on the algorithm that is used in their computation, but the absolute value of the coefficients
will be the same. In our example, setting **Method** to **Eigen** yields the loadings shown in Figure 14.

For PC1, PC4, and PC6, the signs for the loadings are the opposite of those reported in Figure 10. This needs to be kept in mind when interpreting the actual value (and sign) of the components.

When the original variables are all standardized, each eigenvector coefficient
gives a measure of the relative contribution of a variable to the component in question.
These loadings are also the vectors employed in a principal component *biplot*,
a common graph produced in a
principal component analysis (but not currently available in `GeoDa`

).

#### Variable loadings and principal components

To further illustrate how the principal components are obtained, consider the standardized values for the six variables, listed in Figure 15 for the first five observations.

Each principal component is obtained by multiplying the values in the columns for each observation
by the corresponding element of the eigenvector, listed as **Variable Loadings** in Figure 10
(using the results from the default option **SVD**). For the first principal component, the eigenvector is listed
under **PC1**: -0.0659, -0.5123, 0.5118, -0.1062, -0.4513, -0.5063. For example, the value of the first principal
component for the first observation is found as the product of the first row in Figure 15 with
this eigenvector. Specifically, the result is
1.2205 \(\times\) -0.0659 + 2.6270 \(\times\) -0.5123 - 0.1228 \(\times\) 0.5118 - 0.3342 \(\times\) -0.1062 + 1.5973 \(\times\) -0.4513
- 0.0469 \(\times\) -0.5063 = - 2.1508, the value in the first row/column in Figure 13.

Similar calculations can be carried out to verify the other values.

#### Substantive interpretation of principal components

The interpretation and substantive meaning of the principal components can
be a challenge. In *factor analysis*, a number of rotations are applied to clarify the contribution of each variable to the different components. The latter are then imbued with meaning such as “social deprivation”, “religious climate”, etc. Principal component analysis tends to stay away from this, but nevertheless, it is useful to consider the relative contribution of each variable to the respective components.

The table labeled as **Squared correlations** lists those statistics between each of the original
variables in a row and the principal component listed in the column. Each row of the table shows how much of the variance
in the original variable is explained by each of the components. As a result, the values
in each row sum to one.

More insightful is the analysis of each column, which indicates
which variables are important in the computation of the matching component. In our example,
we see that **Crm_prp**, **Litercy**, **Infants** and **Suicids** are the most important
for the first component, whereas for the second component this is **Crm_prs** and **Donatns**.
As we can see from the cumulative proportion listing, these two components explain
about 56% of the variance in the original variables.

Since the correlations are squared, they do not depend on the sign of the eigenvector elements, unlike the loadings.

#### Visualizing principal components

Once the principal components are added to the data table, they are available for use in any graph (or map).

A useful graph is a scatter plot of any pair of principal components. For example, such a graph is shown for the first two components (based on the **SVD** method) in Figure 16.
By construction, the principal component variables are uncorrelated,
which yields the characteristic circular cloud plot. A regression line fit to this scatter
plot yields a horizontal line (with slope zero). Through linking and brushing, we can
identify the locations on the map for any point in the scatter plot.

In addition, this type of bivariate plot is sometimes employed to visually identify *clusters* in the data.
Such clusters would be suggested by distinct high density clumps of points in the graph. Such points are close in a multivariate sense,
since they correspond to a linear combination of the original variables that maximizes explained
variance.^{6}

To illustrate the effect of the choice of eigenvalue computation, the scatter plot in Figure 17 again plots the values for the first and second component, but now using the loadings from the **Eigen** method. Note how the scatter has been flipped around the vertical axis, since what used to be positive for PC1, is now negative, and vice versa.

An even more graphic illustration of the difference between the principal components obtained through the **SVD** method (PC1) and
the **Eigen** method (PC1e) can be seen by the perfect negative correlation in the scatter plot in Figure 16.

In order to gain further insight into the composition of a principal component, we
combine a box plot of the values for the component with a parallel coordinate plot of its main contributing variables. For example, in the box plot in Figure 19, we select the
observations in the top quartile of PC1 (using SVD).^{7}

We link the box plot to a parallel coordinate plot for the
four variables that contribute most to this component. From the analysis above, we know that
these are **Crm_prp**, **Litercy**, **Infants** and **Suicids**. The corresponding linked selection
in the PCP shows *low* values for all but **Litercy**, but the lines in the plot are all close together. This nicely illustrates how the
first component captures a *clustering* in multivariate attribute space among these four
variables.

Since the variables are used in standardized form, low values will tend to have a negative sign, and high values a positive sign. The loadings for PC1 are negative for **Crm_prp**, **Infants** and **Suicids**, which, combined with negative standardized values, will make a (large) positive contribution to the component. The loadings for **Litercy** are positive and they multiply a (large) postive value, also making a positive contribution. As a result, the values for the principal component end up in the top quartile.

The substantive interpretation is a bit of a challenge, since it suggests that high property crime, out of wedlock births and suicide rates (recall that low values for the variables are *bad*, so actually high rates) coincide with high literacy rates (although the latter are limited to military conscripts, so they may be a biased reflection of the whole population).

## Spatializing Principal Components

We can further *spatialize* the visualization of principal components by explicitly
considering their spatial distribution in a map. In addition, we can explore spatial patterns more formally through a local spatial autocorrelation analysis.

### Principal component maps

The visualization of the spatial distribution of the value of a principal component by means of a choropleth map is mostly useful to suggest certain patterns. Caution needs to be used to interpret these in terms of *high* or *low*, since the latter depends on the sign of the component loadings (and thus on the method used to compute the eigenvectors).

For example, using the results from **SVD** for the first principal component, a quartile map shows a distinct pattern, with higher values above a diagonal
line going from the north west to the middle of the
south east border, as shown in Figure 20.

A quartile map for the first component computed using the **Eigen** method shows the same overall spatial pattern, but the roles of high and low are reversed,
as in Figure 21.

Even clearer evidence of the need for care in the interpretation of the components is seen in Figures
22 and 23 that show a box map for the first principal component,
respectively for **SVD** and for **Eigen**. What is shown as a lower outlier in for **SVD** is an upper
outlier in the box map for **Eigen**.

### Cluster maps

We pursue the nature of this pattern more formally through a local Moran cluster map. Again, using the first component from the **SVD** method, we find a strong high-high cluster in the northern part of the country, ranging
from east to west (with p=0.01 and 99999 permutations, using queen contiguity). In addition, there is a pronounced Low-Low cluster in the center of
the country. With the component based on the **Eigen** method, the locations of the clusters would be the same, but their labels would be opposite, i.e., what was high-high becomes low-low and vice versa. In other words, the location of the clusters is not affected by the method used, but their interpretation is.

### Principal components as multivariate cluster maps

Since the principal components combine the original variables, their patterns of spatial clustering could be viewed as an alternative to a full multivariate spatial clustering. The main difference is that in the latter each variable is given equal weight, whereas in the principal component some variables are more important than others. In our example, we saw that the first component is a reasonable summary of the pattern among four variables, **Crm_prp**, **Litercy**, **Infants** and **Suicids**.

First, we can check on the makeup of the component in the identified cluster. For example, we select the high-high values in the cluster map and find the matching lines in a parallel coordinate plot for the four variables, shown in Figure 25. The lines track closely (but not perfectly), suggesting some clustering of the data points in the four-dimensional attribute space. In other words, the high-high spatial cluster of the first principal component seems to match attribute similarity among the four variables.

In addition, we can now compare these results to a cluster or significance map from a multivariate local Geary analysis for the four variables. In Figure 26, we show the significance map rather than a cluster map, since all significant locations are for positive spatial autocorrelation (p < 0.01, 99999 permutations, queen contiguity). While the patterns are not identical to those in the principal component cluster map, they are quite similar and pick up the same general spatial trends.

This suggests that in some instances, a univariate local spatial autocorrelation analysis for one or a few dominant principal components may provide a viable alternative to a full-fledged multivariate analysis.

## Appendix

### Basics - eigenvalues and eigenvectors

Consider a square symmetric matrix:^{8}
\[
A = \begin{bmatrix}
1 & 3\\ 3 & 2
\end{bmatrix}
\]
and a column vector, say:
\[
v = \begin{bmatrix}
1\\
2
\end{bmatrix}
\]
As illustrated in Figure 27, we can visualize the vector as an *arrow* from the origin of a \(x-y\) scatter plot
going to the point (\(x = 1\), \(y = 2\)).

Pre-multiplying this vector by a scalar just moves the point further on the same slope. For example, \[2 \times \begin{bmatrix} 1\\ 2 \end{bmatrix} = \begin{bmatrix} 2\\ 4 \end{bmatrix} \] This is equivalent to moving the arrow over on the same slope from (1,2) to the point (2,4) further from the origin, as shown by the dashed line in Figure 27.

In contrast, pre-multiplying the vector by the matrix does both rescaling and rotation of the vector. In our example, \[ Av = \begin{bmatrix} (1 \times 1) + (3 \times 2)\\ (3 \times 1) + (2 \times 2) \end{bmatrix} = \begin{bmatrix} 7 \\ 5 \end{bmatrix}. \] In Figure 27, the arrow is both moved further away from the origin and also rotated down to the point (7,5) (now the x-value is larger than the y-value, so the slope of the arrow is less steep).

The eigenvectors and eigenvalues of a square symmetric matrix are a special
pair of scalar-vector, such that \(Av = \lambda v\), where \(\lambda\) is the *eigenvalue* and \(v\) is the *eigenvector*.
In addition, the eigenvectors are orthogonal to each other, i.e., \(v_u'v_k = 0\) (for different eigenvectors),
and \(v_u'v_u = 1\) (the sum of the squares of the eigenvector elements sum to one).

What does this mean? For this particular vector (i.e., arrow from the origin), the transformation by \(A\) does not rotate the vector, but simply rescales it (i.e., moves it further or closer to the origin), by exactly the factor \(\lambda\). In our example, the two eigenvectors turn out to be [0.6464 0.7630] and [-0.7630 0.6464], with associated eigenvalues 4.541 and -1.541. Each square matrix has as many eigenvectors and matching eigenvalues as its rank, in this case 2 – for a 2 by 2 nonsingular matrix. The actual computation of eigenvalues and eigenvectors is rather complicated, so let’s just assume we have the result.

Consider post-multiplying the matrix \(A\) with the eigenvector [0.6464 0.7630]: \[\begin{bmatrix} (1 \times 0.6464) + (3 \times 0.7630)\\ (3 \times 0.6464) + (2 \times 0.7630) \end{bmatrix} = \begin{bmatrix} 2.935\\ 3.465 \end{bmatrix} \] Now, for comparison purposes, we rescale the eigenvector with the matching eigenvalue: \[4.541 \times \begin{bmatrix} 0.6464\\ 0.7630 \end{bmatrix} = \begin{bmatrix} 2.935\\ 3.465 \end{bmatrix} \] This yields the same result as the matrix multiplication.

Furthermore, if we stack the eigenvectors in a matrix \(V\), we can verify that they are orthogonal and the sum of squares of the coefficients sum to one, i.e., \(V'V = I\): \[\begin{bmatrix} 0.6464 & -0.7630\\ 0.7630 & 0.6464 \end{bmatrix}' \begin{bmatrix} 0.6464 & -0.7630\\ 0.7630 & 0.6464 \end{bmatrix} = \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} \] In addition, it is easily verified that \(VV' = I\) as well. This means that the transpose of \(V\) is also its inverse (per the definition of an inverse matrix, i.e., a matrix for which the product with the original matrix yields the identity matrix), or \(V^{-1} = V'\).

As a result, we have found a particular vector (and associated scalar) so that multiplying this vector by \(A\) does not result in any rotation, but only in a rescaling.

Eigenvectors and eigenvalues are central in many statistical analyses, but it is important to realize they
are not as complicated as they may seem at first sight. On the other hand, computing them efficiently *is* complicated, and best left
to specialized programs (`GeoDa`

uses the `Eigen`

C++ library for these computations, http://eigen.tuxfamily.org/).

Finally, a couple of useful properties of eigenvalues are worth mentioning.

The sum of the eigenvalues equals the *trace* of the matrix. The trace is the sum of the diagonal elements, in our example, the trace is \(1 + 2 = 3\). The sum of the two eigenvalues is \(4.541 -1.541 = 3\). Also, the product of the eigenvalues equals the *determinant* of the matrix. For a \(2 \times 2\) matrix, the determinant is \(ab - cd\), or the product of the diagonal elements minus the product of the off-diagonal elements. In our example, that is \((1 \times 2) - (3 \times 3) = -7\). The product of the two eigenvalues is \(4.541 \times -1.541 = -7.0\).

### Eigenvector decomposition of a square matrix

For each eigenvector \(v\) of a square symmetric matrix \(A\), we have \(Av = \lambda v\). We can write this compactly for all the eigenvectors and
eigenvalues by organizing the individual eigenvectors as columns in a
\(k \times k\) matrix \(V\), as we have seen above. Similarly, we can organize
the matching eigenvalues as the diagonal elements of a diagonal matrix,
say \(G\). The basic eigenvalue expression can then be written
as \(AV = VG\). Note that \(V\) goes first in the matrix multiplication on the right hand side to ensure that each column of
\(V\) is multiplied by the corresponding eigenvalue on the diagonal of \(G\) to yield \(\lambda v\). Now, we take advantage of the fact that the eigenvectors are orthogonal, namely that \(VV' = I\). Post-multiplying each side of the equation by \(V'\) gives
\(AVV' = VGV'\), or \(A = VGV'\), the so-called *eigenvalue decomposition*
or *spectral decomposition* of a square symmetric matrix.

Since the correlation matrix \(X'X\) is square and symmetric, its spectral decomposition has the same structure: \[X'X = VGV',\] with \(V\) and \(G\) have the same meaning as before. In the context of principal component analysis, the coefficients \(a_h\) needed to construct the component from the original variables are the elements of the eigenvector column, with the row matching the column of the original variable. For example, the coefficients \(a_2\) needed to multiply the variable \(x_2\) would be in the second row of the matrix \(V\).

The variance associated with the h-th principal component is the h-th element on the diagonal of \(G\). With these values in hand, we have all the information to construct and interpret the principal components.

In `GeoDa`

, these computations are carried out by means of the `EigenSolver`

routine from the `Eigen`

C++ library (https://eigen.tuxfamily.org/dox/classEigen_1_1EigenSolver.html).

### Singular Value Decomposition (SVD)

The eigenvalue decomposition described above only applies to square matrices. A more general decomposition is the so-called
*singular value decomposition*, which applies to any rectangular matrix, such as the \(n \times k\) matrix \(X\) with (standardized)
observations.

For a full-rank \(n \times k\) matrix \(X\) (there are many more general cases where SVD can be applied), the decomposition takes the following form: \[X = UDV',\] where \(U\) is a \(n \times k\) orthonormal matrix (i.e., \(U'U = I\)), \(D\) is a \(k \times k\) diagonal matrix, and \(V\) is a \(k \times k\) orthonormal matrix (i.e., \(V'V = I\)). In essence, the decomposition shows how a transformation \(X\) applied to a vector \(y\) (through a matrix multiplication of \(X\) with a \(k\)-dimensional vector \(y\)) can be viewed to consist of a series of rotations and rescalings. Note that in the general rectangular case, the transformation \(Xy\) not only involves rotation and rescaling (as in the square case), but also a change in dimensionality from a \(k\)-dimensional space (the length of \(y\)) to an \(n\)-dimensional space (the length of \(Xy\), i.e., an n-dimensional vector). An intuitive visualization of the operations involved in such a transformation that are reflected in the elements of the SVD can be found at https://en.wikipedia.org/wiki/File:Singular_value_decomposition.gif.

While SVD is very general, we don’t need that full generality for the PCA case. It turns out that there is a direct connection between the eigenvalues and eigenvectors of the (square) matrix \(X'X\) and the SVD of \(X\). Using the expression above, we can write \(X'X\) as: \[X'X = (UDV')'(UDV') = VDU'UDV' = VD^2V',\] since \(U'U = I\).

It thus turns out that the columns of \(V\) from the SVD contain the eigenvectors of \(X'X\) and the square of the diagonal elements of \(D\) in the SVD are the eigenvalues associated with the corresponding principal component.

The concept of singular value decomposition is very broad and a detailed discussion is beyond our scope. A number of computational approaches are available. In `GeoDa`

, the `JacobiSVD`

routine is used from the `Eigen`

C++ library (https://eigen.tuxfamily.org/dox/classEigen_1_1JacobiSVD.html). Note that other software (such as `R`

’s `pca`

) uses an algorithm based on the Householder transformation (from the `lapack`

library). This yields the opposite signs for the eigenvector elements.

## References

Anselin, Luc. 2019. “A Local Indicator of Multivariate Spatial Association, Extending Geary’s c.” *Geographical Analysis* 51 (2): 133–50.

Bellman, R. E. 1961. *Adaptive Control Processes*. Princeton, N.J.: Princeton University Press.

Cressie, Noel, and Christopher K. Wikle. 2011. *Statistics for Spatio-Temporal Data*. Hoboken, NJ: John Wiley; Sons.

Dray, Stéphane, and Thibaut Jombart. 2011. “Revisiting Guerry’s Data: Introducing Spatial Constraints in Multivariate Analysis.” *The Annals of Applied Statistics* 5 (4): 2278–99.

Everitt, Brian S., Sabine Landau, Morven Leese, and Daniel Stahl. 2011. *Cluster Analysis, 5th Edition*. New York, NY: John Wiley.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. *The Elements of Statistical Learning (2nd Edition)*. New York, NY: Springer.

Hotelling, Harold. 1933. “Analysis of a Complex of Statistical Variables into Principal Components.” *Journal of Educational Psychology* 24: 417–41.

Kaiser, H. F. 1960. “The Application of Electronic Computers to Factor Analysis.” *Educational and Psychological Measurement* 20: 141–51.

Kaufman, L., and P. Rousseeuw. 2005. *Finding Groups in Data: An Introduction to Cluster Analysis*. New York, NY: John Wiley.

Lee, John A., and Michel Verleysen. 2007. *Nonlinear Dimensionality Reduction*. New York, NY: Springer-Verlag.

Pearson, Karl. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” *Philosophical Magazine* 2: 559–72.

Wikle, Christopher K., Andrew Zammit-Mangion, and Noel Cressie. 2019. *Spatio-Temporal Statistics with R*. Boca Raton, FL: CRC Press.

University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu↩︎

In

`GeoDa`

, this can be accomplished in the weights manager, using the**variables**option for distance weights, and making sure the transformation is set to**raw**. The nearest neighbor distances are the third column in a weights file created for \(k = 1\).↩︎The standardization should not be done mechanically, since there are instances where the variance differences between the variables are actually meaningful, e.g., when the scales on which they are measured have a strong substantive meaning (e.g., in psychology).↩︎

Since the eigenvalues equal the variance explained by the corresponding component, the diagonal elements of \(D\) are thus the standard deviation explained by the component.↩︎

The concept of reconstruction error is somewhat technical. If \(A\) were a square matrix, we could solve for \(z\) as \(z = A^{-1}x\), where \(A^{-1}\) is the inverse of the matrix \(A\). However, due to the dimension reduction, \(A\) is not square, so we have to use something called a pseudo-inverse or Moore-Penrose inverse. This is the \(p \times k\) matrix \((A'A)^{-1}A'\), such that \(z = (A'A)^{-1}A'x\). Furthermore, because \(A'A = I\), this simplifies to \(z = A'x\) (of course, so far the elements of \(A\) are unknown). Since \(x = Az\), if we knew what \(A\) was, we could find \(x\) as \(Az\), or, with our solution in hand, as \(AA'x\). The reconstruction error is then the squared difference between \(x\) and \(AA'x\). We have to find the coeffients for \(A\) that minimize this expression. For an extensive technical discussion, see Lee and Verleysen (2007), Chapter 2.↩︎

For an example, see, e.g., Chapter 2 of Everitt et al. (2011).↩︎

Note that, when using the

**Eigen**method, these would be the bottom quartile of the first component.↩︎Not all the results for symmetric matrices hold for asymmetric matrices. However, since we only deal with symmetric matrices in the applications covered here, we just stick to the simpler case.↩︎