# Density-Based Clustering Methods

#### Luc Anselin^{1}

#### 10/30/2020 (latest update)

## Introduction

In this chapter, we consider density based clustering methods. These approaches look in the data for high density subregions of arbitrary shape, separated by low-density regions. Alternatively, the regions can be interpreted as modes in the spatial distribution over the support of the observations.

These methods are situated in between the local spatial autocorrelation statistics and the regionalization methods
we consider in later chapters. They pertain primarily to point patterns, but can also be extended to a full
multivariate setting. Even though they show many similarities with spatially constrained clustering methods,
they are not quite the same, in that they do not necessarily yield a complete partitioning of the data.
Therefore, they are considered separately, even though they are included under the **Cluster Methods** in `GeoDa`

.

Attempts to discover high density regions in the data distribution go back to the classic paper on mode analysis by Wishart (1969),
and its refinement in Hartigan (1975). In the literature, these methods are also referred to as *bump hunting*, i.e., looking for
bumps (high regions) in the data distribution.

In this chapter, we will focus on the application of density-based clustering methods to the geographic location of points, but they are generalizable to locations in high-dimensional attribute space as well.

An important concept in this respect is that of a *level set* associated with a given density level \(\lambda\):
\[L(\lambda; p) = \{ x | p(x) \gt \lambda\},\]
i.e., the collection of data points \(x\) for which the probability exceeds the given threshold. A subset of such points that contains the maximum number of points that are *connected* (so-called connected components) is called a *density-contour cluster* of the density function. A
*density-contour tree* is then a tree formed of nested clusters that are obtained by varying the level \(\lambda\) (Hartigan 1975; Müller and Sawitzki 1991; Stuetzle and Nugent 2010).

To visualize this concept, consider the data distribution represented as a three-dimensional surface. Then, a level set consists of those data points that are above a horizontal plane at \(\lambda\), e.g., as islands sticking out of the ocean. As \(\lambda\) increases, the level set becomes smaller. In our island analogy, this corresponds to a rising ocean level (the islands become smaller and may even disappear). In the other direction, as \(\lambda\) decreases, or the ocean level goes down, connections between islands (a land bridge) may become visible so that they no longer appear separated, but rather as a single entity.

As mentioned, in contrast to the classic cluster methods discussed in later methods, density-based methods do not
necessarily yield a complete regionalization, and some observations (points)
may not be assigned to any cluster. In turn, some of these points can be interpreted
as *outliers*, which is of main interest in certain applications.
In a sense, the density-based cluster methods are similar in spirit to the local spatial autocorrelation statistics, although
most of them are not formulated as hypothesis tests.

We consider four approaches. We begin with a simple heat map as a uniform density kernel centered on each location.
The logic behind this graph is similar to that of Openshaw’s
*geographical analysis machine* (Openshaw et al. 1987) and the approach
taken in spatial
scan statistics (Kulldorff 1997), i.e., a simple count of the points within the given radius.

The next methods are all related to DBSCAN (Ester et al. 1996), or *Density Based
Spatial Clustering of Applications with Noise*. We consider both the original DBSCAN,
as well as its *improved* version, referred to as DBSCAN*,
and its *Hierarchical* version, referred to as HDBSCAN (or, sometimes, HDBSCAN*)
(Campello, Moulavi, and Sander 2013; Campello et al. 2015).

The methods are illustrated here using the point locations of liquor stores in Chicago (in 2015), but they can pertain to other geometric forms as well, such as clustering of polygons (Sander et al. 1998). In general, they can also be applied to any set of points in multi-attribute space (i.e., non-geographical), although that use is perhaps less common.

### Objectives

Interpret the results of a uniform kernel heat map

Understand the principles behind DBSCAN

Set the parameters for a DBSCAN approach

Interpret the clustering results yielded by DBSCAN

Understand the difference between DBSCAN and DBSCAN*

Analyze the dendrogram generated by DBSCAN*

Understand the logic underlying HDBSCAN

Interpret the condensed tree and clustering results yielded by HDBSCAN

Understand soft clustering and outlier identification in HDBSCAN

#### GeoDa functions covered

- Map option > Heat Map
- Clusters > DBscan
- Clusters > HDBscan

### Getting started

We will use the **liq_Chicago** shape file with 571 point locations of liquor stores
in the city of Chicago during 2015. This data set is included as one of the sample
data sets, named liquor stores. The point locations
were scraped from Google maps and converted to the Illinois State Plane projection.

In Figure 1, the points are shown against a backdrop of the Chicago community area boundaries.^{2}

## Heat Map

### Principle

The heat map as implemented in `GeoDa`

is a simple uniform kernel for a given bandwidth. A circle with radius equal
to the bandwidth is centered on each point in turn, and the number of points within the radius are counted. The
density of the points is reflected in the color shading.

This is the same principle as underlying the *Geographical Analysis Machine* or GAM of Openshaw et al. (1987), where the
uniform kernel is computed for a range of bandwidths to identify high intensity *clusters*. In our
implementation, no significance
is assigned, and the heat map is used primarily as a visual device to highlight differences in the density
of the points.

### Implementation

The heat map is not invoked from the cluster menu, but as an option on a point map. Right click to bring up
the list of options and select **Heat Map > Specify Bandwidth** as in Figure 2.

This brings up the default heat map, created for a bandwidth that corresponds with the max-min distance, i.e., the largest of the nearest neighbor distances. This ensures that each point has at least one neighbor. The corresponding distance is shown in the dialog, as in Figure 3.

The default setting is almost never very informative, especially when the points are distributed unevenly. For example, after setting the bandwidth to 3000, a much more interesting pattern is revealed, with a greater concentration of points in the near north side of the city, shown in Figure 4.

In addition to the bandwidth, the heat map also has the option to select a variable other than the x-y coordinates
to compute the kernel. This is selected in **Core Distance**. We do not pursue this further here.

Finally, the heat map has all the usual customization options, such as changing the fill color, the outline color, or the transparency.

## DBSCAN

### Principle

The DBSCAN algorithm was originally outlined in Ester et al. (1996) and Sander et al. (1998), and was more recently elaborated upon in Gan and Tao (2017) and Schubert et al. (2017). Its logic is similar to that just outlined for the uniform kernel. In essence, the method again consists of placing circles of a given radius on each point in turn, and identifying those groupings of points where a lot of locations are within each others range.

In order to quantify this, DBSCAN introduces a special terminology. Points are classified
as *Core*, *Border* or *Noise* depending on how many other points are within a critical
distance band, the so-called *Eps neighborhood*. To visualize this, Figure 5
contains nine points, each with a circle centered on it with radius equal to *Eps*.^{3}
Note that in
DBSCAN, any distance metric can be used, not just Euclidean distance as in this illustration.

A second critical concept is the number of points that need to be included in the distance band
in order for the spatial distribution of points to be considered as dense. This is the so-called
minimum number of points, or *MinPts* criterion. In our example, we take that to be four. Note, that
in contrast to the convention we used before in defining k-nearest neighbors, the *MinPts* criterion
includes the point itself. So a *MinPts* = 4 corresponds to a point having 3 neighbors within the *Eps* radius
.^{4}

In Figure 5, all red
points with associated red circles have at least four points within the critical range (including the central point). They are labeled
as *Core* points and are included in the cluster (point C and the other red points). The points in magenta (labeled B), with associated magenta circles have some points within
the distance range (one, to be precise), but not sufficient to meet the *MinPts* criterion. They
are potential *Border* points and may or may not be included in the cluster. Finally, the blue point (labeled N) with associated blue circle does not have
any points within the critical range and is labeled *Noise*. Such a point cannot become part of any cluster.

A point is *directly density reachable* from another point if it belongs to the *Eps* neighborhood of that point and
is one of *MinPts* neighbors of that point.
This is not a symmetric relationship.

In our example, any red point is directly density reachable from at least
one other red point. In turn that point is within the critical range from the original point. However, for points
B, the relationship only holds in one direction, as shown by the arrow. They are directly density reachable from a red point, but since
the B points only have one neighbor, their range does not meet the minimum criterion. Therefore, the neighbor is *not*
directly density reachable from B. A chain of points in which each point is directly density reachable from the previous one is called
*density reachable*. In order to be included, each point in the chain has to have at least *MinPts* neighbors and could serve as the core
of a cluster. All our red points are density reachable.

In order to decide whether a border point should be included in a cluster, the concept of *density connected* is introduced.
Two points are density connected if they are both density reachable from a third point. In our example, the points B are density
connected to the other core points through their neighbor, which is itself directly density reachable from the red points in its
neighborhood.

In DBSCAN, a cluster is then defined as collection of points that are density connected and maximize the density reachability.
The algorithm starts by randomly selecting a point and determining whether it can be classified as *Core* – with at least *MinPts*
in its *Eps* neighborhood – or rather as *Border* or *Noise*. A *Border* point can later be included in a cluster if it is
*density connected* to another *Core* point. It is then assigned the cluster label of the core point and no longer further considered.
One implication of this aspect of the algorithm is that once a *Border* point is assigned a cluster label, it cannot be assigned
to a different cluster, even though it might actually be closer to the corresponding core point. In a later version, labeled
DBSCAN* (considered in the next section), the notion of border points is dropped, and only *Core* points are considered to form clusters.

The algorithm systematically moves through all the points in the data set and repeats the process. The search for neighbors is facilitated by using an efficient spatial data structure, such as an R* tree. When two clusters are density connected, they are merged. The process continues until all points have been evaluated.

Two critical parameters in the DBSCAN algorithm are the distance range and the minimum number of neighbors. In addition, sometimes a tolerance for noise points is specified as well. The latter constitute zones of low density that are not deemed to be interesting.

In order to avoid any noise points, the critical distance
must be large enough so that every point is at least density connected. An analogy is the specification of a max-min distance
band in the creation of spatial weights, which ensures that each point has at least one neighbor. In practice, this is typically not desirable, but in some implementations a maximum
percentage of noise points can be set to avoid too many low density areas.^{5}

Ester et al. (1996) recommend that *MinPts* be set at 4 and the
critical distance adjusted accordingly to make sure that sufficient observations can be classified as *Core*. As in other cluster
methods, some trial and error it typically necessary. Finding a proper value for *Eps* is often considered a major
drawback of the DBSCAN algorithm.

An illustration of the detailed steps and logic of the algorithm is provided by a worked example in the Appendix.

### Implementation

DBSCAN is invoked from the cluster toolbar icon as the first item in the density cluster group, or
from the menu as **Clusters > DBScan **, as shown in Figure 6.

The usual cluster interface appears, listing the variables in the data table and the various
parameters to be selected, as in Figure 7. In our example, we only have **<X-Centroids>**
and **<Y-Centroids>**, since only the location information of the stores has been included. The default
is to have the **Method** selected as **DBScan** (DBSCAN* is uses the same interface and is discussed next).
In addition, we make sure that the **Transformation** is set to **Raw**, in order to use the original coordinates,
without any transformations.

As seen in Figure 7, the default uses the largest nearest neighbor distance of 8179 (feet) such that
all observations are connected (note that this is the same distance as the default bandwidth in
Figure 3). We can check in the weights manager dialog that this is indeed the distance
that would be the default to create a fully connected distance band spatial weight. This yields a median number of neighbors
of 30, with a minimum of 1 and a maximum of 79. Needless to say, the default **Min Points** value of 4 is not
very meaningful in this context.

The resulting cluster map is shown in Figure 8. It contains two clusters, one very large
one with 561 observations, and a small one with 7 observations. In addition, three points are classified as *Noise*.
The **Summary** panel contains the usual information on cluster centers
and sum of squares, which is not very relevant in this example. There is also a button for **Dendrogram**, which
is only used for **DBSCAN*** and is empty in the current example.

We assess more meaningful parameter values in the next set of examples.

#### Border points and minimum cluster size

The DBSCAN algorithm depends critically on the choice of two parameters. One is the **Distance Threshold**, or *Eps*.
The other is the
minimum number of neighbors within this epsilon distance band that an observation must have in order to
be considered a *Core* point, i.e., the **Min Points** parameter. Each cluster must contain at least one core point.

As *Eps* decreases, more and more points will become disconnected and thus will be automatically categorized as *Noise*.
For example, for a distance band of 3000 (as in the heat map in Figure 4), we can visualize the connectivity graph of the corresponding distance band
spatial weights, as in Figure 9. This yields 37 unconnected points with neighbors
ranging from 0 to 30, with a median of 4.

With the distance set to 3000 and leaving Min Points at 4, we obtain the result for DBSCAN shown in Figure 10. There are 19 clusters identified, ranging in size from 233 to 4 observations, with 122 points designated as noise.

In rare instances, a cluster may be identified that has less than the **Min Points** specified. This is
an artifact of
the treatment of *Border* points in the algorithm. Since border points are assigned to the first cluster
to which they are connected, they are no longer available for inclusion in a later cluster. Say a later
core point is connected to the same border point, which is counted as part of the **Min Points** for the
new core. However, since it is already assigned to another cluster, it is no longer available and the
new cluster seems to have fewer members than the specified **Min Points**. This particular feature
of the DBSCAN algorithm can lead to
counterintuitive results, although this rarely happens. We illustrate such an instance below.^{6}

#### Sensitivity analysis

We further assess the sensitivity of the cluster results to the two main parameters by increasing **Min Points** to
10, while keeping the distance at 3000. The corresponding cluster map is shown in Figure 11.
The number of clusters is drastically reduced to 6, ranging in size from 110 to 12. In all, 310 points (more than half)
are designated as noise.

Finally, we decrease the distance to 1000, while keeping **Min Points** at 4. In Figure 12, we now find 21 clusters, with the largest
one consisting of 12 observations, whereas the smallest one has only 3 observations. A majority of the points (456) are
not assigned to any cluster.

Note that this is an instance where a cluster (cluster 21) has *less than* **Min Points** observations assigned to it.
Consider the close up view in Figure 13, which gives the connectivity structure that corresponds
to a distance band of 1000. Cluster 21 consists of points with **id** 501, 72 and 807. Point 501 is clearly a *Core* point,
since it is connected to 72 and 807, as well as to 710, so it has three neighbors and meets the **Min Points** criterion of 4.
Both 72 and 807 are assigned to 501 as *Border* points, since they do not meet the minimum connectivity requirement (each only has two neighbors). The same is the case for 710, but since 710 is already assigned to a different cluster (cluster 2),
it cannot be reassigned to cluster 21. As a result, the latter only has 3 observations.

## DBSCAN*

### Principle

As we have seen, one of the potentially confusing aspects of DBSCAN is the inclusion of so-called
*Border* points in a cluster. The DBSCAN* algorithm, outlined in Campello, Moulavi, and Sander (2013) and further
elaborated upon in Campello et al. (2015), does away
with the notion of border points, and only considers *Core* and *Noise* points.

Similar to the approach in DBSCAN, a *Core* object is defined with respect to a distance threshold (\(\epsilon\)) as the
center of a neighborhood that contains at least *Min Points* other observations. All non-core
objects are classified as *Noise* (i.e., they do not have any other point within the distance range \(\epsilon\)).
Two observations are \(\epsilon\)-*reachable* if each is part of the \(\epsilon\)-neighborhood of the other.
*Core* points are *density-connected* when they are part of a chain of \(\epsilon\)-reachable points.
A *cluster* is then any largest (i.e., all eligible points are included) subset of density
connected pairs. In other words, a cluster consists of a chain of pairs of points that belong to
each others \(\epsilon\)-neighborhoods.

An alternative way to view \(\epsilon\)-reachability is to consider the minimum radius such that two points
satisfy this criterion for a given *Min Points*, or k, where k = *Min Points* - 1 (more precisely, whereas *Min Points*
includes the point itself, k does not). For k > 1 (or, Min Points > 2), this is the so-called
*mutual reachability distance*, the maximum between the distance needed to include the required number
of neighbors for each point (the k-nearest neighbor distance, called *core distance* in HDBSCAN, \(d_{core}\), see below)
and the actual inter-point distance. This has the effect of *pushing*
points apart that may be close together, but otherwise are in a region of low density, so
that their respective k-nearest neighbor distances are (much) larger than the distance between them. Formally,
the concept of
*mutual reachability distance* between points A and B is defined as:
\[d_{mr}(A,B) = max[d_{core}(A),d_{core}(B),d_{AB}].\]
The mutual reachibility distance between the points forms the basis to construct a *minimum spanning tree* (MST) from
which a hierarchy of clusters can be obtained. The hiearchy follows from applying cuts to the edges in the MST in
decreasing order of core distance.

Conceptually, this means that one starts by applying a cut between two edges that
are the furthest apart. This either results in a single point splitting off (a *Noise* point), or in the single cluster
to split into two (two subtrees of the MST). Subsequent cuts are applied for smaller and smaller distances. Decreasing the critical distance
is the same as increasing the density of points, referred to as \(\lambda\) (see below for a formal definition).

As Campello et al. (2015) show, applying cuts to the MST with decreasing distance (or increasing density) is equivalent to applying cuts to a dendrogram obtained from single linkage hierarchical clustering using the mutual reachability distance as the dissimilarity matrix.

DBSCAN* derives a set of *flat* clusters by applying a *cut* to the dendrogram associated with a given
distance threshold. The main differences with the original DBSCAN is that *border* points are no longer
considered and the clustering process is visualized by means of a dendrogram. However, in all other respects it is similar in spirit,
and it still suffers from the need to select a distance threshold.

HDBSCAN (or HDBSCAN*) improves upon this by eliminating the arbitrariness of the threshold. This is considered next.

### Implementation

DBSCAN* is invoked from the cluster menu as **Clusters > DBSCAN** in the same way as DBSCAN, but by
selecting the **Method** as **DBSCAN* **. This also activates the **Min Cluster Size** option, set to equal to
**Min Points** by default. The latter is initialized at **4**. As shown in Figure 14, the starting
point for the **Distance Threshold** is again the distance that ensures that each point has at least one neighbor. The
value is in feet, which follows from using **Raw** for the **Transformation**, to keep the original coordinate units.
Since this max-min value is typically not a very informative distance, we will change it to **3000**. All the other
settings are the same as before.

After invoking **Run**, the dendrogram is populated, as shown in Figure 15. The horizontal scale
shows the mutual reachability distance, and the *cut* line (dashed red line) is positioned at 3000. This yields
12 clusters, color coded in the graph. As was the case for classical hierarchical clustering, the dendrogram supports
linking and brushing (not illustrated here).

**Save/Show Map** saves the cluster classification to the data table and renders the corresponding cluster map,
as in Figure 16. The 12 clusters range in size from 221 to 4, with 194 observations classified
as noise. They follow the same general pattern
as for DBSCAN in Figure 10, but seem more compact. Recall, the latter yielded 19 clusters ranging
in size from 233 to 4, with 122 noise points.

In addition to the cluster map, the **Summary** panel contains the usual measures of fit, which are not that
meaningful in this context.

#### Exploring the dendrogram

As long as **Min Points** (and the minimum cluster size) remains the same, we can explore the dendrogram for
different threshold distance cut values. This can be accomplished by moving the cut line in the graph, or
by typing a different value in the **Distance Threshold** box. For example, in Figure 17,
the cut line has been moved to 2000. There are still 12 clusters, but their makeup is quite different from before.

As we see in the cluster map in Figure 18, the clusters are now much smaller, ranging in size from 74 to 4, with 334 noise points. More interestingly, it seems like the large cluster in the northern part of the city (cluster 1 in Figure 16) is being decomposed into a number of smaller sub-clusters.

Note that for a different value of **Min Points**, the dendrogram needs to be reconstructed, since this critically
affects the mutual reachability distance (through the new core distance).

The main advantage of DBSCAN* over DBSCAN is a much more straightforward treatment of *Border* points (they are all classified as *Noise*), and an easy visualization of the sensitivity to the threshold distance by means of a
dendrogram. However, just as for DBSCAN, the need to select a fixed threshold distance (to obtain a *flat* cluster) remains
an important drawback.

## HDBSCAN

### Principle

HDBSCAN was originally proposed by Campello, Moulavi, and Sander (2013), and more recently elaborated upon by Campello et al. (2015) and
McInnes and Healy (2017).^{7} As in DBSCAN*, the algorithm implements the notion of *mutual reachability distance*. However,
rather than applying a fixed value of \(\epsilon\) or \(\lambda\) to produce a level set,
a hierarchical process is implemented that finds an optimal set of clusters. This allows for different levels of
\(\lambda\) for each cluster. In order to accomplish this, the concept of cluster
*stability* or *persistence* is introduced. Optimal clusters are selected based on their *relative excess of mass* value,
which is a measure of their persistence.

HDBSCAN keeps the notion of *Min Points* from DBSCAN, but introduces the concept of *core distance* of an object (\(d_{core}\))
as the distance between an object and its k-nearest neighbor, where k = Min Points - 1 (in other words, as for DBSCAN, the
object itself is included in *Min Points*).

Intuitively, for the same k, the core distance will be much smaller for
densely distributed points than for sparse distributions. Associated with the core distance for each object \(p\) is
an estimate of *density* or \(\lambda_p\) as the inverse of the core distance. As the core distance becomes smaller,
\(\lambda\) increases, indicating a higher density of points.

As in DBSCAN*, the mutual reachability distance is employed to construct a minimum spanning tree, or, rather,
the associated dendrogram from single linkage clustering. In constrast to DBSCAN*, there is not a global
cut applied to the dendrogram that uses the same value of \(\lambda\) for all clusters. Instead, HDBSCAN derives
an optimal solution, based on the notion of *persistence*
or *stability* of the cluster. A detailed illustration of these concepts and their implementation in the
algorithm is given in the Appendix. Next, we consider each of these concepts more formally.

#### Persistence

As the value of \(\lambda\) increases (or, equivalently, the core distance becomes smaller), larger clusters break into subclusters (our analogy with rising ocean levels), or shed *Noise* points. Note that a *Noise* point dropping out is not
considered a split of the cluster. The cluster remains, but just becomes smaller.

The objective is to identify the most prominent clusters
as those that continue to exist longest. This is formalized through the concept of *excess of mass* (Müller and Sawitzki 1991), or
the total density contained in a cluster after it is formed, i.e., after a level \(\lambda_{min}\) is reached.

In our island analogy, \(\lambda_{min}\) would correspond to the ocean level where the land bridge is first covered by water, such that the islands are no longer connected and appear as separate entities. The excess of mass would be the volume of the island above that ocean level, as illustrated in Figure 19. At the highest density (p in the Figure), two separate clusters are shown on the left, which appear at p = 0.10. With lower density, they are united into a single cluster, which appears around 0.03. At that level, there is an additional smaller cluster as well. With density below this level, there are no separate clusters.

More formally, a notion of *relative excess of mass* of a cluster \(C_i\) that appears at level \(\lambda_{min}(C_i)\) is
used to define the *stability* or *persistence* of a cluster \(C_i\) as:
\[S(C_i) = \sum_{x_j \in C_i} (\lambda_{max}(x_j,C_i) - \lambda_{min} C_i),\]
where \(\lambda_{min} C_i\) is the minimum density level at which \(C_i\) exists, and \(\lambda_{max}(x_j,C_i)\) is the
density level after which point \(x_j\) no longer belongs to the cluster.

In Figure 19, we see how the cluster on the left side comes into being at p = 0.03. At p = 0.10, it splits into two new clusters. So, for the large cluster, \(\lambda_{min}\), or the point where the cluster comes into being is 0.03. The cluster stops existing (it splits) for \(\lambda_{max}\) = 0.10. In turn, p = 0.10 represents \(\lambda_{min}\) for the two smaller clusters.

A *condensed dendrogram* is obtained by evaluating the stability of each cluster relative to the clusters into which it splits.
If the sum of the stability of the two descendants is larger than the stability of the parent node, then the
two children are kept as clusters and the parent is discarded. Alternatively, if the stability of the parent is larger
than that of the children, the children and all their descendants are discarded. In other words, the large cluster is
kept and all the smaller clusters that descend from it are eliminated. Here again, a split that sheds a single point
is not considered a cluster split, but only a shrinking of the original cluster (which retains its label). Only *true* splits that result in subclusters that contain at least two observations are considered in the simplification of the tree. Sometimes a further constraint is imposed on the minimum size of a cluster, which would pertain to the splits as well.
As suggested in Campello et al. (2015), a simple solution is to set the minimum cluster size equal to *Min Points*.

For example, in the illustration in Figure 19, the area in the left cluster between 0.03 and 0.10 is larger than the sum of the areas in the smaller clusters obtained at 0.10. As a result, the latter would not be considered as separate clusters.

This yields a much simpler tree and an identification of *clusters* that maximize the sum of the individual cluster
stabilities.

The C++ software implementation of HDBSCAN in `GeoDa`

is based on the Python code published in McInnes, Healy, and Astels (2017).

#### Cluster membership

The first distance for which a point can become part of a cluster is \(d_{p,core}\), i.e., the k-nearest neighbor
distance that satisfies the *Min Points* criterion. The inverse of this distance is an estimate of density associated
with that entry point, or
\(\lambda_p = 1 / d_{p,core}\). Once a point is part of a cluster, it remains there until the threshold distance
becomes small enough such that all the points in the cluster become *Noise*. This smallest distance, \(d_{min}\), corresponds with the largest density,
or, \(\lambda_{max} = 1/d_{min}\). The *strength* of the cluster membership of point \(p\) can now be expressed as the
ratio of the smallest core distance for the cluster to the core distance for point p, \(d_{min} / d_{p,core}\), or,
equivalently, as the ratio of the density that corresponds to the point’s core distance to the maximum density,
or, \(\lambda_p / \lambda_{max}\).

This ratio gives a measure of the degree to which a point *belongs* to a cluster. For example, for points that
meet the \(d_{min}\) thresholds, i.e., observations in the highest density region, the ratio will be 1. On the other
hand, for *Noise* points, the ratio is set to 0. Points that are part of the cluster, but do not meet the
strictest density criterion will have a value smaller than 1. The higher the ratio, the stronger the membership
of the point in the corresponding cluster.

This ratio can be exploited as the basis of a *soft* or *fuzzy* clustering approach, in which the degree of membership
in a cluster is considered. This is beyond our scope. Nevertheless, the ratio, or *probability* of cluster
membership, remains a useful indicator of the extent to which the cluster is concentrated in high density points.

#### Outlier detection

An interest related to the identification of density clusters is to find points that do *not* belong to
a cluster and are therefore classified as *outliers*. Parametric approaches are typically based on some function
of the spread of the underlying distribution, such as the well-known 3-sigma rule for a normal density.^{8}

An alternative approach is based on non-parametric principles. Specifically, the distance of an observation to
the nearest cluster can be considered as a criterion to classify outliers. We already saw a form of this
with the *Noise* points in DBSCAN, i.e., points that are *too far* from other points (given a neighborhood
radius, or *Eps*) to be considered members of a cluster.

Campello et al. (2015) proposed a post-processing of the HDBSCAN results to characterize the degree to which
an observation can be considered an outlier. The method is referred to as GLOSH, which stands for
*Global-Local Outlier Scores from Hierarchies*.

The logic underlying the outlier detection is closely related to the notion of cluster membership just discussed. In fact, the probability of being an outlier can be seen to be the complement of the cluster membership.

The rationale behind the index is to compare the lowest density threshold for which the point is attached to
a cluster (\(\lambda_p\) in the previous discussion) and the highest density threshold \(\lambda_{max}\). The
GLOSH index for a point \(p\) is then:
\[\mbox{GLOSH}_p = \frac{\lambda_{max} - \lambda_p}{\lambda_{max}},\]
or, equivalently, \(1 - \lambda_p / \lambda_{max}\), the complement of the cluster membership for
all but *Noise* points. For the latter, since the cluster membership was set to zero, the
actual \(\lambda_{max}\) needs to be used. Given the inverse relationship between density and
distance threshold, an alternative formulation of the outlier index is as:
\[\mbox{GLOSH}_p = 1 - \frac{d_{min}}{d_{p,core}}.\]

### Implementation

HDBSCAN is invoked from the Menu or the cluster toolbar icon as the second item in
the density clusters subgroup, **Clusters > HDBSCAN**, as in Figure 20.

The overall setup is very similar to that for DBSCAN, except that there is no threshold distance. The two main parameters
are **Min Cluster Size** and **Min Points**. These are typically set to the same value, but there may be
instances where a larger value for **Min Cluster Size** is desired. The **Min Points** parameter drives the
computation of the core distance for each point, which forms the basis for the construction of the minimum
spanning tree. As before, we make sure that the **Transformation** is set to **Raw**, as in Figure 21.

The first item that appears after selecting **Run** is the dendrogram representation of the minimum spanning tree,
shown in Figure 22.
This is mostly for the sake of completeness, and to allow for comparisons with HDBSCAN*. In and of itself, the raw dendrogram is
not of that much interest. One interesting feature of the dendrogram is that it supports linking
and brushing, which makes it possible to explore what branches in the tree subsets of points or clusters belong to (selected in one of the maps). The reverse is supported as well, so that branches in the tree can be selected and their counterparts
identified in other maps and graphs.

In addition to the dendrogram, a cluster map is generated, with the cluster categories saved to the data table. In our example, as seen in Figure 23, there are five clusters, ranging in size from 181 to 13 observations. There are 221 noise points. The overall pattern is quite distinct from the results of DBSCAN and DBSCAN*, showing more clustering in the southern part of the city. For example, compare to the cluster map for DBSCAN with d = 3000 and MinPts = 10, shown in Figure 11, where the six clusters are all located in the nothern part of the city.

As for the other clustering methods, the usual measures of fit are contained in the **Summary** panel.

#### Exploring the condensed tree

An important and distinct feature of the HDBSCAN method is the visualization of cluster persistence (or stability)
in the *condensed tree* or condensed dendrogram. This is invoked by means of the middle button in the results panel.
The tree visualizes the relative excess of mass that results in the identification of clusters for different values
of \(\lambda\). Starting at the root (top), with \(\lambda\) = 0, the left hand axis shows how increasing values of
\(\lambda\) (going down) results in branching of the tree. Note how the shaded areas are not rectangles, but they
gradually taper off, as single points are shedded from the main cluster.

The optimal clusters are identified by an oval with the same color as the cluster label in the cluster map. The shading of the tree branches corresponds to the number of points contained in the branch, with lighter suggesting more points.

In our application, the corresponding tree is shown in Figure 24. We see how some clusters correspond with leaf nodes, whereas others occur at higher stages in the tree. For example, cluster 1, the left-most cluster in the tree takes precedence over its descendants. In contrast, the second cluster from left (cluster 5) is the last node of the branch, which exists until all its elements become singletons (or, are smaller than the minimum cluster size).

The condensed tree supports linking and brushing as well as a limited form of zoom operations. The latter may be necessary for larger data sets, where the detail is difficult to distinguish in the overall tree. Zooming is implemented as a specialized mouse scrolling operation, slightly different in each system. For example, on a MacBook track pad, this works by moving two fingers up or down. There is no panning, but the focus of the zoom can be moved to specific subsets of the graph.

Figure 25 illustrates this feature. All points belonging to cluster 1 are selected
in the cluster map, and the corresponding observations are highlighted in red in the zoomed in condensed tree.
While some observations are shedded in the core part of the cluster (the area surrounded by the oval), we
see how many only disappear from the cluster at lower nodes in the tree. However, the relative excess of mass
of those lower branches was dominated by the one of the parent node, hence its selection as the *cluster*.

#### Cluster membership

The **Save** button provides the option to add the core distance, cluster membership probability and outlier
index (GLOSH) to the data table. The interface, shown in Figure 26, suggests default names
for the variables, but in most instances one will want to customize those.

The membership probability provides some useful insight into the strength of the clusters. For example,
in Figure 27, the 23 observations that belong to cluster 4 have been selected in
the data table (and moved to the top). The column labeled **HDB_PVAL** lists the probabilities.
The column labeled **HDB_CORE** shows the core distance for each point. Clearly, points are grouped in
the same cluster with a range of different core distances, corresponding to different values of \(\lambda\).

Ten of the 23 points have a probability of 1.0, which means that they meet the strictest \(\lambda\) criterion for the cluster. The other values range from 0.961 to 0.786, suggesting a quite strong degree of overall clustering.

The probabilities can be visualized in any map in order to further visualize the strength of cluster membership for a particular cluster. For example, in Figure 28, the points in cluster 4 are selected in the cluster map on the left, and their probability values are visualized in a natural breaks map on the right. This clearly illustrates how points further from the core of the cluster have lower probabilities.

#### Outliers

A final characteristic of the HDBSCAN output is the GLOSH outlier index, contained in the data table as
the column **HDB_OUT**. In Figure 29, this column has been sorted in descending order,
showing the points with **id** 185 and 630 with an outlier index of respectively 0.661 and 0.614. These observations
correspond to two points in the very southern part of the city and they appear quite separated from the
nearest cluster (cluster 3, the southernmost cluster). The two points are surrounded by other *Noise* points and
are far removed from cluster 3.

As any other variable in the data table, the outlier index can be mapped to provide further visual interpretation of these data patterns.

## Appendix

#### DBSCAN worked example

In order to illustrate the logic behind the DBSCAN algorithm, we consider a toy example of 9 points, the coordinates of which are listed in Figure 30.

We can use `GeoDa`

to create this as a point layer and use the distance weights functionality to
obtain the connectivity structure that ensures that each point has at least one neighbor. From the
weights characteristics, we find that for a distance of 29.1548, the number of neighbors ranges from
1 to 5. The matching connectivity graph is shown in Figure 31.

To illustrate the concept of *Noise* points (i.e., unconnected points for a given critical distance), we
lower the distance cut-off to 20. Now, we have one isolate (point 3) and the maximum number of neighbors
drops to 3. The corresponding connectivity graph is shown in Figure 32 with the
matching weights in Figure 33.

With *MinPts* as 4 (i.e., 3 neighbors for a core point), we can now move through each point, one at a time. Starting with point 1, we find that it only has one neighbor
(point 2) and thus does not meet the *MinPts* criterion. Therefore, we label it as *Noise* for now. Next,
we move to point 2. It has 3 neighbors, meaning that it meets the *Core* criterion and it is labeled cluster 1.
All its neighbors, i.e., 1, 4 and 5 are also labeled cluster 1 and are no longer considered. Note that this
changes the status of point 1 from *Noise* to *Border*. We next move
to point 3, which has no neighbors and is therefore labeled *Noise*.

We move to point 6, which is a neighbor of point 4, which belongs to cluster 1. Since 6 is therefore *density connected*,
it is added to cluster 1 (as a *Border* point). The same holds for point 7, which is similarly added to cluster 1.

Point 8 has two neighbors, which is insufficient to reach the *MinPts* criterion. Therefore, it is labeled *Noise*, since
point 7 is not a *Core* point. Finally, point 9 has only point 8 as neighbor and is therefore also labeled *Noise*.

In the end, we have one cluster consisting of 6 points, shown as the dark blue points in Figure 34.
The remaining points are *Noise*, labeled as light blue in the Figure. They are not part of any cluster.

We illustrate a more interesting outcome by lowering *Eps* to 15 with *MinPts* at 2. As a result, the points are
less connected, shown in the connectivity graph in Figure 35.

Using the same logic as before, we find points 1, 2, and 3 to be unconnected. Point 4 is connected to 5 and 6, which
meets the *MinPts* criterion and results in all three points being labeled cluster 1. Point 7 is again *Noise*,
but points 8 and 9 are connected to each other, yielding cluster 2. The outcome is summarized in Figure 36,
which shows four *Noise* points, a cluster consisting of 4, 5, and 6, and a cluster consisting of points 8 and 9.

This example illustrates the sensitivity of the results to the parameters *Eps* and *MinPts*.

#### HDBSCAN worked example

We apply the HDBSCAN algorithm to the same nine points from Figure 30. We set
**Min Cluster Size** and **Min Points** to 2. As a result, the *mutual reachability distance* is the
same as the original distance, given in Figure 37. Since the core distance for each
point is its nearest neighbor distance (smallest distance to have one neighbor), the maximum of the
two core distances and the actual distance will always be the actual distance. While this avoids one
of the innovations of the algorithm, it keeps our illustration simple.

The distance matrix can be used to find the *core distance* for each point as its nearest neighbor distance
(the smallest element in a row of the distance matrix), as well as the associated \(\lambda_p\) value. The latter is
simply the inverse of the core distance. These results are shown in Figure 38.

The distance matrix also forms the basis for a minimum spanning tree (MST), which is essential for the *hierarchical* nature
of the algorithm. The MST is shown in Figure 39, with both the node identifiers and the edge weights listed (the
distance between the two nodes the edge connects).

One interpretation of the HDBSCAN algorithm is to view it as a succession of *cuts* in the MST. These cuts
start with the highest edge weight and move through all the edges in decreasing order of the weight.
The first cut is illustrated in Figure 40, where 3 is separated from 5 as a result of
its edge weight of 29.15.

The next cut is a bit more problematic in our example, since both 1-2, 5-7, and 7-8 are separated by the same distance of 18.03. This is an artifact of our simple example, and is much less likely to occur in practice. We proceed with a cut 7 and 8, as shown in Figure 41. A different cut will yield a different solution for this example. We ignore that aspect in our illustration.

The process of selecting cuts continues until all the points are singletons and constitute their own cluster. Formally, this is equivalent to carrying out single linkage hierarchical clustering based on the mutual reachability distance matrix.

An alternative visualization of the pruning process is in the form of a tree, shown in Figure 42.
In the *root* node, all the points form a single cluster. The corresponding \(\lambda\) level is set to zero. As
\(\lambda\) increases, or, equivalently, the *core distance* decreases, points fall out of the cluster, or the cluster
splits into two subclusters. This process continues until each point is a *leaf* in the tree (i.e., it is by itself).

We can think of this process using our island analogy. At first all points constitute a large island. As the
ocean level rises, i.e., \(\lambda\) increases, points on the island become submerged, dropping out of the cluster.
Alternatively, two *hills* on the island connected by a valley become separate as the ocean submerges the valley.

This process is formally represented in the tree in Figure 42. As \(\lambda\) reaches 0.034 (for a
distance of 29.15), point 3 gets removed and becomes a leaf (in our island analogy, it becomes submerged). In the `GeoDa`

implementation of HDBSCAN, singletons
are not considered to be valid clusters, so we move on to node C1, which becomes the new root, with \(\lambda\) reset
to zero.

The next meaningful cut is for \(\lambda = 0.055\), or a distance of 18.03. As we see from Figure 41,
this creates two *clusters*, one consisting of the six points 1-7, but without 3, say C2, and the other
consisting of 8 and 9, C3. The process continues for increasing values of \(\lambda\), although in each case
the split involves a singleton and no further valid clusters are obtained. We consider this more closely when
we examine the *stability* of the clusters.

First, note how the \(\lambda_p\) values in Figure 38 correspond to the \(\lambda\) value where
the corresponding point *falls out* of the clusters in the tree in Figure 42.
For example, for point 7, this happens for \(\lambda\) = 0.055, or a core distance of 18.03. The stability or
persistence of each cluster is defined as the sum over all the points of the difference between \(\lambda_p\) and
\(\lambda_{min}\), where \(\lambda_{min}\) is the \(\lambda\) level where the cluster forms.

To make this concrete, consider cluster C3, which is formed (i.e., splits off from C1) for \(\lambda\) = 0.055463.
Therefore, the value of \(\lambda_{min}\) for C3 is 0.055463. The value of \(\lambda_p\) for points 8 and 9 is
0.066667, corresponding with a distance of 15.0 at which they become singletons (or leaves in the tree).
The contribution of each point to the persistence of the cluster C3 is thus 0.066667 - 0.055463 = 0.011204,
as shown in Figure 43. The total value of the persistence of cluster C3 is therefore
0.022408. Similar calculations yield the persistence of cluster C2 as 0.041400. For the sake of completeness,
we also include the results for cluster C1, although those are ignored by the algorithm, since C1 is reset
as *root* of the tree.

The *condensed* tree only includes C1 (as root), C2 and C3. Since the persistence of a singleton is zero,
and C4 includes one less point than C2, but with the same value for \(\lambda_{min}\), the sum of the
persistence of C4 and 7 is less than the value for C2, hence the condensed tree stops there.
This allows for two different values of \(\lambda\) to play a role in the clustering mechanism, rather
than the single value that needs to be set a priori in DBSCAN.

To compare, consider the clusters that would result from setting \(\lambda\) = 0.055, or a core distance of 18.03, using the logic of DBSCAN* (no border points). From Figure 39, we see that this would result in a grouping of 8-9 and 2-4-5-6, with 1, 3 and 7 as noise points. In our tree, that would be represented by C3 and C5. The main contribution of HDBScan is that the optimal level of \(\lambda\) is determined by the algorithm itself, and is adjusted in function of the structure of the data, rather than having a one size fits all that needs to be specified beforehand.

The corresponding cluster map for our toy example is shown in Figure 44.

A final set of outputs from the algorithm pertains to soft clustering and the identification of outliers.
`GeoDa`

saves these values for each point, together with the associated core distance, as shown in
Figure 45. The core distance is the same as in Figure 38. The
*probability* or strength for each point to be part of a cluster is computed as the ratio
of \(\lambda_p\) to \(\lambda_{max}\), with \(\lambda_{max}\) as the largest value of the individual \(\lambda_p\).
Equivalently, this is the ratio of the smallest core distance over the core distance of point p.
This is listed as **HDB_PVAL** in the table. For example, since \(\lambda_{max}\) is 0.066667 in our example,
points 4-5-6 and 8-9 obtain a probability of 1.0, or certainty to be part of the cluster. In contrast,
points 1, 2 and 7 have a lower value. For example, for point 7 the ratio is 15.0/18.03 = 0.832. For noise
points, the probability is set to zero.

The last piece of information is the probability that a point is an outlier. This is computed as \(1 - \lambda_p / \lambda_{max}\). For all but the noise points, this is the complement of the cluster probability. For example, for point 1, this is 1.0 - 0.0555/0.0667 = 1.0 - 0.832 = 0.168. However, for point 3, the noise point, the computation is slightly different. At first sight, one would expect the outlier probability to be 1.0. However, the \(\lambda_{max}\) for this point is the \(\lambda\) value for C1, or 0.055. Consequently the outlier probability is 1.0 - 0.034/0.055 = 0.382.

## References

Campello, Ricardo J. G. B., Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In *Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science, Vol. 7819*, edited by Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu, 160–72.

Campello, Ricardo J. G. B., Davoud Moulavi, Arthur Zimek, and Jörg Sandler. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” *ACM Transactions on Knowledge Discovery from Data* 10,1. https://doi.org/10.1145/2733381.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In *KDD-96 Proceedings*, 226–31.

Gan, Junhao, and Yufei Tao. 2017. “On the Hardness and Approximation of Euclidean DBSCAN.” *ACM Transactions on Database Systems (TODS)* 42 (3): 14.

Hartigan, John A. 1975. *Clustering Algorithms*. New York, NY: John Wiley.

Kulldorff, Martin. 1997. “A Spatial Scan Statistic.” *Communications in Statistics – Theory and Methods* 26: 1481–96.

McInnes, Leland, and John Healy. 2017. “Accelerated Hierarchical Density Clustering.” In *2017 IEEE International Conference on Data Mining Workshops (ICDMW)*. New Orleans, LA. https://doi.org/10.1109/ICDMW.2017.12.

McInnes, Leland, John Healy, and Steve Astels. 2017. “hdbscan: Hierarchical Density Based Clustering.” *The Journal of Open Source Software* 2 (11). https://doi.org/10.21105/joss.00205.

Müller, D. W., and G. Sawitzki. 1991. “Excess Mass Estimates and Tests for Multimodality.” *Journal of the American Statistical Association* 86: 738–46.

Openshaw, Stan, Martin E. Charlton, C. Wymer, and A. Craft. 1987. “A Mark I Geographical Analysis Machine for the Automated Analysis of Point Data Sets.” *International Journal of Geographical Information Systems* 1: 359–77.

Sander, Jörg, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 1998. “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications.” *Data Mining and Knowledge Discovery* 2: 169–94.

Schubert, Erich, Jörg Sander, Martin Ester, Peter Kriegel, and Xiaowei Xu. 2017. “DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN.” *ACM Transactions on Database Systems (TODS)* 42 (3): 19.

Stuetzle, Werner, and Rebecca Nugent. 2010. “A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density.” *Journal of Computational and Graphical Statistics* 19: 397–418.

Wishart, David. 1969. “Mode Analysis: A Generalization of Nearest Neighbor Which Reduces Chaining Effects.” In *Numerical Taxonomy*, edited by A. J. Cole, 282–311. New York, NY: Academic Press.

University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu↩︎

To obtain the figure as shown, the size of the points was changed to 1pt, their fill and outline color set to black. Similarly, the community area outline color was also changed to black.↩︎

The figure is loosely based on Figure 1 in Schubert et al. (2017).↩︎

This definition of

*MinPts*is from the original paper. In some software implementations, the minimum points pertain to the number of nearest neighbors, i.e.,*MinPts*- 1.`GeoDa`

follows the definition from the original papers.↩︎`GeoDa`

currently does not support this option.↩︎The DBSCAN* algorithm avoids such issues by only considering core points to construct the clusters.↩︎

This method is variously referred to as HDBSCAN or HDBSCAN*. For simplicity, we will use the former.↩︎

These approaches are sensitive to the influence of the outliers on the estimates of central tendency and spread. This works in two ways. On the one hand, outliers may influence the parameter estimates such that their presence could be

*masked*, e.g., when the estimated variance is larger than the true variance. The reverse effect is called*swamping*, where observations that are legitimately part of the distribution are made to look like outliers.↩︎