# Density-based clustering in spatial data (2)

*This is the second of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).*

*Ordering points to identify the clustering structure* (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let’s cut the intro and dive right into it.

Let $(X,d)$ be a finite discrete metric space, i.e.~let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Given a positive integer $m \in \mathbb{N}$ we can define a notion of *density* on points of our data set, which in turn can be used to define a *perturbed metric*. Given a starting point $x_0 \in X$ the OPTICS-algorithm iteratively defines a linear ordering on $X$ in terms of a filtration $X_0 \subset X_1 \subset \ldots \subset X_n$ of $X$, where $X_{n+1}$ is obtained from $X_n$ by appending the closest element (with respect to the perturbed metric) in its complement.

The original OPTICS-algorithm also takes an additional parameter $\varepsilon > 0$. However this is only needed to reduce the time complexity and is ignored in our discussion for now — to be more precise, we implicitly set $\varepsilon = \infty$.

### Definitions

Let $(X,d)$ be a finite discrete metric space and $m \in \mathbb{N}$ a positive integer. We define the **(co-)density $\delta_m(x)$ of a point $x$** by \[ \delta_m(x) := d(x, \mathfrak{nn}_m(x) ), \] where $\mathfrak{nn}_m(x) $ is an $m$-nearest-neighbor of $x$. Loosely speaking: the lower the value $\delta_m(x)$ the closer the neighbors of $x$ are distributed around $x$. Note that with $\varepsilon = \delta_m(x)$ the point $x$ is a *core point* in the sense of [2] — in a previous post about DBSCAN we called such a point *$\varepsilon$-dense*. That is why in the literature $\delta_m(x)$ is referred to as **core-distance**. I however like to think of it, and hence refer to it, as a notion of co-density.

Given the co-density $\delta_m(.)$ we can define the **reachability distance $r_x(y)$ of $y$ from $x$** by \[ r_x(y) := \text{max} \big\{ \delta_m(x), d(x,y) \big\}. \] Note that $r_x(y)$ is not symmetric, since the density of $x$ and $y$ may differ.

### Sketch of the algorithm

Choose a starting point $x_0 \in X$. Then we can iteratively define a filtration $X_0 \subset \ldots \subset X_n$ of the data set $X$ by \[ X_0 := \{ x_0 \} \ \ \text{and} \ \ X_{k+1} := X_k \cup \{ x_{k+1} \}, \] where $x_{k+1}$ minimizes $r_{X_k}(.)$ over $X \setminus X_k$. In parallel we define a sequence $(r_n)$ of distances by \[ r_0 = 0 \ \ \text{and} \ \ r_{k+1} := r_{X_k}(x_{k+1}). \] Note that a small distance $r_k$ may be understood as $x_k$ being close to a rather dense region. Therefore the filtration tries to stay in dense regions for as long as possible, before it passes a less dense region. The cluster structure can now be extracted by analyzing the *reachability-plot*, that is a $2$-dimensional plot, with the ordered $x_k$ on the $x$-axis and the associated distances $r_k$ on the $y$-axis. By the above considerations it should be clear that clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

### Example

Consider the data points $X\subset \mathbb{R}^2$ in the euclidian plane given in the following figure:

Below you see the reachability plot corresponding the OPTICS-algorithm applied to the above data set with $m=4$. The starting point is chosen among the group of points on the left.

### References

**[1]** M. Ankerst, M.M. Breunig, H.-P. Kriegel, *OPTICS: Ordering Points To Identify the Clustering Structure*, ACM SIGMOD international conference on Management of data (1999), pp 49–60.

**[2]** M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, *A density-based algorithm for discovering clusters in large spatial databases with noise*, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), pp. 226–231.

Dear Mirko,

this is very interesting again.

I have one question or thought about the reachability and the filtration. If I were to implement that algorithm, my intuition would tell me “let’s try to define the core-distance/co-density with respect to X_k instead of X”. Because this way the algorithm would prefer to extend the filtration on those points which are dense in X_k (instead of dense in X). I think it would be interesting to see what difference that makes.

To me that would have been a more natural choice at first, but the longer I think about it I am already reconsidering. My variation of the algorithm would probably not have the ability (or ‘tendency’) to drift to more dense regions if it started in a sparse region. That would seem to be the main difference. Ok, not a real question after all..

But very interesting post again, I enjoy reading your blog.