Density-based clustering in spatial data (2)

Density-based clustering in spatial data (2)

This is the second of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Ordering points to identify the clustering structure (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let’s cut the intro and dive right into it.

Let $(X,d)$ be a finite discrete metric space, i.e.~let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Given a positive integer $m \in \mathbb{N}$ we can define a notion of density on points of our data set, which in turn can be used to define a perturbed metric. Given a starting point $x_0 \in X$ the OPTICS-algorithm iteratively defines a linear ordering on $X$ in terms of a filtration $X_0 \subset X_1 \subset \ldots \subset X_n$ of $X$, where $X_{n+1}$ is obtained from $X_n$ by appending the closest element (with respect to the perturbed metric) in its complement.

The original OPTICS-algorithm also takes an additional parameter $\varepsilon > 0$. However this is only needed to reduce the time complexity and is ignored in our discussion for now — to be more precise, we implicitly set $\varepsilon = \infty$.


Let $(X,d)$ be a finite discrete metric space and $m \in \mathbb{N}$ a positive integer. We define the (co-)density $\delta_m(x)$ of a point $x$ by \[ \delta_m(x) := d(x, \mathfrak{nn}_m(x) ), \] where $\mathfrak{nn}_m(x) $ is an $m$-nearest-neighbor of $x$. Loosely speaking: the lower the value $\delta_m(x)$ the closer the neighbors of $x$ are distributed around $x$. Note that with $\varepsilon = \delta_m(x)$ the point $x$ is a core point in the sense of [2] — in a previous post about DBSCAN we called such a point $\varepsilon$-dense. That is why in the literature $\delta_m(x)$ is referred to as core-distance. I however like to think of it, and hence refer to it, as a notion of co-density.

Given the co-density $\delta_m(.)$ we can define the reachability distance $r_x(y)$ of $y$ from $x$ by \[ r_x(y) := \text{max} \big\{ \delta_m(x), d(x,y) \big\}. \] Note that $r_x(y)$ is not symmetric, since the density of $x$ and $y$ may differ.

Sketch of the algorithm

Choose a starting point $x_0 \in X$. Then we can iteratively define a filtration $X_0 \subset \ldots \subset X_n$ of the data set $X$ by \[ X_0 := \{ x_0 \} \ \ \text{and} \ \ X_{k+1} := X_k \cup \{ x_{k+1} \}, \] where $x_{k+1}$ minimizes $r_{X_k}(.)$ over $X \setminus X_k$. In parallel we define a sequence $(r_n)$ of distances by \[ r_0 = 0 \ \ \text{and} \ \ r_{k+1} := r_{X_k}(x_{k+1}). \] Note that a small distance $r_k$ may be understood as $x_k$ being close to a rather dense region. Therefore the filtration tries to stay in dense regions for as long as possible, before it passes a less dense region. The cluster structure can now be extracted by analyzing the reachability-plot, that is a $2$-dimensional plot, with the ordered $x_k$ on the $x$-axis and the associated distances $r_k$ on the $y$-axis. By the above considerations it should be clear that clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.


Consider the data points $X\subset \mathbb{R}^2$ in the euclidian plane given in the following figure:

Below you see the reachability plot corresponding the OPTICS-algorithm applied to the above data set with $m=4$. The starting point is chosen among the group of points on the left.


[1] M. Ankerst, M.M. Breunig, H.-P. Kriegel, OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD international conference on Management of data (1999), pp 49–60.

[2] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), pp. 226–231.