Here’s the link to the interview: Q&A with Mirko Klukas, Numenta’s First Visiting Scholar

]]>** Theorem** (see [1, Theorem 1], Section 4 on page 8). There exists a two-layer neural network with ReLU activations and $2n+d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.

That means if we choose $n$ mutually distinct samples $z_1, \ldots, z_n \in \mathbb{R}^d$ and real valued labels $y_1, \ldots, y_n \in \mathbb{R}$ there is a $2$-layer neural network $C$ depending only on $2n+d$ weight parameters such that for $i=1, \ldots , n$ we have $$ C(z_i) = y_i. $$ Moreover the network is of the form $\mathbb{R}^d \stackrel{F}{\to} \mathbb{R}^n \stackrel{w}{\to} \mathbb{R} $ with activation function given by $ \alpha(x) = \max \{ x ,0 \}$.

**Remark and Question.** What is somewhat interesting is that $d$ parameters can be chosen generically, i.e. a random choice will almost certainly give the desired result. This is not immediately clear from the presentation in [1]. If someone has a comment on this I’d be happy to hear about it!

First we are going to reduce the problem to the case where the samples are chosen from $\mathbb{R}$. Choose $a \in \mathbb{R}^d$ such that $x_i := a \cdot z_i$ are mutually distinct, i.e. we have
\[
x_i \neq x_j \text{, for $i \neq j$.}
\]
Relabel the data points and add a value $x_0$ such that we have
$$
x_0 < x_1 < \ldots < x_n.
$$
(Note that a generic $a$ will do the job.)
We are now in the one-dimensional case, and will proceed by defining a family $\{f_i\}$ of affine-linear functions, that will only depend on $n$ parameters, which combined with the previous projection, and the rectifier will give the first layer of the desired network. We define $f_i \colon\thinspace \mathbb{R} \to \mathbb{R}$ by
$$
f_i(x) := \tfrac{1}{x_i – x_{i-1}} (x – x_{i-1}),
$$
and note that we have $f_i(x_i) = 1$, and $f_i(x) \leq 0$ for $x \leq x_{i-1}$. We are now ready to define the final layer of the network. Set $w_1 := y_1$ and define iteratively
$$
w_j := y_j – \sum_{i=1}^{j-1} f_i(x_j)\cdot w_i.
$$
One easily computes that
$$
y_j = \sum_{i=1}^{n} \max\big\{ f_i(a \cdot z_j) , 0 \big\} \cdot w_i.
$$
With $F(z) = ( f_1(a \cdot z), \ldots, f_n(a \cdot z) )$ this can be expressed as $\max\{ F(z_j),0 \}^T \cdot w = y_j$. Note that $F$ is a affine linear function $\mathbb{R}^d \to \mathbb{R}^n$ that depends on $d + n$ weights, namely those coming from $a$ and $x_1,\ldots,x_n$. With the additional $n$ weights defining $w$ we conclude the proof. **QED**

Let $X$ be a finite alphabet and let $X^\ast$ denote the set of all possible strings over $X$. Let $\epsilon$ denote the empty string. The aim is to encode (compress) a sequence of inputs $d = x_1\ldots x_l \in X^\ast$. The approach of Lempel–Ziv splits into two phases: parsing and encoding. In the parsing phase the algorithm splits the sequence $d$ into adjacent non-empty words $ w_1,\ldots,w_p \in X^\ast$, i.e.

\[

d = w_1 \ldots w_p,

\] such that the following conditions hold

- Except for the last word the $w_i$ are mutually distinct, i.e. for $i\neq j$ with $i,j < p$ we have $w_i \neq w_j.$
- Each word is either just one element, or obtained by concatenation of one of its predecessors in the sequence and an element in $X$, i.e. for each $j$ there is an $x \in X$ such that either $w_j = x$ or there is an index $i < j$ with $ w_j = w_i x$.

Such a decomposition can be obtained by forming a dictionary whose keys (ordered by insertion) define the desired decomposition as follows:

*At each step take the longest prefix of the sequence that is not yet contained (as a key) in the dictionary and add it to the dictionary. **Continue with the remaining sequence (i.e. after removing the prefix) until the whole sequence is consumed. The last prefix of the sequence which contains the last element will either be a new word or it will already be in the dictionary (cf. the first condition in the above list).*

With $w_0 := \epsilon$,where $\epsilon$ denotes the empty string, the conditions in the above list imply that each word $w$ in $\{w_1,\ldots,w_p\}$ can uniquely expressed in the form $w = w_ix$, hence we may define

\[

\texttt{tail-index}\thinspace w := i \ \ \ \text{ and } \ \ \ \texttt{head} \thinspace w := x.

\] The compressed version of the sequence $d$ is a sequence in $\{0,\ldots,p\} \times X$ and is defined by

\[

\texttt{LZ78}(d) := \big[ \big( \texttt{tail-index} \thinspace w_1 ,\texttt{head} \thinspace w_1 \big),

\ldots ,

\big( \texttt{tail-index} \thinspace w_p ,\texttt{head} \thinspace w_p \big) \big].

\] The sequence can be decoded by forming a trie $T$ whose nodes are indexed by $i=1,\ldots,p$ and wich store their associated elements $\texttt{head} \thinspace w_i$. We further add a node indexed by $0$ storing the empty string $\epsilon$ — this will be the root of the tree. The parent of a node is given by its associated tail-index, i.e. two nodes $i<j$ are connected by an edge if $\texttt{tail-index} \thinspace w_j = i$. Each node has up to $| X |$ children, and no two children store similar elements, with potentially one exception, the node indexed by $p$ (recall that the last word is the only one that might have already appeared in the list). In case of non-uniqueness, we identify the node indexed by $p$ with the sibling that stores the similar input. The $i$th word $w_i$ can now easily be recovered from $T$ by traversing the path from the root (node $0$) down to node $i$, and concatenating the associated elements (of each traversed node).

Suppose we want to compress $d = “abracadabrarabarbar”$. This sequences decomposes as:

** **a | b | r | ac | ad | ab | ra | rab | ar | ba | r.

Number the words from 1 to 11 going from left to right. The colored part corresponds to the part that can be found earlier in the dictionary. Replace the colored part by the corresponding position assigned earlier, where zero corresponds to the empty string, e.g. the tail ‘ra’ of the 8th word ‘rab’ can be found at position 7, and hence translates to ‘7b’. The resulting sequence reads as

0a | 0b | 0r | 1c | 1d | 1b | 3a | 7b | 1r | 2a | 3.

**[1] **Abraham Lempel and Jacob Ziv, *Compression of individual sequences via variable-rate coding*, IEEE Transactions of Information Theory (1978).

Quite a while ago I stumbled over the white paper [3] of a science startup called Numenta. At the core of the paper lies the model of a sequence memory that enables the prediction of elements in a time series based on its history of predecessors. An essential ingredient in the model are *sparse distributed representations*, a binary data structure motivated by recent findings in neuroscience, where it was observed that sparse activation patterns are used by the brain to process information (e.g. in early auditory and visual areas).

The goal of this post is (a) to briefly recall what *sparse representations* are, and (b) how to encode real-world data as sparse representations. We roughly follow Numenta’s approach and connect it to a construction well-known in computational topology called *witness complex* — a simplicial complex that can be understood as a simplicial approximation of the inputs space.

**Table of contents**

1. SDR’s

1.1. The space of patterns

1.2. Sparse patterns and their natural metric

1.3. Sparseness after Olshausen and Field

2. Spatial pooling and weak witness complexes

2.1. Witnesses with respect to similarity

2.2. Restricting the sight of witnesses

2.3. Approximation by simplicial complexes

Let $X$ denote a discrete set of $n$ elements, where $n$ is a positive integer. Think of it as an (a priori unordered) collection of bits, $\{1,\ldots, n\}$ say. Denote by $\mathcal{P} =\mathcal{P}_n$ the power set of $X$, i.e.

\[

\mathcal{P} := \big\{ p : p \subset X \big\}.

\] We will refer to elements in $\mathcal{P}$ as **patterns**. Another intuitive example would be to think of $X$ as representing a family of pixels, and a pattern representing the collection of black pixels of an black-and-white image. There are two natural operations on $\mathcal{P}$, namely the union of sets $\cup$, and the intersection of sets $\cap$. The latter in fact gives rise to a notion of **similarity** of two patterns $p$ and $p’$ in terms of their intersection or overlap. We define the **overlap count** of $p,p’$ as

\[

\omega(p,p’) := |p\cap p’|.

\] Intuitively: the bigger the overlap the more **similar** the two patterns are. This is a somewhat vague notion — since we do not incorporate the individual sizes of the patterns — but for now we do not want to bother too much about that, when we restrict our attention to *sparse patterns*, in the respective section below, this notion of similarity becomes more accurate.

Note that we can naturally identify the power set of $n$ elements $\mathcal{P}$ with $\{0,1\}^n$, the boolean algebra of $2^n$ elements, as follows: recall that $\mathcal{P}$ denotes the power set of a set of $n$ distinct elements, $\{ 1,\ldots,n \}$ say. Then one easily checks that the desired isomorphism from $\{0,1\}^n$ into $\mathcal{P}$ is given by

\[

(v_1,\ldots,v_n) \mapsto \{ i : v_i=1 \}.

\] With respect to the above identification the union of sets $\cup$ corresponds to the componentwise or-operation $\vee$ on $\{0,1\}^n$, and the intersection of sets $\cap$ corresponds to the componentwise and-operation $\wedge$ on $\{0,1\}^n$. Therefore we indeed have

\[

\big(\mathcal{P};\cup,\cap \big) \cong \big(\{0,1\}^n ; \vee, \wedge \big).

\] Note that we could understand $\{0,1\}^n$ as embedded into $n$-dimensional Euclidean Space $\mathbb{R}^n$. Then the above notion of similarity corresponds to the euclidean scalar product of two vectors. This viewpoint allows us to pull back any distance, or scalarproduct, defined on $\mathbb{R}^n$ to $\mathcal{P}$. We can pull back the metric induced by the $L^1$-norm on $\mathbb{R}^n$, for instance, which yields the the *Hamming distance* on $\{0,1\}^n$ (interpreted as binary strings). We may come back to that later.

The central objects in [1] are *sparse representations*. For some fixed positive integer $k \ll n$ we call a pattern $p \in \mathcal{P}$ * sparse* if the number of elements in $p$ satisfies $|p| = k$. Finally we denote by $SDR = SDR(k,n)$ the set of sparse patterns in $\mathcal{P}$ and refer to its elements as

\[

SDR :=

\big\{ p \in\mathcal{P} : |p| = k \big\}.

\]

The terminology anticipates (and only makes sense in combination with) a

\[

d_H(p,q) := k – |p \cap q|.

\] This is equivalent to the

\[

SDR_{\leq k} :=

\big\{ p \in\mathcal{P} : |p| \leq k \big\}.

\]

**REMARK. (Sparseness after Olshausen and Field)** In [4] *sparseness* turned up within a probabilistic framework where the authors try to match the probability distribution over images observed in nature by a generative linear model. However, in contrast to our approach, the coefficients in [4] are allowed to take values in $\mathbb{R}$. The notion of sparseness then corresponds to assuming a certain prior probability distribution over the coefficients, whose density is shaped to be unimodal and peaked at zero with heavy tails, implying that coefficients are mostly inactive. However there is no restriction on the actual number of active units. If we would like to emphasize this difference, we will refer to a sparse vector in the sense presented here as a *binary sparse vector*. However it should always be clear from the context what definition is referred to.

In the present section I introduce a geometrical perspective on how to encode inputs sampled from a metric space as sparse representations. We follow an idea analogous to the approach in [3] and connect it to a construction well-known in computational topology called *witness complex* (see [1] and cf. also [2]).

Let $(X,d)$ be a, not necessaryly discrete, metric space, e.g. $X = \mathbb{R}^m$ endowed with the Euclidean distance. Our goal is the construction of a map

\[

\Phi \colon\thinspace X \to SDR \subset \mathcal{P}_n.

\] We refer to such a map $\Phi$ as an (untrained) ** spatial pooler** or (binary)

\[

L=\{ l_1,\ldots,l_n\} \subset X.

\] These landmarks can be understood as

\[

\Lambda (x) := \Lambda^{(k)}_L(x) := \{ i_1, \ldots , i_k \},

\] such that $d(x, l_{i}) < d(x, L \setminus \{ l_{i_1},\ldots,l_{i_k} \} )$ for each $i \in \Lambda (x)$. Here $d(x,Y)$ deotes the the minimal distance of a point $x$ to the elements in a finite set $Y$. Note that in general this assignment is not well-defined since the set of $k$ closest landmarks may not be unique — in practice however this can always be achieved by adding a tiny random perturbation to each landmark, or by fixing a rule of precedence, e.g.

**REMARK. (Landmarks in Numenta’s Spatial Pooling)** We can interpret the permanence values of the synaptic connections of a particular column in [3] as a landmark.

For each $i \in \Lambda (x)$ we call $x$ a ** weak witness** (or for simplicitly just

For a lack of better imagination we refer to the collection of witnesses

\[

W(l) := \{ x : \text{$x$ is a witness for $l$} \}.

\] associated to a particular landmark $l$ as its associated ** (depth-$k$) witness cell**. Note that for $k=1$ the witness cell of a landmark equals its associated

\[

W( l_{i_1},…, l_{i_q} ) := \bigcap_{i \in } W(l_i).

\] This obviously generalizes the above notion of a depth depth-$k$ witness cell, and therefore we stay with the terminology.

Analogous to the simpler version, for $k=q$ this equals the

Obviously we can understand $\Lambda (x)$ as an element of $SDR({k,n})$ for each $x \in X$, and hence the assignment

\[

\Phi \colon\thinspace x \mapsto \Lambda (x)

\] defines the desired map. To *train* an encoder on a collection of inputs translates into the right choice of landmarks. I will address that in another post.

Suppose, instead of an honest metric we are given a notion of similarity on $X$ expressed in terms of a non-negative function

\[

\omega \colon\thinspace X\times X \to \mathbb{R}_{\geq 0}.

\] Consider the $m$-dimensional cube $[0,1]^m$ endowed with the Euclidean dot product, for instance. Then the above construction of $\Lambda$ also follows through if we look for the $k$ most similar landmarks — instead of the $k$ closest landmarks. To be more precise we define $\Lambda (x) := \{ i_1, \ldots , i_k \}$ such that $\omega(x, l_{i}) > \omega(x, l_j)$ for each $i \in \Lambda (x)$ and $j \not\in \Lambda (x)$. (This is the approach taken in [3])

Instead of assigning the $k$ closest (or most similar respectively) landmarks to a point, we can further restrict the set the potential landmarks to be within a certain radius, $\theta > 0$ say. To be more precise, $x$ is a ** witness with sight $\theta$ for $l$** iff $l$ is among the $k$ closest landmarks and $x$ is contained in the ball of radius $\theta$ centered at $l$, i.e. $d(x,l) < \theta$. The problem with this approach is that we have to ensure that we find $k$ landmarks to form a complete sparse representation, i.e. one that has exactly $k$ active bits. This can be achieved e.g. by adding random elements to the representation until completion, or by fixing a rule of precedence upon which we choose elements to complete the representation.

A way around this is to establish a softer version of sparse representations, e.g. allow patterns with less or equal than $k$ elements. (This is effectively what was going on in Numenta’s open source implementation NuPIC up to version 0.3.5: if there are less than $k$ landmarks within sight of $x$, i.e. with distance $d(x,l) < \theta$, the non-complete representation is randomly extended to a complete one — update: apparently the random choice is fixed once at the beginning)

This then yields a variation of the above sparse encoder with values in $SDR_{\leq k}$ defined by

\[

\Lambda_\theta(x) := \big\{ l : d(x,l) < \theta \text{ and $l$ is among the $k$ closest landmarks} \big\}.

\]

Suppose we are given a collection of inputs $Y \subset X$. Then there is a simplicial complex $\mathcal{W}(Y,L)$, the ** (weak) witness complex **(cf. [1] or [2]), whose vertex set is given by $L$, and where $\Lambda = \{l_{i_1},…, l_{i_q}\}$ spans a $q$-simplex iff there is a common witness for all landmarks in $\Lambda$, i.e. if $W(l_{i_1},…, l_{i_q}) \neq \emptyset$. An obvious variation of this complex is obtained by restricting the set of witnesses to those with sight $\theta$. We denote this complex by $\mathcal{W}_\theta(Y,L)$. This obviously defines a subcomplex of the (weak) witness complex.

Usually we are interested in the subcomplexes of dimension less or equal than $k$, and we can understand the witness complexes as an simplicial approximation of the data $Y \subset X$.

**[1] **Gunnar Carlsson and Vin de Silva, *Topological estimation using witness complexes*, Eurographics Symposium on Point-Based Graphics (2004).

**[2]** Gunnar Carlsson,** ***Topology and data*, Bull. Amer. Math. Soc. (N.S.) 46 (2009), no. 2, 255–308.

**[3] **Numenta, *Hierarchical Temporal Memory (HTM) Whitepaper* (2011), available at http:// numenta.com/learn/hierarchical- temporal- memory- white- paper.html.

**[4] **B. A. Olshausen and D. J. Field, *Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?*, Vision Research 37 (1997), 3311–3325.

A few posts ago I wrote about the mapper construction by Carlsson-Memoli-Singh and want to follow up on that a little.

I wrote a straightforward implementation of the construction in Python. It can be found here: github.com/mirkoklukas/tda-mapper-py.

fromtdamapper.clusterfunctionsimportVietorisRipsClusteringfromtdamapperimportmapperfromtdamapper.referenceMapimportcreate_functional_cover, coordinate_projectionimportjson # Example data setwithopen("./example/dataset.json")asf: data = json.load(f); data = [ tuple(p) for p in data ] # Gather the mapper input VR = VietorisRipsClustering(epsilon = 0.6) zAxis = coordinate_projection(axis=2, domain=data) funcCover = create_functional_cover(endpoints=range(-12,12), overlap=0.5) # Run the alogrithm result = mapper(VR, zAxis, funcCover)

Below you see a visualization of the mapper result. The graph is colored by the values of **zAxis** , the projection on the z-axis. The size of the nodes reflects the size of the associated clusters.

The take away should be that there are actually two separate branches growing out of a bigger cluster. You shouldn’t focus too much on the fact that the two branches cross each other. Although it reflects the reality of the situation pretty well, it is rather a bi-product of the fact that I used very simple data set in so few dimensions, and that I was too lazy to reduce the number of the crossing in the graph.

And indeed (what a terrible example it would have been if that was not the case) looking at a 3d plot of the original dataset we see that this reflects the shape pretty well.

**Again:** it’s the two branches that matter, not the crossings, I chose a misleading example and really should update that in the future.

The $k$-Means algorithm computes a Voronoi partition of the data set such that each landmark is given by the centroid of the corresponding cell. Let me quickly quote Wikipedia on the history of the algorithm before I explain what it is about: *The term “$k$-means” was first used by James MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957.*

Let $X \subset \mathbb{R}^n$ be a finite collection of data points. Fix a positive integer $k$. Then our aim is to find a partition $\boldsymbol S = \{S_1, \ldots, S_k\}$ of $X$ into $k$ subsets such that it minimizes the following function \[ J(\boldsymbol S) = \sum_{i=1}^k \sum_{x \in S_i} \| x – \mu(S_i) \|^2, \] where $\mu(S)$ denotes the mean of the points in $S$, i.e. \[ \mu(S) = \frac{1}{|S|}\sum_{x \in S} x. \] We denote by $\mu_* \big( \boldsymbol S \big)$ the collection of means of the sets in $\boldsymbol S$.

**As a rule of thumb**: in most of my posts, the $*$-functor applied to some construction (or function) $f$ can in functional-programming terms be translated to \[ f_*(Z) := \verb+map+ \ f \ Z. \]

Let $(Y,d)$ be a metric space. Let $\Lambda \subset Y$ be a finite subset called the **landmarks**. Given a landmark $\lambda \in \Lambda$ we define its associated **Voronoi cell $V_\lambda$** by \[ V_\lambda := \{ y \in Y \ | \ d(y,\lambda) \leq d(y, \Lambda) \}. \] Suppose we are given a subset $X \subset Y$ then we introduce the following shorthand notation for a realtive version of a Voronoi cell \[ V_{X, \lambda} := V_\lambda \cap X. \] When it is clear whether we are dealing with the *relative* or *ordinary* version we may omit the extra index. We write $V_*(\Lambda)$ resp. $(V_X)_*(\Lambda)$ for the whole collection of Voronoi cells associated to the landmarks $\Lambda$, i.e. for the relative version we have \[ (V_X)_*(\Lambda) := \{ V_{X,\lambda} \ | \ \lambda \in \Lambda \}. \]

Suppose we have a discrete set $X$ embedded in $m$-dimensional Euclidean space $\mathbb{R}^m$ endowed with the Euclidean metric $d$. Suppose further we have chosen a family $\Lambda = \{\lambda_1,\ldots,\lambda_k\}$ of landmarks. We would like to produce a partition of $X$, i.e. a decomposition of $X$ into mutually disjoint sets, based on the Voronoi cells associated to $\Lambda$. However we are facing an ambiguity for points $x \in X$ with \[ d(x, \lambda_i) = d(x, \lambda_j), \text{ for some $i \neq j$}. \] We have to make a choice to which set we are assigning $x$ (and from which cell we are removing $x$). For the remaining part of this post we will:

*Assign $x$ to the cell with the lower index, and remove it from the other.*

There is no particular reason to go with this rule other than it is the easiest I could come up with. After reassigning all problematic points we end up with an honest partition of $X$. We will continue to denote these sets by $V_\lambda$ resp. $V_{X,\lambda}$, and continue to refer to them as Voronoi cells.

Computing the minimum of the function $J$ described above is usually too expensive. Instead one uses a heuristic algorithm to compute a local minimum. The most common algorithm is Lloyd’s algorithm which I will sketch in the following: Suppose $X$ is a finite discrete set in $m$-dimensional Euclidean space $\mathbb{R}^m$ endowed with the Euclidean metric $d$. Suppose further we fixed a positive integer $k$. Then choose an arbitrary partition $\boldsymbol S$, i.e. a decomposition of $X$ into a family mutually disjoint sets $S_1,\ldots,S_k \subset X$. Then define a sequence $(C_n)_{n \in \mathbb{N}}$ of partitions as follows \[ C_n := L^n(\boldsymbol S) \ \ \ \text{, where } L:= (V_X \circ \mu)_*. \] It is not hard to show that this sequence converges (see the section below). Hence one can define the result of Lloyd’s algorithm applied to the initial partition $\boldsymbol S$ as follows \[ \mathscr{L}_{V,\mu}\big(\boldsymbol S \big) := \lim_{n \to \infty} C_{n} . \]

Observe that for any any partition $\boldsymbol S$ of $X$ we have \[ J(\boldsymbol S) \geq J \big( L(\boldsymbol S) \big). \] Furthermore equality holds if and only if $\boldsymbol S = L(\boldsymbol S)$. Therefore $J(C_n)$ defines a descending sequence in $\mathbb{R}$ which is bounded below by zero, hence it converges. Since the set of partitions of $X$ is finite $J$ only takes a finite number of values. This implies that $J(C_n)$ takes a constant value for $n$ sufficiently large. By the observation above this implies that $C_n$ is constant for sufficiently large $n$ as well.

A solution to the $k$-Means problem will always be a partition of $X$ into $k$ subsets. A solution to the problem is not always suited to be interpreted as partition into “clusters”. Imagine a point cloud that is distributed according to a probability distribution centered along a straight line. Intuitively one would suggest a single “connected” cluster. However $k$-means by definition would suggest otherwise. Without further analysis we couldn’t tell the difference of that particular data set and another one scattered around $k$ different centers. So one should really see the solution for what it is. A partition, or “cover”, of $X$. Luckily there are further directions to go from here and to build on top of $k$-Means. We briefly sketch one of those possible extensions in the section below.

Let $\boldsymbol S = \{ S_1,\ldots,S_k \} $ be a partition of $X$ obtained by the Lloyd’s algorithm say. We would like to associate a simplicial complex to $\boldsymbol S$. In a previous post on the Mapper construction I explained how to construct the nerve of a covering of $X$. However, since $\boldsymbol S$ is a partition, i.e. the sets are mutually disjoint, this construction will only yield a trivial zero-dimensional complex. All we have to do is to slightly enlarge the sets in the partition $\boldsymbol S$. For $\varepsilon > 0$ we define \[ \boldsymbol S_\varepsilon := \big\{ N_\varepsilon(S_1), \ldots, N_\varepsilon(S_k) \big\}, \] where $N_\varepsilon(S)$ denotes the epsilon neighbourhood of $S$ in $X$, i.e. the set of points in $x$ whose distance to $S$ is at most $\varepsilon$. We can compute the nerve $\check{N}(\boldsymbol S_\varepsilon)$ of the enlarged cover $\boldsymbol S_\varepsilon$. For the “right” choice of $\varepsilon$ we are now able to distinguish the two data sets given in the previous section.

A construction that is closely related (almost similar) to the above is the following. Suppose $\boldsymbol S$ is the set of Voronoi cells associated to a family $\Lambda$ of landmarks. The **strong Witness complex $\mathcal{W}^s(X,\Lambda ,\varepsilon)$** is defined to be the complex whose vertex set is $\Lambda$, and where a collection $(\lambda_{1},\ldots, \lambda_{i})$ defines an $l$-simplex if and only if there is a **witness $x \in X$ for $(\lambda_{1},\ldots, \lambda_{i})$**, i.e. \[ d(x,\lambda_{j}) \leq d(x,\Lambda) + \varepsilon, \text{ for $j=1,…,i$}. \]

*This is the third of a series of posts on cluster-algorithms and ideas in data analysis.*

*Mapper* is a construction that uses a given cluster-algorithm to associate a simplicial complex to a reference map on a given data set. It was introduced by Carlsson–Mémoli–Singh in [1] and lies at the core of the of the topological data analysis startup Ayasdi. A good reference, I personally enjoyed reading, is [2]. Mapper is closely related to the *Reeb graph* of a real-valued function on a topological space. Just as the Reeb graph it is able to capture certain topological and geometrical features of the underlying space.

Let $X$ be a topological space. We associate to each open cover $\mathcal{U}=\{U_i\}_I$ a simplicial complex $\check N(\mathcal{U})$ called the **nerve of $\mathcal{U}$** defined as follows: there is a vertex $i$ for each $U_i$, and a $k$-simplex spanned by $i_0,…,i_k$ whenever the intersection $U_{i_0} \cap \ldots \cap U_{i_k}$ of the corresponding sets is nonempty.

Let $f:X \to Y$ be a continuous map between topological spaces $X$ and $Y$. Any open cover $\mathcal{V}$ of $Y$ induces an open cover $f^*\mathcal{V}$ of $X$ obtained by the pullback under $f$, i.e. \[ f^*\mathcal{V} := \big\{ f^{-1}(V) \ | \ {V \in \mathcal{V}} \big\}. \] For example set $Y = \mathbb{R}$ and think of $f$ as a height function on $X$. In the context of the Mapper construction these maps are called **filters** or **lenses**.

For a given space $X$ let $\mathcal{P}(X)$ denote its power set. Then we call an assignment \[ X \leadsto \pi(X)\subset \mathcal{P}(X), \] that associates to a space $X$ a family $\pi(X)$ of subsets of $X$, a **cluster function** or **cluster algorithm** — note that this is not really a function in the mathematical sense, but rather in the sense of computer science and programming. We refer to an element in $\pi(X)$ as a **cluster**. We don’t require $\pi(X)$ to satisfy any properties at that point. From a programming perspective think of it as the signature of a function implementing a certain cluster algorithm.

Given a family of open sets $\mathcal{U}=\{ U_i \}_I$ we define another family $\pi_*(\mathcal{U})$ of open sets by applying $\pi$ to each set $U_i$ and collecting the resulting clusters, i.e. we define \[ \pi_*(\mathcal{U}) := \bigcup_{i \in I} \pi(U_i). \] Finally we have everything in place to define the Mapper-construction.

Let $X$ be a topological space (the data set), $\pi$ be a cluster algorithm, $f:X \to Y$ a reference map to a topological space $Y$, and $\mathcal{V}$ an open cover of $Y$. Then the result of Mapper applied to this triple is the simplicial complex $\mathcal{M}(\pi, f, \mathcal{V})$ defined by \[ \mathcal{M}(\pi, f, \mathcal{V}) := \check{N} \big( \pi_*(f^*\mathcal{V}) \big). \]

**[1]** G. Carlsson, F. Mémoli and G. Singh, *Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition*, Eurographics Symposium on Point Based Graphics (2007), pp. 91–100.

**[2]** G. Carlsson, *Topology and data*, Bull. Amer. Math. Soc. 46 (2009), pp. 255–308.

*Ordering points to identify the clustering structure* (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let’s cut the intro and dive right into it.

Let $(X,d)$ be a finite discrete metric space, i.e.~let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Given a positive integer $m \in \mathbb{N}$ we can define a notion of *density* on points of our data set, which in turn can be used to define a *perturbed metric*. Given a starting point $x_0 \in X$ the OPTICS-algorithm iteratively defines a linear ordering on $X$ in terms of a filtration $X_0 \subset X_1 \subset \ldots \subset X_n$ of $X$, where $X_{n+1}$ is obtained from $X_n$ by appending the closest element (with respect to the perturbed metric) in its complement.

The original OPTICS-algorithm also takes an additional parameter $\varepsilon > 0$. However this is only needed to reduce the time complexity and is ignored in our discussion for now — to be more precise, we implicitly set $\varepsilon = \infty$.

Let $(X,d)$ be a finite discrete metric space and $m \in \mathbb{N}$ a positive integer. We define the **(co-)density $\delta_m(x)$ of a point $x$** by \[ \delta_m(x) := d(x, \mathfrak{nn}_m(x) ), \] where $\mathfrak{nn}_m(x) $ is an $m$-nearest-neighbor of $x$. Loosely speaking: the lower the value $\delta_m(x)$ the closer the neighbors of $x$ are distributed around $x$. Note that with $\varepsilon = \delta_m(x)$ the point $x$ is a *core point* in the sense of [2] — in a previous post about DBSCAN we called such a point *$\varepsilon$-dense*. That is why in the literature $\delta_m(x)$ is referred to as **core-distance**. I however like to think of it, and hence refer to it, as a notion of co-density.

Given the co-density $\delta_m(.)$ we can define the **reachability distance $r_x(y)$ of $y$ from $x$** by \[ r_x(y) := \text{max} \big\{ \delta_m(x), d(x,y) \big\}. \] Note that $r_x(y)$ is not symmetric, since the density of $x$ and $y$ may differ.

Choose a starting point $x_0 \in X$. Then we can iteratively define a filtration $X_0 \subset \ldots \subset X_n$ of the data set $X$ by \[ X_0 := \{ x_0 \} \ \ \text{and} \ \ X_{k+1} := X_k \cup \{ x_{k+1} \}, \] where $x_{k+1}$ minimizes $r_{X_k}(.)$ over $X \setminus X_k$. In parallel we define a sequence $(r_n)$ of distances by \[ r_0 = 0 \ \ \text{and} \ \ r_{k+1} := r_{X_k}(x_{k+1}). \] Note that a small distance $r_k$ may be understood as $x_k$ being close to a rather dense region. Therefore the filtration tries to stay in dense regions for as long as possible, before it passes a less dense region. The cluster structure can now be extracted by analyzing the *reachability-plot*, that is a $2$-dimensional plot, with the ordered $x_k$ on the $x$-axis and the associated distances $r_k$ on the $y$-axis. By the above considerations it should be clear that clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

Consider the data points $X\subset \mathbb{R}^2$ in the euclidian plane given in the following figure:

Below you see the reachability plot corresponding the OPTICS-algorithm applied to the above data set with $m=4$. The starting point is chosen among the group of points on the left.

**[1]** M. Ankerst, M.M. Breunig, H.-P. Kriegel, *OPTICS: Ordering Points To Identify the Clustering Structure*, ACM SIGMOD international conference on Management of data (1999), pp 49–60.

**[2]** M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, *A density-based algorithm for discovering clusters in large spatial databases with noise*, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), pp. 226–231.

*Density-based spatial clustering of applications with noise* (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 [1]. It uses notions of *connectivity* and *density* of points in the data set to compute the connected components of dense enough regions of the data set. But let’s cut the intro and dive right into it.

Let $(X,d)$ be a finite discrete metric space, i.e. let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Assume we fixed a pair of parameters $\varepsilon > 0$ and $m \in \mathbb{N}$. We say that two points $x,x’ \in X$ are **$\varepsilon$-connected** if there is a sequence of points $x=x_0,\ldots,x_n=x’$ such that \[ d(x_i,x_{i+1}) < \varepsilon, \text{ for $i=0,\ldots,n-1$}. \] Note that **$\varepsilon$-connectivity** defines an equivalence relation on $X$ which we denote by $\sim_\varepsilon$.

For a subset $A \subset X$ we define its **$\varepsilon$-boundary $\Delta_\varepsilon A$** to be the set of points whose distance to $A$ is at most $\varepsilon$, i.e. we set \[ \Delta A := \Delta_\varepsilon A := \{ x \in X \setminus A \ | \ d(A,x) \leq \varepsilon \}. \] Here $d(A,x)$ denotes the minimum $\text{min}_{a \in A} \{ d(a,x) \}$ over all points in $A$. This notion of *boundary* is borrowed from graph theory and slightly adapted to our situation. Don’t worry if you don’t like that notion it could also be omitted — I will elaborate on that later on. The choice of the parameters above allows us to define a discrete notion of **density $\rho(x)$ of a point $x$** by \[ \rho(x) = \rho_{\varepsilon}(x) := |N_\varepsilon(x)|, \] where $|.|$ counts the number of elements of a set and $N_\varepsilon(x)$ denotes the $\varepsilon$-neighborhood of $p$ (i.e.~the set of points whose distance to $x$ is at most $\varepsilon$). A point is called **$m$-dense** if its ($\varepsilon$-)density is greater or equal to $m$, i.e.~its $\varepsilon$-neighborhood $N_\varepsilon(p)$ contains at least $m$ points. In the literature these points usually are referred to as **core points**.

The idea behind DBSCAN can be formulated as follows. Choose pair of parameters $(\varepsilon, m) \in \mathbb{R}_{\geq 0} \times \mathbb{N}$. The first parameter, $\varepsilon > 0$, gives rise to the notions of connectivity, density, and the boundary of subset as described above. Let \[ X_{m \leq \rho} := \rho^{-1}\big( [m,\infty) \big) \subset X \] denote the points of density at least $m$ and let $C_1,\ldots,C_n$ denote its connected components with respect to the notion of $\varepsilon$-connectivity defined above, i.e. \[ \pi_0^\varepsilon(X_{m \leq \rho}) := \big( X_{m \leq \rho} \big)/_{x \sim_\varepsilon x’} = \big\{ C_1,\ldots,C_n \big\}. \] The components $C_1,\ldots,C_n$ can be understood as **dense cores** of the final clusters. Note that these sets are mutually disjoint. The final clusters $\overline{C}_1,\ldots,\overline{C}_n$ are obtained from $C_1,\ldots,C_n$ by adding their respetive boundaries, i.e. for $i=1,\ldots,n$ we define \[ \overline{C}_i := C_i \cup \Delta C_i. \] Actually $\overline{C}_i$ is just the $\varepsilon$-neighborhood $N_\varepsilon(C_i)$ of $C_i$, but I like the formulation in terms of $\Delta_\varepsilon$. Note that that $\overline{C}_1,\ldots,\overline{C}_n$ are not necessarily pairwise disjoint. To achieve that we need to make a choice (usually following the order in which the points are visited) and in that sense an implementation of DBSCAN is usually not deterministic.

**[1]** M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, *A density-based algorithm for discovering clusters in large spatial databases with noise*, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), AAAI Press, pp. 226–231