Finite-Sample Expressivity of Neural Networks

In the present blog post I would like to present a slighlty altered version of the proof of a theorem in [1]. The statement is about theĀ finite-sample expressivity of neural networks. You can basically take the approach and present it in a arguably more direct and compact fashion, and explicitly writing down the desired network. But let us first take a look at said theorem.

Theorem (see [1, Theorem 1], Section 4 on page 8). There exists a two-layer neural network with ReLU activations and $2n+d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.

That means if we choose $n$ mutually distinct samples $z_1, \ldots, z_n \in \mathbb{R}^d$ and real valued labels $y_1, \ldots, y_n \in \mathbb{R}$ there is a $2$-layer neural network $C$ depending only on $2n+d$ weight parameters such that for $i=1, \ldots , n$ we have $$ C(z_i) = y_i. $$ Moreover the network is of the form $\mathbb{R}^d \stackrel{F}{\to} \mathbb{R}^n \stackrel{w}{\to} \mathbb{R} $ with activation function given by $ \alpha(x) = \max \{ x ,0 \}$.

Remark and Question. What is somewhat interesting is that $d$ parameters can be chosen generically, i.e. a random choice will almost certainly give the desired result. This is not immediately clear from the presentation in [1]. If someone has a comment on this I’d be happy to hear about it!

Alternative version of the proof (cf. [1]).

First we are going to reduce the problem to the case where the samples are chosen from $\mathbb{R}$. Choose $a \in \mathbb{R}^d$ such that $x_i := a \cdot z_i$ are mutually distinct, i.e. we have \[ x_i \neq x_j \text{, for $i \neq j$.} \] Relabel the data points and add a value $x_0$ such that we have $$ x_0 < x_1 < \ldots < x_n. $$ (Note that a generic $a$ will do the job.) We are now in the one-dimensional case, and will proceed by defining a family $\{f_i\}$ of affine-linear functions, that will only depend on $n$ parameters, which combined with the previous projection, and the rectifier will give the first layer of the desired network. We define $f_i \colon\thinspace \mathbb{R} \to \mathbb{R}$ by $$ f_i(x) := \tfrac{1}{x_i – x_{i-1}} (x – x_{i-1}), $$ and note that we have $f_i(x_i) = 1$, and $f_i(x) \leq 0$ for $x \leq x_{i-1}$. We are now ready to define the final layer of the network. Set $w_1 := y_1$ and define iteratively $$ w_j := y_j – \sum_{i=1}^{j-1} f_i(x_j)\cdot w_i. $$ One easily computes that $$ y_j = \sum_{i=1}^{n} \max\big\{ f_i(a \cdot z_j) , 0 \big\} \cdot w_i. $$ With $F(z) = ( f_1(a \cdot z), \ldots, f_n(a \cdot z) )$ this can be expressed as $\max\{ F(z_j),0 \}^T \cdot w = y_j$. Note that $F$ is a affine linear function $\mathbb{R}^d \to \mathbb{R}^n$ that depends on $d + n$ weights, namely those coming from $a$ and $x_1,\ldots,x_n$. With the additional $n$ weights defining $w$ we conclude the proof. QED


[1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals, “Understanding deep learning requires rethinking generalization”, under review as a conference paper at ICLR 2017