On the Ability of Neural Nets to Express Distributions

ABSTRACT. Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution---also theoretically not understood---concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks.

We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with $n$ hidden layers. A key ingredient is Barron's Theorem \citep{Barron1993}, which gives a Fourier criterion for approximability of a function by a neural network with 1 hidden layer. We show that a composition of $n$ functions which satisfy certain Fourier conditions (``Barron functions'') can be approximated by a $n+1$-layer neural network.

For probability distributions, this translates into a criterion for a probability distribution to be approximable in Wasserstein distance---a natural metric on probability distributions---by a neural network applied to a fixed base distribution (e.g., multivariate gaussian).

Building up recent lower bound work, we also give an example function that shows that composition of Barron functions is more expressive than Barron functions alone.

ABSTRACT. Let $f:\sphere^{d-1}\times \sphere^{d-1}\to\reals$ be a function of the form $f(\x,\x') = g(\inner{\x,\x'})$ for $g:[-1,1]\to \reals$. We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate $f$ whenever $g$ cannot be approximated by a low degree polynomial. Moreover, for many $g$'s, such as $g(x)=\sin(\pi d^3x)$, the number of neurons must be $2^{\Omega\left(d\log(d)\right)}$. Furthermore, the result holds w.r.t.\ the uniform distribution on $\sphere^{d-1}\times \sphere^{d-1}$.
As many functions of the above form can be well approximated by poly-size depth three networks with poly-bounded weights, this establishes a separation between depth two and depth three networks w.r.t.\ the uniform distribution on $\sphere^{d-1}\times \sphere^{d-1}$.

ABSTRACT. We analyze dropout in deep networks with rectified linear units and
the quadratic loss. Our results expose surprising differences between
the behavior of dropout and more traditional regularizers like weight
decay. For example, on some simple data sets dropout training
produces negative weights even though the output is the sum of the
inputs. This provides a counterpoint to the suggestion that dropout
discourages co-adaptation of weights. We also show that the dropout
penalty can grow exponentially in the depth of the network while the
weight-decay penalty remains essentially linear, and that dropout is
insensitive to various re-scalings of the input features, outputs, and
network weights. This last insensitivity implies that there are no
isolated local minima of the dropout training criterion. Our work
uncovers new properties of dropout, extends our understanding of why
dropout succeeds, and lays the foundation for further progress.

ABSTRACT. We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \max(0, \mathbf{w} \cdot \mathbf{w})$ with $\mathbf{w} \in \mathbb{S}^{n−1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai et al. (2012) where the learner is given access to a distribution D on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by D, for any convex, bounded, and Lipschitz loss function.

The algorithm runs in polynomial-time (in n) with respect to any distribution on $\mathbb{S}^{n−1}$ (the unit sphere in n dimensions) and for any error parameter $\epsilon = \Omega(1/ \log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $\epsilon$ must be $\Omega(1)$ and strong assumptions are required on the marginal distribution.

We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLU with fixed polynomial-dependence in the dimension. For depth-2 networks of sigmoids, we obtain the first algorithms that have a polynomial dependency in all parameters.

Our techniques combine kernel methods and polynomial approximations with a “dual-loss” approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for “convex piecewise-linear fitting” and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere.

Nearly-tight VC-dimension bounds for neural networks

ABSTRACT. We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting $W$ be the number of weights and $L$ be the number of layers, we prove that the VC-dimension is $O(W L \log(W))$ and $\Omega( W L \log(W/L) )$. This improves both the previously known upper bounds and lower bounds. In terms of the number $U$ of non-linear units, we prove a tight bound $\Theta(W U)$ on the VC-dimension. All of these results generalize to arbitrary piecewise linear activation functions.

ABSTRACT. We obtain the first polynomial-time algorithm for exact tensor completion that improves over the bound implied by reduction to matrix completion.
The algorithm recovers an unknown 3-tensor with $r$ incoherent, orthogonal components in $\R^n$ from $r\cdot \tilde O(n^{1.5})$ randomly observed entries of the tensor.
This bound improves over the previous best one of $r\cdot \tilde O(n^{2})$ by reduction to exact matrix completion.
Our bound also matches the best known results for the easier problem of approximate tensor completion (Barak \& Moitra, 2015).

Our algorithm and analysis extends seminal results for exact matrix completion (Candes \& Recht, 2009) to the tensor setting via the sum-of-squares method.
The main technical challenge is to show that a small number of randomly chosen monomials are enough to construct a degree-3 polynomial with a precisely planted orthogonal global optima over the sphere and that this fact can be certified within the sum-of-squares proof system.

Fast and robust tensor decomposition with applications to dictionary learning

ABSTRACT. We develop fast spectral algorithms for tensor decomposition that match the robustness guarantees of the best known polynomial-time algorithms for this problem based on the sum-of-squares (SOS) semidefinite programming hierarchy.

Our algorithms can decompose a 4-tensor with $n$-dimensional orthonormal components in the presence of error with constant spectral norm (when viewed as an $n^2$-by-$n^2$ matrix).
The running time is $n^5$ which is close to linear in the input size $n^4$.

We also obtain algorithms with the same running time to learn sparsely-used orthogonal dictionaries even when feature representations have constant relative sparsity and non-independent coordinates.

The only previous polynomial-time algorithms to solve these problem are based on solving large semidefinite programs.
In contrast, our algorithms are easy to implement directly and are based on spectral projections and tensor-mode rearrangements.

Or work is inspired by recent of Hopkins, Schramm, Shi, and Steurer (STOC'16) that shows how fast spectral algorithms can achieve the guarantees of SOS for average-case problems.
In this work, we introduce general techniques to capture the guarantees of SOS for worst-case problems.

ABSTRACT. Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the ``high noise'' regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the {\em best} degree-$4$ sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows us to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.

Fundamental limits of symmetric low-rank matrix estimation

ABSTRACT. We consider the high-dimensional inference problem where the signal is a low-rank symmetric matrix which is corrupted by an additive Gaussian noise. Given a probabilistic model for the low-rank matrix, we compute the limit in the large dimension setting for the mutual information between the signal and the observations, as well as the matrix minimum mean square error, while the rank of the signal remains constant. We also show that our model extends beyond the particular case of additive Gaussian noise and we prove an universality result connecting the community detection problem to our Gaussian framework. We unify and generalize a number of recent works on PCA, sparse PCA, submatrix localization or community detection by computing the information-theoretic limits for these problems in the high noise regime. In addition, we show that the posterior distribution of the signal given the observations is characterized by a parameter of the same dimension as the square of the rank of the signal (i.e. scalar in the case of rank one).
This allows to locate precisely the information-theoretic thresholds for the above mentioned problems.
Finally, we connect our work with the hard but detectable conjecture in statistical physics.

Matrix Completion from O(n) Samples in Linear Time

ABSTRACT. We consider the problem of reconstructing a rank-$k$ $n \times n$ matrix $M$ from a sampling of its entries. Under a certain incoherence assumption on $M$ and for the case when both the rank and the condition number of $M$ are bounded, it was shown in \citep{CandesRecht2009, CandesTao2010, keshavan2010, Recht2011, Jain2012, Hardt2014} that $M$ can be recovered exactly or approximately (depending on some trade-off between accuracy and computational complexity) using $O(n \, \text{poly}(\log n))$ samples in super-linear time $O(n^{a} \, \text{poly}(\log n))$ for some constant $a \geq 1$.

In this paper, we propose a new matrix completion algorithm using a novel sampling scheme based on a union of independent sparse random regular bipartite graphs. We show that under the same conditions w.h.p. our algorithm recovers an $\epsilon$-approximation of $M$ in terms of the Frobenius norm using $O(n \log^2(1/\epsilon))$ samples and in linear time $O(n \log^2(1/\epsilon))$. This provides the best known bounds both on the sample complexity and computational cost for reconstructing an unknown low-rank matrix.

The novelty of our algorithm is a new step of thresholding singular values and rescaling singular vectors in the application of the ``vanilla'' alternating minimization algorithm. The structure of sparse random regular graphs is used heavily for controlling the impact of these regularization steps.

High-Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transition.

ABSTRACT. We consider a sparse linear regression model Y=X\beta^{*}+W where X is n\times p matrix Gaussian i.i.d. entries, W
is n\times 1 noise vector with i.i.d. mean zero Gaussian entries and standard deviation \sigma,
and \beta^{*} is p\times 1 binary vector with support size (sparsity) k.
Using a novel conditional second moment method we obtain a tight up to a
multiplicative constant approximation of the optimal squared error \min_{\beta}\|Y-X\beta\|_{2},
where the minimization is over all k-sparse binary
vectors \beta.
The approximation reveals interesting structural properties of the
underlying regression problem. In particular,

[(a)] We establish that n^{*}=2k\log p/\log (2k/\sigma^{2}+1) is a phase transition point
with the following ``all-or-nothing'' property.
When n exceeds n^{*}, (2k)^{-1}\|\beta_{2}-\beta^*\|_0\approx 0, and when n is below n^{*},
(2k)^{-1}\|\beta_{2}-\beta^*\|_0\approx 1, where \beta_2 is the optimal solution achieving the smallest squared error.
As a corollary n^{*} is the asymptotic threshold for recovering \beta^* information theoretically. Note that n^* is asymptotically
below the threshold n_{\text{LASSO/CS}}=(2k+\sigma^2)\log p, above which
the LASSO and Compressive Sensing methods are able to recover \beta^*.

[(b)] We compute the squared error for an intermediate problem \min_{\beta}\|Y-X\beta\|_{2} where minimization is restricted to vectors
\beta with \|\beta-\beta^{*}\|_0=2k \zeta, for some fixed ratio \zeta\in [0,1].
We show that a lower bound part \Gamma(\zeta) of the estimate, which essentially corresponds to the estimate based on the first moment method,
undergoes a phase transition at three different thresholds, namely
n_{\text{inf,1}}=\sigma^2\log p, which is information theoretic bound for recovering \beta^* when k=1
and \sigma is large, then at n^{*} and finally at n_{\text{LASSO/CS}}.

[(c)] We establish a certain Overlap Gap Property (OGP) on the space of all k-sparse binary vectors \beta when n\le ck\log p for sufficiently small constant c.
By drawing a connection with a similar OGP exhibited by many randomly generated constraint satisfaction problems and statistical physics models,
we conjecture that OGP is the source of algorithmic
hardness of solving the minimization problem \min_{\beta}\|Y-X\beta\|_{2} in the regime n

Rates of estimation for determinantal point processes

ABSTRACT. Determinantal point processes (DPPs) have wide-ranging applications in machine learning, where they are used to enforce the notion of diversity in subset selection problems. Many estimators have been proposed, but surprisingly the basic properties of the maximum likelihood estimator (MLE) have received little attention. In this paper, we study the local geometry of the expected log-likelihood function to prove several rates of convergence for the MLE. We also give a complete characterization of the case where the MLE converges at a parametric rate. Even in the latter case, we also exhibit a potential curse of dimensionality where the asymptotic variance of the MLE is exponentially large the dimension of the problem.

ABSTRACT. We consider a new PAC-style learning model in which a joint distribution over vector pairs (x, y) is determined by an unknown function c(x) that maps input vectors x not to individual outputs, but to entire distributions over output vectors y. Our main results take the form of rather general reductions from our model to algorithms for PAC learning the function class and the distribution class separately, and show that virtually every such combination yields an efficient algorithm in our model. Our methods include a randomized reduction to classification noise and an application of Le Cam’s method to obtain robust learning algorithms.

A second-order look at stability and generalization

ABSTRACT. Using differentiability assumptions on the loss function and a concentration inequality for bounded second order differences it is shown that the generalization error for classification with L2 regularisation obeys a Bernstein-type inequality.

Optimal learning via local entropies and sample compression

ABSTRACT. Under margin assumptions, we prove several risk bounds, represented via the distribution dependent local entropies of the classes or the sizes of specific sample compression schemes. In some cases, our guarantees are optimal up to constant factors for families of classes. We discuss limitations of our approach and give several applications. In particular, we provide improved high probability bounds for the hard-margin SVM, extended analysis of certain empirical risk minimizers under log-concave distributions, a new variant of an online to batch conversion, and distribution dependent localized bounds in the aggregation framework.