COLT 2016: THE 29TH ANNUAL CONFERENCE ON LEARNING THEORY
PROGRAM FOR FRIDAY, JUNE 24TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-09:50 Session 7: Deep Networks
09:00
The Power of Depth for Feedforward Neural Networks
SPEAKER: unknown

ABSTRACT. We show that there are simple functions on R^d, expressible by small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for most continuous activation functions, including rectified linear units and sigmoids, and is a formal demonstration that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

09:10
Benefits of depth in neural networks

ABSTRACT. For any positive integer k, there exist neural networks with Θ(k^3) layers, Θ(1) nodes per layer, and Θ(1) distinct parameters which can not be approximated by networks with O(k) layers unless they are exponentially large --- they must possess Ω(2^k) nodes. This result is proved here for a class of nodes termed semi-algebraic gates which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with ReLU and maximization gates, and boosted decision trees (in this last case with a stronger separation: Ω(2^(k^3)) total tree nodes are required).

09:20
On the Expressive Power of Deep Learning: A Tensor Analysis
SPEAKER: unknown

ABSTRACT. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition.

Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.

09:30
Cortical Computation via Iterative Constructions
SPEAKER: unknown

ABSTRACT. We study Boolean functions of an arbitrary number of input variables that can be realized by simple iterative constructions based on constant-size primitives. This restricted type of construction needs little global coordination or control and thus is a candidate for neurally feasible computation. Valiant's construction of a majority function can be realized in this manner and, as we show, can be generalized to any uniform threshold function. We study the rate of convergence, finding that while linear convergence to the correct function can be achieved for any threshold using a fixed set of primitives, for quadratic convergence, the size of the primitives must grow as the threshold approaches 0 or 1. We also study finite realizations of this process and the learnability of the functions realized. We show that the constructions realized are accurate outside a small interval near the target threshold, where the size of the construction grows as the inverse square of the interval width. This phenomenon, that errors are higher closer to thresholds (and thresholds closer to the boundary are harder to represent), is a well-known cognitive finding.

09:40
A Guide to Learning Arithmetic Circuits

ABSTRACT. An \emph{arithmetic circuit} is a directed acyclic graph in which the operations are $\{+,\times\}$. In this paper, we exhibit several connections between learning algorithms for arithmetic circuits and other problems. In particular, we show that:

\begin{itemize} \item Efficient learning algorithms for arithmetic circuit classes imply explicit exponential lower bounds.

\item General circuits and formulas can be learned efficiently with membership and equivalence queries iff they can be learned efficiently with membership queries only.

\item Low-query learning algorithms for certain classes of circuits imply explicit rigid matrices.

\item Learning algorithms for multilinear depth-3 and depth-4 circuits must compute square roots. \end{itemize}

09:50-10:10Coffee Break
10:10-11:10 Session 8: Tensor Methods/Programming Hierarchies
10:10
How to calculate partition functions using convex programming hierarchies: provable bounds for variational methods

ABSTRACT. We consider the problem of approximating partition functions for Ising models. We make use of recent tools in combinatorial optimization: the Sherali-Adams and Lasserre convex programming hierarchies, in combination with variational methods to get algorithms for calculating partition functions in these families. These techniques give new, non-trivial approximation guarantees for the partition function beyond the regime of correlation decay. They also generalize some classical results from statistical physics about the Curie-Weiss ferromagnetic Ising model, as well as provide a partition function counterpart of classical results about max-cut on dense graphs (\cite{arora1995polynomial}). With this, we connect techniques from two areas that have apparently been disparate up to now -- optimization and counting/partition function approximations. (i.e. #-P type of problems).

Furthermore, we design to the best of our knowledge the first provable, convex variational methods. Though in the literature there are a host of convex versions of variational methods (\cite{wainwright2003tree, wainwright2005new, heskes2006convexity, meshi2009convexifying}), they come with no guarantees (apart from some extremely special cases, like e.g. the graph has a single cycle \cite{weiss2000correctness}). We consider dense and expander-like graphs, and interestingly, the reason our approach works on these types of graphs is because local correlations propagate to global correlations -- completely the opposite of algorithms based on correlation decay. In the process we design novel entropy approximations based on the low-order moments of a distribution.

10:30
Noisy Tensor Completion via the Sum-of-Squares Hierarchy
SPEAKER: unknown

ABSTRACT. In the noisy tensor completion problem we observe $m$ entries (whose location is chosen uniformly at random) from an unknown $n_1 \times n_2 \times n_3$ tensor $T$. We assume that $T$ is entry-wise close to being rank $r$. Our goal is to fill in its missing entries using as few observations as possible. Let $n = \max(n_1, n_2, n_3)$. We show that if $m = n^{3/2} r$ then there is a polynomial time algorithm based on the sixth level of the sum-of-squares hierarchy for completing it. Our estimate agrees with almost all of $T$'s entries almost exactly and works even when our observations are corrupted by noise. This is also the first algorithm for tensor completion that works in the overcomplete case when $r > n$, and in fact it works all the way up to $r = n^{3/2-\epsilon}$.

Our proofs are short and simple and are based on establishing a new connection between noisy tensor completion (through the language of Rademacher complexity) and the task of refuting random constant satisfaction problems. This connection seems to have gone unnoticed even in the context of matrix completion. Furthermore, we use this connection to show matching lower bounds. Our main technical result is in characterizing the Rademacher complexity of the sequence of norms that arise in the sum-of-squares relaxations to the tensor nuclear norm. These results point to an interesting new direction: Can we explore computational vs. sample complexity tradeoffs through the sum-of-squares hierarchy?

10:50
Basis Learning as an Algorithmic Primitive
SPEAKER: unknown

ABSTRACT. A number of important problems in theoretical computer science and machine learning can be interpreted as recovering a certain basis. These include symmetric matrix eigendecomposition, certain tensor decompositions, Independent Component Analysis (ICA), spectral clustering and Gaussian mixture learning. Each of these problems reduces to an instance of our general model, which we call a "Basis Encoding Function" (BEF). We show that learning a basis within this model can then be provably and efficiently achieved using a first order iteration algorithm (gradient iteration). Our algorithm goes beyond tensor methods while generalizing a number of existing algorithms---e.g., the power method for symmetric matrices, the tensor power iteration for orthogonal decomposable tensors, and cumulant-based FastICA---all within a broader function-based dynamical systems framework. Our framework also unifies the unusual phenomenon observed in these domains that they can be solved using efficient non-convex optimization. Specifically, we describe a class of BEFs such that their local maxima on the unit sphere are in one-to-one correspondence with the basis elements. This description relies on a certain "hidden convexity" property of these functions.

We provide a complete theoretical analysis of the gradient iteration even when the BEF is perturbed. We show convergence and complexity bounds polynomial in dimension and other relevant parameters, such as perturbation size. Our perturbation results can be considered as a non-linear version of the classical Davis-Kahan theorem for perturbations of eigenvectors of symmetric matrices. In addition we show that our algorithm exhibits fast (superlinear) convergence and relate the speed of convergence to the properties of the BEF. Moreover, the gradient iteration algorithm can be easily and efficiently implemented in practice. Finally we apply our framework by providing the first provable algorithm for recovery in a general perturbed ICA model.

11:10-11:30Break
12:30-14:30Lunch Break
14:30-15:50 Session 10: Bandits and Reinforcement Learning
Chair:
14:30
Multi-scale exploration of convex functions and bandit convex optimization (Best Paper Award)

ABSTRACT. We construct a new map from a convex function to a distribution on its domain, with the property that this distribution is a multi-scale exploration of the function. We use this map to solve a decade-old open problem in adversarial bandit convex optimization by showing that the minimax regret for this problem is $\tilde{O}(\mathrm{poly}(n) \sqrt{T})$, where $n$ is the dimension and $T$ the number of rounds. This bound is obtained by studying the dual Bayesian maximin regret via the information ratio analysis of Russo and Van Roy, and then using the multi-scale exploration to construct a new algorithm for the Bayesian convex bandit problem.

14:50
Delay and Cooperation in Nonstochastic Bandits
SPEAKER: unknown

ABSTRACT. We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc Exp3} algorithm and prove that with $K$ actions and $N$ agents the average per-agent regret after $T$ rounds is at most of order $\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$, where $\alpha_{\le d}$ is the independence number of the $d$-th power of the connected communication graph $G$. We then show that for any connected graph, for $d=\sqrt{K}$ the regret bound is $K^{1/4}\sqrt{T}$, strictly better than the minimax regret $\sqrt{KT}$ for noncooperating agents. More informed choices of $d$ lead to bounds which are arbitrarily close to the full information minimax regret $\sqrt{T\ln K}$ when $G$ is dense. When $G$ has sparse components, we show that a variant of \textsc{Exp3-Coop}, allowing agents to choose their parameters according to their centrality in $G$, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

15:10
Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models
SPEAKER: unknown

ABSTRACT. In this paper we study a model-based approach to calculating approximately optimal policies in Markovian Decision Processes. In particular, we derive novel bounds on the loss of using a policy derived from a factored linear model, a class of models which generalize virtually all previous models that come with strong computational guarantees. For the first time in the literature, we derive performance bounds for model-based techniques where the model inaccuracy is measured in weighted norms. Moreover, our bounds show a decreased sensitivity to the discount factor and, unlike similar bounds derived for other approaches, they are insensitive to measure mismatch. Similarly to previous works, our proofs are also based on contraction arguments, but with the main differences that we use carefully constructed norms building on Banach lattices, and the contraction property is only assumed for operators acting on ``compressed'' spaces, thus weakening previous assumptions, while strengthening previous results.

15:30
Reinforcement Learning of POMDPs using Spectral Methods
SPEAKER: unknown

ABSTRACT. We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.

15:50-16:20Coffee Break
16:20-17:30 Session 11: PAC Learning
16:20
Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables
SPEAKER: unknown

ABSTRACT. We study the structure and learnability of sums of independent integer random variables (SIIRVs). For $k \in \Z_{+}$, a {\em$k$-SIIRV of order $n \in \Z_{+}$} is the probability distribution of the sum of $n$ mutually independent random variables each supported on $\{0, 1, \dots, k-1\}$. We denote by ${\cal S}_{n,k}$ the set of all $k$-SIIRVs of order $n$.

How many samples are required to learn an arbitrary distribution in ${\cal S}_{n,k}$? In this paper, we tightly characterize the sample and computational complexity of this problem. More precisely, we design a computationally efficient algorithm that uses $\widetilde{O}(k/\eps^2)$ samples, and learns an arbitrary $k$-SIIRV within error $\eps,$ in total variation distance. Moreover, we show that the {\em optimal} sample complexity of this learning problem is $\Theta((k/\eps^2)\sqrt{\log(1/\eps)}),$ i.e., we prove an upper bound and a matching information-theoretic lower bound. Our algorithm proceeds by learning the Fourier transform of the target $k$-SIIRV in its effective support. Its correctness relies on the {\em approximate sparsity} of the Fourier transform of $k$-SIIRVs -- a structural property that we establish, roughly stating that the Fourier transform of $k$-SIIRVs has small magnitude outside a small set.

Along the way we prove several new structural results about $k$-SIIRVs. As one of our main structural contributions, we give an efficient algorithm to construct a sparse {\em proper} $\eps$-cover for ${\cal S}_{n,k},$ in total variation distance. We also obtain a novel geometric characterization of the space of $k$-SIIRVs. Our characterization allows us to prove a tight lower bound on the size of $\eps$-covers for ${\cal S}_{n,k}$ -- establishing that our cover upper bound is optimal -- and is the key ingredient in our tight sample complexity lower bound.

Our approach of exploiting the sparsity of the Fourier transform in distribution learning is general, and has recently found additional applications. In a subsequent work~\cite{DKS15c}, we use a generalization of this idea to obtain the first computationally efficient learning algorithm for Poisson multinomial distributions. In~\cite{DKS15b}, we build on our Fourier-based approach to obtain the fastest known proper learning algorithm for Poisson binomial distributions ($2$-SIIRVs).

16:40
Properly Learning Poisson Binomial Distributions in Almost Polynomial Time
SPEAKER: unknown

ABSTRACT. We give an algorithm for properly learning Poisson binomial distributions. A Poisson binomial distribution (PBD) of order $n \in \Z_+$ is the discrete probability distribution of the sum of $n$ mutually independent Bernoulli random variables. Given $\widetilde{O}(1/\eps^2)$ samples from an unknown PBD $\p$, our algorithm runs in time $(1/\eps)^{O(\log \log (1/\eps))}$, and outputs a hypothesis PBD that is $\eps$-close to $\p$ in total variation distance. The sample complexity of our algorithm is known to be nearly-optimal, up to logarithmic factors, as established in previous work~\cite{DDS12stoc}. However, the previously best known running time for properly learning PBDs~\cite{DDS12stoc, DKS15} was $(1/\eps)^{O(\log(1/\eps))}$, and was essentially obtained by enumeration over an appropriate $\eps$-cover. We remark that the running time of this cover-based approach cannot be improved, as any $\eps$-cover for the space of PBDs has size $(1/\eps)^{\Omega(\log(1/\eps))}$~\cite{DKS15}.

As one of our main contributions, we provide a novel structural characterization of PBDs, showing that any PBD $\p$ is $\eps$-close to another PBD $\q$ with $O(\log(1/\eps))$ distinct parameters. More precisely, we prove that, for all $\eps >0,$ there exists an explicit collection $\cal{M}$ of $(1/\eps)^{O(\log \log (1/\eps))}$ vectors of multiplicities, such that for any PBD $\p$ there exists a PBD $\q$ with $O(\log(1/\eps))$ distinct parameters whose multiplicities are given by some element of ${\cal M}$, such that $\q$ is $\eps$-close to $\p.$ Our proof combines tools from Fourier analysis and algebraic geometry.

Our approach to the proper learning problem is as follows: Starting with an accurate non-proper hypothesis, we fit a PBD to this hypothesis. This fitting problem can be formulated as a natural polynomial optimization problem. Our aforementioned structural characterization allows us to reduce the corresponding fitting problem to a collection of $(1/\eps)^{O(\log \log(1/\eps))}$ %carefully constructed systems of low-degree polynomial inequalities. We show that each such system can be solved in time $(1/\eps)^{O(\log \log(1/\eps))}$, which yields the overall running time of our algorithm.

16:50
Learning and Testing Junta Distributions
SPEAKER: unknown

ABSTRACT. We consider the problem of learning distributions in the presence of irrelevant features. This problem is formalized by introducing a new notion of k-junta distributions. Informally, a distribution D over the domain X^n is a k-junta distribution with respect to another distribution U over the same domain if there is a set J \subseteq [n] of size |J| <= k that captures the difference between D and U.

We show that it is possible to learn $k$-junta distributions with respect to the uniform distribution over the Boolean hypercube \{0,1\}^n in time \poly(n^k, 1/\epsilon). This result is obtained via a new Fourier-based learning algorithm inspired by the Low-Degree Algorithm of Linial, Mansour, and Nisan (1993).

We also consider the problem of testing whether an unknown distribution is a $k$-junta distribution with respect to the uniform distribution. We give a nearly-optimal algorithm for this task. Both the analysis of the algorithm and the lower bound showing its optimality are obtained by establishing connections between the problem of testing junta distributions and testing uniformity of weighted collections of distributions.

17:00
Efficient algorithms for learning and 1-bit compressed sensing under asymmetric noise
SPEAKER: unknown

ABSTRACT. We study the \emph{approximate recovery} problem: Given corrupted $1$-bit measurements of the form $\sgn(w^* \cdot x_i)$, recover a vector $w$ that is a good approximation to $w^* \in \R^d$. This problem has been considered by the learning theory and signal processing communities from different perspectives. The fundamental question of interest in this space is how to design noise-tolerant and computationally-efficient algorithms. Moreover, when $w^*$ is $t$-sparse, we must use a number of measurements $x_i$'s that scale polynomially in $t$ and only logarithmically in $d$, the ambient dimension. In both communities, prior work has mainly focused on the case of symmetric noise. We consider this problem in two highly asymmetric noise models: \emph{bounded noise model} and \emph{adversarial (agnostic) noise model.}

The {\em bounded noise} (a.k.a Massart) model is a realistic noise model where an adversary can flip the measurement of each point $x$ with probability $\eta(x)\leq \eta< 1/2$. In this work, we first consider the non-sparse setting and introduce a new computationally-efficient algorithm that returns $w$ such that $\| w- w^*\| \leq \epsilon$ for any constant $\beta =\frac{1-\eta}{2}$, in time $\poly(d, \frac 1 \epsilon)$ when the distribution is isotropic log-concave. This improves over the work of \cite{awasthi2015efficient} that required the noise to be as small as $\eta\approx 10^{-7}$. We then introduce an attribute-efficient variant of this algorithm that achieves the same guarantee with a number of measurements $\poly(t, \log(d), \frac 1 {\epsilon})$ when $\|w^*\|_0\leq t$. For the adversarial noise, where any $\tau$ fraction of measurements can be corrupted, we introduce an attribute-efficient variation of the algorithm of \cite{awasthi2014power} that returns $w$ such that $\|w-w^*\| \leq O(\tau) + \epsilon$, with $\tilde\Omega(t ~ \polylog(d) ~ \frac 1 {\epsilon^3})$ measurements. This improves over the best known result of \cite{Plan2014robust} in $1$-bit compressed sensing, which guarantees $\|w-w^*\| \leq O( \tau\sqrt{\log \frac 1 \tau} ) + \epsilon$, with $\tilde\Omega(t ~ \polylog(d) ~ \frac 1 {\epsilon^6})$ measurements, in terms of both the sample complexity and the approximation factor.

17:10
Complexity theoretic limitations on learning DNF's
SPEAKER: unknown

ABSTRACT. Using the recently developed framework of Daniely, Shaelv-Shwartz and Linial, we show that under a natural assumption on the complexity of random K-SAT, learning DNF formulas is hard. Furthermore, the same assumption implies the hardness of various learning problems, including intersections of $\omega(\log(n))$ halfspaces, agnostically learning conjunctions, as well as virtually all (distribution free) learning problems that were previously shown hard (under various complexity assumptions).

17:20
Sign rank versus VC dimension
SPEAKER: unknown

ABSTRACT. We study the maximum possible sign rank of $N \times N$ sign matrices with a given VC dimension $d$. For $d=1$, this maximum is {three}. For $d=2$, this maximum is $\tilde{\Theta}(N^{1/2})$. For $d >2$, similar but slightly less accurate statements hold. Our lower bounds improve over previous ones by Ben-David et al.~and can be interpreted as exhibiting a weakness of kernel-based classifiers. Our upper bounds, on the other hand, can be interpreted as exhibiting the universality of kernel-based classifiers.

The lower bounds are obtained by probabilistic constructions, using a theorem of Warren in real algebraic topology. The upper bounds are obtained using a result of Welzl about spanning trees with low stabbing number, and using the moment curve.

The upper bound technique is also used to: (i) provide estimates on the number of classes of a given VC dimension, and the number of maximum classes of a given VC dimension -- answering a question of Frankl from '89, and (ii) design an efficient algorithm that provides an $O(N/\log(N))$ multiplicative approximation for the sign rank (computing the sign rank is equivalent to the existential theory of the reals).

We also observe a general connection between sign rank and spectral gaps which is based on Forster's argument. Consider the $N \times N$ adjacency matrix of a $\Delta$ regular graph with a second eigenvalue of absolute value $\lambda$ and $\Delta \leq N/2$. We show that the sign rank of the signed version of this matrix is at least $\Delta/\lambda$. We use this connection to prove the existence of a maximum class $C\subseteq\{\pm 1\}^N$ with VC dimension $2$ and sign rank $\tilde{\Theta}(N^{1/2})$. This answers a question of Ben-David et al.~regarding the sign rank of large VC classes. We also describe limitations of this approach, in the spirit of the Alon-Boppana theorem.

We further describe connections to communication complexity, geometry, learning theory, and combinatorics.

17:30-17:50Break