COLT 2015: 28TH ANNUAL CONFERENCE ON LEARNING THEORY
PROGRAM FOR MONDAY, JULY 6TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:00 Session 18: Invited Talk
09:00
Synthetic theory of Ricci curvature - when information theory, optimization, geometry and gradient flows meet
10:00-10:40 Session 19: Probabilistic Models and Reinforcement Learning
10:00
Computational Lower Bounds for Community Detection on Random Graphs
SPEAKER: unknown

ABSTRACT. This paper studies the problem of detecting the presence of a small dense community planted in a large Erd\H{o}s-R\'enyi random graph $\calG(N,q)$, where the edge probability within the community exceeds $q$ by a constant factor. Assuming the hardness of the planted clique detection problem, we show that the computational complexity of detecting the community exhibits the following phase transition phenomenon: As the graph size $N$ grows and the graph becomes sparser according to $q=N^{-\alpha}$,there exists a critical value of $\alpha = \frac{2}{3}$, below which there exists a computationally intensive procedure that can detect far smaller communities than any computationally efficient procedure, and above which a linear-time procedure is statistically optimal. The results also lead to the average-case hardness results for recovering the dense community and approximating the densest $K$-subgraph.

10:05
Bad Universal Priors and Notions of Optimality
SPEAKER: unknown

ABSTRACT. A big open question of algorithmic information theory is the choice of the universal Turing machine (UTM). For Kolmogorov complexity and Solomonoff induction we have invariance theorems: the choice of the UTM changes bounds only by a constant. For the universally intelligent agent AIXI (Hutter, 2005) no invariance theorem is known. Our results are entirely negative: we discuss cases in which unlucky or adversarial choice of the UTM causes AIXI to misbehave drastically. We show that Legg-Hutter intelligence and thus balanced Pareto optimality is entirely subjective, and that every policy is Pareto optimal in the class of all computable environments. This undermines all existing optimality properties for AIXI. While it may still serve as a gold standard for AI, our results imply that AIXI is a relative theory, dependent on the choice of the UTM.

10:10
Thompson Sampling for Learning Parameterized Markov Decision Processes
SPEAKER: unknown

ABSTRACT. We consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning problems, and derive a frequentist regret bound for priors over general parameter spaces. The result shows that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability. It holds for priors without any additional, specific closed-form structure such as conjugate or product-form priors. Moreover, the constant factor in the logarithmic scaling encodes the information complexity of learning the MDP, in terms of the Kullback-Leibler geometry of the parameter space.

10:15
Fast Mixing for Discrete Point Processes
SPEAKER: unknown

ABSTRACT. We investigate the systematic mechanisms for designing fast mixing Markov chains Monte Carlo algorithms to sample from discrete point processes. Such processes are defined as probability distributions $\mu(S)\propto \exp(f(S))$ over all subsets $S\subseteq 2^V$ of a finite set $V$ through a bounded set function $f:2^V\rightarrow \mathbb{R}$. In particular, a subclass of discrete point processes characterized by submodular functions (which include determinantal point processes, log-submodular distributions, and submodular point processes) has recently gained a lot of interest in machine learning and shown to be effective for modeling diversity and coverage. We show that if the set function (not necessarily submodular) displays a natural notion of decay of correlation, then it is possible to design fast mixing Markov Chain Monte Carlo methods that provide size-free error bounds on marginal approximations. The conditions that we introduce involve a control on the second order (discrete) derivatives of set functions. We then provide sufficient conditions for fast mixing when the set function is submodular, and specialize our results for canonical examples.

10:20
On Convergence of Emphatic Temporal-Difference Learning
SPEAKER: Huizhen Yu

ABSTRACT. We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces. Such algorithms were recently proposed by Sutton, Mahmood, and White as an improved solution to the problem of divergence of off-policy temporal-difference learning with linear function approximation. We present in this paper the first convergence proofs for two emphatic algorithms, ETD($\lambda$) and ELSTD($\lambda$). We prove, under general off-policy conditions, the convergence in $L^1$ for ELSTD($\lambda$) iterates and the almost sure convergence of the approximate value functions calculated by both algorithms. Our analysis involves new techniques with applications beyond emphatic algorithms leading, for example, to the first proof that standard TD($\lambda$) also converges under off-policy training for $\lambda$ sufficiently large.

10:25
Faster Algorithms for Testing under Conditional Sampling
SPEAKER: unknown

ABSTRACT. There has been considerable recent interest in distribution-tests whose run-time and sample requirements are sublinear in the domain-size $k$. We study two of the most important tests under the conditional-sampling model where each query specifies a subset $S$ of the domain, and the response is a sample drawn from $S$ according to the underlying distribution.

For identity testing, which asks whether the underlying distribution equals a specific given distribution or $\delta$-differs from it, we reduce the known time and sample complexities from $O(\delta^{-4})$ to $O(\delta^{-2})$, thereby matching the information theoretic lower bound. For closeness testing, which asks whether two distributions underlying observed data sets are equal or different, we reduce existing complexity from $O(\delta^{-4} \log^5 k)$ to an even sub-logarithmic $O(\delta^{-5} \log \log k)$, and providing a better bound to an open problem in Bertinoro Workshop on Sublinear Algorithms.

10:30
Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery
SPEAKER: unknown

ABSTRACT. We show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently studied by Dwork et al., who showed how to answer ~n^2 queries efficiently, and also by Hardt and Ullman, who showed that answering ~n^3 queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n^2) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private.

We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an interactive fingerprinting code (Fiat and Tassa '01). In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions.

10:40-11:10Coffee Break
11:10-12:25 Session 20: Regression
11:10
Learning with Square Loss: Localization through Offset Rademacher Complexity
SPEAKER: unknown

ABSTRACT. We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of \citep{RakSriTsy15} for the bounded case while also providing guarantees without the boundedness assumption. Our high-probability statements for the unbounded case are based on the pathbreaking small-ball analysis of \cite{Mendelson14}.

11:30
Minimax rates for memory-bounded sparse linear regression
SPEAKER: unknown

ABSTRACT. We establish a minimax lower bound of $\Omega\p{\frac{kd}{b\epsilon}}$ on the number of samples needed to estimate the parameters in a $k$-sparse linear regression of dimension $d$ given a memory bound of $b$ bits, where $\epsilon$ is the $L_2$ parameter error. When the covariance of the regressors is the identity matrix, we also provide an algorithm that uses $\tilde{O}(b+k)$ bits and requires $\tilde{O}(\frac{kd}{b\epsilon^2})$ samples to achieve error $\epsilon$. Our lower bound also holds in the more general communication-bounded setting, where instead of a memory bound, at most $b$ bits of information are allowed to be (adaptively) communicated about each sample.

11:50
Algorithms for Lipschitz Learning on Graphs
SPEAKER: unknown

ABSTRACT. We develop fast algorithms for solving regression problems on graphs where one is given the value of a function at some vertices, and must find its smoothest possible extension to all vertices. The extension we compute is the absolutely minimal Lipschitz extension, and is the limit for large $p$ of $p$-Laplacian regularization. We present an algorithm that computes a minimal Lipschitz extension in expected linear time, and an algorithm that computes an absolutely minimal Lipschitz extension in expected time $O (m n)$. The latter algorithm has variants that seem to run much faster in practice. These extensions are particulary amenable to regularization: we can perform $l_{0}$ regularization on the given values in polynomial time and $l_{1}$ regularization on the graph edge weights in time $\widetilde{O} (m^{3/2})$. Our algorithms naturally extend to directed graphs.

12:10
Variable Selection is Hard
SPEAKER: unknown

ABSTRACT. Variable selection for sparse linear regression is the problem of finding, given an m x p matrix B and a target vector y, a sparse vector x such that Bx approximately equals y. Assuming a standard complexity hypothesis, we show that no polynomial-time algorithm can find a k'-sparse x with ||Bx-y||^2<=h(m,p), where k'=k*2^{log^{1-delta}p}$ and h(m,p)=p^{C_1} m^{1-C_2}, where delta>0, C_1>0, and C_2>0 are arbitrary. This is true even under the promise that there is an unknown k-sparse vector x^* satisfying B x^* = y. We prove a similar result for a statistical version of the problem in which the data are corrupted by noise.

To the authors' knowledge, these are the first hardness results for sparse regression that apply when the algorithm simultaneously has k'>k and h(m,p)>0.

12:15
Regularized Linear Regression: A Precise Analysis of the Estimation Error
SPEAKER: unknown

ABSTRACT. Non-smooth regularized convex optimization procedures have emerged as a powerful tool to recover structured signals (sparse, low-rank, etc.) from (possibly compressed) noisy linear measurements. We focus on the problem of linear regression and consider a general class of optimization methods that minimize a loss function measuring the misfit of the model to the observations with an added “structured-inducing” regularization term. Celebrated instances include the LASSO, Group-LASSO, Least-Absolute Deviations method, etc.. We develop a quite general framework for how to determine precise prediction performance guaranties (e.g. mean-square-error) of such methods for the case of Gaussian measurement ensemble. The machinery builds upon Gordon's Gaussian min-max theorem under additional convexity assumptions that arise in many practical applications. This theorem associates with a primary optimization (PO) problem a simplified auxiliary optimization (AO) problem from which we can tightly infer properties of the original (PO), such as the optimal cost, the norm of the optimal solution, etc. Our theory applies to general loss functions and regularization and provides guidelines on how to optimally tune the regularizer coefficient when certain structural properties (such as sparsity level, rank, etc.) are known.

12:20
Truthful Linear Regression
SPEAKER: unknown

ABSTRACT. We consider the problem of fitting a linear model to data held by individuals who are concerned about their privacy. Incentivizing most players to report their data to the analyst truthfully constrains our design to mechanisms that provide a privacy guarantee to the participants; we use differentially privacy to model individuals' privacy losses. This immediately poses a problem, as differentially private computation of a linear model necessarily produces a biased estimation, and existing approaches to design mechanisms to elicit data from privacy-sensitive individuals do not generalize well to biased estimators. We manage to overcome this this challenge, through appropriate design of the computation and payment scheme.