ABSTRACT. We consider the online sparse linear regression problem, which is the problem of sequentially making predictions observing only a limited number of features in each round, to minimize regret with respect to the best sparse linear regressor, where prediction accuracy is measured by square loss. We give an {\em inefficient} algorithm that obtains regret bounded by $\tilde{O}(\sqrt{T})$ after $T$ prediction rounds. We complement this result by showing that no algorithm running in polynomial time per iteration can achieve regret bounded by $O(T^{1-\delta})$ for any constant $\delta > 0$ unless $NP \subseteq BPP$. This computational hardness result resolves an open problem presented in COLT 2014 (Kale, 2014) and also posed by Zolghadr et al. (2013). This hardness result holds even if the algorithm is allowed to access more features than the best sparse linear regressor up to a logarithmic factor in the dimension.

ABSTRACT. We consider the problem of prediction with expert advice when the losses of the experts have low-dimensional structure: they are restricted to an unknown $d$-dimensional subspace.

We devise algorithms with regret bounds that are independent of the number of experts and depends only on the rank $d$.
For the stochastic model we show a tight bound of $\Theta(\sqrt{dT})$, and extend it to a setting of an approximate $d$ subspace.
For the adversarial model we show an upper bound of $O(d\sqrt{T})$ and a lower bound of $\Omega(\sqrt{dT})$.

ABSTRACT. We consider the online version of the isotonic regression problem. Given a set of linearly ordered points (e.g., on the real line), the learner must predict labels sequentially at adversarially chosen positions and is evaluated by her total squared loss compared against the best isotonic (non-decreasing) function in hindsight. We survey several standard online learning algorithms and show that none of them achieve the optimal regret exponent; in fact, most of them (including Online Gradient Descent, Follow the Leader and Exponential Weights) incur linear regret. We then prove that the Exponential Weights algorithm played over a covering net of isotonic functions has regret is bounded by $O(T^{1/3} \log^{2/3}(T))$ and present a matching $\Omega(T^{1/3})$ lower bound on regret. We also provide a computationally efficient version of this algorithm. We also analyze the noise-free case, in which the revealed labels are isotonic, and show that the bound can be improved to $O(\log(T)$ or even to $O(1)$ (when the labels are revealed in the isotonic order). Finally, we extend the analysis beyond squared loss and give bounds for log-loss and absolute loss.

Online Learning and Blackwell Approachability in Quitting Games

SPEAKER: unknown

ABSTRACT. We consider sequential decision problems, seen as a game between a decision maker and Nature, known as regret minimization, or more generally, as Blackwell approachability. Similarly to Markov Decision Processes, we will assume that the payoff at one stage also depends on some state parameter, whose evolution is controlled by the actions of both the decision maker and Nature. More specifically, we will suppose that Nature, the decision maker, or both, might have some terminating (or absorbing) actions so that the stream of payoff is constant whenever they are chosen. We call those environments ``quitting games''.

We characterize convex target sets C that are approachable, i.e. learnable in the sense of Blackwell, in the sense that the decision maker has a policy ensuring that the expected average payoff converge to C. We provide a simple geometric sufficient condition (in the spirit of Blackwell's condition) ensuring that C is approachable. We also provide a closely related necessary condition such that, whenever it is not satisfied, the target set C is excludable (i.e., Nature can approach a complement of the target set). We provide such conditions when the horizon is known in advance and show that, unlike standard Blackwell approachability, the necessary or sufficient conditions for the anytime version of this problem are drastically different. More specifically, in the special case when only the decision maker can terminate the game, our two conditions are equivalent (and coincide with Blackwell's condition), which allows a full characterization of approachable convex sets. However, if only nature can terminate, there is a gap to which we give some insight.

ABSTRACT. We present a series of theoretical and algorithmic results combining the benefits of the statistical learning approach to time series prediction with that of on-line learning. We prove new generalization guarantees for hypotheses derived from regret minimization algorithms in the general scenario where the data is generated by a non-stationary non-mixing stochastic process. Our theory enables us to derive model selection techniques with favorable theoretical guarantees in the scenario of time series, thereby solving a problem that is notoriously difficult in that scenario. It also helps us devise new ensemble methods with favorable theoretical guarantees for the task of forecasting non-stationary time series.

ABSTRACT. If a concept class can be represented with a certain amount of memory, can it be efficiently learned with the same amount of memory? What concepts can be efficiently learned by algorithms that extract only a few bits of information from each example? We introduce a formal framework for studying these questions, and investigate the relationship between the fundamental resources of memory or communication and the sample complexity of the learning task. We relate our memory-bounded and communication-bounded learning models to the well-studied statistical query model. This connection can be leveraged to obtain both upper and lower bounds: we show strong lower bounds on learning parity functions with bounded communication, as well as upper bounds on solving sparse linear regression problems with limited memory.

Matching Matrix Bernstein with Little Memory: Near-Optimal Finite Sample Guarantees for Oja's Algorithm

SPEAKER: unknown

ABSTRACT. In this paper we provide improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n \in R^{d\times d}$ sampled independently from distributions satisfying $E[A_i] = \Sigma$ for $\Sigma \succeq 0$, we present an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $\Sigma$. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance $\frac{1}{n} \sum_{i \in [n]} A_i$ as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of $O(d)$ or $1/gap$ where $gap$ is the relative distance between the top two eigenvalues of $\Sigma$.

We achieve these results through a novel analysis of the classic Oja's algorithm, one of the oldest and perhaps, most popular algorithms for streaming PCA. We show that simply picking a random initial point $w_0$ and applying the natural update rule $w_{i + 1} = w_i + \eta_i A_i w_i$ suffices for suitable choice of $\eta_i$. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA.

An Improved Gap-Dependency Analysis of the Noisy Power Method

SPEAKER: unknown

ABSTRACT. We consider the noisy power method algorithm, which has wide applications in machine learning and statistics, especially those related to principal component analysis (PCA) under resource (communication, memory or privacy) constraints.
Existing analysis of the noisy power method (Hardt and Price, 2014; Li et al., 2016) shows an unsatisfactory dependency over the ``consecutive" spectral gap $(\sigma_k-\sigma_{k+1})$ of an input data matrix, which could be very small and hence limits the algorithm's applicability. In this paper, we present a new analysis of the noisy power method that achieves improved gap dependency for both sample complexity and noise tolerance bounds. More specifically, we improve the dependency over $(\sigma_k-\sigma_{k+1})$ to dependency over $(\sigma_k-\sigma_{q+1})$, where $q$ is an intermediate algorithm parameter and could be much larger than the target rank $k$.Our proofs are built upon a novel characterization of proximity between two subspaces that differ from canonical angle characterizations analyzed in previous works (Hardt and Price, 2014; Li et al., 2016).Finally, we apply our improved bounds to distributed private PCA and memory-efficient streaming PCA and obtain bounds that are superior to existing results in the literature.

ABSTRACT. It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard to solve exactly on worst-case instances.
What is the complexity of solving Sparse PCA approximately?
Our contributions include:
a simple and efficient algorithm that achieves an n^{-1/3}-approximation;
NP-hardness of approximation to within (1-\epsilon), for some small constant \epsilon > 0;
SSE-hardness of approximation to within {\em any} constant factor; and
a (``quasi-quasi-polynomial'') gap for the standard semidefinite program.