A General Characterization of the Statistical Query Complexity
ABSTRACT. Statistical query (SQ) algorithms are algorithms that have access to an SQ oracle for the input distribution $D$ instead of i.i.d.~ samples from $D$. Given a query function $\phi:X \rightarrow [-1,1]$, the oracle returns an estimate of $E_{x\sim D}[\phi(x)]$ within some tolerance $\tau_\phi$ that roughly corresponds to the number of samples.
In this work we demonstrate that the complexity of solving an arbitrary statistical problem using SQ algorithms can be captured by a relatively simple notion of statistical dimension that we introduce. SQ algorithms capture a broad spectrum of algorithmic approaches used in theory and practice, most notably, convex optimization techniques. Hence our statistical dimension allows to investigate the power of a variety of algorithmic approaches by analyzing a single linear-algebraic parameter. Such characterizations were investigated over the past 20 years in learning theory but prior characterizations are restricted to the much simpler setting of classification problems relative to a fixed distribution on the domain (Blum et al., 1994; Bshouty Feldman 2002; Yang 2001; Yang 2005; Balcazar et al., 2007; Simon 2007; Feldman 2012; Szorenyi 2009). Our characterization is also the first to precisely characterize the necessary tolerance of queries. We give applications of our techniques to two open problems in learning theory and to algorithms that are subject to memory and communication constraints.
Mixing Implies Lower Bounds for Space Bounded Learning
ABSTRACT. One can learn any hypothesis class H with O(log |H|) labeled examples. Alas, learning with so few
examples requires saving the examples in memory, and this requires |X|^{O(log|H|)} memory states,
where X is the set of all labeled examples. A question that arises is how many labeled examples
are needed in case the memory is bounded. Previous work showed, using techniques such as
linear algebra and Fourier analysis, that parities cannot be learned with bounded memory and less
than $|H|^{\Omega(1)}$ examples. One might wonder whether a general combinatorial condition exists for
unlearnability with bounded memory, as we have with the condition V Cdim(H) = \infty for PAC unlearnability.
In this paper we give such a condition. We show that if an hypothesis class H, when viewed as a bipartite graph between hypotheses H and labeled examples X, is mixing, then learning it requires $|H|^{\Omega(1)}$ examples under a certain bound on the memory. Note that the class of parities
is mixing. As an immediate corollary, we get that most hypothesis classes are unlearnable with
bounded memory. Our proof technique is combinatorial in nature and very different from previous
analyses.
ABSTRACT. Daniely, Linial, and Shalev-Schwartz (STOC 2014) recently developed a beautiful new approach to proving computational hardness results for learning, based on the conjectured hardness of refuting constraint-satisfaction problems. This was used to give the first evidence of the hardness of PAC learning DNF formulas (Daniely and Shalev-Schwartz, COLT 2016) and of agnostically learning
halfspaces (Daniely, STOC 2016).
In this paper, we illuminate the connection between PAC learning and refutation. Specifically, we prove that for every concept class P, PAC-learning P is *equivalent* to RRHS-refuting a dual class P*, where RRHS-refutation of a class Q refers to refuting systems of equations where the constraints are (worst-case) functions from the class Q but the right-hand-sides of the equations are uniform and independent random bits. The reduction from refutation to PAC learning can be viewed as an abstraction of (part of) the work of Daniely et al. The converse, however, is new, and is based on a combination of techniques from pseudorandomness (Yao `82) with boosting (Schapire `90).
In addition, we show that PAC-learning the class of DNF formulas is polynomially equivalent to PAC-learning its dual class DNF*, and thus PAC-learning DNF is equivalent to RRHS-refutation of DNF, suggesting an avenue to obtain stronger lower bounds for PAC-learning DNF (e.g., exponential rather than quasipolynomial).
Inapproximability of VC Dimension and Littlestone's Dimension
ABSTRACT. We study the complexity of computing the VC Dimension and Littlestone's Dimension.
Given an explicit description of a finite universe and a concept class (a binary matrix whose (x,c)-th entry is 1 iff element x belongs to concept c),
both can be computed exactly in quasi-polynomial time (n^{O(log n)}).
Assuming the randomized Exponential Time Hypothesis (ETH), we prove nearly matching lower bounds on the running time, that hold even for approximation algorithms.
ABSTRACT. One way to define the randomness of a fixed individual sequence is to ask how hard it is to predict. When prediction error is measured via squared loss, it has been established that memoryless sequences (which are, in a precise sense, hard to predict) have some of the stochastic attributes of truly random sequences. In this paper, we ask how changing the loss function used changes the set of memoryless sequences, and in particular, the stochastic attributes they possess. We answer this question using tools from property elicitation, showing that the property elicited by the loss function determines the stochastic attributes of the corresponding memoryless sequences. We apply our results to the question of price calibration in prediction markets.
ABSTRACT. We study loss functions that measure the accuracy of a prediction based on multiple data points simultaneously. To our knowledge, such loss functions have not been studied before in the area of property elicitation or in machine learning more broadly. As compared to traditional loss functions that take only a single data point, these multi-observation loss functions can in some cases drastically reduce the dimensionality of the hypothesis required. In elicitation, this corresponds to requiring many fewer reports; in empirical risk minimization, it corresponds to algorithms on a hypothesis space of much smaller dimension. We explore some examples of the tradeoff between dimensionality and number of observations, give some geometric characterizations and intuition for relating loss functions and the properties that they elicit, and discuss some implications for both elicitation and machine-learning contexts.
ABSTRACT. This work initiates a systematic investigation of testing {\em high-dimensional} structured
distributions by focusing on testing {\em Bayesian networks} -- the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other non-descendant nodes once its parents are fixed. Specifically, we study the properties of identity testing and closeness testing of Bayesian networks. Our main contribution is the first non-trivial efficient testing algorithms for these problems and corresponding information-theoretic lower bounds. For a wide range of parameter settings, our testing algorithms have sample complexity {\em sublinear} in the dimension and are sample-optimal, up to constant factors.
Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing
ABSTRACT. We show that the square Hellinger distance between two Bayesian networks on the same directed graph, G, is subadditive with respect to the neighborhoods of G. Namely, if P and Q are the probability distributions defined by two Bayesian networks on the same DAG, our inequality states that the square Hellinger distance, H^2(P,Q), between P and Q is upper bounded by the sum, \sum_v H^2(P_{v \cup Pi_v}, Q_{v \cup Pi_v}), of the square Hellinger distances between the marginals of P and Q on every node v and its parents Pi_v in the DAG. Importantly, our bound does not involve the conditionals but the marginals of P and Q. We derive a similar inequality for more general Markov Random Fields.
As an application of our inequality, we show that distinguishing whether two (unknown) Bayesian networks P and Q on the same (but potentially unknown) DAG satisfy P=Q vs d_TV(P,Q)>epsilon can be performed from \tilde{O}(|Sigma|^{3(d+1)/4} * n / epsilon^2) samples, where d is the maximum in-degree of the DAG and Sigma the domain of each variable of the Bayesian networks. If P and Q are defined on potentially different and potentially unknown trees, the sample complexity becomes \tilde{O}(|Sigma|^4.5 * n / epsilon^2). In both cases the dependence of the sample complexity on n and epsilon is optimal up to logarithmic factors. Lastly, if P and Q are product distributions over {0,1}^n and Q is known, the sample complexity becomes O(n^0.5 / epsilon^2), which is optimal up to constant factors.
Two-Sample Tests for Large Random Graphs using Network Statistics
ABSTRACT. We consider a two-sample hypothesis testing problem, where the distributions are defined on the space of undirected graphs, and one has access to only one observation from each model. A motivating example for this problem is comparing the friendship networks on Facebook and LinkedIn. The practical approach to such problems is to compare the networks based on certain network statistics. In this paper, we present a general principle for two-sample hypothesis testing in such scenarios without making any assumption about the network generation process. The main contribution of the paper is a general formulation of the problem based on concentration of network statistics, and consequently, a consistent two-sample test that arises as the natural solution for this problem. We also show that the proposed test is minimax optimal for certain network statistics.
Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds
ABSTRACT. Although there exist plentiful theories of empirical risk minimization (ERM) for supervised learning, current theoretical understandings of ERM for a related problem---stochastic convex optimization (SCO), are limited. In this work, we strengthen the realm of ERM for SCO by exploiting smoothness and strong convexity conditions to improve the risk bounds. First, we establish an $\O(d/n + \sqrt{F_*/n})$ risk bound when the random function is nonnegative, convex and smooth, and the expected function is Lipschitz continuous, where $d$ is the dimensionality of the problem, $n$ is the number of samples, and $F_*$ is the minimal risk. Thus, when $F_*$ is small we obtain an $\O(d/n)$ risk bound, which is analogous to the $\O(1/n)$ optimistic rate of ERM for supervised learning. Second, if the objective function is also $\lambda$-strongly convex, we prove an $\O(d/n + \kappa F_*/n )$ risk bound where $\kappa$ is the condition number, and improve it to $O(1/[\lambda n^2] + \kappa F_*/n)$ when $n=\Omega(\kappa d)$. As a result, we obtain an $O(\kappa/n^2)$ risk bound under the condition that $n$ is large and $F_*$ is small, which to the best of our knowledge, is the first $O(1/n^2)$-type of risk bound of ERM. Third, we stress that the above results are established in a unified framework, which allows us to derive new risk bounds under weaker conditions, e.g., without convexity of the random function and Lipschitz continuity of the expected function. Finally, we demonstrate that to achieve an $O(1/[\lambda n^2] + \kappa F_*/n)$ risk bound for supervised learning, the $\Omega(\kappa d)$ requirement on $n$ can be replaced with $\Omega(\kappa^2)$, which is dimensionality-independent.
Stochastic Composite Least-Squares Regression with convergence rate O(1/n)
ABSTRACT. We consider the minimization of composite objective functions composed of the expectation of quadratic functions and an arbitrary convex function. We study the stochastic dual averaging algorithm with a constant step-size, showing that it leads to a convergence rate of O(1/n) without strong convexity assumptions. This thus extends earlier results on least-squares regression with the Euclidean geometry to (a) all convex regularizers and constraints, and (b) all geometries represented by a Bregman divergence. This is achieved by a new proof technique that relates stochastic and deterministic recursions.
A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints
ABSTRACT. We develop a simple routine unifying the analysis of several important recently-developed stochastic optimization methods including SAGA, Finito, and stochastic dual coordinate ascent (SDCA). First, we show an intrinsic connection between stochastic optimization methods and dynamic jump systems, and propose a general jump system model for stochastic optimization methods. Our proposed model recovers SAGA, SDCA, Finito, and SAG as special cases. Then we combine jump system theory with several simple quadratic inequalities to derive sufficient conditions for convergence rate certifications of the proposed jump system model under various assumptions (with or without individual convexity, etc). The derived conditions are linear matrix inequalities (LMIs) whose size roughly scale with the size of the training set. We make use of the symmetry in the stochastic optimization methods and reduce these LMIs to some equivalent small LMIs whose sizes are at most three by three. We solve these small LMIs to provide analytical proofs of new convergence rates for SAGA, Finito and SDCA (with or without individual convexity). We also explain why our proposed LMI fails in analyzing SAG, and discuss a key difference between SAG and other methods. An advantage of our approach is that the proposed analysis can be automated for a large class of stochastic methods under various assumptions (with or without individual convexity, etc).
Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox
ABSTRACT. We present and analyze statistically optimal, communication and memory efficient distributed stochastic optimization algorithms with near-linear speedups (up to log factors). This improves over prior work which includes methods with near-linear speedups but polynomial communication requirements (accelerated minibatch SGD) and communication efficient methods which do not exhibit any runtime speedups over a naive single-machine approach. We first analyze a distributed SVRG variant as a distributed stochastic optimization method and show that it can achieve near-linear speedups with logarithmic rounds of communication, at the cost of high memory requirements. We then present a novel method, stochastic DANE, which trades off memory for communication and still allows for optimization with communication which scales only logarithmically with the desired accuracy while also being memory efficient. Stochastic DANE is based on a minibatch prox procedure, solving a non-linearized subproblem on a minibatch at each iteration. We provide a novel analysis for this procedure which achieves the statistical optimal rate regardless of minibatch size and smoothness, and thus significantly improving on prior work.
The Sample Complexity of Optimizing a Convex Function
ABSTRACT. In this paper we study optimization from samples of convex functions. There are many scenarios in which we do not know the function we wish to optimize but can learn it from data. In such cases, we are interested in bounding the number of samples required to optimize the function. Our main result shows that in general, the number of samples required to obtain a non-trivial approximation to the optimum of a convex function is exponential in its dimension, even when the function is PAC-learnable. We also obtain strong lower bounds for strongly convex and Lipschitz continuous functions. On the positive side, we show that there are interesting classes of functions and distributions for which the sample complexity is polynomial in the dimension of the function.
The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime
ABSTRACT. We propose a novel technique for analyzing adaptive sampling called the {\em Simulator}. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective allows us to match the strength of both Fano and change-of-measure techniques, without succumbing to the limitations of either method. For concreteness, we apply our techniques to a structured multi-arm bandit problem in the fixed-confidence pure exploration setting, where we show that the constraints on the means imply a substantial gap between the moderate-confidence sample complexity, and the asymptotic sample complexity as $\delta \to 0$ found in the literature. We also prove the first instance-based lower bounds for the top-k problem which incorporate the appropriate log-factors. Moreover, our lower bounds zero-in on the number of times each \emph{individual} arm needs to be pulled, uncovering new phenomena which are drowned out in the aggregate sample complexity. Our new analysis inspires a simple and near-optimal algorithm for the best-arm and top-k identification, the first {\em practical} algorithm of its kind for the latter problem which removes extraneous log factors, and outperforms the state-of-the-art in experiments.
Learning with Limited Rounds of Adaptivity: Coin Tossing, Multi-Armed Bandits, and Ranking from Pairwise Comparisons
ABSTRACT. In many learning settings, active/adaptive querying is possible, but the number of rounds of adaptivity is limited. We study the relationship between query complexity and adaptivity in identifying the k most biased coins among a set of n coins with unknown biases. This problem is a common abstraction of many well-studied problems, including the problem of identifying the k best arms in a stochastic multi-armed bandit, and the problem of top-k ranking from pairwise comparisons.
An r-round adaptive algorithm for the k most biased coins problem specifies in each round the set of coin tosses to be performed based on the observed outcomes in earlier rounds, and outputs the set of k most biased coins at the end of r rounds. When r=1, the algorithm is known as non-adaptive; when r is unbounded, the algorithm is known as fully adaptive. While the power of adaptivity in reducing query complexity is well known, full adaptivity requires repeated interaction with the coin tossing (feedback generation) mechanism, and is highly sequential, since the set of coins to be tossed in each round can only be determined after we have observed the outcomes of the coin tosses from the previous round. In contrast, algorithms with only few rounds of adaptivity require fewer rounds of interaction with the feedback generation mechanism, and offer the benefits of parallelism in algorithmic decision-making. Motivated by these considerations, we consider the question of how much adaptivity is needed to realize the optimal worst case query complexity for identifying the k most biased coins. Given any positive integer r, we derive essentially matching upper and lower bounds on the query complexity of r-round algorithms. We then show that Theta(log*n) rounds are both necessary and sufficient for achieving the optimal worst case query complexity for identifying the k most biased coins. In particular, our algorithm achieves the optimal query complexity in at most log*n rounds, which implies that on any realistic input, 5 parallel rounds of exploration suffice to achieve the optimal worst-case sample complexity. The best known algorithm prior to our work required Theta(log n) rounds to achieve the optimal worst case query complexity even for the special case of k=1.
Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration
ABSTRACT. We study the combinatorial pure exploration problem \combibandit{} in a stochastic multi-armed bandit game. In an \combibandit{} instance, we are given $n$ stochastic arms with unknown reward distributions, as well as a family $\subsetfam$ of feasible subsets over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify the feasible subset in $\subsetfam$ with the maximum total weight, using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-$k$ arm identification problem, both of which have attracted significant attention in recent years. We provide a novel \emph{instance-wise} lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of $\ln|\subsetfam|$. For an important class of combinatorial families (including spanning trees, matchings, and path constraints), we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and the notion of approximate Pareto curves in multi-objective optimization (note that $|\subsetfam|$ can be exponential in $n$). We also show that the $\ln|\subsetfam|$ factor is inevitable in general, through a nontrivial lower bound construction utilizing a combinatorial structure resembling the Nisan-Wigderson design. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general \combibandit problem.
We further introduce an even more general problem, formulated in geometric terms. We are given $n$ Gaussian arms with unknown means and unit variance. Consider the $n$-dimensional
Euclidean space $\mathbb{R}^n$, and a collection $\anssetcol$ of disjoint subsets. Our goal is to determine the subset in $\anssetcol$ that contains the mean profile (which is the $n$-dimensional vector of the means), using as few samples as possible. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.
ABSTRACT. We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality K from N possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon T, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.