COLT 2018: 31ST ANNUAL CONFERENCE ON LEARNING THEORY
PROGRAM FOR MONDAY, JULY 9TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:00 Session 22
09:00
Learning from Unreliable Datasets

ABSTRACT. A wide range of learning tasks require human input in labeling massive data. The collected data though are usually low quality and contain inaccuracies and errors. As a result, modern science and business face the problem of learning from unreliable data sets.

In this work, we provide a generic approach that is based on verification of only few records of the data set to guarantee high quality learning outcomes for various optimization objectives. Our method, identifies small sets of critical records and verifies their validity. We show that many problems only need poly(1/epsilon) verifications, to ensure that the output of the computation is at most a factor of (1 ± epsilon) away from the truth. For any given instance, we provide an instance optimal solution that verifies the minimum possible number of records to approximately certify correctness. Then using this instance optimal formulation of the problem we prove our main result: ”every function that satisfies some Lipschitz continuity condition can be certified with a small number of verifications”. We show that the required Lipschitz continuity condition is satisfied even by some NP-complete problems, which illustrates the generality and importance of this theorem.

In case this certification step fails, an invalid record will be identified. Removing these records and repeating until success, guarantees that the result will be accurate and will depend only on the verified records. Surprisingly, as we show, for several computation tasks more efficient methods are possible. These methods always guarantee that the produced result is not affected by the invalid records, since any invalid record that affects the output will be detected and verified.

09:10
Privacy-preserving Prediction

ABSTRACT. Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression.

We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all prediction private algorithms.

09:20
Calibrating Noise to Variance in Adaptive Data Analysis

ABSTRACT. Datasets are often used multiple times and each successive analysis may depend on the outcome of previous analyses. Standard techniques for ensuring generalization and statistical validity do not account for this adaptive dependence. A recent line of work studies the challenges that arise from such adaptive data reuse by considering the problem of answering a sequence of ``queries'' about the data distribution where each query may depend arbitrarily on answers to previous queries.

The strongest results obtained for this problem rely on differential privacy -- a strong notion of algorithmic stability with the important property that it ``composes'' well when data is reused. However the notion is rather strict, as it requires stability under replacement of an arbitrary data element. The simplest algorithm is to add Gaussian (or Laplace) noise to distort the empirical answers. However, analysing this technique using differential privacy yields suboptimal accuracy guarantees when the queries have low variance.

Here we propose a relaxed notion of stability that also composes adaptively. We demonstrate that a simple and natural algorithm based on adding noise scaled to the standard deviation of the query provides our notion of stability. This implies an algorithm that can answer statistical queries about the dataset with substantially improved accuracy guarantees for low-variance queries. The only previous approach that provides such accuracy guarantees is based on a more involved differentially private median-of-means algorithm and its analysis exploits stronger ``group'' stability of the algorithm.

09:30
A Data Prism: Semi-verified learning in the small-alpha regime

ABSTRACT. We consider a simple model of unreliable or crowdsourced data where there is an underlying set of $n$ binary variables, each ``evaluator'' contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $r$ of the variables, and the learner is given the true value of a \emph{constant} number of variables. We show that, provided an $\alpha$-fraction of the evaluators are ``good'' (either correct, or with independent noise rate $p < 1/2$), then the true values of a $(1-\eps)$ fraction of the $n$ underlying variables can be deduced as long as $r > \log_{2-2p}(1/\alpha)$. For example, if the fraction of ``good'' evaluators is larger than $1/16$ and there is no noise in their responses, then accurate recovery is possible provided each worker evaluates a random set of $4$ items. This result is optimal in that if $r \leq \log_{2-2p}(1/\alpha)$ the large dataset can contain no information. This setting can be viewed as an instance of the \emph{semi-verified} learning model introduced in CSV17, which explores the tradeoff between the number of items evaluated by each worker and the fraction of ``good'' evaluators. In the standard adversarial setting, our algorithm requires $\tilde{O}\left(n^{\log_{2-2p}(1/\alpha)}\right)$ evaluators. However, the algorithm runs in near linear time, $\tilde{O}_{r,\eps}(n)$, and hence would require only a near-linear number of evaluations in the weaker model in which the adversary's responses to each $r$-tuple of items must be independent of the set of evaluations collected. These settings and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment.

This extreme parameter regime, where the fraction of reliable data is small (inverse exponential in the amount of data provided by each source), is relevant to a number of practical settings. For example, the setting where one collects a dataset on customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a \emph{constant} (polynomial in $1/\alpha$ but independent of $n$), number of randomly selected items, to recover an accurate estimate of the entire set of preferences, even if the fraction of the original dataset contributed by the demographic of interest is inverse exponential in the number of preferences supplied by each customer. In this sense, our results can be viewed as a ``data prism'' allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.

09:40
Unleashing Linear Optimizers for Group-Fair Learning and Optimization
SPEAKER: Daniel Alabi

ABSTRACT. Most systems and learning algorithms optimize average performance or average loss -- one reason being computational complexity. However, many objectives of practical interest are more complex than simply average loss. This arises, for example, when balancing performance or loss with fairness across people. We prove that, from a computational perspective, optimizing arbitrary objectives that take into account performance over a small number of groups is not significantly harder to optimize than average performance. Our main result is a polynomial-time reduction that uses a linear optimizer to optimize an arbitrary (Lipschitz continuous) function of performance over a (constant) number of possibly-overlapping groups. This includes fairness objectives over small numbers of groups, and we further point out that other existing notions of fairness such as individual fairness can be cast as convex optimization and hence more standard convex techniques can be used. Beyond learning, our approach applies to multi-objective optimization, more generally.

09:50
Private Sequential Learning

ABSTRACT. We formulate a private learning model to study an intrinsic tradeoff between privacy and query complexity in sequential learning. Our model involves a learner who aims to determine a scalar value, $v^*$, by sequentially querying an external database with binary responses. In the meantime, an adversary observes the learner's queries, though not the responses, and tries to infer from them the value of $v^*$. The objective of the learner is to obtain an accurate estimate of $v^*$ using only a small number of queries, while simultaneously protecting her privacy by making $v^*$ provably difficult to learn for the adversary. Our main results provide tight upper and lower bounds on the learner's query complexity as a function of desired levels of privacy and estimation accuracy. We also construct explicit query strategies whose complexity is optimal up to an additive constant.

10:00-10:30Coffee Break
10:30-12:00 Session 23
10:30
Hardness of Learning Noisy Halfspaces using Polynomial Thresholds

ABSTRACT. We prove the hardness of weakly learning halfspaces in the presence of adversarial noise using polynomial threshold functions (PTFs). In particular, we prove that for any constants d \in {Z}^+ and eps > 0, it is NP-hard to decide: given a set of {-1,1}-labeled points in {R}^n whether (YES Case) there exists a halfspace that classifies (1 - eps)-fraction of the points correctly, or (NO Case) any degree-d PTF classifies at most (1/2 + eps)-fraction of the points correctly. This strengthens to all constant degrees the previous NP-hardness of learning using degree-2 PTFs shown by Diakonikolas et al. (2011). The latter result had remained the only progress over the works of Feldman et al. (2006) and Guruswami et al. (2006) ruling out weakly proper learning adversarially noisy halfspaces.

10:40
Active Tolerant Testing

ABSTRACT. In this work, we give the first algorithms for tolerant testing of nontrivial classes in the active model: estimating the distance of a target function to a hypothesis class C with respect to some arbitrary distribution D, using only a small number of label queries to a polynomial-sized pool of unlabeled examples drawn from D. Specifically, we show that for the class C of unions of d intervals on the line, we can estimate the error rate of the best hypothesis in the class to an additive error epsilon from only $O(\frac{1}{\epsilon^6}\log \frac{1}{\epsilon})$ label queries to an unlabeled pool of size $O(\frac{d}{\epsilon^2}\log \frac{1}{\epsilon})$. The key point here is the number of labels needed is independent of the VC-dimension of the class. This extends the work of Balcan et al. (2012) who solved the non-tolerant testing problem for this class (distinguishing the zero-error case from the case that the best hypothesis in the class has error greater than epsilon).

We also consider the related problem of estimating the performance of a given learning algorithm A in this setting. That is, given a large pool of unlabeled examples drawn from distribution D, can we, from only a few label queries, estimate how well A would perform if the entire dataset were labeled? We focus on k-Nearest Neighbor style algorithms, and also show how our results can be applied to the problem of hyperparameter tuning (selecting the best value of k for the given learning problem).

10:50
Time-Space Tradeoffs for Learning Finite Functions from Random Tests, with Applications to Polynomials

ABSTRACT. We develop an extension of recent analytic methods for obtaining time-space tradeoff lower bounds for problems of learning from uniform random test samples. With our methods we can obtain bounds for learning finite learning functions from test even when the space of tests is significantly smaller than the space of inputs and the feedback from each test can come from an arbitrary finite set.

At the core of our results, we reduce the time-space complexity of learning from random tests to the question of how much the corresponding evaluation matrix amplifies the 2-norms of ``almost uniform'' probability distributions. To analyze the latter, we formulate it as a semidefinite program, and we analyze its dual. In order to handle multivalued feedback from tests, we apply this norm amplification analysis to complex matrices.

As applications that follow from our new techniques, we show that any algorithm that learns $m$-variate polynomial functions of degree at most d over F_2 with success at least 2^{-O(m)} from evaluations on randomly chosen inputs either requires space \Omega(mn/d) or 2^{\Omega(m/d)} time where n=(m/d)^{\Theta(d)} is the dimension of the space of such polynomials. These bounds are asymptotically optimal for polynomials of arbitrary constant degree since they match the tradeoffs achieved by natural learning algorithms for the problems. We also extend these results to learning linear and quadratic polynomials over any odd prime field F_p where we show that \Omega(mn\log_2 p) space or time p^{\Omega(m)} is required.

To derive our bounds for learning degree d polynomials over finite fields, we show that analyzing the dual of the semidefinite program corresponds to studying the distribution of the bias of all degree d polynomials with respect to uniformly random inputs. For example, in the case of polynomials over F_2 the distribution of bias corresponds to the weight distribution of Reed-Muller codes over F_2.

11:00
Empirical bounds for functions with weak interactions

ABSTRACT. We provide sharp empirical estimates of expectation, variance and normal approximation for a class of statistics whose variation in any argument does not change too much when another argument is modified. Examples of such weak interactions are furnished by U- and V-statistics, Lipschitz L- statistics and various error functionals of L2-regularized algorithms and Gibbs algorithms.

11:10
A Direct Sum for Information Learners

ABSTRACT. How many bits of information are required to PAC learn a class of hypotheses of VC dimension d? The mathematical setting we follow is of Bassily et al. (2018) where the value of interest is the mutual information I(S;A(S)) between the input sample S and the hypothesis outputted by the learning algorithm A. We introduce a class of functions of VC dimension d over the domain X with information complexity at least Omega(d log log (|X|/d)) bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case d=1.

The above result is in fact a special case of a more general phenomenon we explore. We define the notion of *information complexity* of a given class of functions H. Roughly speaking, it is the minimum amount of information that an algorithm for H must use to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes.

11:20
Exact and Robust Conformal Inference Methods for Predictive Machine Learning With Dependent Data

ABSTRACT. We extend conformal inference to general settings that allow for time series data. Our proposal is developed as a randomization method and accounts for potential serial dependence by including block structures in the permutation scheme. As a result, the proposed method retains the exact, model-free validity when the data are i.i.d. or more generally exchangeable, similar to usual conformal inference methods. When exchangeability fails, as is the case for common time series data, the proposed approach is approximately valid under weak assumptions on the conformity score.

11:30
Approximation beats concentration? An approximation view on inference with smooth radial kernels

ABSTRACT. Positive definite kernels and their associated Reproducing Kernel Hilbert Spaces provide a mathematically compelling and practically competitive framework for learning from data.

In this paper we take the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties. We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and ``Fourier'' coefficients of functions in the kernel space restricted to a discrete set of data points. We also investigate the fitting capacity of kernels, giving explicit bounds on the fat shattering dimension of the balls in Reproducing Kernel Hilbert spaces. Interestingly, the same properties that make kernels very effective approximators for functions in their ``native'' kernel space, also limit their capacity to represent arbitrary functions. We discuss various implications, including those for gradient descent type methods.

It is important to note that most of our bounds are measure independent. Moreover, at least in moderate dimension, the bounds for eigenvalues are much tighter than the bounds which can be obtained from the usual matrix concentration results. For example, we see that the eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain. We call this ``approximation beats concentration'' phenomenon as even when the data are sampled from a probability distribution, some of their aspects are better understood in terms of approximation theory.

11:40
Approximate Nearest Neighbors in Limited Space

ABSTRACT. We consider the $(1+\epsilon)$-approximate nearest neighbor search problem: given a set $X$ of $n$ points in a $d$-dimensional space, build a data structure that, given any query point $y$, finds a point $x \in X$ whose distance to $y$ is at most $(1+\epsilon) \min_{x \in X} \|x-y\|$ for an accuracy parameter $\epsilon \in (0,1)$. Our main result is a data structure that occupies only $O(\epsilon^{-2} n \log(n) \log(1/\epsilon))$ bits of space, assuming all point coordinates are integers in the range $\{-n^{O(1)} \ldots n^{O(1)}\}$, i.e., the coordinates have $O(\log n)$ bits of precision. This improves over the best previously known space bound of $O(\epsilon^{-2} n \log(n)^2)$, obtained via the randomized dimensionality reduction method of \cite{johnson1984extensions}. We also consider the more general problem of estimating all distances from a collection of query points to all data points $X$, and provide almost tight upper and lower bounds for the space complexity of this problem.

11:50
Efficient Active Learning of Sparse Halfspaces

ABSTRACT. We study the problem of efficient PAC active learning of d-dimensional linear classifiers, where the goal is to learn a classifier with low error, using as few label queries as possible. Given the extra assumption that there is a t-sparse linear separator that peforms well on the data (t << d), we would like our active learning algorithm to be {\em attribute efficient}, i.e. to have label complexities sublinear in d. In this paper, we provide an efficient algorithm that achieve this goal. Under certain distributional assumptions on the data, our algorithm achieves a label complexity of O(t \polylog(d, \frac 1 \epsilon)). In contrast, existing algorithms in this setting either are computationally inefficient, or have label complexity bounds polynomial in d or \frac 1 \epsilon.

12:00-14:00Lunch Break
12:00-14:00 Session 24

Poster Session