Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering
SPEAKER: unknown
ABSTRACT. Hierarchical clustering is a popular method for analyzing data, which
associates a tree to a dataset. Hartigan consistency has been used
extensively as a framework to analyze such clustering algorithms from a
statistical point of view. Still, as we show in the paper, a tree which is
Hartigan consistent with a given density can look very different than the
correct limit tree. Specifically, Hartigan consistency permits two types of
undesirable configurations which we term over-segmentation and {\it improper
nesting}. Moreover, Hartigan consistency is a limit property and does not
directly quantify difference between trees.
In this paper we identify two limit properties, separation and minimality,
which address both over-segmentation and improper nesting and together imply
(but are not implied by) Hartigan consistency. We proceed to introduce a
merge distortion distance between hierarchical clusterings and show that
convergence in that distance implies both separation and minimality. We also
show that uniform separation and minimality imply convergence in the merge
distortion distance. Furthermore, we show that our merge distortion distance
is stable under perturbations of the density.
Finally, we demonstrate applicability of these concepts by proving
convergence results for two clustering algorithms. First, we show
convergence (and hence separation and minimality) of the recent robust single
linkage algorithm of Chaudhuri and Dasgupta (2010). Second, we provide
convergence results on manifolds for topological split tree clustering.
Simple, Efficient, and Neural Algorithms for Sparse Coding
SPEAKER: unknown
ABSTRACT. Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Recent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field \cite{OF} in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used.
Efficient Representations for Lifelong Learning and Autoencoding
SPEAKER: unknown
ABSTRACT. It has been a long-standing goal in machine learning, as well as in AI more generally, to develop life-long learning systems that learn many different tasks over time, and reuse insights from tasks learned, ``learning to learn'' as they do so. In this work we pose and provide efficient algorithms for several natural theoretical formulations of this goal. Specifically, we consider the problem of learning many different target functions over time, that share certain commonalities that are initially unknown to the learning algorithm. Our aim is to learn new internal representations as the algorithm learns new target functions, that capture this commonality and allow subsequent learning tasks to be solved more efficiently and from less data. We develop efficient algorithms for two very different kinds of commonalities that target functions might share: one based on learning common low-dimensional and unions of low-dimensional subspaces and one based on learning nonlinear Boolean combinations of features. Our algorithms for learning Boolean feature combinations additionally have a dual interpretation, and can be viewed as giving an efficient procedure for constructing near-optimal sparse Boolean autoencoders under a natural ``anchor-set'' assumption.
ABSTRACT. We study a statistical model for the \emph{tensor principal component analysis problem} introduced by Montanari and Richard:
Given a order-$3$ tensor $\bT$ of the form $\bT = \tau \cdot v_0^{\tensor 3} + \bA$, where $\tau \geq 0$ is a signal-to-noise ratio, $v_0$ is a unit vector, and $\bA$ is a random noise tensor, the goal is to recover the planted vector $v_0$.
For the case that $\bA$ has iid standard Gaussian entries, we give an efficient algorithm to recover $v_0$ whenever $\tau \gg n^{3/4} \log(n)^{1/4}$, and certify that the recovered vector is close to a maximum likelihood estimator, all with high probability over the random choice of $\bA$.
The previous best algorithms with provable guarantees required $\tau \geq \Omega(n)$.
In the regime $\tau \ll n$, natural tensor-unfolding-based spectral relaxations for the underlying optimization problem break down.
To go beyond this barrier, we use convex relaxations based on the sum-of-squares method.
Indeed our recovery algorithm proceeds by rounding a degree-$4$ sum-of-squares relaxations of the maximum-likelihood-estimation problem for the statistical model.
To complement our algorithmic results, we show that degree-$4$ sum-of-squares relaxations break down for $\tau \ll n^{3/4}$, which demonstrates that improving our current guarantees (by more than logarithmic factors) would require new techniques or might even be intractable.
Finally, we show how to exploit additional problem structure in order to solve our sum-of-squares relaxations, up to some approximation, very efficiently.
Our fastest algorithm runs in nearly-linear time using shifted (matrix) power iteration and has similar guarantees as above.
The analysis of this algorithm also confirms a conjecture of Montanari and Richard about singular vectors of tensor unfoldings.
ABSTRACT. In this work we study the widely used spectral clustering algorithms, i.e. partition a graph into k clusters via (1) embedding the vertices of a graph into a low-dimensional space using the bottom eigenvectors of the Laplacian matrix, and (2) partitioning embedded points via k-means algorithms. We show that, for a wide class of well-clustered graphs, spectral clustering algorithms can give a good approximation of the optimal clustering. To the best of our knowledge, it is the first theoretical analysis of spectral clustering algorithms for a wide family of graphs, even though such approach was proposed in the early 1990s and has comprehensive applications.
We also give a nearly-linear time algorithm for partitioning well-clustered graphs, which is based on heat kernel embeddings and approximate nearest neighbor data structures.
ABSTRACT. This paper revisits the online PCA problem.
Given a stream of $n$ vectors $x_t \in \R^d$ (columns of $X$) the algorithm must output $y_t \in \R^\ell$ (columns of $Y$) before receiving $x_{t+1}$.
The goal of online PCA is to simultaneously minimize the target dimension $\ell$ and the error $\|X - (XY^\pinv)Y\|^2$.
We describe two simple and deterministic algorithms.
The first, receives a parameter $\Delta$ and guaranties that $\|X - (XY^\pinv)Y\|^2$ is not significantly larger than $\Delta$.
It requires a target dimension of $\ell = O(k/\eps)$ for any $k,\eps$ such that $\Delta \ge \eps\sigma_1^2 + \sigma_{k+1}^2$.
The second receives $k$ and $\eps$ and guaranties that $\|X - (XY^\pinv)Y\|^2 \le \eps\sigma_1^2 + \sigma_{k+1}^2$.
It requires a target dimension of $O( k\log n/\eps^2)$.
Different models and algorithms for Online PCA were considered in the past.
This is the first that achieves a bound on the spectral norm of the residual matrix.
Correlation Clustering with Noisy Partial Information
SPEAKER: unknown
ABSTRACT. In this paper, we propose and study a semi-random model for the Correlation Clustering problem on arbitrary graphs G. We give two approximation algorithms for Correlation Clustering instances from this model. The first algorithm finds a solution of value $(1+ \delta)optcost + O_{\delta}(n\log^3 n)$ with high probability, where \optcost is the value of the optimal solution (for every $\delta > 0$). The second algorithm finds the ground truth clustering with an arbitrarily small classification error $\eta$ (under some additional assumptions on the instance).
Stochastic Block Model and Community Detection in the Sparse Graphs: A spectral algorithm with optimal rate of recovery
SPEAKER: unknown
ABSTRACT. In this paper, we present and analyze a simple and robust spectral algorithm
for the stochastic block model with $k$ blocks, for any $k$ fixed. Our algorithm works with graphs having constant edge density,
under an optimal condition on the gap between the density inside a block and the density between the blocks.
As a co-product, we settle an open question posed by Abbe et. al. concerning censor block models.
ABSTRACT. The Laplacian matrices of graphs arise in many fields including Machine Learning, Computer Vision, Optimization, Computational Science, and of course Network Analysis. We will explain what these matrices are and why they arise in so many applications. We then will survey recent progress on the design of algorithms that allow us to solve such systems of linear equations in nearly linear time.
In particular, we will show how fast algorithms for graph sparsification directly lead to fast Laplacian system solvers. As an application, we will explain how Laplacian system solvers can be used to quickly solve linear programs arising from natural graph problems.