SPAA 2016: 28TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES
PROGRAM FOR WEDNESDAY, JULY 13TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:20 Session 11: Concurrent Data Structures
Chair:
09:00
Lock-free Transactions without Aborts for Linked Data Structures
SPEAKER: unknown

ABSTRACT. Non-blocking data structures allow scalable and thread-safe accesses to shared data. They provide individual operations that appear to execute atomically. However, it is often desirable to execute multiple operations atomically in a transactional manner. Previous solutions, such as software transactional memory (STM) and transactional boosting, manage transaction synchronization in an external layer separated from the data structure's own thread-level concurrency control. Although this reduces programming effort, it leads to overhead associated with additional synchronization and the need to rollback aborted transactions.

In this work, we present a new methodology for transforming high-performance lock-free linked data structures into high-performance lock-free transactional linked data structures without revamping the data structures' original synchronization design. Our approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks. We encapsulate all operations, operands, and transaction status in a transaction descriptor, which is shared among the nodes accessed by the same transaction. We coordinate threads to help finish the remaining operations of delayed transactions based on their transaction descriptors. For each data structure operation, we introduce a procedure that allows the interpretation of the logical status of a node based on the transaction status. Write operations are invisible to operations outside the scope of the transaction until the transaction commits.

In our experimental evaluation using transactions with randomly generated operations, our lock-free transactional lists and skiplist outperform the transactional boosted ones by 40% on average and as much as 125% for large transactions. Our methodology outperforms the alternative STM-based approaches by a factor of 3 to 10 across all scenarios. More importantly, we achieve 4 to 6 orders of magnitude less spurious conflicts than the alternatives.

09:20
Concurrent Search Data Structures Can Be Blocking and Practically Wait-Free
SPEAKER: Tudor David

ABSTRACT. We argue that there is virtually no practical situation in which one should seek a "theoretically wait-free" algorithm at the expense of a state-of-the-art blocking algorithm in the case of search data structures: blocking algorithms are simple, fast, and can be made "practically wait-free".

We draw this conclusion based on the most exhaustive study of blocking data structures to date. We consider (a) different data structures of different sizes, (b) numerous uniform and non-uniform workloads, representative of a wide range of practical scenarios, with different percentages of update operations, (c) with and without delayed threads, (d) on different hardware technologies, including processors providing HTM instructions.

We explain our claim that blocking search data structures are practically wait-free through an analogy with the birthday paradox, revealing that, in state-of-the-art algorithms implementing such data structures, the probability of conflicts is extremely small. When conflicts occur as a result of context switches and interrupts, we show HTM-based locks enable blocking algorithms to cope wit them.

09:40
Fast and Robust Memory Reclamation for Concurrent Data Structures
SPEAKER: unknown

ABSTRACT. In concurrent systems without automatic garbage collection, it is challenging to determine when it is safe to reclaim memory, especially for lock-free data structures. Existing concurrent memory reclamation schemes are either fast but do not tolerate process delays, robust to delays but with high overhead, or both robust and fast but narrowly applicable.

This paper proposes QSense, a novel concurrent memory reclamation technique. QSense is a hybrid technique with a fast path and a fallback path. In the common case (without process delays), a high-performing memory reclamation scheme is used (fast path). If process delays block memory reclamation through the fast path, a robust fallback path is used to guarantee progress. The fallback path uses hazard pointers, but avoids their notorious need for frequent and expensive memory fences.

QSense is widely applicable, as we illustrate through several lock-free data structure algorithms. Our experimental results show that QSense has an overhead comparable to the fastest memory reclamation techniques, while still tolerating delays.

10:00
Benchmarking Concurrent Priority Queues: Performance of k-LSM and Related Data Structures
SPEAKER: unknown

ABSTRACT. A number of concurrent, relaxed priority queues have recently been proposed and implemented. Results are commonly reported for a \emph{throughput benchmark} that uses a uniform distribution of keys drawn from a large integer range, and mostly for single systems. We have conducted more extensive benchmarking of three recent, relaxed priority queues on four different types of systems with different key ranges and distributions. While we can show superior throughput and scalability for our own $k$-LSM priority queue for the uniform key distribution, the picture changes drastically for other distributions, both with respect to achieved throughput and relative merit of the priority queues. The throughput benchmark alone is thus not sufficient to characterize the performance of concurrent priority queues. Our benchmark code and $k$-LSM priority queue are publicly available to foster future comparison.

10:10
Fast Concurrent Cuckoo Kick-out Eviction Schemes for High-Density Tables

ABSTRACT. Cuckoo hashing guarantees constant-time lookups regardless of table density, making it a viable candidate for high-density tables. Cuckoo hashing insertions perform poorly at high table densities, however. In this paper, we mitigate this problem through the introduction of novel kick-out eviction algorithms. Experimentally, our algorithms reduce the number of bins viewed per insertion for high-density tables by as much as a factor of ten.

We also introduce an optimistic concurrency scheme for transactional multi-writer cuckoo hash tables (not using hardware transactional memory). For delete-light workloads, one of our kick-out algorithms avoids all competition between insertions with high probability, and significantly reduces transaction-abort frequency. This result is extended to arbitrary workloads using a new synchronization mechanism called a claim flag.

10:40-12:00 Session 12: Graph Algorithms
10:40
The Family Holiday Gathering Problem or Fair and Periodic Scheduling of Independent Sets
SPEAKER: unknown

ABSTRACT. We introduce and examine the {\em Holiday Gathering Problem} which models the difficulty in scheduling non-interfering transmissions in (wireless) networks. Our goal is to schedule transmission rounds so that the antennas that transmit in a given round will not interfere with each other, i.e.\ all of the other antennas that can interfere will not transmit in that round, while minimizing the number of consecutive rounds in which antennas do not transmit.

Following a long tradition in Computer Science, we introduce the problem by an intuitive story. Assume we live in a perfect world where families enjoy being together. Consequently, parents, whose children are in a monogamous relation, would like to have {\em all} their children at home for the holiday meal (i.e.\ there is a special pleasure gained by hosting all the children simultaneously and they wish to have this event occur as frequently as possible). However, the conflict is that the in-laws would also be happiest if all their children come to them. Our goal can be described as scheduling an infinite sequence of ``guest lists'' in a distributed setting so that each child knows where it will spend the holiday. The holiday gathering problem is closely related to several classical problems in computer science, such as the {\em dining philosophers problem} on a general graph and periodic scheduling.

The process of the scheduling should be done with no further communication after initialization, by using a small amount of local data. The result should minimize the number of consecutive holidays where the family is not together. In a good sequence this number depends on local properties of the parents (e.g., their number of children). Furthermore, solutions that are periodic, i.e.\ a gathering occurs every fixed number of rounds, are useful for maintaining a small amount of information at each node and reducing the amount of ongoing communication and computation.

Our algorithmic techniques show interesting connections between periodic scheduling, coloring, and universal prefix free encodings. We develop a coloring-based construction where the period of each node colored with the $c$ is at most $2^{1+\log^*c}\cdot\prod_{i=0}^{\log^*c} \log^{(i)}c$ (where $\log^{(i)}$ means iterating the $\log$ function $i$ times). This is achieved via a connection with {\it prefix-free encodings}. We prove that this is the best possible for coloring-based solutions. We also show a construction with period at most $2d$ for a node of degree $d$.

11:00
Election vs. Selection: How Much Advice is Needed to Find the Largest Node in a Graph?
SPEAKER: unknown

ABSTRACT. Finding the node with the largest label in a network, modeled as an undirected connected graph, is one of the fundamental problems in distributed computing. This is the way in which leader election is usually solved. We consider two distinct tasks in which the largest-labeled node is found deterministically. In selection, this node must output 1 and all other nodes must output 0. In election, the other nodes must additionally learn the largest label. Our aim is to compare the difficulty of these two tasks executed under stringent running time constraints. The measure of difficulty is the amount of information that nodes of the network must initially possess in order to solve the given task in an imposed amount of time. Following the standard framework of algorithms with advice, this information (a single binary string) is provided to all nodes at the start by an oracle knowing the entire graph. The length of this string is called the size of advice. Consider the class of n-node graphs with any diameter diam ≤ D. If time is larger than diam, then both tasks can be solved without advice. For the task of election, we show that if time is smaller than diam, then the optimal size of advice is Θ(logn), and if time is exactly diam, then the optimal size of advice is Θ(logD). For the task of selection, the situation changes dramatically, even within the class of rings. Indeed, for the class of rings, we show that, if time is O(diam^ϵ), for any ϵ<1, then the optimal size of advice is Θ(logD), and, if time is Θ(diam) (and at most diam) then this optimal size is Θ(loglogD).

11:20
Near-Optimal Distributed Algorithms for Fault-Tolerant Tree Structures
SPEAKER: unknown

ABSTRACT. Tree structures such as breadth-first search (BFS) trees and minimum spanning trees (MST) are among the most fundamental graph structures in distributed network algorithms. However, by definition, these structures are not robust against failures and even a single edge's removal can disrupt their functionality. A well-studied concept which attempts to circumvent this issue is Fault-Tolerant Tree Structures, where the tree gets augmented with additional edges from the network so that the functionality of the structure is maintained even when an edge fails. These structures, or other equivalent formulations, have been studied extensively from a centralized viewpoint\cite{PPFTBFS13, Nardelli1997, Chechik2010, Grandoni-Williams2012}. However, despite the fact that the main motivations come from distributed networks, their distributed construction has not been addressed before.

In this paper, we present distributed algorithms for constructing fault tolerant BFS and MST structures. The time complexity of our algorithms are nearly optimal in the following strong sense: they almost match even the lower bounds of constructing (basic) BFS and MST trees.

11:40
Applications of Uniform Sampling: Densest Subgraph and Beyond
SPEAKER: unknown

ABSTRACT. In this paper we provide a framework to analyze the effect of uniform sampling on graph optimization problems. Interestingly, we apply this framework to a general class of graph optimization problems that we call \emph{heavy subgraph problems}, and show that uniform sampling preserves a $1-\epsilon$ approximate solution to these problems. This class contains many interesting problems such as densest subgraph, directed densest subgraph, densest bipartite subgraph, $d$-max cut, and $d$-sum-max clustering. As an immediate impact of this result, one can use uniform sampling to solve these problems in streaming, turnstile or Map-Reduce settings. Indeed, our results, by characterizing heavy subgraph problems, address Open Problem 13 at the IITK Workshop on Algorithms for Data Streams in 2006 regarding the effects of subsampling, in the context of graph streams.

The densest subgraph problem has several applications across different areas such as community detection, link spam detection, distance query indexing, analyzing communication in social networks, and computational biology. Bahmani, Kumar and Vassilvitskii in PVLDB 2012 consider the \emph{densest subgraph problem} in the streaming setting and present a $\frac{1}{2(1+\epsilon)}$-approximation streaming algorithms in $\log_{1+\epsilon}(n)$ passes. Recently Bhattacharya et al. in STOC 2015 improve this result and provide the first one pass algorithm for the densest subgraph problem in the streaming model with additions and deletions to its edges, i.e., for dynamic graph streams. They present a $(0.5-\epsilon)$-approximation algorithm using $\tilde{O}(n)$ space and $\tilde{O}(1)$ update time, where factors of $\epsilon$ and $\log(n)$ are suppressed in the $\tilde{O}$ notation.

In this paper we improve the $(0.5-\epsilon)$-approximation algorithm of Bhattacharya et al. by providing a $(1-\epsilon)$-approximation algorithm using $\tilde{O}(n)$ space and $\tilde{O}(1)$ update time. In simultaneous and independent works McGregor et al. and Mitzenmacher et al. provide similar results, which we compare to below. In addition to this, one can implement our algorithm in time $\tilde{O}(n^2)$ on a graph in the offline setting. To the best of our knowledge this is the fastest $(1-\epsilon)$-approximation algorithm for the densest subgraph problem in the offline setting.

11:50
Brief Announcement: Relaxed Byzantine Vector Consensus
SPEAKER: unknown

ABSTRACT. Byzantine vector consensus requires that non-faulty processes reach agreement on a decision (or output) that is in the convex hull of the inputs at the non-faulty processes. Recent work has shown that, for n processes with up to f Byzantine failures, when the inputs are d-dimensional vectors of reals, n>=max(3f+1,(d+1)f+1) is the tight bound for synchronous systems, and n>=(d+2)f+1 is tight for approximate consensus in asynchronous systems.

Due to the dependence of the lower bound on vector dimension d, the number of processes necessary becomes large when the vector dimension is large. With the hope of reducing the lower bound on n, we propose relaxed versions of Byzantine vector consensus: k-relaxed Byzantine vector consensus and (delta,p)-relaxed Byzantine vector consensus. k-relaxed consensus only requires consensus for projections of inputs on every subset of k dimensions. (delta,p)-relaxed consensus requires that the output be within distance delta of the convex hull of the non-faulty inputs, where distance is defined using the Lp-norm. An input-dependent delta allows the distance from the non-faulty convex hull to be dependent on the maximum distance between the non-faulty inputs.

We show that for k-relaxed consensus and (delta,p)-relaxed consensus with constant delta>=0, the bound on n is identical to the bound stated above for the original vector consensus problem. On the other hand, when delta depends on the inputs, we show that the bound on n is smaller when d>=3. Input-dependent delta may be of interest in practice -- in essence, input-dependent delta scales with the spread of the inputs.

13:30-14:40 Session 13: Distributed Algorithms
13:30
The Cost of Unknown Diameter in Dynamic Networks
SPEAKER: unknown

ABSTRACT. For dynamic networks with {\em unknown diameter}, we prove novel lower bounds on the time complexity of a range of basic distributed computing problems. Together with trivial upper bounds under dynamic networks with {\em known diameter} for these problems, our lower bounds show that the complexities of all these problems are {\em sensitive} to whether the diameter is known to the protocol beforehand: Not knowing the diameter increases the time complexities by a large poly$(N)$ factor as compared to when the diameter is known, resulting in an exponential gap. Our lower bounds are obtained via communication complexity arguments and by reducing from the two-party {\sc DisjointnessCP} problem. We further prove that sometimes this large poly$(N)$ cost can be completely avoided if the protocol is given a good estimate on $N$. In other words, having such an estimate makes some problems no longer sensitive to unknown diameter.

13:50
Churn- and DoS-resistant Overlay Networks Based on Network Reconfiguration
SPEAKER: unknown

ABSTRACT. We present three robust overlay networks: First, we present a network that organizes the nodes into an expander and is resistant to even massive adversarial churn. Second, we develop a network based on the hypercube that maintains connectivity under adversarial DoS-attacks. For the DoS-attacks we use the notion of a Omega(log log n)-late adversary which only has access to topological information that is at least Omega(log log n) rounds old. Finally, we develop a network that combines both churn- and DoS-resistance. The networks gain their robustness through constant network reconfiguration, i.e., the topology of the networks changes constantly. Our reconfiguration algorithms are based on node sampling primitives for expanders and hypercubes that allow each node to sample a logarithmic number of nodes uniformly at random in O(log log n) communication rounds. These primitives are specific to overlay networks and their optimal runtime represents an exponential improvement over known techniques. Our results have a wide range of applications, for example in the area of scalable and robust peer-to-peer systems.

14:10
Fast Distributed Algorithms for Connectivity and MST in Large Graphs
SPEAKER: unknown

ABSTRACT. Motivated by the increasing need to understand the algorithmic foundations of distributed large-scale graph computations, we study a number of fundamental graph problems in a message-passing model for distributed computing where $k \geq 2$ machines jointly perform computations on graphs with $n$ nodes (typically, $n \gg k$). The input graph is assumed to be initially randomly partitioned among the $k$ machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation.

Our main result is an (almost) optimal distributed randomized algorithm for graph connectivity. Our algorithm runs in $\tilde{O}(n/k^2)$ rounds ($\tilde{O}$ notation hides a $\polylog(n)$ factor and an additive $\polylog(n)$ term). This improves over the best previously known bound of $\tilde{O}(n/k)$ [Klauck et al., SODA 2015], and is optimal (up to a polylogarithmic factor) in view of an existing lower bound of $\tilde{\Omega}(n/k^2)$. Our improved algorithm uses a bunch of techniques, including linear graph sketching, that prove useful in the design of efficient distributed graph algorithms. We then present fast randomized algorithms for computing minimum spanning trees, (approximate) min-cuts, and for many graph verification problems. All these algorithms take $\tilde{O}(n/k^2)$ rounds, and are optimal up to polylogarithmic factors. We also show an almost matching lower bound of $\tilde{\Omega}(n/k^2)$ for many graph verification problems using lower bounds in random-partition communication complexity.

14:30
A Tight Distributed Algorithm for All Pairs Shortest Paths and Applications
SPEAKER: unknown

ABSTRACT. Given an unweighted and undirected graph, this paper aims to give a tight distributed algorithm for computing the all pairs shortest paths (APSP) under synchronous communications and the $\mathcal{CONGEST(B)}$ model, where each node can only transfer $B$ bits of information along each incident edge in a round. The best previous results for distributively computing APSP needs $O(N+D)$ time where $N$ is the number of nodes and $D$ is the diameter~\cite{DBLP:conf/podc/HolzerW12,DBLP:conf/icalp/PelegRT12}. This result is obtained by scheduling $N$ parallel distributed Breadth-First-Search (BFS) tree constructions. However, there is still a $B$ factor gap from the lower bound $\Omega(N/B+D)$~\cite{DBLP:conf/podc/HolzerW12}. In order to close this gap, we propose a multiplexing technique to push the parallelization of distributed BFS tree constructions to the limit such that we can solve APSP in $O(N/B+D)$ time which meets the lower bound. This result also implies a $\Theta(N/B+D)$ time distributed algorithm for diameter. In addition, we extend our distributed algorithm to compute girth which is the length of the shortest cycle and clustering coefficient which is related to counting the number of triangles incident to each node. The time complexities for computing these two graph properties are also $O(N/B+D)$.

15:00-16:00 Session 14: Parallel Graph Algorithms
15:00
Parallel Shortest-Paths Using Radius Stepping
SPEAKER: Yan Gu

ABSTRACT. The single-source shortest path problem (SSSP) with nonnegative edge weights is a notoriously difficult problem to solve efficiently in parallel---it is one of the graph problems said to suffer from the transitive-closure bottleneck. In practice, the $\Delta$-stepping algorithm of Meyer and Sanders (J. Algorithms, 2003) often works efficiently but has no known theoretical bounds on general graphs. The algorithm takes a sequence of steps, each increasing the radius by a user-specified value $\Delta$. Each step settles the vertices in its annulus but can take $\Theta(n)$ substeps, each requiring $\Theta(m)$ work ($n$ vertices and $m$ edges).

In this paper, we describe Radius-Stepping, an algorithm with the best-known tradeoff between work and depth bounds for SSSP with nearly-linear ($\otilde(m)$) work. The algorithm is a $\Delta$-stepping-like algorithm but uses a variable instead of fixed-size increase in radii, allowing us to prove a bound on the number of steps. In particular, by using what we define as a vertex $k$-radius, each step takes at most $k+2$ substeps. Furthermore, we define a $(k, \rho)$-graph property and show that if an undirected graph has this property, then the number of steps can be bounded by $O(\frac{n}{\rho} \log \rho L)$, for a total of $O(\frac{kn}{\rho} \log \rho L)$ substeps, each parallel. We describe how to preprocess a graph to have this property. Altogether, Radius-Stepping takes $O((m+n\log n)\log \frac{n}{\rho})$ work and $O(\frac{n}{\rho}\log n \log (\rho{}L))$ depth per source after preprocessing. The preprocessing step can be done in $O(m\log n + n\rho^2)$ work and $O(\rho^2)$ depth or in $O(m\log n + n\rho^2\log n)$ work and $O(\rho\log \rho)$ depth.

15:20
Parallel Metric Tree Embedding based on an Algebraic View on Moore-Bellman-Ford
SPEAKER: unknown

ABSTRACT. A metric tree embedding of expected stretch $\alpha$ maps a weighted n-node graph $G = (V, E, \omega)$ to a weighted tree $T = (V_T, E_T, \omega_T)$ with $V \subseteq V_T$, and $dist(v, w, G) \leq dist(v, w, T)$ and $E[dist(v, w, T)] \leq \alpha dist(v, w, G)$ for all $v, w \in V$. Such embeddings are highly useful for designing fast approximation algorithms, as many hard problems are easy to solve ontree instances. However, to date the best parallel $polylog n$ depth algorithm that achieves an asymptotically optimal expected stretch of $\alpha \in O(\log n)$ uses $\Omega(n^2)$ work and requires a metric as input. In this paper, we show how to achieve the same guarantees using $\tilde{O}(m 1+\epsilon)$ work, where $m$ is the number of edges of $G$ and $\epsilon > 0$ is an arbitrarily small constant. Moreover, one may reduce the work further to $\tilde{O}(m + n 1+\epsilon)$, at the expense of increasing the expected stretch $\alpha$ to $O(\epsilon^{−1} \log n)$ using the spanner construction of Baswana and Sen as preprocessing step. Our main tool in deriving these parallel algorithms is an algebraic characterization of a generalization of the classic Moore-Bellman-Ford algorithm. We consider this framework, which subsumes a large variety of previous "Moore-Bellman-Ford-flavored" algorithms, to be of independent interest.

15:40
Online Packet Scheduling for CIOQ and Buffered Crossbar Switches
SPEAKER: unknown

ABSTRACT. We consider the problem of online packet scheduling in Combined Input and Output Queued (CIOQ) and buffered crossbar switches. In the widely used CIOQ switches, packet buffers (queues) are placed at both input and output ports. An $N \times N$ CIOQ switch has $N$ input ports and $N$ output ports, where each input port is equipped with $N$ queues, each of which corresponds to an output port, and each output port is equipped with only one queue. In each time step, arbitrarily many packets may arrive at each input port, and only one packet can be transmitted from each output port. Packets are transferred from the queues of input ports to the queues of output ports through the internal fabric. %with a rate potentially exceeding the rate of transmission Buffered crossbar switches follow a similar design, but are equipped with additional buffers in their internal fabric. In either model, our goal is to maximize the number or, in case the packets have weights, the total weight of transmitted packets.

Our main objective is to devise online algorithms that are both competitive and efficient. We improve the previously known results for both switch models, both for weighted and unweighted packets.

For unweighted packets, Kesselman and Ros\'{e}n (J.~Algorithms `06) give an online algorithm that is 3-competitive for CIOQ switches. We give a faster, more practical algorithm achieving the same competitive ratio. In the buffered crossbar model we also show 3-competitiveness, improving the previously known ratio of 4.

For weighted packets, we give 5.83- and 14.83-competitive algorithms with an elegant analysis for CIOQ and buffered crossbar switches, respectively. This improves upon the previously known ratios of 6 and 16.24.