SPAA 2016: 28TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES
PROGRAM FOR MONDAY, JULY 11TH
Days:
next day
all days

View: session overviewtalk overview

09:00-10:20 Session 1: Parallel Algorithms
09:00
Shuffles and Circuits (On Lower Bounds for Modern Parallel Computation)
SPEAKER: unknown

ABSTRACT. The goal of this paper is to identify fundamental limitations on how efficiently algorithms imple- mented on platforms such as MapReduce and Hadoop can compute the central problems in the motivat- ing application domains, such as graph connectivity problems.

We introduce an abstract model of massively parallel computation, where essentially the only restric- tions are that the “fan-in” of each machine is limited to s bits, where s is smaller than the input size n, and that computation proceeds in synchronized rounds, with no communication between different ma- chines within a round. Lower bounds on the round complexity of a problem in this model apply to every computing platform that shares the most basic design principles of MapReduce-type systems.

We prove that computations in our model that use few rounds can be represented as low-degree polynomials over the reals. This connection allows us to translate a lower bound on the (approximate) polynomial degree of a Boolean function to a lower bound on the round complexity of every (random- ized) massively parallel computation of that function. These lower bounds apply even in the “unbounded width” version of our model, where the number of machines can be arbitrarily large. As one example of our general results, computing any non-trivial monotone graph property — such as connectivity — requires a super-constant number of rounds when every machine can accept only a sub-polynomial (in n) number of input bits s.

Finally, we prove that, in two senses, our lower bounds are the best one could hope for. For the unbounded-width model, we prove a matching upper bound. Restricting to a polynomial number of machines, we show that asymptotically better lower bounds require proving that P ̸= NC1.

09:20
Parallel Algorithms for Summing Floating-Point Numbers
SPEAKER: unknown

ABSTRACT. We provide several efficient parallel algorithms for summing $n$ floating point numbers, so as to produce a correctly rounded floating-point representation of the sum. We present algorithms in PRAM, external-memory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.

09:40
Randomized approximate nearest neighbor search with limited adaptivity
SPEAKER: unknown

ABSTRACT. We study the fundamental problem of approximate nearest neighbor search in $d$-dimensional Hamming space $\{0,1\}^d$. We consider algorithms in the famous cell-probe model with a limited number of rounds of parallel accesses to the data structure. For any $k\ge 1$, we give a simple randomized algorithm solving the approximate nearest neighbor search with $k$ rounds of parallel memory accesses, with $O(k(\log d)^{1/k})$ memory accesses in total. For large enough $k$, we also give a more sophisticated randomized algorithm for the problem with $O(k+(\frac{1}{k}\log d)^{O(1/k)})$ total memory accesses in $k$ rounds. Both algorithms use data structures of size in polynomial of $n$, the number of points in the database.

We also prove an $\Omega(\frac{1}{k}(\log d)^{1/k})$ lower bound on the total number of memory accesses required by any randomized algorithm solving the approximate nearest neighbor search within $k$ rounds of parallel memory accesses on a data structure of polynomial size. This lower bound shows that our first simple algorithm is asymptotically optimal for any constant number of rounds. And our second algorithm approaches the asymptotically optimal tradeoff between the number of rounds and the number of memory accesses, in a sense that the lower bound of memory accesses for any $k_1$ rounds can be matched by the algorithm within $k_2=O(k_1)$ rounds.

10:00
Encoding Short Ranges in TCAM Without Expansion: Efficient Algorithm and Applications
SPEAKER: unknown

ABSTRACT. We present RENE — a novel encoding scheme for short ranges on Ternary content addressable memory (TCAM), which, unlike previous solutions, does not impose row expansion, and uses bits proportionally to the maximal range length. We provide theoretical analysis to show that our encoding is the closest to the lower bound of number of bits used. In addition, we show several applications of our technique in the field of packet classification, and also, how the same technique could be used to efficiently solve other hard problems such as the nearest-neighbor search problem and its variants. We show that using TCAM, one could solve such problems in much higher rates than previously suggested solutions, and outperform known lower bounds in traditional memory models. We show by experiments that the translation process of RENE on switch hardware induces only a negligible 2.5% latency overhead. Our nearest neighbor implementation on a TCAM device provides search rates that are up to four orders of magnitude higher than previous best prior-art solutions.

13:30-15:00 Session 3: Scheduling Parallel Computation
13:30
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
SPEAKER: unknown

ABSTRACT. The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "||" (parallel) and ";" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "$\fire$" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies~\cite{AlpernCaFe93}) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that algorithms have increased ``parallelizability'' in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model for "reasonably regular" algorithms. The running time for the algorithms in this paper is asymptotically optimal: $O\left(\frac{\sum_{i=0}^{h-1} \cc(\Qt;\sigma\cdot M_i)\cdot C_i}{p}\right)$, where $\cc$ is the cache complexity of task $\Qt$, $C_i$ is the cost of cache miss at level-$i$ cache which is of size $M_i$, $\sigma\in(0,1)$ is a constant, and $p$ is the number of processors in an $h$-level cache hierarchy.

13:50
A Practical Solution to the Cactus Stack problem
SPEAKER: unknown

ABSTRACT. Work-stealing is a popular method for load-balancing dynamic multithreaded computations on shared-memory systems. In theory, a randomized work-stealing scheduler can achieve near linear speedup when the computation has sufficient parallelism and requires stack space that is linear in the number of processors. In practice, however, work-stealing runtimes sacrifice interoperability with serial code to achieve these bounds. For example, both Cilk and Cilk++ prohibit a C function from calling a Cilk function. Other work- stealing runtime systems that do not have this restriction either lack a strong time bound, which might cause them to deliver little or no speedup in the worst case, or lack a strong space bound, which might lead to an excessive memory footprint. This problem was previously described as the cactus stack problem.

In this paper, we present Fibril, a new multithreading library that supports a fork-join programming model using work-stealing. Fibril solves the cactus stack problem by (1) implementing on a cactus stack that conforms to the calling convention of serial code and (2) returning unused memory pages of suspended stacks to the operating system to bound consumption of physical memory. Theoretically, Fibril achieves strong bounds on both time and memory usage without sacrificing interoperability with serial code. Empirically, Fibril achieves up to 2.8× the performance of Intel Cilk Plus — a state-of-the-art work-stealing platform — for the ten benchmarks we evaluated.

14:10
Latency-Hiding Work Stealing
SPEAKER: unknown

ABSTRACT. The rise of multicore computers has changed the landscape of parallel computing. Parallel applications no longer consist solely of computational, batch workloads but also include applications that may communicate with the external world. For example, a parallel program may take input from users, perform I/O on a secondary storage device, communicate with other devices over a network, perform remote procedure calls, etc. Such operations can incur substantial latency, requiring the program to wait for a response. In the current state of the art, the theoretical models of parallelism and parallel scheduling algorithms do not account for latency.

In this work, we extend the dag (Directed Acyclic Graph) model for parallelism to account for latency and present a work-stealing algorithm that hides latency to improve performance. Our latency-hiding work-stealing algorithm allows user-level threads to suspend without blocking the underlying worker, usually a system thread. When a user-level thread suspends, the algorithm seeks another thread to execute. To ensure provable performance, the algorithm creates multiple deques per worker and treats each deque essentially in the same way as the original work-stealing algorithm. We show that the algorithm requires W/P + S(U+1)(1+lg(U+1)) time to run a computation with W work, S span, and U suspension width. The suspension width bounds the number of threads that can be suspended at any time during an execution and can be calculated by using a technique similar to s-t cuts in flow algorithms. We also briefly present a prototype implementation of the algorithm and some preliminary empirical findings.

14:30
Provably good and practically efficient parallel race detection for fork-join programs
SPEAKER: unknown

ABSTRACT. Determinacy races occur when different schedules of a program can result in memory accesses that observe different values. These race conditions are often bugs in the program. This type of bug is notoriously difficult to expose, but various race-detection tools exist. A key component of race detectors is an algorithm for series-parallel (SP) maintenance, which identifies whether two accesses are logically parallel. This paper describes an asymptotically optimal algorithm, called WSP-Order, for performing SP maintenance in parallel programs with fork-join (or nested) parallelism. In particular, given an input fork-join program with $T_1$ work and $T_\infty$ span, WSP-Order augments the program to maintain series-parallel relationships and executes it in $O(T_1/P + T_\infty)$ time on P processors, which is asymptotically optimal. At the heart of WSP-Order is a work-stealing scheduler designed specifically for SP maintenance. We also implemented a race-detector, called C-RACER, based on WSP-Order algorithm within the Cilk Plus runtime system and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, as the number of processors increase, it achieves almost as high a speedup as the original program without race-detection.

14:50
Dynamic Determinacy Race Detection for Task Parallelism with Futures
SPEAKER: unknown

ABSTRACT. Existing determinacy race detectors for dynamic task-parallel languages are limited to programs with strict computation graphs, where a task can only wait for its descendant tasks to complete. However, modern task-parallel programming models include parallel constructs that allow for more general synchronization patterns than those that can be expressed by strict computation graphs. Examples of such constructs include futures (which are present in many programming models), task dependences in OpenMP 4.0, doacross synchronization in OpenMP 4.5, and phaser synchronization in Java~7 and Habanero-Java (HJ). In this paper, we present the first known determinacy race detector for dynamic task parallelism with futures. As with past determinacy race detectors, our algorithm guarantees that all potential determinacy races will be checked so that if no race is reported for a given input, then all executions with that input are guaranteed to be race-free. Thus, the output of our race detection is independent of how the parallel tasks are scheduled.

The space and time complexity of the algorithm is similar to those of the classical SP-bags algorithm, when using only structured parallel constructs such as spawn-sync and async-finish. In the presence of point-to-point synchronization using futures, the complexity of the algorithm increases by a factor determined by the number of future operations, which includes future task creation and future get operations. Further, our paper states additional theoretical guarantees relating data races, functional determinism, structural determinism, and deadlock freedom. We have completed a prototype implementation of our algorithm, and have evaluated it using seven benchmark programs. The experimental results show that the slowdown factor observed for our algorithm relative to the sequential version is in the range of 1.00$\times$ -- 9.92$\times$, which is very much in line with slowdowns experienced for fully strict computation graphs.

In summary, the theoretical and experimental results in this paper promise to make our algorithm desirable for modern task parallel programming models with futures, such as C++11, Java, and .Net Task Parallel Library, as well as research languages such as Chapel, HJ and X10, where point-to-point synchronization coexist with structured parallelism.

15:30-16:40 Session 4: Transactional Memory and Beyond
15:30
RUBIC: Online Parallelism Tuning for Collocated Transactional Memory Applications
SPEAKER: unknown

ABSTRACT. With the advent of Chip-Multiprocessors, Transactional Memory (TM) emerged as a powerful paradigm to simplify parallel programming. Unfortunately, as more and more cores become available in commodity systems, the scalability limits of a wide class of TM applications become more evident.

Hence, online parallelism tuning techniques were proposed to adapt the optimal number of threads of TM applications. However, state-of-the-art solutions are exclusively tailored to single-process systems with relatively static workloads, exhibiting pathological behaviors in scenarios where multiple multi-threaded TM processes contend for the shared hardware resources.

This paper proposes RUBIC, a novel parallelism tuning method for TM applications in both single and multi-process scenarios. RUBIC helps the co-running processes adapt their parallelism level so that they can adequately space-share the available cores in a system.

When compared to previous online parallelism tuning solutions, RUBIC achieves unprecedented system-wide fairness and efficiency, both in single- and multi-process scenarios. Our evaluation with different workloads and scenarios shows that, on average, RUBIC enhances the overall performance by 26% with respect to the best-performing state-of-the-art online parallelism tuning technique, in multi-process scenarios, while incurs no overhead in single-process cases. RUBIC also exhibits unique features in convergence to a fair and efficient state.

15:50
Extending TM Primitives using Low Level Semantics
SPEAKER: unknown

ABSTRACT. Transactional Memory (TM) has recently emerged as an optimistic concurrency control technique that isolates concur- rent executions at the level of reads and writes, therefore providing an easy programming interface. However, such transparency could be overly conservative from an application-level perspective. In this work, we propose an extension to the classical TM primitives (read and write) to capture program code semantics while maintaining the same level of programming abstraction. We deployed this extension on two state-of-the-art STM algorithms and integrated into the GCC compiler and the RSTM software framework. Results showed speedups of up to 4× on different applications including micro benchmarks and STAMP.

16:10
Investigating the performance of hardware transactions on a multi-socket machine
SPEAKER: unknown

ABSTRACT. The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, however, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines.

This paper investigates the behavior of hardware transactions on a large multi-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.

16:30
Brief Announcement: Transactional Data Structure Libraries
SPEAKER: unknown

ABSTRACT. We introduce transactions into libraries of concurrent data structures; such transactions can be used to ensure atomicity of sequences of data structure operations. By restricting transactional access to a well-defined set of data structure operations, we strike a balance between the ease-of-programming of transactions and the efficiency of custom-tailored data structures. We exemplify this concept by designing and implementing a library supporting transactions on any number of maps, sets (implemented as skiplists), and queues. Our library offers efficient and scalable transactions, which are an order of magnitude faster than state-of-the-art transactional memory toolkits. Moreover, our approach treats stand-alone data structure operations (like put and enqueue) as first class citizens, and allows them to execute with virtually no overhead, at the speed of the original data structure library.

17:00-18:00 Session 5: Parallel Algorithms
Chair:
17:00
Cache-Adaptive Analysis
SPEAKER: unknown

ABSTRACT. Memory efficiency and locality have substantial impact on the performance of programs, particularly when operating on large data sets. Thus, memory- or I/O-efficient algorithms have received significant attention both in theory and practice. The widespread deployment of multicore machines, however, brings new challenges. Specifically, since the memory (RAM) is shared across multiple processes, the effective memory-size allocated to each process fluctuates over time.

This paper presents techniques for designing and analyzing algorithms in a cache-adaptive setting, where the RAM available to the algorithm changes over time. These techniques make analyzing algorithms in the cache-adaptive model almost as easy as in the DAM model. Our techniques enable us to analyze a wide variety of algorithms — Master-Method-style algorithms, Akra-Bazzi-style algorithms, collections of mutually recursive algorithms, and algorithms, such as FFT, that break problems of size N into subproblems of size Θ(N^c).

We demonstrate the effectiveness of these techniques by deriving several results: • We give a simple recipe for determining whether common divide-and-conquer cache-oblivious algorithms are optimally cache adaptive and, if not, how far they are from optimal. • We show how to bound an algorithm’s non-optimality. We give a tight analysis showing that a class of cache-oblivious algorithms is a logarithmic factor worse than optimal. • We show the generality of our techniques by analyzing the cache-oblivious FFT algorithm, which is not covered by the above theorems. Nonetheless, the same general techniques can show that it is at most O(log log N ) away from optimal in the cache adaptive setting, and that this bound is tight.

These general theorems give concrete results about several algorithms that could not be analyzed using earlier techniques. For example, our results apply to Fast Fourier Transform, matrix multiplication, Jacobi Multipass Filter, and cache-oblivious dynamic-programming algorithms, such as Longest Common Subsequence and Edit Distance.

Our results also give algorithm designers clear guidelines for creating optimally cache-adaptive algorithms.

17:20
Parallel Algorithms for Asymmetric Read-Write Costs
SPEAKER: unknown

ABSTRACT. Motivated by the significantly higher cost of writing than reading in emerging memory technologies, we consider parallel algorithm design under such asymmetric read-write costs, with the goal of reducing the number of writes while preserving work-efficiency and low span. We present a nested-parallel model of computation that combines (i) small per-task stack-allocated memories with symmetric read-write costs and (ii) an unbounded heap-allocated shared memory with asymmetric read-write costs, and show how the costs in the model map efficiently onto a more concrete machine model under a work-stealing scheduler. We use the new model to design reduced write, work-efficient, low span parallel algorithms for a number of fundamental problems such as reduce, list contraction, tree contraction, breadth-first search, ordered filter, and planar convex hull. For the latter two problems, our algorithms are output-sensitive in that the work and number of writes decrease with the output size. We also present a reduced write, low span minimum spanning tree algorithm that is nearly work-efficient (off by the inverse Ackermann function). Our algorithms reveal several interesting techniques for significantly reducing shared memory writes in parallel algorithms without asymptotically increasing the number of shared memory reads.

17:40
Brief Announcement: Preserving Happens-before in Persistent Memory
SPEAKER: unknown

ABSTRACT. It is expected that nonvolatile, byte-addressable memory (NVBM) will be commercially available in the near future. However, registers and caches are expected to remain transient. In the event of a power failure, NVBM content will be preserved, but data stored in the cache and processor will be lost. Since main memory is customarily seen through the filter of current cache contents, the data preserved in NVBM after a crash, absent careful management, are likely to be inconsistent and thus unusable.

Previous work has explored the semantics of instructions used to push the contents of cache to NVBM. These semantics comprise a "memory persistency model," analogous to a traditional "memory consistency model." In this brief announcement we introduce explicit epoch persistency, a memory persistency model that captures the current and expected semantics of Intel x86 and ARM v8 persistent memory instructions. We also present a construction that augments any data-race--free program (for release consistency or any stronger memory model) in such a way that the contents of NVBM, in the wake of a crash, are guaranteed to represent a consistent cut in the happens-before graph of the program's execution.

17:50
Brief Announcement: Approximating the I/O Complexity of One-Shot Red-Blue Pebbling
SPEAKER: unknown

ABSTRACT. Red-blue pebbling is a model of computation that captures the complexity of I/O operations in systems with external memory access. We focus on one-shot pebbling strategies, that is without re-computation. Prior work on this model has focused on finding upper and lower bounds on the I/O complexity of certain families of graphs. We give a polynomial-time bi-criteria approximation algorithm for this problem for graphs with bounded out-degree. More precisely, given a $n$-vertex DAG that admits a pebbling strategy with $R$ red pebbles and I/O complexity $\opt$, our algorithm outputs a strategy using $O(R \cdot \log^{3/2} n)$ red pebbles, and I/O complexity $O(\opt \cdot \log^{3/2} n)$. We further extend our result to the generalization of red-blue pebble games that correspond to multi-level memory hierarchies. Finally, we complement our theoretical analysis with an experimental evaluation of our algorithm for red-blue pebbling.