SPAA 2016: 28TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES
PROGRAM FOR TUESDAY, JULY 12TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:30 Session 6: Scheduling
09:00
General Profit Scheduling and the Power of Migration on Heterogeneous Machines
SPEAKER: unknown

ABSTRACT. In this paper we consider the power of migration in heterogeneous machines settings and general profit scheduling. We begin by showing that on related machines or on related machines with restricted assignment that any migratory algorithm can be simulated by a non-migratory algorithm given $1+\eps$ speed augmentation and $O(\frac{1}{\eps})$ and $O(\frac{1}{\eps^2})$ machine augmentation, respectively, for any $0 < \eps \leq 1$. Similar results were only known in the case of identical machines and our results effectively show that migration does not give too much additional power to an algorithm, even in heterogeneous environments. Our results are constructive and can be computed efficiently in the offline setting. We complement our result by showing that there exists migratory schedules on related machines which require $\Omega(\frac{1}{\eps})$ machine augmentation with $(1+\eps)$-speed to be simulated by any non-migratory scheduler for any $0 < \eps \leq 1/2$, showing that machine augmentation without speed augmentation is insufficient for a non-migratory scheduler to simulate a migratory scheduler.

We then use these results to study general profit scheduling where a set of $n$ jobs arrive over time online and every job $i$ has a function $g_i(t)$ specifying the profit of completing job $i$ at time $t$. The goal of the schedule is to maximize the total profit obtained. We give a $(1+\eps)$-speed $O(\frac{1}{\eps^2})$-competitive algorithm in the unrelated machines setting for any $\eps >0$ when compared against a non-migratory adversary. Previous results were only known in the identical machines setting. As an example of the usefulness of the previous results on migration, they with the results on genial profit scheduling give a $(1+\eps)$-speed $O(\frac{1}{\eps^4})$-competitive algorithm for general profit scheduling when comparing against a migratory algorithm on related machines with restricted assignment for any $\eps >0$.

09:20
The Power of Migration in Online Machine Minimization
SPEAKER: unknown

ABSTRACT. In this paper we investigate the power of migration in online scheduling on multiple parallel machines. The problem is to schedule preemptable jobs with release dates and deadlines on a minimum number of machines. We show that migration, that is, allowing that a preempted job is continued on a different machine, has a huge impact on the performance of a schedule. More precisely, let m be the number of machines required by a migratory solution, then the increase in the number of machines when disallowing migration is unbounded in m. This complements and strongly contrasts previous results on variants of this problem. In both, the offline variant and a model allowing extra speed, the power of migration is limited as the increase of number of machines resp. speed can be bounded by a small constant.

In this paper, we also derive the first non-trivial bounds on the competitive ratio for non-migratory online scheduling to minimize the number of machines without extra speed. We show that in general no online algorithm can achieve a competitive ratio of f(m), for any function f, and give a lower bound of Omega(log n). For agreeable instances and instances with "loose" jobs, we give O(1)-competitive algorithms and, for laminar instances, we derive an O(log m)-competitive algorithm.

09:40
Fair Online Scheduling for Selfish Jobs on Heterogeneous Machines
SPEAKER: unknown

ABSTRACT. Scheduling jobs on multiple machines has numerous applications and has been a central topic of research in the scheduling literature. Recently, much progress has been made particularly in online scheduling with the development of powerful analysis tools. In this line of wok a centralized scheduler typically dispatches jobs to machines to exploit the given resources the best to achieve the best system performance which is measured by a certain global scheduling objective. While this approach has been very successful in attacking scheduling problems of growing complexity, the underlying assumption that jobs follow a centralized scheduler may not be realistic in certain scheduling settings. %This becomes more apparent when there are more machines to be considered.

In this paper we initiate the study of online scheduling for selfish jobs in the presence of multiple machines. Selfish behavior of jobs is a common aspect observed in the absence of a centralized scheduler. We explore this question in the unrelated machines setting, arguably one of the most general multiple machine models. In this model each job can have a completely different processing time on each machine. Motivated by several practical scenarios, we assume that when a job arrives it {\em chooses the machine that completes the job the earliest} i.e. minimizes the flow time of the job. The goal is to design a local scheduling algorithm on each machine with the goal of minimizing the total (weighted) flow time. We show that the algorithm Smoothed Latest Arrival Processor Sharing, which was introduced in a recent work by Im et al. \cite{ImKM14, ImKMP14}, yields an $O(1 / \eps^2)$-competitive schedule when given $(1 + \eps)$ speed. We also extend our result to minimize total flow-time plus energy consumed. To show this result we establish several interesting properties of the algorithm which could be of potential use for other scheduling problems.

10:00
Scheduling Parallelizable Jobs Online to Minimize Maximum Flow Time
SPEAKER: unknown

ABSTRACT. In this paper we study the problem of scheduling a set of dynamic multithreaded jobs with the objective of minimizing the maximum latency experienced by any job. We assume that jobs arrive online and the scheduler has no information about the arrival rate, arrival time or work distribution of the jobs. The scheduling goal is to minimize the maximum amount of time between the arrival of a job and its completion --- this goal is referred to in scheduling literature as maximum flow time. While theoretical online scheduling of parallel jobs has been studied extensively, most prior work has focussed on a highly stylized model of parallel jobs called the "speedup curves model." We model parallel jobs as directed acyclic graphs, which is a more realistic way to model dynamic multithreaded jobs.

In this context, we prove that a simple First-In-First-Out scheduler is $(1+\eps)-speed O(\frac{1}{\eps})$-competitive for any $\eps > 0$. We then develop a more practical work-stealing scheduler and show that it has a maximum flow time of $O(\frac{1}{\eps^2} \max\{\opt,\ln(n)\})$ for $n$ jobs, with $(1+\eps)$-speed. This result is essentially tight as we also provide a lower bound of $\Omega(\log(n))$ for work stealing. In addition, for the case where jobs have weights (typically representing priorities) and the objective is minimizing the maximum weighted flow time, we show a non-clairvoyant algorithm is $(1+\eps)$-speed $O(\frac{1}{\eps^2})$-competitive for any $\eps >0$, which is essentially the best positive result that can be shown in the online setting for the weighted case due to strong lower bounds without resource augmentation.

After establishing theoretical results, we perform an empirical study of work-stealing. Our results indicate that, on both real world and synthetic workloads, work-stealing performs almost as well as an optimal scheduler.

10:20
A QPTAS for Non-preemptive Speed-scaling
SPEAKER: unknown

ABSTRACT. Modern processors typically allow dynamic speed-scaling offering an effective trade-off between high throughput and energy efficiency. In a classical model, a processor/machine runs in speed $s$ when consuming power $s^\alpha$ where $\alpha >1$ is a constant. Yao \etal [FOCS 1995] studied the problem of completing all jobs before their deadlines on a single machine with the minimum energy in their seminal work and gave a nice polynomial time algorithm. The influential work has been extended to various settings. In particular, the problem has been extensively studied in the presence of multiple machines as multi-core processors have become dominant computing units.

However, when jobs must be scheduled \emph{non-preemptively}, our understanding of the problem remains fairly unsatisfactory. Often, preempting a job is prohibited since it could be very costly. Previously, a $O((w_{\max}/ w_{\min})^{\alpha})$-approximation was known for the non-preemptive setting where $w_{\max}$ and $w_{\min}$ denote the maximum and minimum job sizes, respectively. Even when there is only one machine, the best known approximation factor had a dependency on $\alpha$. In this paper, for any fixed $\alpha >1$ and $\eps >0$, we give the \emph{first} $(1+\eps)$-approximation for this problem on multiple machines which runs in $n^{O(\polylog (n))}$ time.

13:30-15:00 Session 8: Scheduling and Resource Allocation
13:30
Robust and Probabilistic Failure-Aware Placement
SPEAKER: unknown

ABSTRACT. Motivated by the growing complexity and heterogeneity of modern data centers, and the prevalence of commodity component failures, this paper studies the failure-aware placement problem of placing tasks of a parallel job on machines in the data center with the goal of increasing availability. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound $W$ and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we need to find a placement that maximizes the probability that at least $K$ out of $N$ tasks survive at any time.

For adversarial failures, we first show that (i) the problems are in $\Sigma_2$, the second level of the polynomial hierarchy, (ii) a basic variant, that we call RobustFAP, is co-NP-hard, and (iii) an all-or-nothing version of RobustFAP is $\Sigma_2$-complete. We then give a PTAS for RobustPAF, a key ingredient of which is a solution that we design for a fractional version of RobustPAF. We then look at fractional RobustPAF over hierarchies, denoted HierRobustFAP, and design a novel {\it Generalized Spreading}\/ algorithm which is simultaneously optimal for all $W$. This algorithm generalizes the classical notion of {\em max-min fairness}\/ to work with nodes of differing capacities, differing reliability weights and hierarchical structures. Using randomized rounding, we extend this to give an algorithm for integral HierRobustPAF.

For the probabilistic version, we first give an algorithm that achieves an additive $\eps$ approximation in the failure probability for the single level version, called ProbPAF, while giving up a $(1 + \eps)$ multiplicative factor in the number of failures. We then extend the result to the hierarchical version, HierProbPAF, achieving a $\eps$ additive approximation in failure probability while giving up a $(L + \eps)$ multiplicative factor in the number of failures, where $L$ is the number of levels in the hierarchy.

13:50
Brief Announcement: Flexible Resource Allocation for Clouds and All-Optical Networks
SPEAKER: unknown

ABSTRACT. Motivated by the cloud computing paradigm, and by key optimization problems in all-optical networks, we study two variants of the classic job interval scheduling problem, where a reusable resource is allocated to competing job intervals in a {\em flexible} manner. Each job, $J_i$, requires the use of up to $r_{max}(i)$ units of the resource, with a profit of $p_i \geq 1$ accrued for each allocated unit. The goal is to feasibly schedule a subset of the jobs so as to maximize the total profit. The resource can be allocated either in {\em contiguous} or {\em non-contiguous} blocks. These problems can be viewed as flexible variants of the well known {\em storage allocation} and {\em bandwidth allocation} problems.

We show that the contiguous version is strongly NP-hard, already for instances where all jobs have the same profit and the same maximum resource requirement. For such instances, we derive the best possible positive result, namely, a polynomial time approximation scheme. We further show that the contiguous variant admits a $(\frac{5}{4}+\eps)$-approximation algorithm, for any fixed $\eps >0$, on instances whose job intervals form a {\em proper} interval graph. At the heart of the algorithm lies a non-standard parameterization of the approximation ratio itself, which is of independent interest.

For the non-contiguous case, we uncover an interesting relation to the classic {\em paging} problem that leads to a simple $O(n \log n)$ algorithm for {\em uniform} profit instances of $n$ jobs. The algorithm is easy to implement and is thus practical.

14:00
Clairvoyant Dynamic Bin Packing for Job Scheduling with Minimum Server Usage Time
SPEAKER: unknown

ABSTRACT. The MinUsageTime Dynamic Bin Packing (DBP) problem targets at minimizing the accumulated usage time of all the bins in the packing process. It models the server acquisition and job scheduling issues in many cloud-based systems. Earlier work has studied MinUsageTime DBP in the non-clairvoyant setting where the departure time of each item is not known at the time of its arrival. In this paper, we investigate MinUsageTime DBP in the clairvoyant setting where the departure time of each item is known for packing purposes. We study both the offline and online versions of Clairvoyant MinUsageTime DBP. We present two approximation algorithms for the offline problem, including a 5-approximation Duration Descending First Fit algorithm and a 4-approximation Dual Coloring algorithm. The latter gives the current best upper bound for offline Clairvoyant MinUsageTime DBP. For the online problem, we establish a lower bound of $\frac{1+\sqrt{5}}{2}$ on the competitive ratio of any online packing algorithm. We propose two strategies of item classification for online packing, including a classify-by-departure-time strategy and a classify-by-duration strategy. We analyze the competitiveness of these strategies when they are applied to the classical First Fit packing algorithm. It is shown that both strategies can substantially reduce the competitive ratio for Clairvoyant MinUsageTime DBP compared to the original First Fit algorithm.

14:20
Improved Approximation Algorithms for Scheduling Co-Flows
SPEAKER: unknown

ABSTRACT. Co-flow scheduling is a recent networking abstraction introduced to capture application-level communication patterns in datacenters. In this framework, a datacenter is modeled as a single non-blocking switch with m input ports and m output ports. A co-flow j is a collection of flow demands {d^j_{io}}_{i∈m,o∈m} that is said to be complete once all of its requisite flows have been scheduled.

We consider the offline co-flow scheduling problem with release times to minimize the total weighted completion time. Co-flow scheduling generalizes the well studied concurrent open shop scheduling problem and is thus NP-hard. Recently, Qiu, Stein and Zhong [12] obtained the first constant approximation algorithms for this problem with a deterministic 67/3-approximation and a randomized (9+ 16√2/3)≈16.54-approximation. In this paper, we improve upon their algorithm to yield a deterministic 12-approximation algorithm. For the special case when all release times are zero, we obtain a deterministic 8-approximation and a randomized (3 + 2 √ 2) ≈ 5.83-approximation.

14:30
Parallelism in Randomized Incremental Algorithms
SPEAKER: unknown

ABSTRACT. In this paper we show that most sequential random incremental algorithms are in fact parallel. We consider several random incremental algorithms including algorithms for sorting; convex hull, Delaunay triangulation, and trapezoidal maps in two dimensions; linear programming, closest pair, and smallest enclosing disk in constant dimensions; as well as least-element lists and strongly connected components on graphs.

We analyze the dependence between iterations in an algorithm, and show that the dependence structure is shallow for all of the algorithms, implying high parallelism. We identify three types of dependences found in the algorithms studied and present a framework for analyzing each type of algorithm. The framework is then applied to each of the algorithms to give work-efficient polylogarithmic-depth parallel algorithms for most of the problems that we study. Some of these algorithms are straightforward (e.g., convex hull and linear programming), while others are more novel and require more effort to obtain the desired bounds (e.g., Delaunay triangulation and strongly connected components). The most surprising of these results is for planar Delaunay triangulation for which the incremental approach is by far the most commonly used in practice, but for which it was not previously known whether it is theoretically efficient in parallel.

14:50
Energy optimization of memory intensive parallel workloads
SPEAKER: unknown

ABSTRACT. Energy consumption is an important concern in modern multicore processors. The energy consumed by a multi- core processor during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy minimization using Global DVFS (Dynamic Volt- age and Frequency Scaling), despite being thorough, ignores the time and the energy consumed by the CPU on memory accesses and the dynamic energy consumed by the idle cores. This article presents an analytical energy-performance model for parallel workloads that accounts for the time and the energy consumed by the CPU chip on memory accesses in addition to the time and energy consumed by the CPU on CPU instructions. In addition, the model we present also ac- counts for the dynamic energy consumed by the idle cores. The existing work on global DVFS for parallel workloads suggests that using a single frequency for the entire dura- tion of a parallel application is not energy optimal and that varying the frequency according to the changes in the par- allelism of the workload can save energy. We present an analytical framework around our energy-performance model to predict the operating frequencies (that depend upon the amount of parallelism) for global DVFS that minimize the overall CPU energy consumption. We show how the optimal frequencies in our model differ from the optimal frequencies in a model that does not account for memory accesses. We further show how the memory intensity of an application affects the optimal frequencies.

15:30-16:40 Session 9: Parallel Algorithms
15:30
Just Join for Parallel Ordered Sets and Maps
SPEAKER: unknown

ABSTRACT. Ordered sets (and maps when data is associated with each key) are one of the most important and useful data types. The set-set functions union, intersection and difference are particularly useful in certain applications. Brown and Tarjan first described an algorithm for these functions, based on 2-3 trees, that meet the optimal $\Theta\left(n\log\left(\frac{m}{n}+1\right)\right)$ time bounds in the comparison model ($n$ and $m > n$ are the input sizes). Later Adams showed very elegant algorithms for the functions, and others, that are generic across different balancing schemes. They only require a single function that is specific to the balancing scheme---a function that joins two balanced trees. Furthermore the algorithms are naturally parallel. However, in the twenty-four years since, no one has shown that the algorithms are work efficient (sequentially or parallel).

In this paper we show tha Adams' algorithms are both work efficient and highly parallel (polylog depth) across four different balancing schemes---AVL trees, red-black trees, weight balanced trees and treaps. To do this we need careful, but simple, algorithms for Join that maintain certain invariants, and our proof is (mostly) generic across the schemes.

To understand how they perform in practice we have also implemented the parallel algorithms (all code except Join is generic across the balancing schemes). Interestingly the implementations on all four balancing schemes and three set functions perform similarly in time and speedup (more than 45x on 64 cores). We also compare the performance of our implementation to other existing libraries and algorithms including the standard template library (STL) implementation of red-black trees, the multicore standard template library (MCSTL), and a recent parallel implementation based on weight-balanced trees. Our implementations are not as fast as the best of these on fully overlapping keys, but better than all on keys with a skewed overlap (two Gaussians with different means).

15:50
Parallel Equivalence Class Sorting: Algorithms, Lower Bounds, and Distribution-Based Analysis
SPEAKER: unknown

ABSTRACT. We study parallel comparison-based algorithms for finding all equivalence classes of a set of $n$ elements, where sorting according to some total order is not possible. Such scenarios arise, for example, in applications, such as in distributed computer security, where each of $n$ agents are working to identify the private group to which they belong, with the only operation available to them being a zero-knowledge pairwise-comparison (which is sometimes called a ``secret handshake'') that reveals only whether two agents are in the same group or in different groups. We provide new parallel algorithms for this problem, as well as new lower bounds and distribution-based analysis.

16:10
Parallel Approaches to the String Matching Problem on the GPU
SPEAKER: unknown

ABSTRACT. We design a family of parallel algorithms and GPU implementations for the exact string matching problem, based on Rabin-Karp (RK) randomized string matching. We describe and analyze three primary parallel approaches to binary string matching: cooperative (CRK), divide-and-conquer (DRK), and a novel hybrid of both (HRK). The CRK is most effective for large patterns (>8K characters), while the DRK approach is superior for shorter patterns. We then generalize the DRK to support any alphabet size without loss of performance. Our DRK method achieves up to a 63 GB/s processing rate on 4-character patterns from an 8-bit alphabet on an NVIDIA Tesla K40c GPU. We next demonstrate a novel parallel two-stage matching method (DRK-2S), which first skims the text for a smaller subset of the pattern and then verifies all potential matches in parallel. Our DRK-2S method is superior for pattern sizes up to 128K compared to the fastest CPU-based string matching implementations. With an 8-bit alphabet and up to 1K-character patterns, we get a geometric mean speedup of 3.85x against the best CPU methods, and can achieve a processing rate of at least 53 GB/s.

16:30
Brief Announcement: MIC++: Accelerating Maximal Information Coefficient Calculation with GPUs and FPGAs
SPEAKER: unknown

ABSTRACT. To discover relationships and associations between pairs of variables in large biological data sets has become one of the most significant challenges for bioinformatics scientists, e.g. genome sequencing and biological annotations. To tackle this problem, Maximal information-based nonparametric exploration (MINE) statistics have been widely applied for classifying relationships, such as in global health, gene expression, recommendation system, and the human gut microbiota. In particular, the maximal information coefficient (MIC) has been used as a measure of the linear or non-linear association between two variables. However, there are many literatures focusing on high performance architectures and accelerators for state-of-the-art MIC calculations. In order to achieve high computing speed of MIC, in this paper we present MIC++, a parallel approach based on the heterogeneous accelerators including Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA) engines, focusing on both coarse-grained and fine-grained parallelism. To augment the proposed MIC++ framework, we also extend the traditional two-variable algorithm to a general multi-variable algorithm, which is able to discover the relationships between multiple sets of variables. We also present a novel method to uncover the parallelism between different iterations in the main computation. As the evaluation of MIC++, we have demonstrated the performance on the Hadoop cloud computing framework, the state-of-the-art GPU accelerators, and the FPGA based accelerators. Experimental results demonstrate that the proposed parallel implementation can significantly achieve more than 6X~14X speedup using GPU, and 4X~13X using FPGA based accelerators.

17:00-18:00 Session 10: Robots, Amoebots, and Cobras (Oh my!)
17:00
Universal Shape Formation for Programmable Matter
SPEAKER: unknown

ABSTRACT. We envision programmable matter consisting of systems of computationally limited devices (which we call particles) that are able to self-organize in order to achieve a desired collective goal without the need for central control or external intervention. Central problems for these particle systems are shape formation and coating problems. In this paper, we present a universal shape formation algorithm which takes an arbitrary shape composed of equilateral triangles of unit size and lets the particles build that shape at a scale depending on the number of particles in the system. Our algorithm runs in O(sqrt(n)) asynchronous execution rounds, where n is the number of particles in the system, provided we start from a well-initialized configuration of the particles. This is optimal in a sense that for any shape deviating from the initial configuration, any movement strategy would require Omega(sqrt(n)) rounds in the worst case (over all asynchronous activations of the particles). Our algorithm relies only on local information (e.g., particles do not have ids, nor do they know n, or have any sort of global coordinate/orientation system), and requires only a constant-size memory per particle.

17:20
Asymptotically Optimal Gathering on a Grid
SPEAKER: unknown

ABSTRACT. In this paper, we solve the local gathering problem of a swarm of n indistinguishable, point-shaped robots on a two dimensional grid in asymptotically optimal time O(n) in the fully synchronous FSYNC time model. Given an arbitrarily distributed (yet connected) swarm of robots, the gathering problem on the grid is to locate all robots within a 2x2-sized area that is not known beforehand. Two robots are connected if they are vertical or horizontal neighbors on the grid. The locality constraint means that no global control, no compass, no global communication and only local vision is available; hence, a robot can only see its grid neighbors up to a constant $L_1$-distance, which also limits its movements. A robot can move to one of its eight neighboring grid cells and if two or more robots move to the same location they are \emph{merged} to be only one robot. The locality constraint is the significant challenging issue here, since robot movements must not harm the (only globally checkable) swarm connectivity. For solving the gathering problem, we provide a synchronous algorithm -- executed by every robot -- which ensures that robots merge without breaking the swarm connectivity. In our model, robots can obtain a special state, which marks such a robot to be performing specific connectivity preserving movements in order to allow later merge operations of the swarm. Compared to the grid, for gathering in the Euclidean plane for the same robot and time model the best known upper bound is O(n^2) [].

17:40
Better bounds for coalescing-branching random walks on graphs
SPEAKER: unknown

ABSTRACT. Coalescing-branching random walks, or cobra walks for short, are a natural variant of random walks on graphs that can model the spread of disease through contacts or the spread of information in networks. In a k-cobra walk, at each time step a subset of the vertices are active; each active vertex chooses k random neighbors (sampled indpendently and uni- formly with replacement) that become active at the next step, and these are the only active vertices at the next step. A natural quantity to study for cobra walks is the cover time, which corresponds to the expected time when all nodes have become infected or received the disseminated information. In this work, we extend previous results for cobra walks in multiple ways. We show that the cover time for the 2- cobra walk on [0, n]d is O(n) (where the order notation hides constant factors that depend on d); previous work had shown the cover time was O(n·polylog(n)). We show that the cover time for a 2-cobra walk on an n-vertex d-regular graph with conductance φG is O(φ−2 log2 n), significantly generalizing G a previous result that held only for expander graphs with sufficiently high expansion. And finally we show that the cover time for a 2-cobra walk on a graph with n vertices is always O(n11/4 log n); this is the first result showing that the bound of Θ(n3) for the worst-case cover time for random walks can be beaten using 2-cobra walks.