ICA3PP 2016: 16TH INTERNATIONAL CONFERENCE ON ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING
PROGRAM FOR FRIDAY, DECEMBER 16TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:00 Session 11: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Rafael Asenjo

  • Associate Professor at the Dept. of Computer Architecture, Univ. Malaga

Heterogeneous computing is seen as a path forward to deliver the energy and performance improvements needed over the next decade. That way, heterogeneous systems feature GPUs (Graphics Processing Units) or FPGAs (Field Programmable Gate Arrays) that excel at accelerating complex tasks while consuming less energy. There are also heterogeneous architectures on-chip, like the processors developed for mobile devices (laptops, tablets and smartphones) comprised of multiple cores and a GPU. More recently, some architectures have also paired multicores along with an FPGA in the same die. Examples of the latest are Xilinx Zync (2 cores Cortex-A9 + FPGA), Xilinx UltraScale+ MPSoC (4 cores Cortex-A53 + GPU Mali 400 + FPGA) or Intel HARP (12 cores Xeon + FPGA).

This talk covers hardware and software aspects of this kind of heterogeneous architectures. Regarding the HW, we briefly discuss the underlying architecture of some heterogeneous chips composed of multicores+GPU and multicores+FPGA, delving into the differences between both kind of accelerators and how to measure the energy they consume. We also address the different solutions to get a coherent view of the memory shared between the cores and the GPU or between the cores and the FPGA. With respect to the SW, different heterogeneous programming models will be introduced, paying more attention to those that are aimed at exploiting several devices at the same time (CPU + GPU or CPU + FPGA). Again, the different optimization techniques and the levels of parallelism that are suitable for the GPU and for the FPGA will be identified. Finally, we present our own proposal that tackles heterogenous execution of applications based on the pipeline and parallel_for patterns. We discuss our extensions to the Threading Building Blocks, TBB, pipeline and parallel_for templates to automatically distribute the workload between the multicore and the accelerator, taking care of the load balancing and considering energy consumption in the scheduling policies. We evaluate the performance and energy efficiency of the different approaches for several heterogenous processors: Intel Ivy Bridge, Intel Haswell, Samsung Exynos 5 Octa, Xilinx Zync and Intel Broadwell + Altera Stratix V FPGA.

Rafael Asenjo received his B.S. and M.S. degrees in Telecommunications Engineering in 1993 and his Ph.D. degree in 1997, both from the University of Malaga, Spain. He has been an Associate Professor at the Dept. of Computer Architecture, Univ. Malaga, Spain, since 2001, where he leads a research group working on “productivity” in the context of high performance computing.  He collaborated on the IBM XL-UPC compiler in 2008 and has contributed to the Cray’s Chapel runtime development since 2011. This year he served as General Chair of ACM PPoPP’16 and has also served as Program Committee member for IPDPS’13, IPDPS’14 and SC’15. His research interests include  programming models, parallelization of irregular codes, parallel IO, parallelizing compilers and heterogeneous architectures.

Location: Room Alejandra
10:00-10:30Coffee Break
10:30-12:30 Session 12A: ICA3PP: Distributed and Network-based Computing (II)
Location: Room Alejandra
10:30
Impact of Shutdown Techniques for Energy-Efficient Cloud Data Centers
SPEAKER: Issam Rais

ABSTRACT. Electricity consumption is a worrying concern in current large-scale systems like datacenters and supercomputers. These infrastructures are often dimensioned according to the workload peak. However, their consumption is not power-proportionnal: when the workload is low, the consumption is still high. Shutdown techniques have been developped to adapt the number of switched-on servers to the actual workload. However, datacenter operators are reluctant to adopt such approaches because of their potential impact on reactivity and hardware failures, and their energy gain which is often largely misjudged. In this article, we evaluate the potential gain of shutdown techniques by taking into account shutdown and boot up costs in time and energy. This evaluation is made on recent server architectures and future hypothetical energy-aware architectures. We also determine if the knowledge of future is required for saving energy with such techniques. We present simulation results exploiting real traces collected on different infrastructures under various machine configurations with several shutdown policies, with and without workload prediction.

11:00
Processing Partially Ordered Requests in Distributed Stream Processing Systems
SPEAKER: Ning Huang

ABSTRACT. In many application scenarios of distributed stream processing, there might be partial order relations among the requests. However, existing stream processing systems can not directly handle partially ordered requests, while indirect mechanisms are usually strongly coupled with business logic, which lack flexibility and have limited performance. We propose Pork, a novel distributed stream processing system targeting at partially ordered requests, with a simple but efficient fault-tolerance strategy which makes it a complete and realistic system. In order to conduct a thorough evaluation of the new system, a typical example of partially ordered requests processing is described. In the experiments, the new system has achieved a parallelism and requests throughput which are several times larger than the traditional mechanism in the presented example, and the performance overhead due to parallelism is considerably small. Then the scalability characteristic of the new system is discussed. What's more, the experiment results also show that the new system has a more flexible load balancing ability.

11:30
Microcities: a Platform based on Microclouds for Neighborhood Services

ABSTRACT. The current datacenter-centralized architecture limits the cloud to the location of the datacenters, generally far from the user. This architecture collides with the latest trend of ubiquity of Cloud comput- ing. Distance leads to increased utilization of the broadband Wide Area Network and poor user experience, especially for interactive applications. A semi-decentralized approach can provide a better Quality of Experi- ence (QoE) in large urban populations in mobile cloud networks. To do so, it should confine local traffic close to the user while still maintaining centralized characteristics, running on the users and network devices. In this paper, we propose a novel semi-decentralized cloud architecture based on microclouds. Microclouds are dynamically created and allow users to contribute resources from their computers, mobile and network devices to the cloud. This way, they provide a dynamic and scalable system without the need of an extra investment in infrastructure. We also provide a description of a realistic mobile cloud use case, and the adaptation of microclouds on it. Our results indicate that this architec- ture is scalable with the number of mobile devices and provide a latency significantly lower than regular datacenter-centralized approaches.

12:00
Implement and Optimization of Indoor Positioning System Based on Wi-Fi Signal
SPEAKER: Xin Li

ABSTRACT. As wireless routers are used widely, indoor positioning technology based on Wi-Fi signal has drawn more attentions. In order to improve the accuracy of indoor positioning system and reduce the workload of data collection in the offline phase, we introduce the Kalman filter algorithm in this paper. This algorithm can improve the credibility of collected data. All the data are clustered using K-Means algorithm for increasing the positioning efficiency. K-Nearest-Neighbor (KNN) algorithm is performed in on-line phase. The result of the experiment shows that the proposed approach can achieve high positioning accuracy with the use of filtered data and weighted KNN algorithm.

10:30-12:30 Session 12B: ICA3PP: Service Dependability and Security in Distributed and Parallel Systems
Location: Room Beatriz I
10:30
OBC Based Optimization of Re-encryption for Cryptographic Cloud Storage
SPEAKER: Huidong Qiao

ABSTRACT. In a cryptographic cloud storage system, it’s still very inefficient to revoke the user’s access right of a large file. This is because that a encrypted file stored in cloud has to be decrypted and encrypted again under a new key (re-encryption), in order to prevent the revoked user from accessing the file with the previous key. To improve the performance of re-encryption operation, we propose orderly block chaining (OBC) encryption mode. OBC encryption mode has special property that the ciphertext produced by OBC must be decrypted in precise permutation order of the blocks. And without the information of correct permutation order, even holding the encryption key, it is still computationally infeasible to decrypt any one of the blocks. Then, the file, which is encrypted by OBC, can be re-encrypted by just re-permuting the blocks of ciphertext in another random permutation order. Experimental results show that OBC based optimization can sharply cut down the cost of re-encryption while keeping the security of the data. It makes the revocation of dynamic policy truly practical for the cryptographic cloud storage system.

11:00
Dynamic Verifiable Search over Encrypted Data in Untrusted Clouds
SPEAKER: Xiaohong Nie

ABSTRACT. The scalable and elastic storage capabilities of cloud computing motivate enterprises and individuals to outsource their data and query services to cloud platforms. Since the cloud service provider (CSP) is outside the trusted domain of cloud users, existing research suggests encrypting data before outsourcing and employing searchable symmetric encryption (SSE) to facilitate keyword-based search on the ciphertexts. To make SSE be more applicable in cloud computing, Kurosawa et al. proposed a dynamic verifiable SSE (DVSSE) scheme, which employed inverted indexes and the RSA accumulator to enable the user to search and update files in a verifiable way. However, their scheme works only under the assumption of an honest but curious CSP. In this paper, we propose a secure DVSSE scheme, DVSSES, for the untrusted cloud environments. Specifically, DVSSES is constructed in two different ways. The basic DVSSES, called DVSSES-1, is constructed based on the Merkle hash tree (MHT) and BLS signatures, which can be easily extended from DVSSE. Since DVSSES-1 incurs a heavy cost during the update phase, the advanced DVSSES, called DVSSES-2, utilizes random permutations to improve the performance. Extensive experiments on real data set demonstrate the efficiency and effectiveness of our proposed scheme.

11:30
Reducing TCB of Linux Kernel Using User-Space Device Driver

ABSTRACT. The Linux kernel has enormous code size, which makes the kernel a prime target exploited by attackers to steal the privacy of system, or even to crash the system. Especially, the untrusted Linux device drivers, which take the biggest code size of kernel, bring great threats to the kernel. However, current researches, trying to isolate the Linux device drivers, either have a large attack surface due to the complex features exposed to applications, or leave the device drivers in the TCB (trusted computing base). We move the device drivers into user-space to reduce the TCB of kernel and port the OS features as libraries to decrease kernel’s attack surface. This paper presents an architecture based on proxy driver and library OSes to separate untrusted and unmodified device drivers from kernels enhanced with a narrower system call interface. We discuss the implementation of a prototype, and also the case study about an unmodified Ethernet card driver that is supported by the prototype. The evaluation of the case study shows an acceptable performance overhead. We manage to narrow the attack surface by reducing 81.6% of the system calls, and reduce the TCB by decreasing the code base (inside TCB) of the Ethernet card driver into 900 LoC.

10:30-12:30 Session 12C: ICA3PP: Big Data and its Applications
Location: Room Beatriz II
10:30
Energy-Aware Query Processing on Parallel Database Cluster Nodes
SPEAKER: Amine Roukh

ABSTRACT. In the last few years, we have been seeing a significant increase in research about the energy efficiency of hardware and software components in both centralized and parallel platforms. In data centers, DBMSs are one of the major energy consumers, in which, a large amount of data is queried by complex queries running daily. Having green nodes is a pre-condition to design an energy-aware parallel database cluster. Generally, the most existing DBMSs focus to high-performance during query optimization phase, while usually ignoring the energy consumption of the queries. In this paper, we propose a methodology, supported by a tool called EnerQuery, that makes nodes of parallel database clusters saving energy when optimizing queries. To show its effectiveness, we implement it in query optimize of PostgreSQL DBMS. A mathematical cost model is used to estimate the energy consumption. Its parameters are identified by a machine learning technique. We conduct intensive experiments using our cost models and a measurement tool dedicated to compute energy using dataset of TPC-H benchmark. Based on the obtained results, a probabilistic proof to demonstrate the confidence bounds of our model and results is given.

11:00
Current flow betweenness centrality with Apache Spark
SPEAKER: Laura Ricci

ABSTRACT. The identification of the most central nodes of a graph is a fundamental task of data analysis. The current flow betweenness is a centrality index which considers how the information flows along all the paths of a graph, not only on the shortest ones. Finding the exact value of the current flow betweenness is computationally expensive for large graphs, so the definition of algorithms returning an approximation of this measure is mandatory. In this paper we propose a solution that estimates the current flow betweenness in a distributed setting using the Apache Spark framework. The computation is defined and organized for the distributed environment in order to provide an approximate solution within an acceptable computational time. Our experimental evaluation shows that the algorithm achieves high correlation with the exact value of the current flow betweenness, is scalable and outperforms other algorithms.

11:30
Optimizing Inter-server Communications by Exploiting Overlapping Communities in Online Social Networks
SPEAKER: Jingya Zhou

ABSTRACT. As the rapid growth of online social networks (OSNs), inter-server communications are becoming an obstacle to scaling the storage systems of OSNs. To address the problem, network partitioning and data replication are two commonly used approaches. In this paper, we exploit the combination of both approaches simultaneously and propose a data placement scheme based on overlapping communities detection. The principle behind the proposed scheme is to co-locate frequently interactive users together as long as it brings positive traffic reduction and satisfies load constraint. We conduct trace-driven experiments and the results show that our scheme significantly reduces the inter-server communications as well as preserving good load balancing.

10:30-12:30 Session 12D: ICA3PP: Parallel and Distributed Algorithms (IV)
Location: Room Beatriz III
10:30
On a Parallel Algorithm for the Determination of Multiple Optimal Solutions for the LCSS Problem

ABSTRACT. For particular real world combinatorial optimization problems e.g. the longest common subsequence problem (LCSSP) from Bioinformatics, determining multiple optimal solutions (DMOS) is quite useful for experts. However, for large size problems, this may be too time consuming, thus the resort to parallel computing. We address here the parallelization of an algorithm for DMOS for the LCSSP. Considering the dynamic programming algorithm solving it, we derive a generic algorithm for DMOS (A-DMOS). Since the latter is a non perfect DO-loop nest, we adopt a three-step approach. The first consists in transforming the A-DMOS into a perfect nest. The second consists in choosing the granularity and the third carries out a dependency analysis in order to determine the type of each loop i.e. either parallel or serial. The practical performances of our approach is evaluated through experimentations achieved on input benchmarks and random DNA sequences and targeting a parallel multicore machine

11:00
GPU computing to speed-up the resolution of microrheology models
SPEAKER: Gloria Ortega

ABSTRACT. Active microrheology is a technique to obtain rheological properties in soft matter from the microscopic motion of colloidal tracers used as probes and subjected to external forces. This technique extends the measurement of the friction coefficient to the nonlinear-response regime of strongly driven probes. Active microrheology can be described starting from microscopic equations of motion for the whole system including both the host-fluid particles and the tracer. While the main observable is the effective friction coefficient with the bath, tracer position correlation functions describe the tracer motion, and reveal the underlying dynamics of the host bath. On the other hand, pulling the tracer provokes a non-linear non-affine strain field in the host bath, what requires a deep understanding of the dynamics of the system. Different theoretical approaches have been proposed to deal with this problem. In this work, we present simulations of a tracer dragged by a constant force through a dense bath of hard colloids. The size of the tracer has been varied, keeping the bath density constant, approaching the hydrodynamic limit. In order to calculate the tracer's position, iterative methods have to be used. These methods are computationally highly demanding, specially when the number of colloidal tracers is high. Therefore, it is necessary to use HPC in order to develop and validate this kind of models. The present work shows the results of the microrheology method varying the number of colloidal tracers and using GPU computing in order to solve problem of interest.

11:30
Locality of Computation for Stencil Optimization
SPEAKER: Junhong Liu

ABSTRACT. Stencil computation is a performance critical kernel that is widely used in scientific and engineering applications. In the case of high accurate numerical simulation, high order stencil computation is compute-intensive and lacks an analytical model to guide its optimization. In this paper we propose a novel insight into stencil computation optimization: compute locality (locality of computation)which is classified into spatial locality and temporal locality. Moreover, we develop a redundant computation elimination (RCE) algorithm to exploit temporal locality. It has the ability to detect and eliminate inter-iteration redundant computation on multi-direction of stencil loops without hindering spatial locality of SIMD. We implement the RCE optimization strategy using ROSE compiler infrastructure. The experiments with a benchmark of eleven stencil applications show that temporal locality of RCE averagely improves performance by 15.4% and 10.1% for benchmark without or with SIMD optimization.

12:00
Bin Recycling Strategy for an Accuracy-Aware Implementation of Two-Point Angular Correlation Function on GPU

ABSTRACT. Cosmological studies, in particular those relating to the large scale distribution of galaxies, have to cope with an extraordinary increase in data volume with the current and upcoming sky surveys. These usually involve the estimation of N-point correlation functions of galaxy properties. Due to the fact that the correlation functions are based on histogram construction, they have a high computational cost, which worsens with the ever growing size of the datasets and the standard sample. At the same time, correlation functions exhibit a high sensitiveness to the accuracy of the estimation. Therefore, their implementations require maintaining a high accuracy within a reasonable processing time. GPU computing can be adopted to overcome the latter problem, but the standard implementation of the histogram construction on GPU lacks the appropriate accuracy for calculating the cosmological correlation functions. In this work, the "bin recycling strategy" is implemented and evaluated for the estimation of the Two-Point Angular Correlation Function. At the same time the lack of the appropriate accuracy in the calculation of diverse implementations of histogram construction on GPU is demonstrated. The "bin recycling strategy" for the Two-Point Angular Correlation Function outperforms other implementations while enabling the processing of a large number of galaxies. As a consequence of this work, an accuracy-aware GPU implementation of the Two-Point Angular Correlation Function is stated and evaluated to assure the correctness of the results.

12:30-13:00 Session 13: Conference closing

Conference closing and next location presentations.

Location: Room Alejandra
13:00-14:00Lunch Break