previous day
next day
all days

View: session overviewtalk overview

08:30-08:45 Session 5: Conference inauguration
Franck Cappello (Argonne National Laboratory, USA)
Jesus Carretero (Universidad Carlos III de Madrid, Spain)
Javier Garcia Blas (Carlos III University, Spain)
Manish Parashar (State University of New Jersey University, USA)
Location: Barrio de Salamanca
08:45-09:30 Session 6: Keynote Speaker 1. Rick Stevens.


The adoption of machine learning is proving to be an amazingly successful strategy in improving predictive models for cancer and infectious disease. In this talk I will discuss two projects my group is working on to advance biomedical research through the use of machine learning and HPC. In cancer, machine learning and in deep learning in particular, is used to advance our ability to diagnosis and classify tumors. Recently demonstrated automated systems are routinely out performing human expertise. Deep learning is also being used to predict patient response to cancer treatments and to screen for new anti-cancer compounds. In basic cancer research its being use to supervise large-scale multi-resolution molecular dynamics simulations used to explore cancer gene signaling pathways. In public health it’s being used to interpret millions of medical records to identify optimal treatment strategies. In infectious disease research machine learning methods are being used to predict antibiotic resistance and to identify novel antibiotic resistance mechanisms that might be present. More generally machine learning is emerging as a general tool to augment and extend mechanistic models in biology and many other fields. It’s becoming an important component of scientific workloads. From a computational architecture standpoint, deep neural network (DNN) based scientific applications have some unique requirements. They require high compute density to support matrix-matrix and matrix-vector operations, but they rarely require 64bit or even 32bits of precision, thus architects are creating new instructions and new design points to accelerate training. Most current DNNs rely on dense fully connected networks and convolutional networks and thus are reasonably matched to current HPC accelerators. However future DNNs may rely less on dense communication patterns. Like simulation codes power efficient DNNs require high-bandwidth memory be physically close to arithmetic units to reduce costs of data motion and a high-bandwidth communication fabric between (perhaps modest scale) groups of processors to support network model parallelism. DNNs in general do not have good strong scaling behavior, so to fully exploit large-scale parallelism they rely on a combination of model, data and search parallelism. Deep learning problems also require large-quantities of training data to be made available or generated at each node, thus providing opportunities for NVRAM. Discovering optimal deep learning models often involves a large-scale search of hyperparameters. It’s not uncommon to search a space of tens of thousands of model configurations. Naïve searches are outperformed by various intelligent searching strategies, including new approaches that use generative neural networks to manage the search space. HPC architectures that can support these large-scale intelligent search methods as well as efficient model training are needed.

Franck Cappello (Argonne National Laboratory, USA)
Location: Barrio de Salamanca
09:30-10:30 Session 7: Best Paper Award
Rajkumar Buyya (The University of Melbourne, Australia)
Location: Barrio de Salamanca
Hamid Arabnejad (Dublin City University (DCU), Ireland)
Claus Pahl (Free University of Bozen-Bolzano, Italy)
Pooyan Jamshidi (Imperial College London, UK)
Giovani Estrada (Intel, Leixlip, Ireland)
A Comparison of Reinforcement Learning Techniques for Fuzzy Cloud Auto-Scaling

ABSTRACT. A goal of cloud service management is do design self-adaptable auto-scaler to react to workload fluctuations and changing the resources assigned. The key problem is how and when to add/remove resources in order to meet agreed service-level agreements. Reducing application cost and guaranteeing service-level agreements (SLAs) are two critical factors of dynamic controller design. In this paper, we compare two dynamic learning strategies based on a fuzzy logic system, which learns and modifies fuzzy scaling rules at runtime. A self-adaptive fuzzy logic controller is combined with two reinforcement learning (RL) approaches: (i) Fuzzy SARSA learning (FSL) and (ii) Fuzzy Q-learning (FQL). As an off-policy approach, Q-learning learns independent of the policy currently follows, whereas SARSA as an on-policy always incorporates the actual agent’s behavior and leads to faster learning. Both approaches are implemented and compared in their advantages and disadvantages, here in the OpenStack cloud platform. We demonstrate that both auto-scaling approaches can handle various load traffic situations, sudden and periodic, and delivering resources on demand while reducing operating costs and preventing SLA violations. The experimental results demonstrate that FSL and FQL have acceptable performance in terms of adjust number of virtual machine targeted to optimize SLA compliance and response time.

Phuong Nguyen (University of Illinois at Urbana-Champaign, USA)
Steven Konstanty (University of Illinois at Urbana-Champaign, USA)
Todd Nicholson (University of Illinois at Urbana-Champaign, USA)
Thomas O'Brien (University of Illinois at Urbana-Champaign, USA)
Aaron Schwartz-Duval (University of Illinois at Urbana-Champaign, USA)
Timothy Spila (University of Illinois at Urbana-Champaign, USA)
Klara Nahrstedt (University of Illinois at Urbana-Champaign, USA)
Roy Campbell (University of Illinois at Urbana-Champaign, USA)
Indranil Gupta (University of Illinois at Urbana-Champaign, USA)
Michael Chan (University of Illinois at Urbana-Champaign, USA)
Kenton McHenry (University of Illinois at Urbana-Champaign, USA)
Normand Paquin (University of Illinois at Urbana-Champaign, USA)
4CeeD: Real-Time Data Acquisition and Analysis Framework for Material-related Cyber-Physical Environments

ABSTRACT. In this paper, we present a data acquisition and analysis framework for materials-to-devices processes, named 4CeeD, that focuses on the immense potential of capturing, accurately curating, correlating, and coordinating materials-to-devices digital data in a real-time and trusted manner before fully archiving and publishing them for wide access and sharing. In particular, 4CeeD consists of novel services: a curation service for collecting data from microscopes and fabrication instruments, curating, and wrapping of data with extensive metadata in real-time and in a trusted manner, and a cloud-based coordination service for storing data, extracting meta-data, analyzing and finding correlations among the data. Our evaluation results show that our novel cloud framework is able to help researchers significantly save time and cost spent on experiments, and is efficient in dealing with high-volume and fast-changing workload of heterogeneous types of experimental data.

10:30-11:00Coffee Break
11:00-12:30 Session 8A: Mobile, Hybrid and Emerging Clouds
Dhabaleswar Panda (The Ohio State University, USA)
Location: Velazquez
Francisco J. Clemente-Castelló (Universidad Jaume I, Spain)
Bogdan Nicolae (Huawei Research Germany, Germany)
Rafael Mayo Gual (University Jaume I, Spain)
Juan Carlos Fernandez (Universitat Jaume I, Spain)
M. Mustafa Rafique (IBM Research, Ireland)
Data Locality Strategies for Iterative MapReduce Applications in Hybrid Clouds

ABSTRACT. Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) is a popular and cost-effective way to deal with the increasing complexity of big data analytics, where the explosion of data sizes frequently leads to insufficient local data center capacity. While attractive, hybrid cloud bursting is not without challenges: the network link between the on-premise and the off-premise computational resources often exhibit high latency and low throughput ("weak link") compared to the links within the same data-center. How this reflects on the data transfer patterns exhibited by big data anlytics applications is not well understood. This paper studies the impact of such inter-cloud data transfers that need to pass over the weak links and proposes novel data locality strategies to minimize the negative consequences. We focus our study on iterative MapReduce applications, which are a class of large-scale data intensive applications particularly popular on hybrid clouds. We show that data movements need to be optimized in close coordination with the task scheduling for optimal results, especially when the weak link is of low throughput. We run extensive experiments in a distributed, multi-VM setup and report multi-fold improvement over traditonal approaches both in terms of performance and cost-effectiveness.

Yunbo Li (Mines Nantes, France)
Anne-Cécile Orgerie (CNRS, France)
Ivan Rodero (Rutgers University, USA)
Manish Parashar (Rutgers University, USA)
Jean-Marc Menaud (Mines Nantes, France)
Leveraging Renewable Energy in Edge Clouds for Data Stream Analysis in IoT

ABSTRACT. The emergence of Internet of Things (IoT) is participating to the increase of data- and energy-hungry applications. As connected devices do not yet offer enough capabilities for sustaining these applications, users perform computation offloading to the cloud. To avoid network bottlenecks and reduce the costs associated to data movement, edge cloud solutions have started being deployed, thus improving the Quality of Service. In this paper, we advocate for leveraging on-site renewable energy production in the different edge cloud nodes to green IoT systems while offering improved QoS compared to core cloud solutions. We propose an analytical model to decide whether to offload computation from the objects to the edge or to the core Cloud, depending on the renewable energy availability and the desired application QoS. This model is validated on our application use-case that deals with video stream analysis from vehicle cameras.

Longjie Ma (Guangdong University of Technology, China)
Jigang Wu (Guangdong University of Technology, China)
Long Chen (Guangdong University of Technology, China)
DOTA: Delay Bounded Optimal Cloudlet Deployment and User Association in WMANs

ABSTRACT. In the large-scale Wireless Metropolitan Area Network (WMAN) consisting of many wireless Access Points (APs), choosing the appropriate position to place cloudlet is very important for reducing the user’s access delay. For service provider, it is always very costly to deployment cloudlets. How many cloudlets should be placed in a WMAN and how much resource each cloudlet should have is very important for the service provider. In this paper, we study the cloudlet placement and resource allocation problem in a large-scale Wireless WMAN, we formulate the problem as an novel cloudlet placement problem that given an average access delay between mobile users and the cloudlets, place K cloudlets to some strategic locations in the WMAN with the objective to minimize the number of use cloudlet K. We then propose an exact solution to the problem by formulating it as an Integer Linear Programming (ILP). Due to the poor scalability of the ILP, we devise a clustering algorithm K-Medoids (KM) for the problem. For a special case of the problem where all cloudlets computing capabilities have been given, we propose an efficient heuristic for it. We finally evaluate the performance of the proposed algorithms through experimental simulations. Simulation result demonstrates that the proposed algorithms are effective.

11:00-12:30 Session 8B: Applications and Big Data (I)
Roy Campbell (University of Illinois at Urbana-Champaign, USA)
Location: Goya
Wonbae Kim (UNIST, South Korea)
Young-Ri Choi (UNIST, South Korea)
Beomseok Nam (UNIST, South Korea)
Mitigating YARN Container Overhead with Input Splits

ABSTRACT. We analyze YARN container overhead and present early results of reducing its overhead by dynamically adjusting the input split size. YARN is designed as a generic resource manager that decouples programming models from resource management infrastructures. We demonstrate that YARN’s generic design incurs significant overhead because each container must perform various initialization steps, including authentication. To reduce container overhead without changing the existing YARN framework significantly, we propose leveraging the input split, which is the logical representation of physical HDFS blocks. With input splits, we can combine multiple HDFS blocks and increase the input size of each container, thereby enabling a single map wave and reducing the number of containers and their initialization overhead. Experimental results shows that we can avoid recurring container overhead by selecting the right size for input splits and reducing the number of containers.

Jonathan Wang (UC Berkeley, USA)
Wucherl Yoo (Lawrence Berkeley National Laboratory, USA)
Alex Sim (Lawrence Berkeley National Laboratory, USA)
Peter Nugent (Lawrence Berkeley National Laboratory, USA)
Kesheng Wu (Berkeley Lab, USA)
Parallel Variable Selection for Effective Performance Prediction

ABSTRACT. Large data analysis problems often involve a large number of variables, and the corresponding analysis algorithms may examine all variable combinations to find the optimal solution. For example, to model the time required to complete a scientific workflow, we need to consider the impact of dozens of parameters. To reduce the model building time and reduce the likelihood of overfitting, we look to variable selection methods to identify the critical variables for the performance model. In this work, we create a combination of variable selection and performance prediction methods that is as effective as the exhaustive search approach when the exhaustive search could be completed in a reasonable amount of time. To handle the cases where the exhaustive search is too time consuming, we develop the parallelized variable selection algorithm. Additionally, we develop a parallel grouping mechanism that further reduces the variable selection time by 70%. As a case study, we exercise the variable selection technique with the performance measurement data from the Palomar Transient Factory (PTF) workflow. The application scientists have determined that about 50 variables and parameters are important to the performance of the workflows. Our tests show that the Sequential Backward Selection algorithm is able to approximate the optimal subset relatively quickly. By reducing the number of variables used to build the model from 50 to 4, we are able to maintain the prediction quality while reducing the model building time by a factor of 6. Using the parallelization and grouping techniques we developed in this work, the variable selection process was reduced from over 18 hours to 15 minutes while ending up with the same variable subset.

Dongyao Wu (Data61, CSIRO, Australia)
Sherif Sakr (The University of New South Wales, Australia)
Liming Zhu (Data61, CSIRO, Australia)
Huijun Wu (Data61, CSIRO, Australia)
Towards Big Data Analytics across Multiple Clusters

ABSTRACT. Big data are increasingly collected and stored in a highly distributed infrastructures due to the development of sensor network, cloud computing, IoT and mobile computing among many other emerging technologies. In practice, the majority of existing big-data-processing frameworks (e.g., Hadoop and Spark) are designed based on the single-cluster setup with the assumptions of centralized management and homogeneous connectivity which makes them sub-optimal and sometimes infeasible to apply for scenarios that require implementing data analytics jobs on highly distributed data sets (across racks, data centers or multi-organizations). In order to tackle this challenge, we present HDM-MC, a multi-cluster big data processing framework which is designed to enable the capability of performing large scale data analytics across multi-clusters with minimum extra overhead due to additional scheduling requirements. In this paper, we present the architecture and realization of the system. In addition, we evaluate the performance of our framework in comparison to other state-of-art single cluster big data processing frameworks.

11:00-12:30 Session 8C: Scheduling and Resource Management (I)
Shikharesh Majumdar (Carleton Univeristy, Canada)
Location: Serrano
Hyunwook Baek (University of Utah, USA)
Abhinav Srivastava (AT&T Labs - Research, USA)
Jacobus Van der Merwe (University of Utah, USA)
CloudSight: A tenant-oriented transparency framework for cross-layer cloud troubleshooting

ABSTRACT. Troubleshooting in an infrastructure-as-a-Service (IaaS) cloud platform is an inherently difficult task because it is a multi-player as well as multi-layer environment where tenant and provider effectively share administrative duties. To address these concerns, we present our work on CloudSight in which cloud providers allow tenants greater system-wide visibility through a transparency-as-a-service abstraction. We present the design, implementation, and evaluation of CloudSight in the OpenStack cloud platform.

Fei Zhang (Gessellschaft für wissenschafttiche Datenverarbeitung mbH Göttingen (GWDG), Göttingen, Germany, Germany)
Xiaoming Fu (Georg-August-Universität Göttingen, Germany)
Ramin Yahyapour (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG), Germany)
CBase: A New Paradigm for Fast Virtual Machine Migration across Data Centers

ABSTRACT. Live Virtual Machine (VM) migration offers a couple of benefits to cloud providers and users, but it is limited within a data center. With the development of cloud computing and the cooperation between data centers, live VM migration is also desired across data centers. Based on a detailed analysis of VM deployment models and the nature of VM image data, we design and implement a new migration framework called CBase. The key concept of CBase is a newly introduced central base image repository for reliable and efficient data sharing between VMs and data centers. With this central base image repository, live VM migration and further performance optimizations are made possible. The results from an extensive experiment show that CBase is able to support VM migration efficiently, outperforming existing solutions in terms of total migration time and network traffic.

Chao Zheng (University of Notre Dame, USA)
Ben Tovar (University of Notre Dame, USA)
Douglas Thain (University of Notre Dame, USA)
Deploying High Throughput Scientific Workflows on Container Schedulers with Makeflow and Mesos

ABSTRACT. Workflows are a widely used abstraction for describing large scientific applications and running them on distributed systems. However, most workflow systems have been silent on the question of what execution environment each task in the workflow is expected to run in. Consequently, a workflow may run successfully in the environment it developed, but often fail on other platform due to the differences in operating system, installed software, and network configuration. Container-based scheduler has recently arisen as a potential solution to this problem, which adopts container to distribute computing resources and deliver well-defined execution environments to applications in need. In this paper, we consider how to better connect workflow system to container schedulers. As an example of current technology, we use Makeflow and Mesos. We present five design challenges, and address them by using four configurations that connecting workflow system to container scheduler from different level of the infrastructure. In order to take full advantage of the resource sharing schema of Mesos, we enable the resource monitor of Makeflow to dynamically update the task resource requirement. We explore the performance of a large bioinformatics workflow, and observe that using Makeflow, Work Queue and Resource monitor together not only increase the transfer throughput but also achieves highest resource usage rate.

11:00-12:30 Session 8D: Scale Challenge
Antonio J. Peña (Barcelona Supercomputing Center (BSC), Spain)
Michela Täufer (University of Delaware, USA)
Location: Latina
Abhinav Bhatele (Lawrence Livermore National Laboratory, USA)
Jae-Seung Yeom (Lawrence Livermore National Laboratory, USA)
Nikhil Jain (Lawrence Livermore National Laboratory, USA)
Chris Kuhlman (Virginia Tech, USA)
Yarden Livnat (University of Utah, USA)
Keith Bisset (Virginia Tech, USA)
Laxmikant Kale (University of Illinois at Urbana-Champaign, USA)
Madhav Marathe (Virginia Tech, USA)
Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks

ABSTRACT. Controlling the spread of infectious diseases in large populations is an important societal challenge. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. We describe EpiSimdemics, a highly scalable, parallel code written in Charm++ that uses agent-based modeling to simulate disease spreads over large, realistic, co-evolving interaction networks. We present a new parallel implementation of EpiSimdemics that achieves unprecedented strong and weak scaling on different architectures --- Blue Waters, Cori and Mira. EpiSimdemics achieves five times greater speedup than the second fastest parallel code in this field. This unprecedented scaling is an important step to support the long term vision of real-time epidemic science. Finally, we demonstrate the capabilities of EpiSimdemics by simulating the spread of influenza over a realistic synthetic social contact network spanning the continental United States (~280 million nodes and 5.8 billion social contacts).

Mahmoud Ismail (KTH - Royal Institute of Technology, Sweden)
Salman Niazi (KTH - Royal Institute of Technology, Sweden)
Mikael Ronström (Oracle, Sweden)
Seif Haridi (KTH - Royal Institute of Technology, Sweden)
Jim Dowling (KTH - Royal Institute of Technology, Sweden)
Scaling HDFS to more than 1 million operations per second with HopsFS

ABSTRACT. HopsFS is an open-source, next generation distribution of the Apache Hadoop Distributed File System (HDFS) that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-shared state distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower client latencies for large clusters. In this paper, we detail the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second - at least 16 times higher throughput than HDFS. In particular, we discuss how we exploit recent high performance features from NewSQL databases, such as application defined partitioning, partition-pruned index scans, and distribution aware transactions. Together with more traditional techniques, such as batching and write-ahead caches, we show how many incremental optimizations have enabled a revolution in distributed hierarchical file system performance.

Jordi Torres (Barcelona Supercomputing Center (BSC) & Universitat Politècnica de Catalunya - UPC Barcelona Tech, Spain)
Francesc Sastre (Barcelona Supercomputing Center (BSC-CNS), Spain)
Maurici Yagües (Barcelona Supercomputing Center (BSC-CNS), Spain)
Víctor Campos (Barcelona Supercomputing Center (BSC-CNS), Spain)
Xavier Giró-I-Nieto (Universitat Politècnica de Catalunya - UPC Barcelona Tech, Spain)
Scaling a Convolutional Neural Network for classification of Adjective Noun Pairs with TensorFlow on GPU Clusters

ABSTRACT. Deep neural networks have gained popularity in recent years, obtaining outstanding results in a wide range of applications such as computer vision in both academia and multiple industry areas. The progress made in recent years cannot be understood without taking into account the technological advancements seen in key domains such as High Performance Computing, more specifically in the Graphic Processing Unit (GPU) domain. These kind of deep neural networks need massive amounts of data to effectively train the millions of parameters they contain, and this training can take up to days or weeks depending on the computer hardware we are using. In this work, we present how the training of a deep neural network can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the training times and final accuracy of the models is studied. We used TensorFlow on top of the GPU cluster of servers with 2 K80 GPU cards, at Barcelona Supercomputing Center (BSC). The results show an improvement for both focused areas. On one hand, the experiments show promising results in order to train a neural network faster. The training time is decreased from 106 hours to 16 hours in our experiments. On the other hand we can observe how increasing the numbers of GPUs in one node rises the throughput, images per second, in a near-linear way. Morever an additional distributed speedup of 10.3 is achieved with 16 nodes taking as baseline the speedup of one node.

Shaoliang Peng (NUDT, China)
Bertil Schmidt (Johannes Gutenberg University Mainz, Germany)
Yutong Lu (NUDT, China)
Xiangke Liao (NUDT, China)
Weiliang Zhu (Shanghai Institute of Materia Medica, Chinese Academy of Sciences, China)
Kuan-Ching Li (Providence University, Taiwan)
mD3DOCKxb: An Ultra-Scalable CPU-MIC Coordinated Virtual Screening Framework

ABSTRACT. Molecular docking is an important method in computational drug discovery. In large-scale virtual screening, millions of small drug-like molecules (chemical compounds) are compared against a designated target protein (receptor). Depending on the utilized docking algorithm for screening, this can take several weeks on conventional HPC systems. However, for certain applications including large-scale screening tasks for newly emerging infectious diseases such high runtimes can be highly prohibitive. In this paper, we investigate how the massively parallel neo-heterogeneous architecture of Tianhe-2 Supercomputer consisting of thousands of nodes comprising CPUs and MIC coprocessors that can efficiently be used for virtual screening tasks. Our proposed approach is based on a coordinated parallel framework called mD3DOCKxb in which CPUs collaborate with MICs to achieve high hardware utilization. mD3DOCKxb comprises a novel efficient communication engine for dynamic task scheduling and load balancing between nodes in order to reduce communication and I/O latency. This results in a highly scalable implementation with parallel efficiency of over 84% (strong scaling) when executing on 8,000 Tianhe-2 nodes comprising 192,000 CPU cores and 1,368,000 MIC cores. As a case study, we have completed the screening of two million chemical compounds against the Ebola (VP35) virus in around one hour, which is an important and promising fact for handling acute infectious diseases. Availability: More source code and results are available at: Source Code:, Results:

Meng Jintao (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China)
Ning Guo (Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China)
Jianqiu Ge (Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China)
Yanjie Wei (Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China)
Pavan Balaji (Argonne National Laboratory, USA)
Bingqiang Wang (National Supercomputer Center in Guangzhou, Sun Yat-sen University, China)
Scalable Assembly for Massive Genomic Graphs

ABSTRACT. Scientists increasingly want to assemble large genomes, metagenomes, and large numbers of individual genomes. In order to meet the demand for processing these huge datasets, parallel genome assembly is a vital step. Among all the parallel genome assemblers, de Bruijn graph based ones are most popular. However, the size of de Bruijn graph is determined by the number of distinct kmers used in the algorithm, thus redundant kmers in the genome datasets donot contribute to the graph size. The scalability of genome assemblers is influenced directly by the distinct kmers in the dataset or de Bruijn graph size, rather than the input dataset size. In order to assembly large genomes, we have artificially created 16 datasets of 4 Terabytes in total from the human reference genome. The human reference genome is firstly mutated with a 5% mutation rate, and then subject to a genome sequencing data simulator ART. The simulated datasets have linearly increasing number of distinct kmers as the size/number of the combined datasets increases. We then evaluate all five time-consuming steps of the SWAP-Assembler 2.0 (SWAP2) using these 16 simulated datasets. Compared with our previous experiment on 1000 human dataset with fixed de Bruijn graph size, the weak-scaling test shows that SWAP2 can scale well from 1024 cores using one dataset to 16,384 cores with 16 simulated datasets. The percentage of time usage for all five steps of SWAP2 is fixed, and total time usage is also constant. The result showed that the time usage of graph simplification occupied almost 75% of the total time usage, which will be subject to further optimization for future work.

12:30-14:00Lunch Break
14:00-16:00 Session 9A: Architecture and Networking
Klara Nahrstedt (University of Illinois at Urbana-Champaign, USA)
Location: Goya
George Michelogiannakis (LBNL, USA)
Jeremiah J. Wilke (Sandia national laboratories, USA)
Joseph P. Kenny (Sandia national laboratories, USA)
Samuel Knight (Sandia national laboratories, USA)
Khaled Ibrahim (LBNL, USA)
John Shalf (LBNL, USA)
APHiD: Hierarchical task placement to enable a tapered fat tree topology for lower power and cost in HPC networks

ABSTRACT. The power and procurement cost of bandwidth in system-wide networks has forced a steady drop in the byte/flop ratio. This trend of computation becoming faster relative to the network is expected to hold. In this paper, we explore how cost-oriented task placement enables reducing the cost of system-wide networks by enabling high performance even on tapered topologies where more bandwidth is provisioned at lower levels. We describe APHiD, an efficient hierarchical placement algorithm that uses new techniques to improve the quality of heuristic solutions and reduces the demand on high-level, expensive bandwidth in hierarchical topologies. We apply APHiD to a tapered fat-tree topology, demonstrating that APHiD maintains application scalability even for severely tapered network configurations. Using simulation, we show the for tapered networks APHiD improves performance by more than 50% over random placement and even 15% in some cases over costlier, state-of-the-art placement algorithms.

Shashank Gugnani (The Ohio State University, India)
Xiaoyi Lu (The Ohio State University, USA)
Dhabaleswar Panda (The Ohio State University, USA)
Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud

ABSTRACT. Running Big Data applications in the cloud has become extremely popular in recent times. To enable the storage of data for these applications, cloud-based distributed storage solutions are a must. OpenStack Swift is an object storage service which is widely used for such purposes. Swift is one of the main components of the OpenStack software package. Although Swift has become extremely popular in recent times, its proxy server based design limits the overall throughput and scalability of the cluster. Swift is based on the traditional TCP/IP sockets based communication which has known performance issues such as context-switch and buffer copies for each message transfer. Modern high-performance interconnects such as InfiniBand and RoCE offer advanced features such as RDMA and provide high bandwidth and low latency communication. In this paper, we propose two new designs to improve the performance and scalability of Swift. We propose changes to the Swift architecture and operation design. We propose high-performance implementations of network communication and I/O modules based on RDMA to provide the fastest possible object transfer. In addition, we use efficient hashing algorithms to accelerate object verification in Swift. Experimental evaluations with microbenchmarks, Swift stack benchmark (ssbench), and synthetic application workloads reveal upto 2x and 7.3x performance improvement with our two proposed designs for put and get operations. To the best of our knowledge, this is the first work towards accelerating OpenStack Swift with RDMA over high-performance interconnects in the literature.

Elena Agostini (La Sapienza, Italy)
Davide Rossetti (NVIDIA, USA)
Sreeram Potluri (NVIDIA, USA)
Offloading communication control logic in GPU accelerated applications

ABSTRACT. NVIDIA GPUDirect is a family of technologies aimed at optimizing data movement among GPUs (P2P) or between GPUs and third-party devices (RDMA). GPUDirect Async, introduced in CUDA 8.0, is a new addition which allows direct synchronization between GPU and third party devices. For example, Async allows an NVIDIA GPU to directly trigger and poll for completion of communication operations queued to an InfiniBand Connect-IB network adapter, removing involvement of CPU from critical communication path of GPU applications. In this paper we describe the motivations, supported usage models and the building blocks of GPUDirect Async. We then present a performance evaluation using a micro-benchmark and a synthetic stencil benchmark. Finally, we demonstrate the use of Async in a several number of apps: HPGMG-FV, a proxy for real-world geometric multi-grid applications, achieving up to 25% improvement in the total execution time; CoMD-CUDA, a proxy for Classical Molecular Dynamics codes, reducing communications times up to 30%; LULESH2-CUDA, a proxy for hydrodynamics equations, with an average performance improvement of 12% in the total execution time.

Noah Wolfe (Rensselaer Polytechnic Institute, USA)
Misbah Mubarak (Argonne National Labs, USA)
Nikhil Jain (Lawrence Livermore National Laboratory, USA)
Jens Domke (Technische Universität Dresden, Germany)
Abhinav Bhatele (Lawrence Livermore National Laboratory, USA)
Chris Carothers (RPI, USA)
Robert Ross (Argonne National Laboratory, USA)
Preliminary Performance Analysis of Multi-rail Fat-tree Networks

ABSTRACT. Among the low-diameter and high-radix networks being deployed in next-generation HPC systems, dual-rail fattree networks, such as the one being proposed for Summit at Oak Ridge National Laboratory, are a promising approach. While multi-rail fat-tree networks alleviate communication bottlenecks arising from increasing processor speeds, it opens new design considerations, such as routing choices, job placements, and scalability of rails. We extend our fat-tree model in the CODES parallel simulation framework to support multi-rail configurations and static routings of all types, resulting in a complex research vehicle for fat-tree system analysis. Our detailed packet-level simulations use communication traces from real applications on a Summit-like fat-tree network to make performance predictions in three areas: (1) performance comparison of single- and multi-rail networks, (2) finding efficient job placements and injection policies for multi-rails and studying their impact in conjunction with the intra-rail routing schemes, and (3) we perform a scaling study where we vary the number of application ranks per node.

Alireza Ranjbar (Ericsson Research, Finland)
Miika Komu (Ericsson Research, Finland)
Patrik Salmela (Ericsson Research, Finland)
Tuomas Aura (Aalto University, Finland)
SynAPTIC: Secure And Persistent connecTIvity for Containers

ABSTRACT. Cloud virtualization technology is shifting towards light-weight containers, which provide isolated environments for running cloud-based services. The emerging trends such as container-based micro-service architectures and hybrid cloud deployments result in increased traffic volumes between the micro-services, mobility of the communication endpoints, and some of the communication taking place over untrusted networks. Yet, the services are typically designed with the assumption of scalable, persistent and secure connectivity. In this paper, we present the SynAPTIC architecture, which enables secure and persistent connectivity between mobile containers, especially in the hybrid cloud and in multi-tenant cloud networks. The solution is based on the standardized Host Identity Protocol (HIP) that tenants can deploy on top of existing cloud infrastructure independently of their cloud provider. Optional cloud-provider extensions based on Software-Defined Networking (SDN) further optimize the networking architecture. Our qualitative and quantitative evaluation shows that SynAPTIC performs better than some of the existing solutions.

14:00-16:00 Session 9B: Performance Modeling and Evaluation (I)
Laurent Lefevre (INRIA, France)
Location: Velazquez
Abhinandan Sridhara Rao Prasad (University of Goettingen, Germany)
David Koll (University of Goettingen, Germany)
Jesus Omana Iglesias (Nokia Bell Labs, Ireland)
Jordi Arjona Aroca (Nokia Bell Labs, Ireland)
Volker Hilt (Nokia Bell Labs, Germany)
Xiaoming Fu (University of Goettingen, Germany)
Optimal Resource Configuration of Complex Services in the Cloud

ABSTRACT. Virtualization helps to deploy the functionality of expensive and rigid hardware appliances on scalable virtual resources running on commodity servers. However, optimal resource provisioning for non-trivial services is still an open problem. While there have been efforts to answer the questions of when provision additional resources in a running service, and how many resources are needed, the question of what should be provisioned has not been investigated yet. In particular for complex applications or services, which consist of a set of connected components, where each component in turn potentially consists of multiple component instances (e.g., VMs or containers). Each instance of a component can be run in different resource configurations or flavors (i.e., number of cores or amount of memory), while the service constructed by the combination of these resource configurations must satisfy the customer Service Level Objective (SLO). In this work, we offer to service providers an answer to the what to deploy question by introducing RConf, a system that automatically chooses the optimal combination of component instance configurations for non-trivial network services. In particular, we propose an analytical model based robust queuing theory that is able to accurately model arbitrary components, and develop an algorithm that finds the combination of these instances, such that the overall utilization the running instances is maximized while meeting SLO requirements.

Anchen Chai (Universite de Lyon, France)
Mohammad Bazm (Insa de Lyon, France)
Sorina Camarasu Pop (CREATIS - CNRS, France)
Tristan Glatard (Concordia University, Canada)
Hugues Benoit-Cattin (CREATIS - INSA de Lyon, France)
Frederic Suter (CC IN2P3 / CNRS, France)
Modeling Distributed Platforms from Application Traces for Realistic File Transfer Simulation

ABSTRACT. Simulation is a fast, controlled, and reproducible way to evaluate new algorithms for distributed computing platforms in a variety of conditions. However, the realism of simulations is rarely assessed, which critically questions the applicability of a whole range of findings.

In this paper we present our efforts to build platform models from application traces, to allow for the accurate simulation of file transfers across a distributed infrastructure. File transfers are key to performance, as the variability of file transfer times has important consequences on the dataflow of the application. We present a methodology to build realistic platform models from application traces and provide a quantitative evaluation of the accuracy of the derived simulations. Results show that the proposed models are able to correctly capture real-life variability and significantly outperform the state-of-the-art model.

Zengxiang Li (Institute of High Performance Computing Agency for Science, Technology and Research Singapore, Singapore)
Bowen Zhang (National University of Singapore, Singapore)
Shen Ren (Institute of High Performance Computing Agency for Science, Technology and Research Singapore, Singapore)
Yong Liu (IHPC, Singapore)
Zheng Qin (Institute of High Performance Computing, Singapore)
Rick Siow Mong Goh (Institute of High Performance Computing, Singapore)
Mohan Gurusamy (National University of Singapore, Singapore)
Performance Modelling and Cost Effective Execution for Distributed Graph Processing on Configurable VMs

ABSTRACT. Graph Processing has been widely used to capture complex data dependency and uncover relationship insights. Due to the ever-growing graph scale and algorithm complexity, distributed graph processing has become more and more popular. In this paper, we investigate how the tradeoff between performance and cost can be achieved for large scale graph processing on configurable virtual machine (VMs). We analyze the system architecture and implementation details of a Pregel-like distributed graph processing framework and then develop a system-aware model to predict the execution time. Consequently, cost effective execution scenarios are recommended by selecting a certain number of VMs with specified capability subject to the predefined resource price and user preference. Experiments using synthetic and real world graphs have verified that system-aware model can achieve much higher prediction accuracy than popular machine-learning models. As a result, its recommended execution scenarios can obtain cost efficiency close to the optimal scenarios.

Hani Nemati (Polytechnique Montreal, Canada)
Suchakrapani Datt Sharma (Polytechnique Montreal, Canada)
Michel Dagenais (Polytechnique Montreal, Canada)
Fine-grained Nested Virtual Machine Performance Analysis Through First Level Hypervisor Tracing

ABSTRACT. The overhead and complexity of virtualization can be decreased by the introduction of hardware-assisted virtualization. As a result, the ability to pool resources as Virtual Machines (VMs) on a physical host increases. In nested virtualization, a VM can in turn support the execution of one or more VM(s). Nowadays, nested VMs are often being used to address compatibility issues, security concerns, software scaling and continuous integration scenarios. With the increased adoption of nested VMs, there is a need for newer techniques to troubleshoot any unexpected behavior. Because of privacy and security issues, ease of deployment and execution overhead, these investigation techniques should preferably limit their data collection in most cases to the physical host level, without internal access to the VMs.

This paper introduces the Nested Virtual Machine Detection Algorithm (NDA) - a host hypervisor based analysis method which can investigate the performance of nested VMs. NDA can uncover the CPU overhead entailed by the host hypervisor and guest hypervisors, and compare it to the CPU usage of Nested VMs. We also developed a nested VM process state analyzer to detect the state of processes along with the reason for being in that state. We further developed several graphical views, for the TraceCompass trace visualization tool, to display the virtual CPUs of VMs and their corresponding nested VMs, along with their states. We investigate wake-up latency for nested VM along with added latency in different levels of virtualization. These approaches are based on host hypervisor tracing, which brings a lower overhead (around 1%) as compared to other approaches. Based on our analysis and the implemented graphical views, our techniques can quickly detect different problems and their root causes, such as unexpected delays inside nested VMs.

14:00-16:00 Session 9C: Scheduling and Resource Management (II)
Daniele Lezzi (Barcelona Supercomputing Center (BSC-CNS), Spain)
Location: Serrano
Giorgio Lucarelli (INRIA Grenoble Rhône-Alpes, France)
Fernando Mendonca (INRIA, France)
Denis Trystram (Grenoble university, France)
A new on-line method for scheduling independent tasks

ABSTRACT. We present a new method for scheduling independent tasks on a parallel machine composed of identical processors. This problem has been studied extensively for a long time with many variants. We are interested here in designing a generic algorithm in the on-line non-preemptive setting whose performance is good for various objectives. The basic idea of this algorithm is to detect some problematic tasks that are responsible for the delay of other shorter tasks. Then the former tasks are redirected to be executed in a dedicated part of the machine. We show through an extensive experimental campaign that this method is effective and in most cases is closer to some standard lower bounds than the base-line method for the problem.

Nishant Saurabh (University of Innsbruck, Austria)
Dragi Kimovski (University of Innsbruck, Austria)
Francesco Gaetano (University of Innsbruck, Austria)
Radu Prodan (University of Innsbruck, Austria)
A Two-Stage Multi-Objective Optimization of Erasure Coding in Overlay Networks

ABSTRACT. In the recent years, overlay networks have emerged as a crucial platform for deployment of various distributed applications. Many of these applications rely on data redundancy techniques, such as erasure coding, to achieve higher fault tolerance. However, erasure coding applied in large scale overlay networks entails various overheads in terms of storage, latency and data rebuilding costs. These overheads are largely attributed to the selected erasure coding scheme and the encoded chunk placement in the overlay network. This paper explores a multi-objective optimization approach for identifying appropriate erasure coding schemes and encoded chunk placement in overlay networks. The uniqueness of our approach lies in the consideration of multiple erasure coding objectives such as encoding rate and redundancy factor, with overlay network performance characteristics like storage consumption, latency and system reliability. Our approach enables a variety of tradeoff solutions with respect to these objectives to be identified in the form of a Pareto front. To solve this problem, we propose a novel two stage multi-objective evolutionary algorithm, where the first stage determines the optimal set of encoding schemes, while the second stage optimizes placement of the corresponding encoded data chunks in overlay networks of varying sizes. We study the performance of our method by generating and analyzing the Pareto optimal sets of tradeoff solutions. Experimental results demonstrate that the Pareto optimal set produced by our multi-objective approach includes and even dominates the chunk placements delivered by a related state-of-the-art weighted sum method.

Marcus Carvalho (Universidade Federal da Paraiba, Brazil)
Francisco Brasileiro (Universidade Federal de Campina Grande, Brazil)
Raquel Lopes (Universidade Federal de Campina Grande, Brazil)
Giovanni Farias (Universidade Federal de Campina Grande, Brazil)
Alessandro Fook (Universidade Federal de Campina Grande, Brazil)
João Mafra (Universidade Federal de Campina Grande, Brazil)
Daniel Turull (Ericsson Research, Sweden)
Multi-dimensional admission control and capacity planning for IaaS clouds with multiple service classes

ABSTRACT. Infrastructure as a Service (IaaS) providers typically offer multiple service classes to deal with the wide variety of users adopting this cloud computing model. In this scenario, IaaS providers need to perform efficient admission control and capacity planning in order to minimize infrastructure costs, while fulfilling the different Service Level Objectives (SLOs) defined for all service classes offered. However, most of the previous work on this field consider a single resource dimension -- typically CPU -- when making such management decisions. We show that this approach will either increase infrastructure costs due to over-provisioning, or violate SLOs due to lack of capacity for the resource dimensions being ignored. To fill this gap, we propose admission control and capacity planning methods that consider multiple service classes and multiple resource dimensions. Our results show that our admission control method can guarantee a high availability SLO fulfillment in scenarios where both CPU and memory can become the bottleneck resource. Moreover, we show that our capacity planning method can find the minimum capacity required for both CPU and memory to meet SLOs with good accuracy. We also analyze how the load variation on one resource dimension can affect another, highlighting the need to manage resources for multiple dimensions simultaneously.

Robayet Nasim (Karlstad University, Sweden)
Andreas Kassler (Karlstad University, Sweden)
A Robust Tabu Search Heuristic for VM Consolidation under Demand Uncertainity in Virtualized Datacenters

ABSTRACT. In virtualized datacenters (vDCs), dynamic consolidation of virtual machines (VMs) is used as one of the most common techniques to achieve both energy- and resource- utilization efficiency. Live migrations of VMs are used for dynamic consolidation but due to dynamic resource demand variation of VMs may lead to frequent and non-optimal migrations. Assuming deterministic workload of the VMs may ensure the most energy/resource-efficient VM allocations but eventually may lead to significant resource contention or under-utilization if the workload varies significantly over time. On the other hand, adopting a conservative approach by allocating VMs depending on their peak demand may lead to low utilization, if the peak occurs infrequently or for a short period of time. Therefore, in this work we design a robust VM migration scheme that strikes a balance between protection for resource contention and additional energy costs due to powering on more servers while considering uncertainties on VMs resource demands. We use the theory of $\Gamma$ robustness and derive a robust Mixed Integer Linear programming (MILP) formulation. Due to the complexity, the problem is hard to solve for online optimization and we propose a novel heuristic based on Tabu search. Using several scenarios, we show that that the proposed heuristic can achieve near optimal solution qualities in a short time and scales well with the instance sizes. Moreover, we quantitatively analyze the trade-off between energy cost versus protection level and robustness.

16:00-16:30Coffee Break
16:30-18:00 Session 10A: Applications and Big Data (II)
Marco Frincu (West University of Timisoara, Romania)
Location: Goya
Long Cheng (Eindhoven University of Technology, Netherlands)
Boudewijn van Dongen (Eindhoven University of Technology, Netherlands)
Wil van der Aalst (Eindhoven University of Technology, Netherlands)
Efficient Event Correlation over Distributed Systems

ABSTRACT. Event correlation is a cornerstone for process discovery over event logs crossing multiple data sources. The computed correlation rules and process instances will greatly help us to unleash the power of process mining. However, exploring all possible event correlations over a log could be time consuming, especially when the log is large. State-of-the-art methods based on MapReduce designed to handle this challenge have offered significant performance improvements over standalone implementations. Regardless, all the techniques are still based on a conventional generating-and-pruning scheme, which could result in event partitioning across multiple machines being inefficient. In this paper, following the principle of filtering-and-verification, we propose a new algorithm, called RF-GraP, which targets for efficient correlation over distributed systems. We present the detailed implementation of our approach and conduct a quantitative evaluation using the Spark platform. Experimental results demonstrate that the proposed method is indeed efficient. Compared to the state-of-the-art, we are able to achieve significant performance speedups with obviously less network communication.

Li-Yung Ho (Academia Sinica, Taiwan)
Jan-Jan Wu (Institute of Information Science, Academia Sinica, Taiwan)
Pangfeng Liu (Department of Computer Science and Information Engineering, National Taiwan University, Taiwan)
Chia-Chun Shih (Chunghwa Telecom Laboratories, Taiwan)
Chi-Chang Huang (Chunghwa Telecom Laboratories, Taiwan)
Chao-Wen Huang (Chunghwa Telecom Laboratories, Taiwan)
Efficient Cache Update for In-Memory Cluster Computing with Spark

ABSTRACT. This paper proposes a scalable and efficient cache update technique to improve the performance of in-memory cluster computing in Spark, a popular open-source system for big data computing. Although the memory cache speeds up data processing in Spark, its data immutability constraint requires reloading the whole RDD when part of its data is updated. Such constraint makes the RDD update inefficient. To address this problem, we divide an RDD into partitions, and propose the partial-update RDD (PRDD) method to enable users to replace individual partition(s) of an RDD. We devise two solutions to the RDD partition problem – a dynamic programming algorithm and a nonlinear programming method. Experiment results suggest that, PRDD achieves 4.32x speedup when compared with the original RDD in Spark. We apply PRDD to a billing system for Chunghwa Telecomm, the largest telecommunication company in Taiwan. Our result shows that the PRDD based billing system outperforms the original billing system in CHT by a factor of 24x in throughput. We also evaluate PRDD using the TPC-H benchmark, which also yields promising result.

Sumin Hong (Ulsan National Institute of Science and Technology, South Korea)
Woohyuk Choi (Ulsan National Institute of Science and Technology, South Korea)
Won-Ki Jeong (Ulsan National Institute of Science and Technology, South Korea)
GPU in-memory processing using Spark for iterative computation

ABSTRACT. Due to its simplicity and scalability, MapReduce has become a de facto standard computing model for big data processing. Since the original MapReduce model was only appropriate for embarrassingly parallel batch processing, many follow-up studies have focused on improving the efficiency and performance of the model. Spark follows one of these recent trends by providing in-memory processing capability to reduce slow disk I/O for iterative computing tasks. However, the acceleration of Spark's in-memory processing using graphics processing units (GPUs) is challenging due to its deep memory hierarchy and host-to-GPU communication overhead. In this paper, we introduce a novel GPU-accelerated MapReduce framework that extends Spark's in-memory processing so that iterative computing is performed only in the GPU memory. Having discovered that the main bottleneck in the current Spark system for GPU computing is data communication on a Java virtual machine, we propose a modification of the current Spark implementation to bypass expensive data management for iterative task offloading to GPUs. We also propose a novel GPU in-memory processing and caching framework that minimizes host-to-GPU communication via lazy evaluation and reuses GPU memory over multiple mapper executions. The proposed system employs message-passing interface (MPI)-based data synchronization for inter-worker communication so that more complicated iterative computing tasks, such as iterative numerical solvers, can be efficiently handled. We demonstrate the performance of our system in terms of several iterative computing tasks in big data processing applications, including machine learning and scientific computing.

16:30-18:00 Session 10B: Programming Models and Runtime Systems (I)
Taisuke Boku (University of Tsukuba, Japan)
Location: Velazquez
Emiliano Silvestri (DIAG - Sapienza Universita' di Roma, Italy)
Simone Economo (Sapienza Università di Roma, Italy)
Pierangelo Di Sanzo (DIAG, Sapienza University of Rome, Italy)
Alessandro Pellegrini (Sapienza University of Roma, Italy)
Francesco Quaglia (Sapienza Universit�di Roma, Italy)
Preemptive Software Transactional Memory

ABSTRACT. In state of the art Software Transactional Memory (STM) systems, threads carry out the execution of transactions as non-interruptible tasks. Hence a thread can react to the injection of a higher priority transactional task and take care of its processing only at the end of the currently executed transaction. In this article we pursue a paradigm shift where the execution of an in-memory transaction is carried out as a preemptable task, so that a thread can start processing a higher priority transactional task before finalizing its current transaction. We achieve this goal in an application-transparent manner, by only relying on innovative operating system facilities we include in our preemptive STM architecture. With our approach we are able to reevaluate CPU assignment across transactions along a same thread with period of the order of few tens of microseconds. This is mandatory for an effective priority management architecture given the typically finergrain nature of in-memory transactions compared to their counterpart in database systems. We integrated our preemptive STM architecture within the TinySTM package, and released it as open source. We also provide the results of an experimental assessment of our proposal based on running a port of the TPC-C benchmark to the STM environment.

Jiaqi Liu (The Ohio State University, USA)
Gagan Agrawal (The Ohio State University, USA)
Supporting Fault-Tolerance in Presence of In-Situ Analytics

ABSTRACT. In-situ analytics have recently emerged as a promising approach to reduce I/O, network, and storage congestion for scientific data analysis. At the same time, Mean Time To Failure (MTTF) has also continuously decreased for HPC clusters, making failures during the execution of a simulation more likely. It is difficult to apply Checkpointing and Restart (C/R) to a simulation executed with in-situ analytics, because we not only need to store analytic and simulation states, but also need to ensure consistency between them.

To solve the above problem, we present a novel solution to apply C/R approach. The solution exploits a common property of analytics applications, which is that they fit the reduction-style processing pattern. Based on the observation that the global reduction result is correct if all local reduction objects have been processed, only global reduction objects and progress of the reduction procedure is checkpointed, together with an application-level checkpoint of the simulation. We also discuss the specific details for handling time-sharing and space sharing modes, which are suitable for multi-core and many-core environments, respectively.

We have evaluated our approach in the context of Smart middleware for developing in-situ analytics, which provides a Map-Reduce like API. We demonstrate the low overheads of our approach with different scientific simulations and analytics tasks.

Hoang-Vu Dang (University of Illinois at Urbana-Champaign, USA)
Sangmin Seo (Argonne National Laboratory, USA)
Abdelhalim Amer (Argonne National Laboratory, USA)
Pavan Balaji (Argonne National Laboratory, USA)
Advanced Thread Synchronization for Multithreaded MPI Implementations

ABSTRACT. Concurrent multithreaded access to the Message Passing Interface (MPI) is gaining importance to support emerging hybrid MPI applications. Efficient threading support, however, is nontrivial and demands thread synchronization methods that adapt to MPI runtime workloads. Prior studies showed that threads in waiting states, which occurs in routines with blocking semantics, often interfere with other threads (active) and degrade their progress. This occurs when both classes of threads compete for the same resource, and ownership passing to threads in waiting states does not guarantee communication to advance. The best known practical solution tackled the challenge by prioritizing active threads and adapting first-in-first-out arbitration within each class. This approach, however, suffers from lingering wasted resource acquisitions (waste), ignores data locality, and thus results in poor scalability.

In this work, we propose thread synchronization improvements to eliminate waste while preserving data locality in a production MPI implementation. First, we leverage the MPI runtime knowledge and a fast selective wakeup method to reactivate waiting threads, thus, accelerating progress and eliminating waste. Second, we rely on a cooperative progress model that dynamically elects and restricts a single waiting thread to drive a communication context for improved data locality. Finally, active threads are prioritized and synchronized with a locality-preserving lock that is hierarchical and exploits unbounded bias for high throughput. Results show significant improvement in synthetic microbenchmarks and two MPI+OpenMP applications.

16:30-18:00 Session 10C: Storage and I/O (I)
Hong Jiang (University of Texas Arlington, USA)
Location: Serrano
Jiang Zhou (Texas Tech University, USA)
Wei Xie (Texas Tech University, USA)
Dong Dai (Texas Tech University, USA)
Yong Chen (Texas Tech University, USA)
Pattern-Directed Replication Scheme for Heterogeneous Object-based Storage

ABSTRACT. Data replication is a key technique to achieve data availability, reliability, and optimized performance in distributed storage systems and data centers. In recent years, with the emergence of new storage devices, heterogeneous objectbased storage systems, such as a storage system with the coexistence of hard disk drives and solid state drives, have become increasingly attractive as they combine merits of different storage devices to deliver better promise. However, existing data replication schemes do not well distinguish heterogeneous storage device characteristics, and they do not consider the impact of applications’ access patterns in a heterogeneous environment either, leading to suboptimal performance. This paper introduces a new data replication scheme called Patterndirected Replication Scheme (PRS) to achieve efficient data replication for heterogeneous storage systems. Different from traditional schemes, the PRS selectively replicates data objects and distributes replicas to different storage devices according to their features. It reorganizes objects with object distance calculation and arranges replicas according to application’s data access pattern identified. In addition, the PRS uses a pseudo random algorithm to optimize replica layout by considering the storage device performance and capacity features. We have evaluated the pattern-directed replication scheme with extensive tests in Sheepdog, a widely used object-based storage system. The experimental results confirm that it is a highly efficient replication scheme for heterogeneous storage systems. For instance, the read performance was improved by 105% to nearly 10x compared with existing replication schemes.

Jiwoong Park (Seoul National University, South Korea)
Cheolgi Min (Seoul National University, South Korea)
Heonyoung Yeom (Seoul National University, South Korea)
A New File System I/O Mode for Efficient User-level Caching

ABSTRACT. Abstract—A large number of cloud datastores have been developed to handle the cloud OLTP workload. Double caching problem where the same data resides both at the user buffer and the kernel buffer has been identified as one of the problems and has been largely solved by using direct I/O mode to bypass the kernel buffer. However, maintaining the caching layer only in user-level has the disadvantage that the user process may monopolize memory resources and that it is difficult to fully utilize the system memory due to the risks of the forced termination of the process or the unpredictable performance degradation in case of memory pressure.

In this paper, we propose a new I/O mode, DBIO, to efficiently exploit OS kernel buffer as a victim cache for user- level file content cache, enjoying the strengths of kernel-level cache rather than just skipping it. DBIO provides the extended file read/write system calls, which enable user programs to dynamically choose the right I/O behavior based on their context when issuing I/Os instead of when opening the file. On the various OLTP benchmarks with the modified version of MySQL/InnoDB, DBIO improves the in-memory cache hit ratio and the transaction performance compared to both buffered and direct I/O mode, fully utilizing the user buffer and the kernel buffer without double caching.

Djillali Boukhelef (USTHB University, Algeria)
Kamel Boukhalfa (USTHB University, Algeria)
Jalil Boukhobza (Université Bretagne Occidentale, France)
Hamza Ouarnoughi (Université Bretagne Occidentale, France)
Laurent Lemarchand (Université de Bretagne Occidentale, France)
COPS: Cost Based Object Placement Strategies on Hybrid Storage System for DBaaS Cloud

ABSTRACT. Solid State Drive (SSD) became a must-have technology in a Cloud context. However, it is too expensive to replace magnetic drives. As a consequence, they are used together with Hard Disk Drive (HDD) in Hybrid Storage Systems (HSS), and when it comes to storing data, some placement strategies are used to find the best location (SSD or HDD). While for many applications, those strategies need to be I/O performance driven, in Cloud context, they must be cost driven: minimize the cost of data placement while satisfying Service Level Objectives. This paper presents two Cost based Object Placement Strategies (COPS) for DBaaS objects in HSS: a genetic based approach (G-COPS) and an ad-hoc heuristic approach (H-COPS) based on incremental optimization. Both algorithms were tested for small and large instance problems. While G-COPS proved to be closer to the optimal solution in case of small instance problems, H-COPS showed a better scalability as it approached the exact solution even for large instance problems (by 10% in average). In addition, H-COPS showed execution times of ~1s for up to one thousand (1000) objects making it good candidate to be used in runtime. Both H-COPS and G-COPS performed better than state-of-the-art solutions as they satisfied SLOs while reducing the overall cost by 60% for both small and large instance problems.

19:00-21:00 Session 11: Welcome Cocktail & Poster session
Richard Knepper (Indiana University, USA)
Eric Coulter (IU, USA)
Sudhakar Pamidighantam (Indiana University, USA)
Using the Jetstream Research Cloud to provide Science Gateway resources

ABSTRACT. We describe the use of the Jetstream research-cloud, a purpose-built system with the goal of supporting “long-tail” research by providing a flexible, on-demand research infrastruc- ture, to provide scalable back-end resources for science gateways. In addition to providing cloud-like resources for on-demand science, Jetstream offers the capability to instantiate long-running clusters which support science gateways. Science gateways are web-based systems built on computational infrastructure which provide commonly-used tools to a community of users. We created a persistent cluster on the Jetstream system which is connected to the SEAGrid science gateway and provides additional compute resources for a variety of quantum chem- istry calculations. We discuss the further application of toolkits provided by the Extreme Science and Engineering Discovery Environment (XSEDE) to build general-purpose clusters on the research cloud.

Walter Dos Santos Filho (Universidade Federal de Minas Gerais, Brazil)
Luiz Fernando Carvalho (UFMG, Brazil)
Gustavo Paula Avelar (UFMG, Brazil)
Atila Silva Jr (UFMG, Brazil)
Lucas Ponce (UFMG, Brazil)
Dorgival Guedes (UFMG, Brazil)
Wagner Meira Jr. (UFMG, Brazil)
Lemonade: A scalable and efficient Spark-based platform for data analytics

ABSTRACT. Data Analytics is a concept related to pattern and relevant knowledge discovery from large amounts of data. In general, the task is complex and demands knowledge in very specific areas, such as massive data processing and parallel programming languages. However, analysts are usually not versed in Computer Science, but in the original data domain. In order to support them in such analysis, we present Lemonade --- Live Exploration and Mining Of a Non-trivial Amount of Data from Everywhere --- a plataform for visual creation and execution of data analysis workflows. Lemonade encapsulates storage and data processing environment details, providing higher-level abstractions for data source access and algorithms coding. The goal is to enable batch and interactive execution of data analysis tasks, from basic ETL to complex data mining algorithms, in parallel, in a distributed environment. The current version supports HDFS, local filesystems and distributed environments such as Apache Spark, the state-of-art framework for Big Data analysis.

Ainhoa Azqueta-Alzúaz (Universidad Politécnica de Madrid, Spain)
Marta Patiño-Martinez (Universidad Politécnica de Madrid, Spain)
Ivan Brondino (LeanXcale, Spain)
Ricardo Jimenez-Peris (LeanXcale, Spain)
Massive Data Load on Distributed Database Systems over HBase

ABSTRACT. Big Data has become a pervasive technology to manage the ever-increasing volumes of data. Among Big Data solutions, scalable data stores play an important role, especially, key-value data stores due to their large scalability (thousands of nodes). The typical workflow for Big Data applications include two phases. The first one is to load the data into the data store typically as part of an ETL (Extract-Transform-Load) process. The second one is the processing of the data itself. BigTable and HBase are the preferred key-value solutions based on hash-partitioned data stores. However, they suffer from an important efficiently issue in the loading phase that creates a single node bottleneck. In this paper, we identify and quantify this bottleneck and propose a tool for parallel massive data loading that solves satisfactorily the bottleneck enabling all the parallelism and associated throughput of the underlying key-value data store during the loading phase as well. The proposed solution has been implemented as a tool for parallel massive data loading over HBase, the key-value data store of the Hadoop ecosystem.

Norman Lim (Carleton University, Canada)
Shikharesh Majumdar (Carleton Univeristy, Canada)
Peter Ashwood-Smith (Huawei Technologies Canada, Canada)
Techniques for Handling Error in User-estimated Execution Times During Resource Management on Systems Processing MapReduce Jobs

ABSTRACT. In our previous work, we described a resource allocation and scheduling technique for processing an open stream of MapReduce jobs with SLAs (characterized by an earliest start time, an execution time, and a deadline) called the Hadoop Constraint Programming based Resource Management technique (HCP-RM). Since the user-estimated job execution times are used to perform resource allocation and scheduling, error/inaccuracies in the execution times can hinder the ability of HCP-RM from making effective scheduling decisions, leading to a degradation in system performance. This paper focuses on improving the robustness of HCP-RM by introducing a mechanism to handle errors/inaccuracies in user estimates of job execution times that are submitted as part of the job’s SLA. A Prescheduling Error Handling technique (PSEH) is devised to adjust the user-estimated execution times of the jobs to make them more accurate before they are used by the resource management algorithm. Results of experiments conducted on a Hadoop cluster deployed on Amazon EC2 demonstrate the effectiveness of the PSEH technique in improving system performance.

Somayeh Sobati (LyonII, France)
Privacy Preserving Data Outsourcing via Homomorphic Secret Splitting Schemes

ABSTRACT. Data confidentiality is a major concern in cloud outsourcing scenarios due to the untrustworthiness of the cloud service providers. In this paper, we propose a lightweight additive homomorphic scheme inspired by Shamir’s secret sharing scheme to preserve the privacy of outsourced data. We split sensitive values into several meaningless splits and store them at a cloud service provider. Then, we extend the proposed scheme to preserve ordering and support different kinds of queries. The proposed schemes are extended with respect to the Paillier’s cryptosystem which is widely used scheme in the literature. We will deeply look at efficiency issues without sacrificing data privacy. The key contribution of our work is order-preserving property over our additive homomorphic scheme that improves the practicality of our schemes.

Sanna Aizad (University of Derby, UK)
Ashiq Anjum (University of Derby, UK)
Rizos Sakellariou (The University of Manchester, UK)
Representing Variant Calling Format as Directed Acyclic Graphs to enable the use of cloud computing for efficient and cost effective genome analysis

ABSTRACT. Since the Human Genome Project, the human genome has been represented as a sequence of 3.2 billion base pairs and is referred to as the “Reference Genome”. With advancements in technology, it has become easier to sequence genomes of individuals, creating a need to represent this information in a new way. Several attempts have been made to represent the genome sequence as a graph. The Variant Calling Format (VCF) file carries information about variations within genomes and is the primary format of choice for genome analysis tools. This short paper aims to motivate work in representing the VCF file as Directed Acyclic Graphs (DAGs) to run on a cloud in order to exploit the high performance capabilities provided by cloud computing.

Lorraine Herger (IBM Research, USA)
Carlos Fonseca (IBM, USA)
IBM Research Hybrid Cloud

ABSTRACT. The ability for IBM researchers and development teams to take an idea or business opportunity and quickly solution it by requesting resources from a self-service cloud portal or an application-programming interface, API, has drastically changed how IBM Research procures information technology resources. In this poster, we’ll present a brief history, the current IBM Research Hybrid cloud and our analysis supporting the need for Hybrid cloud.

Wang Binfeng (School of Computer, National University of Defense Technology, China)
Su Jinshu (Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, China)
EffiEye: Application-aware Large Flow Detection in Data Center

ABSTRACT. With the rapid development of cloud computing, thousands of servers and various cloud applications are involved in data center. These changes result in more and more complex flows in data center, which motivates the need for faster, lower overhead, more scalable large flow detection technology. This paper firstly shows the shortcomings of the traditional large flow detection technologies. Then it proposes a new method named EffiEye, which efficiently realizes application-aware large flow detection in the controller. EffiEye mainly replies on two different mechanisms: one is the flow classification based on the pre-classification of cloud applications in App Info module, which can ensure the fast detecting speed; the other is the flow-stat triggering supported by OpenFlow 1.5, which can ensure the high detecting accuracy.

Vamis Xhagjika (UPC Universitat Politècnica de Catalunya - KTH Royal Institute of Technology, Sweden)
Oscar Divorra Escoda (Tokbox Inc. a Telefonica company, Spain)
Leandro Navarro (UPC, Spain)
Vladimir Vlassov (Royal Institute of Technology (KTH), School for Information and Communication Technology, Sweden)
Load and Video Performance Patterns of a Cloud Based WebRTC Architecture

ABSTRACT. Web Real-Time Communication or Real-time communication in the Web (WebRTC/RTCWeb) is a prolific new standard and technology stack, providing full audio/videoagnostic communications for the Web. Service providers implementing such technology deal with various levels of complexity ranging anywhere from: high service distribution, multi-client integration, P2P and Cloud assisted communication backends, content delivery, real-time constraints and across clouds resource allocation. To the best of our knowledge, a study of the joint factors of multi-cloud distribution, network and media parameters as well as back-end resource loads and respective correlations as seen by passive user measurements, is necessary. These patterns are crucial in designing resource allocation algorithms and media Service Level Objectives (SLO) spanning over multiple data-centers or servers. Based on our analysis, we discover strong periodical load patterns even though the nature of user interaction with the system is mostly not predetermined with variable user churn.

Sergio Rivas-Gomez (KTH Royal Institute of Technology, Sweden)
Stefano Markidis (KTH Royal Institute of Technology, Sweden)
Ivy Bo Peng (KTH Royal Institute of Technology, Sweden)
Erwin Laure (KTH Royal Institute of Technology, Sweden)
Gokcen Kestor (Pacific Northwest National Laboratory, USA)
Roberto Gioiosa (Pacific Northwest National Laboratory, USA)
Extending Message Passing Interface Windows to Storage

ABSTRACT. This paper presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead.

Juan Fang (Beijing University of Technology, China)
Qingwen Fan (Beijing University of Technology, China)
Xiaoting Hao (Beijing University of Technology, China)
Yanjin Cheng (Beijing University of Technology, China)
Performance Optimization by Dynamicly Altering Cache Replacement Algorithm in CPU-GPU Heterogeneous Multi-Core Architecture

ABSTRACT. Cache is an important resource between cores in heterogeneous multi-core architecture which is the main factor that affects system performance and power consumption. The implementation algorithm of cache replacement in current heterogeneous multi-core environment is thread-blinded, leading to a lower utilization of the cache. In fact, each of the CPU and GPU applications has its own characteristics, where CPU is responsible for the implementation of tasks and serial logic control, while GPU has a great advantage in parallel computing, which causes the need of cache blocks for CPU more sensitive than those for GPU. With that in mind, this research gives full consideration to the increment of thread priority in the cache replacement algorithm and takes a novel strategy to improve the work efficiency of last-level-cache (LLC), where the CPU and GPU applications share LLC dynamically and not in an absolutely fair status. Furthermore, our methodology switches policies between the LRU and LFU effectively by comparing the number of cache misses on the LLC, which takes both the time and frequency of the accessing cache block into consideration. The experimental results indicate that this optimization method can effectively improve system performance.

Agostino Forestiero (ICAR-CNR, Italy)
Multi-agent recommendation system in Internet of Things

ABSTRACT. The Internet of Things (IoT) aims to bridge the gap between the physical and the cyber world to allow a deeper understanding of user preferences and behaviors. The interactions and relations between users and things need of an effective and efficient recommendation approaches to better meet users interests. Suggesting useful things in IoT environment is a very important task for many applications such as urban computing, smart cities, health care, etc., and it needs to be widely investigated. The goal of recommendation systems is to produce a set of significant suggestions for a user with given characteristics. In this paper, a multi-agent algorithm that, by exploiting of a decentralized and self organizing strategy, builds a distributed recommendation system in IoT environment, is proposed. Things are represented through bit vectors, the thing descriptors, obtained through a locality preserving hash function that maps similar things into similar bit vectors. Cyber agents manage the thing descriptors and exchange them on the basis of ad-hoc probability functions. The outcome is the emergence of an organized overlay-network of cyber agents that allows to obtain an efficient things recommender system. Preliminaries results confirm the validity of the approach.

Nikolaos Chalvantzis (National Technical University of Athens, Greece)
Ioannis Konstantinou (National Technical University of Athens, Greece)
Nectarios Koziris (National Technical University of Athens, Greece)
BBQ: Elastic MapReduce over Cloud Platforms

ABSTRACT. Cloud infrastructure services such as Amazon EMR allow users to have access to tailor-made Big Data processing clusters within a few clicks from their web browser, thanks to the elastic property of the cloud. In virtual cloud environments, resource management is desired to be performed in a way which optimizes utilization, thus maximizing the value of the resources acquired. As cloud infrastructures become increasingly popular for Big Data analysis, the execution of programs with respect to user selected performance goals, such as job completion deadlines, remains a challebge. In this work we present BARBECUE (joB AwaRe Big-data Elasticity CloUd managEment System), a system that allows a Hadoop MapReduce virtual cluster to automatically adjust its size to the workload it is required to execute in order to respect individual jobs’ completion deadlines without acquiring more resources than the least necessary. To that end we use a Performance Model for MapReduce jobs which can express cluster resources (i.e., YARN Container capacity) and execution time as a function of the number of nodes in the cluster. We also add a new feature to Hadoop MapReduce which can now dynamically, on-the-fly update the number of selected Reduce Tasks in cases where the cluster is expanded, so that our system makes full use of the resources it has acquired during the reduce phase of the execution. BBQ uses an adaptation of the hill climbing algorithm to estimate the optimal combination of number of nodes and reduce waves given a known job, its data input and an execution deadline. The attendees will be able to watch, in real-time, the system perform cluster resizes well in order to execute its assigned jobs in time.

Jinhong Zhou (University of Science and Technology of China, China)
Shaoli Liu (SKLCA, ICT, China)
Qi Guo (SKLCA, ICT, China)
Xuda Zhou (USTC, China)
Tian Zhi (SKLCA, ICT, China)
Daofu Liu (SKLCA, ICT, China)
Chao Wang (University of Science and Technology of China, China)
Xuehai Zhou (University of Science and Technology of China, China)
Yunji Chen (SKLCA, ICT, China)
Tianshi Chen (SKLCA, ICT, China)
TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing

ABSTRACT. Large-scale graph processing is now a crucial task of many commercial applications, and it is conventionally supported by general-purpose processors. These processors are designed to flexibly support highly diverse workloads with classic techniques such as on-chip cache and dynamic pipelining. Yet, it is difficult for the on-chip cache to exploit irregular data locality in large-scale graph processing, even though there are a few high-degree vertices that are frequently accessed in real-world graphs; it is not efficient to perform regular arithmetic operations via sophisticated dynamic pipelining. In short, general-purpose processors could not be the ideal platforms to graph processing.

In this paper, we design a reconfigurable graph processing accelerator, with the purpose of providing an energy-efficient and flexible hardware platform for large-scale graph processing. This accelerator features two main components, i.e., the on-chip storage to exploit the data locality of graph processing, and the reconfigurable functional units to adapt to diversified operations in different graph processing tasks. On a total of 36 practical graph processing tasks, we demonstrate that, on average, our accelerator design achieves 1.58x and 25.56x better performance and energy efficiency, respectively, than the GPU baseline.

Jinhong Zhou (University of Science and Technology of China, China)
Chongchong Xu (University of Science and Technology of China, China)
Xianglan Chen (University of Science and Technology of China, China)
Chao Wang (University of Science and Technology of China, China)
Xuehai Zhou (University of Science and Technology of China, China)
Mermaid: Integrating Vertex-Centric with Edge-Centric for Real-World Graph Processing

ABSTRACT. There has been increasing interests in processing large-scale real-world graphs, and recently many graph systems have been proposed. Vertex-centric GAS (Gather-Apply-Scatter) and Edge-centric GAS are two graph computation models being widely adopted, and existing graph analytics systems commonly follow only one computation model, which is not the best choice for real-world graph processing. In fact, vertex degrees in real-world graphs often obey skewed power-law distributions: most vertices have relatively few neighbors while a few have many neighbors. We observe that vertex-centric GAS for high-degree vertices and edge-centric GAS for low-degree vertices is a much better choice for real-world graph processing.

In this paper, we present Mermaid, a system for processing large-scale real-world graphs on a single machine. Mermaid skillfully integrates vertex-centric GAS with edge-centric GAS through a novel vertex-mapping mechanism, and supports streamlined graph processing. On a total of 6 practical natural graph processing tasks, we demonstrate that, on average, Mermaid achieves 1.83x better performance than the state-of-the-art graph system on a single machine.

Javier Prades (Universitat Politècnica de València, Spain)
Federico Silla (Technical University of Valencia, Spain)
A Live Demo for Showing the Benefits of Applying the Remote GPU Virtualization Technique to Cloud Computing

ABSTRACT. Cloud computing has become pervasive nowadays. Additionally, cloud computing customers increasingly demand the use of accelerators such as CUDA GPUs. This has motivated that Amazon, for example, provides virtual machine instances comprising up to 16 NVIDIA GPUs. However, the use of GPUs in cloud computing deployments is not exempt from important concerns. In order to overcome many of these concerns, the remote GPU virtualization technique can be used. In this paper we present the design of a live demo to be used in exhibitions in order to show the benefits of using such a virtualization technique in the context of cloud computing systems. The demo was designed to be technically sound at the same time that it draws the attention of attendees. The demo was successfully used in the recent SuperComputing'16 exhibition, attracting more than 100 people to the booth.

Ioannis Giannakopoulos (National Technical University of Athens, Greece)
Ioannis Konstantinou (National Technical University of Athens, Greece)
Dimitrios Tsoumakos (Department of Informatics, Ionian University, Greece)
Nectarios Koziris (National Technical University of Athens, Greece)
AURA: Recovering from Transient Failures in Cloud Deployments

ABSTRACT. In this work, we propose AURA, a cloud deployment tool used to deploy applications over providers that tend to present transient failures. The complexity of modern cloud environments imparts an error-prone behavior during the deployment phase of an application, something that hinders automation and magnifies costs both in terms of time and money. To overcome this challenge, we propose AURA, a framework that formulates an application deployment as a Directed Acyclic Graph traversal and re-executes the parts of the graph that failed. AURA achieves to execute any deployment script that updates filesystem related resources in an idempotent manner through the adoption of a layered filesystem technique. Our demonstration indicates that any application can be deployed with AURA, with minimum deployment script re-executions, even in the most unstable environments.

Bo Xu (University of Science and Technology of China, China)
Changlong Li (University of Science and Technology of China, China)
Hang Zhuang (University of Science and Technology of China, China)
Jiali Wang (University of Science and Technology of China, China)
Qingfeng Wang (University of Science and Technology of China, China)
Jinhong Zhou (University of Science and Technology of China, China)
Xuehai Zhou (University of Science and Technology of China, China)
DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions

ABSTRACT. Sequence alignment algorithms are a basic and critical component of many bioinformatics fields. With rapid development of sequencing technology, the fast growing reference database volumes and longer length of query sequence become new challenges for sequence alignment. However, the algorithm is prohibitively high in terms of time and space complexity. In this paper, we present DSA, a scalable distributed sequence alignment system that employs Spark to process sequences data in a horizontally scalable distributed environment, and leverages data parallel strategy based on Single Instruction Multiple Data (SIMD) instruction to parallelize the algorithm in each core of worker node. The experimental results demonstrate that 1) DSA has outstanding performance and achieves up to 201x speedup over SparkSW. 2) DSA has excellent scalability and achieves near linear speedup when increasing the number of nodes in cluster.