ICA3PP 2016: 16TH INTERNATIONAL CONFERENCE ON ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING
PROGRAM FOR THURSDAY, DECEMBER 15TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 6A: Quo Vadis Ubiquitous Computing?

Pedro José Marrón

  • Full Professor for Pervasive Computing at the University of Duisburg-Essen and Co-Founder of Locoslab GmbH

The world is changing at an extremely rapid pace and it seems impossible even for computer scientists to keep up with the evolution of technologies. Fifteen years ago, smart phones were just a dream and people were reluctant to go online for many things. Nowadays, everything seems to be moving to the virtual realm and as of today, 3 billion people have regular access to the Internet and to communication technologies. This number is equal to the world population in 1967. In this talk, we will look back at some of the predictions of future computing done by “experts” in the last years and will analyze the state of Ubiquitous Computing technologies using examples from current research projects not with virtual entities, but with real people in real cities.

Prof. Dr. Pedro José MARRÓN received his bachelor and master’s degree in computer engineering from the University of Michigan in Ann Arbor in 1996 and 1998. At the end of 1999, he moved to the University of Freiburg in Germany to work on his Ph.D., which he received with honors in 2001. From 2003 until 2007, he worked at the University of Stuttgart as a senior researcher, leading the mobile data management and sensor network group. In 2007, he left Stuttgart to become a Professor of Computer Science at the University of Bonn, where he led the sensor networks and pervasive computing group. In 2009 he left Bonn to become a full Professor at the University of Duisburg-Essen. He is currently head of the “Networked Embedded Systems Group” (NES) which counts with almost 20 researchers working on fields related to ubiquitous computing. He is also co-founder of Locoslab GmbH, a spin-off of the University of Duisburg-Essen specialized in providing complete solutions for location-based services. Additionally, Pedro Marrón is also the initiator and president of UBICITEC, the European Center for Ubiquitous Technologies and Smart Cities, which counts with over 20 institutional partners from industry and academia forming a virtual European Center with clear research and dissemination objectives. The goal of UBICITEC is to coordinate the research efforts on enabling technologies for Smart Cities, e.g. Internet of Things and to encourage the transfer of technology to industry.

Location: Room Alejandra
09:00-10:00 Session 6B: En2ESO (I)
Location: Room Beatriz III
09:00
5G-XHaul: Enabling scalable virtualization for future 5G Transport Networks
SPEAKER: Daniel Camps

ABSTRACT. Network slicing is a major trend in the design of future 5G networks that will enable operators to effectively service multiple industry verticals with a single network infrastructure. Thus, network slicing will shape all segments of the future 5G networks, including the radio access, the transport network and the core network. In this paper we introduce the control plane design produced by the 5G-XHaul project. 5G-XHaul envisions a future 5G transport network composed of heterogeneous technology domains, including wireless and optical segments, which will be able to transport end user and operational services. Consequently, 5G-XHaul proposes a hierarchical SDN control plane where each controller is responsible for a limited network domain, and proposes a multi-technology virtualization framework that enables a scalable slicing of the transport network by operating at the edge of the network.

09:30
Generalized Orchestration of IT/Cloud and Networks for SDN/NFV 5G Services

ABSTRACT.  It is widely accepted that the so-called 5G services are conceived around the joint allocation and use of heterogeneous resources, including network, computing and storage. Resources are placed on distributed locations constrained by the different service requirements, resulting in cloud infrastructures that need to be interconnected, relying on overarching control and generalized orchestration, loosely defined as to the coherent coordination of heterogeneous systems.  In this context, the 5G-Crosshaul project defines a new generation of transport networks for 5G, integrating both fronthaul and backhaul segments into a common transport infrastructure, addressing the requirements associated to 5G mobile transport network. In this talk, we will provide an overview of network orchestration, considering different models; we extend them to take into account cloud management while mentioning relevant existing initiatives and conclude with the NFV architecture. We will briefly discuss the 5G-Crosshaul approach to the SDN/NFV-based control plane, and its support for multi-tenancy and network slicing.

10:00
Multi-tenancy architectures for heterogeneous resource allocations

ABSTRACT. The talk firstly will highlight some architectural aspects to be considered to assure the allocation of heterogeneous resources in support of multi-tenancy. Then, some use cases as well as results about optimal and sub-optimal resource algorithms will be presented and discussed.

10:00-10:30Coffee Break
10:30-12:30 Session 7A: ICA3PP: Distributed and Network-based Computing (I)
Location: Room Beatriz I
10:30
Graphein: A Novel Optical High-radix Switch Architecture for 3D Integration
SPEAKER: Jian Jie

ABSTRACT. The demand from exascale computing has made the design of high-radix switch chips an attractive and challenging research field in EHPC (Exascale High-Performance Computing). Recent development of silicon photonic and 3D integration technologies has inspired new methods of designing high-radix switch chips. In this paper, we propose Graphein — a novel optical high-radix switch architecture, which significantly reduces the radix of switch network by distributing a high-radix switch network into multiple layers via 3D integration, and which improves switch bandwidth while lowering switch chips power consumption by using silicon photonic technology. Our theoretical analysis shows that Graphein architecture can achieve 100% throughput. Our simulation shows that the average latencies under both random and hotspot patterns are less than 10 cycles, and the throughput under random pattern is almost 100%. Compared to hi-rise architecture, Graphein ensures the packets from different source ports receive fairer service, thereby yielding more concentrated latency distribution. In addition, the power consumption of the Graphein chip is about 19.2W, which totally satisfies the power constraint on a high-radix switch chip.

11:00
Online Resource Coalition Reorganization for Efficient Scheduling on the Intercloud

ABSTRACT. While users running applications on the intercloud can run their applications on configurations unavailable on single clouds they are faced with VM performance fluctuations among providers and even within the same provider as recent papers have indicated. These fluctuations can impact an application's objectives. A solution is to cluster resources into coalitions working together towards a common goal, i.e., ensuring that the deviation from the objectives is minimal. These coalitions are formed based on historical information on the performance of the underlying resources by assuming that patterns in the deployment of the applications are repeatable. However, static coalitions can lead to underutilized resources due to the fluctuating job flow leading to obsolete information. Thus, we propose an online coalition formation metaheuristics which allows us to update existing and create new coalitions at run time based on the job flow. We test our AntClust online coalition formation method against a static coalition formation approach.

11:30
Improving the Performance of Volunteer Computing with Data Volunteers: A Case Study with the ATLAS@home Project

ABSTRACT. Volunteer computing is a type of distributed computing in which ordinary people donate processing and storage resources to one or more scientific projects. Most of the existing volunteer computing systems have the same basic structure: a client program runs on the volunteer's computer, periodically contacting project-operated servers over the Internet to request jobs and report the results of completed jobs. BOINC is the main middleware system for this type of distributed computing. The aim of volunteer computing is that organizations be able to attain large computing power thanks to the participation of volunteer clients instead of a high investment in infrastructure. There are projects, like the ATLAS@home project, in which the number of running jobs has reached a plateau, due to a high load on data servers caused by file transfer. This is why we have designed an alternative, using the same BOINC infrastructure, in order to improve the performance of BOINC projects that have reached their limit due to the I/O bottleneck in data servers. This alternative involves having a percentage of the volunteer clients running as data servers, called data volunteers, that improve the performance by reducing the load on data servers. This paper describes our alternative in detail and shows the performance of the solution, applied to the ATLAS@home project, using a simulator of our own, ComsimBOINC.

12:00
3-additive Approximation Algorithm for Multicast Time in 2D Torus Networks

ABSTRACT. In this paper, we propose 3-additive approximation algorithm for multicast time in wormhole-routed 2D torus networks. HMDIAG (Hybrid Modified DIAGonal) divides the 2D torus into four meshes and performs preprocessing at the source node to create the Diagonal Paths (DP), along which the message is sent in each mesh. At the source node and every intermediate node, another process is performed to send the message to a subset of destination nodes along a path branching from the DP. HMDIAG is a tree based multicast algorithm that uses two startup times. Simulation results show that the multicast time, latency, and coefficient variation of multicast time of HMDIAG is better than TASNEM and Multipath-HCM.

10:30-12:30 Session 7B: ICA3PP: Performance Modeling and Evaluation (II)
Location: Room Beatriz II
10:30
Porting Matlab applications to high-performance C++ codes: CPU/GPU-accelerated spherical deconvolution of diffusion MRI data

ABSTRACT. In many scientific research fields, Matlab has been established as de facto tool for application design. This approach offers multiple advantages such as rapid deployment prototyping and the use of high performance linear algebra, among others. However, the applications developed are highly dependent of the Matlab runtime, limiting the deployment in heterogeneous platforms. In this paper we migrate a Matlab-implemented application to the C++ programming language allowing the parallelization in GPUs. In particular, we have chosen RUMBA-SD, a spherical deconvolution algorithm which estimates the intra-voxel white-matter fiber orientations from diffusion MRI data. We describe the methodology used along with the tools and libraries leveraged during the translation task of such an application. To demonstrate the benefits of the migration, we perform a series of experiments using different HPC heterogeneous platforms and linear algebra libraries. The results show that the C++ version attains, on average, an speedup of 3x over the Matlab one.

11:00
Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets

ABSTRACT. Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the end-user and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%.

11:30
D-SPACE4Cloud: A Design Tool for Big Data Applications

ABSTRACT. The last years have seen a steep rise in data generation worldwide, with the development and widespread adoption of several software projects targeting the Big Data paradigm. Many companies currently engage in Big Data analytics as part of their core business activities, nonetheless there are no tools or techniques to support the design of the underlying infrastructure configuration backing such systems. In particular, the focus in this paper is set on Cloud deployed clusters, which represent a cost-effective alternative to on premises installations. We propose a novel tool implementing a battery of optimization and prediction techniques integrated so as to efficiently assess several alternative resource configurations, in order to determine the minimum cost cluster deployment satisfying Quality of Service constraints. Further, the experimental campaign conducted on real systems shows the validity and relevance of the proposed method.

12:00
On Stochastic performance and cost-aware optimal capacity planning of unreliable Infrastructure-as-a-Service cloud
SPEAKER: Weiling Li

ABSTRACT. Cloud computing is a kind of Internet-based computing, where shared resources, data and information are provided to computers and other devices on-demand. It is a model for enabling ubiquitous, on-demand access to a shared pool of configurable computing resources. This novel computing paradigm brings both opportunities and challenges which reshape the way that computational resources and services are provisioned. Performance evaluation of cloud data-centers is one of these challenges and has drawn considerable attention from academy and industry. In this study, we present an analytical approach to the performance analysis of Infrastructure-as-a-Service cloud data-centers with unreliable task executions and resubmissions of unsuccessful tasks. Several performance metrics are considered and analyzed under variable load intensities, failure frequencies, multiplexing abilities, and service intensities. We also conduct a case study based on a real-world cloud data-center and employ a confidence interval check to validate the correctness of the proposed model. For the performance optimization and optimal capacity planning purposes, we are also interested to know the minimized expected response time subjected to the constraint of request rejection rate, hardware cost (in terms of the cost of physical machines and the request buffer). We show that the optimization problem can be numerically solved through a simulated-annealing-based algorithm.

10:30-12:30 Session 7C: BigTrust (I)
Chair:
Location: Room Beatriz IV
10:30
Traffic Sign Recognition Based on Parameter-free Detector and Multi-modal Representation
SPEAKER: Xiaoping Ren

ABSTRACT. For the traffic sign that is difficult to detect in traffic environment, a traffic sign detection and recognition is proposed in this paper. First, the color characteristics of the traffic sign are segmented, and region of interest is expanded and extracts edge. Then edge is roughly divided by linear drawing and miscellaneous points removing. Turing angle curvature is computed according to the relations between the curvature of the vertices, vertices type is classified. The standard shapes such as circular, triangle, rectangle, etch are detected by parameter-free detector. For improving recognition accuracy, two different methods were presented to classify the detected candidate regions of traffic sign. The one method was dual-tree complex wavelet transform(DT-CWT) and 2D independent component analysis(2DICA) that represented candidate regions on grayscale image and reduced feature dimension, then a nearest neighbor classifier was employed to classify traffic sign image and reject noise regions. The other method was template matching based on intra pictograms of traffic sign. The obtained different recognition results were fused by some decision rules. The experimental results show that the detection and recognition rate of the proposed algorithm is higher for conditions such as traffic signs obscured, uneven illumination, color distortion, and it can achieve the effect of real-time processing.

10:55
Statistical analysis of CCM.M-K1 International Comparison based on Monte Carlo method
SPEAKER: unknown

ABSTRACT. The application of the Monte Carlo method is used in the processing of the measurement result of CCM.M-K1. This method can get over the limitations that apply in certain cases to the method described in GUM. Introduction and analysis of CCM.M-K1 measurement result was given out and commercial software named @RISK was used to purse numerical simulation and the result was compared with the final report of CCM.M-K1, which showed that differences between results of these two were negligible.

11:20
Secure data access in Hadoop using elliptic curve cryptography

ABSTRACT. Big data analytics allows to obtain valuable information from different data sources. It is important to maintain control of those data because unauthorized copies could be used by other entities or companies interested in them. Hadoop is widely used for processing large volumes of information and therefore is ideal for develop big data applications. Its security model focuses on the control within a cluster by preventing unauthorized users or encrypting data distributed among nodes. Sometimes, data theft is carried out by personnel who have access to the system so they can skip most of the security features. In this paper, we present an extension to the Hadoop security model that lets control the information from the source, avoiding that data can be used by unauthorized users and improving corporative e-governance. We use an eToken with elliptic curve cryptography that perform a robust operation of the system and prevents from being falsified, duplicated or manipulated.

11:45
The Research of Recommendation System based on User-Trust Mechanism and Matrix decomposition
SPEAKER: Zhang Panpan

ABSTRACT. Recommendation system is a tool that can help users quickly and effectively obtain useful resources in the face of the large amounts of information. Collaborative filtering is a widely used recommendation technology which for recommend for source users through similar neighbors’ scores, but is faced with the problem of data sparseness and cold start.Although recommendation system based on trust model can solve the above problems to some extent, but still need further improve to its coverage.To solve these problems, the paper proposes a matrix decomposition algorithm mixed with user trust mechanism(hereinafter referred to as UTMF),The algorithm uses matrix decomposition to fill the score matrix,and combinate trust rating information of users in the filling process.According to the results of experiment using the Epinions dataset, UTMF algorithm can improve the precision of the recommended, effectively ease the cold start problem.

12:10
Reversible data hiding using non-local means prediction
SPEAKER: Yingying Fang

ABSTRACT. In this paper, we propose a prediction-error based reversible data hiding scheme by incorporating non-local means (NLM) prediction. The traditional local predictors reported in literatures rely on the local correlation and behave badly in predicting textural pixels. By globally utilizing the potential self-similarity contained in the image itself, NLM can achieve better prediction in texture regions. More specifically, the pixels are predicted adaptively without the need of transmitting side information. In this paper the textural pixels distinguished by its local complexity are predicted by NLM while the smooth pixels having high local correlation are predicted by a local predictor. The incorporation of NLM makes the proposed method possible to achieve accurate predictions in both smooth and texture regions. Optimal parameters in the method are obtained by minimizing the prediction-error entropy. Experimental results compared with the state-of-the-art methods show that the proposed method can yield high-fidelity embedded images with a satisfactory embedding capacity, especially for texture images.

10:30-12:30 Session 7D: SCDT (I)

Supercomputing Co-Design Technology Workshop (SCDT-2016)

Co-­Design for Extreme Scale Computing through Runtime Technologies

Thomas Sterling
Intelligent Systems Engineering Department, Professor 
Center for Research in Extreme Scale Technologies Indiana University, Director

Abstract
As HPC enters the 100-­Petaflops era with the introduction of the TaiHuLight computer in China, the co-­design of algorithms, hardware architecture, and enabling system software is becoming imperative. The recent generations of systems has exemplified what has been convenient for hardware technologies for maximum flops and reduction of energy, but not application execution efficiency or user productivity. This is particularly true for those problems that require strong scaling or have significant dynamic components. Runtime system software offer an opportunity to address these and other challenges but suffer from overhead costs that hinder their effectiveness, at least for some problems requiring exploitation of near fine-­grain parallelism. Co-­design of system hardware and software architecture with dynamic adaptive application requirements may drive advances of future classes of scalable systems for extreme scale. This presentation will describe the challenges and the gaps between system architecture and dynamic execution that may be bridged through advances in algorithms, runtime software, and hardware architecture design. Results from recent research with the HPX-­5 runtime software system and consideration of investigation of FPGA runtime hardware support will be discussed and conclusions derived in support of advanced co-­design principles. Questions are welcome from the audience throughout the presentation.

 

Brief Biography
Dr. Thomas Sterling holds the position of Professor of Informatics and Computing at the Indiana University (IU) School of Informatics and Computing Department of Intelligent Systems Engineering (ISE) as well as serves as Director of the IU Center for Research in Extreme Scale Technologies (CREST). Since receiving his Ph.D from MIT in 1984 as a Hertz Fellow, Dr. Sterling has engaged in applied research in parallel computing system structures, semantics, and operation in industry, government labs, and academia. Dr. Sterling is best known as the "father of Beowulf" for his pioneering research in commodity/Linux cluster computing for which he shared the Gordon Bell Prize in 1997. He led the HTMT Project sponsored by multiple agencies to explore advanced technologies and their implication for high-end computer system architectures. Other research projects in which he contributed included the DARPA DIVA PIM architecture project with USC-ISI, the DARPA HPCS program sponsored Cray-led Cascade Petaflops architecture, and the Gilgamesh high-density computing project at NASA JPL. Sterling is currently involved in research associated with the innovative ParalleX execution model for extreme scale computing to establish the foundation principles guiding the development of future generation Exascale computing systems. ParalleX is currently the conceptual centerpiece of the XPRESS roject as part of the DOE X-stack program and has been demonstrated via the proof-of-concept HPX-5 runtime system software. Dr. Sterling is the co-author of six books and holds six patents. He was the recipient of the 2013 Vanguard Award and is a Fellow of the AAAS. He is also co-guest editor with Bill Gropp of the HPCwire Exascale Edition.

 

Location: Room Alejandra
10:30
Co-­Design for Extreme Scale Computing through Runtime Technologies

ABSTRACT. As HPC enters the 100-­Petaflops era with the introduction of the TaiHuLight computer in China, the co-­design of algorithms, hardware architecture, and enabling system software is becoming imperative. The recent generations of systems has exemplified what has been convenient for hardware technologies for maximum flops and reduction of energy, but not application execution efficiency or user productivity. This is particularly true for those problems that require strong scaling or have significant dynamic components. Runtime system software offer an opportunity to address these and other challenges but suffer from overhead costs that hinder their effectiveness, at least for some problems requiring exploitation of near fine-­grain parallelism. Co-­design of system hardware and software architecture with dynamic adaptive application requirements may drive advances of future classes of scalable systems for extreme scale. This presentation will describe the challenges and the gaps between system architecture and dynamic execution that may be bridged through advances in algorithms, runtime software, and hardware architecture design. Results from recent research with the HPX-­5 runtime software system and consideration of investigation of FPGA runtime hardware support will be discussed and conclusions derived in support of advanced co-­design principles. Questions are welcome from the audience throughout the presentation.

11:30
System monitoring-based holistic resource utilization analysis for every user of a large HPC center

ABSTRACT. The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement codesign techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.

11:50
Automated Parallel Simulation of Heart Electrical Activity Using Finite Element Method

ABSTRACT. In this paper we present an approach to the parallel simu- lation of the heart electrical activity using the finite element method with the help of the FEniCS automated scientific computing framework. FEniCS allows scientific software development using the near-mathematical notation and provides automatic parallelization on MPI clusters. We implemented the ten Tusscher–Panfilov (TP06) cell model of cardiac electrical activity. The scalability testing of the implementation was performed using up to 240 CPU cores and the 95 times speedup was achieved. We evaluated various combinations of the Krylov parallel linear solvers and the preconditioners available in FEniCS. The best performance was provided by the conjugate gradient method and the biconjugate gradient stabilized method solvers with the successive over-relaxation preconditioner. Since the FEniCS-based implementation of TP06 model uses notation close to the mathematical one, it can be utilized by computational mathematicians, biophysicists, and other researchers without extensive parallel computing skills.

12:10
Using hStreams programming library for accelerating a real-life application on Intel MIC

ABSTRACT. The main goal of this paper is the suitability assessment of the hStreams programming library for porting a real-life scientific application to heterogeneous platforms with Intel Xeon Phi coprocessors. This emerging library offers a higher level of abstraction to provide effective concurrency among tasks, and control over the overall performance. In our study, we focus on applying the FIFO streaming model for a parallel application which implements the numerical model of alloy solidification. In the paper, we show how scientific applications can benefit from multiple streams. To take full advantages of hStreams, we propose a decomposition of the studied application that allows us to distribute tasks belonging to the computational core of the application among two logical streams within two logical/physical domains. Effective overlapping computations with data transfers is another goal achieved in this way. The proposed approach allows us to execute the whole application 3.5 times faster than the original parallel version running on two CPUs.

10:30-12:30 Session 7E: En2ESO (II)
Location: Room Beatriz III
10:30
Seer: Empowering Software Defined Networking with Data Analytics

ABSTRACT. Network complexity is increasing, making network control and orchestration a challenging task. The proliferation of network information and tools for data analytics can provide an important insight into resource provisioning and optimisation. The network knowledge incorporated in software defined networking can facilitate the knowledge driven control, leveraging the network programmability. We present Seer: a flexible, highly configurable data analytics platform for network intelligence based on software defined networking and big data principles. Seer combines a computational engine with a distributed messaging system to provide a scalable, fault tolerant and real-time platform for knowledge extraction. Our first prototype uses Apache Spark for streaming analytics and open network operating system (ONOS)controller to program a network in real-time. The first application we developed aims to predict the mobility pattern of mobile devices inside a smart city environment.

11:00
Multi-Domain Orchestration for NFV: Challenges and Research Directions

ABSTRACT. In this paper we focus on the problem of multi-domain orchestration over multi-technology environments, with the focus being on the LTE segment. In order to facilitate service deployment in end-to-end setups, new orchestration designs are required that exploit and advance existing methodologies. We examine in detail the challenges on multi-domain Network Function Virtualization (NFV) orchestration for the general case, we provide the current landscape and the existing technologies, while focusing on the LTE network side, we provide an analysis with LTE-specific considerations. We also describe a reference architecture, that jointly considers the problem of NFV orchestration and supports the concept of Network Slicing.

11:30
Baguette: Towards end-to-end service orchestration in heterogeneous networks

ABSTRACT. Network services are the key mechanism for operators to introduce intelligence and generate profit from their infrastructures. The growth of the number of network users and the stricter application network requirements have highlighted a number of challenges in orchestrating services using existing production management and configuration protocols and mechanisms. Recent networking paradigms like Software Defined Networking (SDN) and Network Function Virtualization (NFV), provide a set of novel control and management interfaces that enable unprecedented automation, flexibility and openness capabilities in operator infrastructure management. This paper presents Baguette, a novel and open service orchestration framework for operators. Baguette supports a wide range of network technologies, namely optical and wired Ethernet technologies, and allows service providers to automate the deployment and dynamic re-optimization of network services. We present the design of the orchestrator and elaborate on the integration of Baguette with existing low-level network and cloud management frameworks. 

12:00
Energy efficient orchestration of virtual services in 5G integrated fronthaul/backhaul infrastructures
SPEAKER: Giada Landi

ABSTRACT. The 5G-Crosshaul project is developing an integrated fronthaul/backhaul for 5G infrastructures, composed of heterogeneous forwarding elements (XFEs) and processing units (XPUs), controlled through an SDN and NFV based control platform to deliver multi-tenant network slices and services. In this scenario, we propose a novel energy management SDN application, called EMMA, for energy-wise service orchestration. EMMA monitors power consumption based on power measurements or traffic/CPU load statistics and orchestrates resources through energy-aware routing and VNF placement algorithms, automatically adjusting XFEs' and XPUs' power states to guarantee energy efficiency on per-tenant basis or at the whole physical infrastructure level.

12:30-14:00Lunch Break
14:00-14:30 Session 8: Industrial session
Location: Room Alejandra
14:00
Intel Software Development Tools for Parallel Computing
SPEAKER: Edmund Preiss

ABSTRACT. - Overview of the Intel Software Development Tools ( Intel Parallel Studio XE) o Recently introduced new features § Python, Big Data and Machine learning, high level storage/application and performance analysis

- Intel’s parallelization concepts o Threading Models o Vectorization

14:30-16:00 Session 9A: ICA3PP: Parallel and Distributed Algorithms (III)
Location: Room Alejandra
14:30
The Impact of Panel Factorization on the Gauss-Huard Algorithm for the Solution of Linear Systems on Modern Architectures

ABSTRACT. The Gauss-Huard algorithm (the GHA) is a specialized version of Gauss-Jordan elimination for the solution of linear systems that, enhanced with column pivoting, exhibits numerical stability and computational cost close to those of the conventional solver based on the LU factorization with row pivoting. Furthermore, the GHA can be formulated as a procedure rich in matrix multiplications, so that high performance can be expected on current architectures with multi-layered memories. Unfortunately, in principle the GHA does not admit the introduction of look-ahead, a technique that has been demonstrated to be rather useful to improve the performance of the LU factorization on multi-threaded platforms with high levels of hardware concurrency. In this paper we analyze the effect of this drawback on the implementation of the GHA on systems accelerated with graphics processing units (GPUs), exposing the roles of the CPU-to-GPU and single precision-to-double precision performance ratios, as well as the contribution from the operations in the algorithm’s critical path.

15:00
Improving Hash Distributed A* for shared-memory architectures using abstraction

ABSTRACT. The A* algorithm is generally used to solve combinatorial optimization problems, but it requires high computing power and a large amount of memory, hence, effcient parallel A* algorithms, for different architectures, are needed. In this sense, Hash Distributed A* (HDA*) parallelizes A* by applying a decentralized strategy and a hash-based node distribution scheme. However, this distribution scheme results in frequent node transfers among processors. In this paper, we present Optimized AHDA*, a version of HDA* for shared-memory architectures, that uses an abstraction-based node distribution scheme and a technique to group several nodes before transferring them to the corresponding thread. Both methods reduce the amount of node transfers and mitigate communication and contention. We assess the effect of each technique on algorithm performance. Finally, we evaluate the scalability of the proposed algorithm, when it is run on a multicore machine, using the 15-puzzle as a case study.

15:30
Leveraging the Performance of LBM-HPC for Large Sizes on GPUs using Ghost Cells

ABSTRACT. Today, we are living a growing demand of larger and more efficient computational resources from the scientific community. On the other hand, the appearance of GPUs for general purpose computing supposed an important advance for covering such demand. These devices offer an impressive computational capacity at low cost and an efficient power consumption. However, the memory available in these devices is (sometimes) not enough, and so it is necessary computationally expensive memory transfers from (to) CPU to (from) GPU, causing a dramatic fall in performance. Recently, the Lattice-Boltzmann Method has positioned as an efficient methodology for fluid simulations. Although this method presents some interesting features particularly amenable to be efficiently exploited on parallel computers, it requires a considerable memory capacity, which can suppose an important drawback, in particular, on GPUs. In the present paper, it is proposed a new GPU-based implementation, which minimizes such requirements with respect to other state-of-the-art implementations. It allows us to execute almost 2× bigger problems without additional memory transfers, achieving faster executions when dealing with large problems.

16:00
Hardware-Based Sequential Consistency Violation Detection Made Simpler

ABSTRACT. Modern architectures aggressively reorder and overlap memory accesses, causing Sequential Consistency Violations (SCVs). An SCV is practically always a bug. This paper proposes Dissector, a hardware software combined approach to detect SCVs in a conventional TSO machine. Dissector hardware works by piggybacking information about pending stores with cache coherence messages. Later, it detects if any of those pending stores can cause an SCV cycle. Dissector keeps hardware modifications minimal and simpler by sacrificing some degree of detection accuracy. Dissector recovers the loss in detection accuracy by using a postprocessing software which filters out false positives and extracts detail debugging information. Dissector hardware is lightweight, keeps the cache coherence protocol clean, does not generate any extra messages, and is unaffected by branch mispredictions. Moreover, due to the postprocessing phase, Dissector does not suffer from false positives. This paper presents a detailed design and implementation of Dissector in a conventional TSO machine. Our experiments with different concurrent algorithms, bug kernels, Splash2 and Parsec applications show that Dissector has a better SCV detection ability than a state-of-the-art hardware based approach with much less hardware. Dissector hardware induces a negligible execution overhead of 0.02%. Moreover, with more processors, the overhead remains virtually the same.

14:30-16:00 Session 9B: TAPEMS (I)
Location: Room Beatriz II
14:30
Improving the energy efficiency of Evolutionary Multiobjective algorithms

ABSTRACT. Problems for which many objective functions have to be simultaneously optimized can be easily found in many fields of science and industry. To solve this kind of problems in a reasonable amount of time and taking into account the energy efficiency is still a relevant task. Most of the evolutionary multi-objective optimization algorithms based on parallel computing are focused only on performance. In this paper, we propose a parallel implementation of the most time consuming parts of the Evolutionary Multi-Objective algorithms with major attention to energy consumption. Specifically, we focus on the most computationally expensive part of the state-of-the-art evolutionary NSGA-II algorithm, the Non-Dominated Sorting (NDS) procedure. GPU platforms have been considered due to their high acceleration capacity and energy efficiency. A new version of NDS procedure is proposed (referred to as EFNDS). A made-to-measure data structure to store the dominance information has been designed to take advantage of the GPU architecture to compute NDS. NSGA-II based on EFNDS is comparatively evaluated with another state-of-art GPU version, and also with a widely used sequential version. In the evaluation we adopt a benchmark that is scalable in the number of objectives as well as decision variables (the DTLZ test suite) using a large number of individuals (from 500 up to 30000). The results clearly indicate that our proposal achieves the best performance and energy efficiency for solving large scale multi-objective optimization problems on GPU.

15:00
Network-aware Optimization of MPDATA on Homogeneous Multi-core Clusters with Heterogeneous Network

ABSTRACT. The communication layer of modern HPC platforms is getting increasingly heterogeneous and hierarchical. As a result, even on platforms with homogeneous processors, the communication cost of many parallel applications will significantly vary depending on the mapping of their processes to the processors of the platform. The optimal mapping, minimizing the communication cost of the application, will strongly depend on the network structure and performance as well as the logical communication flow of the application. In our previous work, we proposed a general approach and two approximate heuristic algorithms aimed at minimization of the communication cost of data parallel applications which have two-dimensional symmetric communication pattern on heterogeneous hierarchical networks, and tested these algorithms in the context of the parallel matrix multiplication application. In this paper, we develop a new algorithm that is built on top of one of these heuristic approaches in the context of a real-life application, MPDATA, which is one of the major parts of the EULAG geophysical model. We carefully study the communication flow of MPDATA and discover that even under the assumption of a perfectly homogeneous communication network, the logical communication links of this application will have different bandwidths, which makes the optimization of its communication cost particularly challenging. We propose a new algorithm that is based on cost functions of one of our general heuristic algorithms and apply it to optimization of the communication cost of MPDATA, which has asymmetric heterogeneous communication pattern. We also present experimental results demonstrating performance gains due to this optimization.

15:30
Comparative Analysis of OpenACC Compilers
SPEAKER: Daniel Barba

ABSTRACT. OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. There is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementations of the standard, using CUDA and OpenCL as backends. Nvidia is investing in this project and they have released an OpenACC Toolkit, including the PGI Compiler. There are, however, more develop- ments out there. In this work, we analyze different available OpenACC compilers that have been developed by companies or universities during the last years. We check their performance and maturity, keeping in mind that OpenACC is designed to be used without extensive knowledge about parallel programming. Our results show that the compilers are on their way to a reasonable maturity, presenting different strengths and weaknesses.

 

14:30-16:00 Session 9C: SCDT (II)
Location: Room Beatriz I
14:30
Efficient Distributed Computations with DIRAC

ABSTRACT. High Energy Physics (HEP) experiments at the LHC collider at CERN were among the first scientific communities with very high computing requirements. Nowadays, researchers in other scientific domains are in need of similar computational power and storage capacity. Solution for the HEP experiments was found in the form of computational grid - distributed computing infrastructure integrating large num- ber of computing centers based on commodity hardware. These infrastructures are very well suited for High Throughput applications used for analysis of large volumes of data with trivial parallelization in multiple independent execution threads. More advanced applications in HEP and other scientific domains can exploit complex parallelization techniques using multiple interacting execution threads. A growing number of High Performance Computing (HPC) centers, or supercomputers, support this mode of operation. One of the software toolkits developed for building distributed computing systems is the DIRAC Interware. It allows seamless integration of computing and storage resources based on different technologies into a single coherent system. This product was very successful to solve problems of large HEP experiments and was upgraded in order to offer a general-purpose solution. The DIRAC Interware can help including also HPC centers into a common federation to achieve similar goals as for computational grids. However, integration of HPC centers imposes certain requirements on their internal organization and external connectivity presenting a complex co-design problem. A distributed infrastructure including supercomputers is planned for construction. It will be applied for inter- disciplinary large-scale problems of modern science and technology.

14:50
The Co-design of Astrophysical Code for Massively Parallel Supercomputers
SPEAKER: Igor Chernykh

ABSTRACT. The rapid growth of supercomputer technologies became a driver for the development of natural sciences. Most of the discoveries in astronomy, in physics of elementary particles, in the design of new materials in the DNA research are connected with numerical simulation and with supercomputers. Supercomputer simulation became an important tool for the processing of the great volume of the observation and experimental data accumulated by the mankind. Modern scientific challenges put the actuality of the works in computer systems and in the scientific software design to the highest level. The architecture of the future exascale systems is still being discussed. Nevertheless, it is necessary to develop the algorithms and software for such systems right now. It is necessary to develop software that is capable of using tens and hundreds of thousands of processors and of transmitting and storing of large volumes of data. In the present work the technology for the development of such algorithms and software is proposed. As an example of the use of the technology, the process of the software development is considered for some problems of astrophysics and plasma physics.

15:10
Generalized Approach to Scalability Analysis of Parallel Applications
15:30
Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing

ABSTRACT. Three dimensional particle-in-cell laser-plasma simulation is an important area of computational physics. Solving state-of-the-art problems requires large-scale simulation on a supercomputer using specialized codes. A growing demand in computational resources inspires research in improving efficiency and co-design for supercomputers based on manycore architectures. This paper presents first performance results of the particle-in-cell plasma simulation code PICADOR on the recently introduced Knights Landing generation of Intel Xeon Phi. A straightforward rebuilding of the code yields a 2.43 x speedup compared to the previous Knights Corner generation. Further code optimization results in an additional 1.89 x speedup. The optimization performed is beneficial not only for Knights Landing, but also for high-end CPUs and Knights Corner. The optimized version achieves 100 GFLOPS double precision performance on a Knights Landing device with the speedups of 2.35 x compared to a 14-core Haswell CPU and 3.47 x compared to a 61-core Knights Corner Xeon Phi.

14:30-16:00 Session 9D: UCER (I)
Location: Room Beatriz III
14:30
Implementation of the Beamformer Algorithm for the NVIDIA Jetson

ABSTRACT. Nowadays, the aim of the technology industry is intensively shifting to improve the ratio Gflop/watt of computation. Many proces- sors implement the low power design of ARM architecture like, e.g. the NVIDIA TK1, a chip which also includes a GPU embedded in the same die to improve performance at a low energy consumption. This type of devices are very suitable target machines to be used on applications that require mobility like, e.g. those that manage and reproduce real acoustics environments. One of the most used algorithms in these reproduction environments is the Beamformer Algorithm. We have implemented the variant called Beamformer QR-LCMV, based on the QR decomposition, which is a very computationally demanding operation. We have explored different options differing basically in the high performance computing library used. Also we have built our own version with the aim of approaching the real-time processing goal when working on this type of low power devices.

15:00
Efficiency of GPUs for Relational Database Engine Processing

ABSTRACT. Relational database management systems (RDBMS) are still widely required by numerous business applications. Boosting performances without compromising functionalities represents a big challenge. To achieve this goal, we propose to boost an existing RDBMS by making it able to use hardware architectures with high memory bandwidth like GPUs. In this paper we present a solution named CuDB. We compare the performances and energy efficiency of our approach with different GPU ranges. We focus on technical specificities of GPUs which are most relevant for designing high energy efficient solutions for database processing.

15:30
MARL-Ped+Hitmap: Towards improving agent-based simulations with distributed arrays

ABSTRACT. Multi-agent systems allow the modelling of complex, het- erogeneous, and distributed systems in a realistic way. MARL-Ped is a multi-agent system tool, based on the MPI standard, for the simulation of different scenarios of pedestrians who autonomously learn the best behavior by Reinforcement Learning. MARL-Ped uses one MPI process for each agent by design, with a fixed fine-grain granularity. This requirement limits the performance of the simulations for a restricted number of processors that is lesser than the number of agents. On the other hand, Hitmap is a library to ease the programming of parallel applications based on distributed arrays. It includes abstractions for the automatic partition and mapping of arrays at runtime with arbitrary granularity, as well as functionalities to build flexible communication patterns that transparently adapt to the data partitions.
In this work, we present the methodology and techniques of granularity selection in Hitmap, applied to the simulations of agent systems. As a first approximation, we use the MARL-Ped multi-agent pedestrian simulation software as a case of study for intra-node cases. Hitmap allows to transparently map agents to processes, reducing oversubscription and intra-node communication overheads. The evaluation results show significant advantages when using Hitmap, increasing the flexibility, performance, and agent-number scalability for a fixed number of processing elements, allowing a better exploitation of isolated nodes.

14:30-16:00 Session 9E: SUT4Coaching (I)
Location: Room Beatriz IV
14:30
Improved Track Path Method in Real Time by using GPS and Accelerometer Data
SPEAKER: Haklin Kimm

ABSTRACT. In this paper we represent an improved algorithm that traces a tracked path in real time by using GPS and Accelerometer data. In our previous work the use of a small, 16-bit processor in conjunction with a stand-alone GPS receiver to gather GPS data through a wrist-watch-like device for runners was proposed. However, we were faced with the problems of erroneous raw GPS data and limited memory size. Obtaining a trace of a running path requires a larger memory capacity on the wearable GPS device, so it is recommended to sample raw GPS data less frequently and generate a reduced track path. We propose a method that can obtain more acceptable GPS track paths by applying 3-degree accelerometer data and selecting less erroneous data.

15:00
An Event-Based Approach for Discovering Activities of Daily Living by Hidden Markov Models
SPEAKER: Kévin Viard

ABSTRACT. Smart Home technologies may improve the comfort and the safety of the frail people into their houses. To achieve this goal, models of Activities of Daily Living (ADL) are often used to detect dangerous situations or behavioral changes in the habits of these persons. In this paper, it is proposed an approach to build a model of ADLs, under the form of Hidden Markov Models (HMMs), from a training database of observed events emitted by binary sensors. The main advantage of our approach is that no knowledge of actions really performed during the learning period is required. Finally, we applied our approach to a real case study and we discussed the quality of the results obtained.

15:30
A Learning System to Support Social and Empathy Disorders Diagnosis through Affective Avatars
SPEAKER: Ramon Hervas

ABSTRACT. Nowadays diagnosis and treatment of cognitive and physical health issues can be empowered through the use of information technologies. However, there is a significant gap between the potential of those technologies and the real application. One example is the use of serious games with health proposals, a trending research area still not implanted in health systems. This paper proposes the use of serious games, particularly an interactive and affective avatar-based application to support the diagnosis and treatment of empathy and socialization issues, in an autonomous way through the implementation of a learning algorithm based on the ground truth obtained from the evaluation with real users, including normotypical users, users with Down syndrome and users with intellectual disability.

16:00
Analysis of the Innovation Outputs in mHealth for Patient Monitoring

ABSTRACT. In the last decade, mobile health (mHealth) has developed as a natural consequence of the advances in mobile technologies, the growing spread of mobile devices, and their application in the provision of novel health services. mHealth has demonstrated the potential to make the health care sector more efficient and sustainable and to increase the healthcare quality. Considering the boost to the healthcare area which will be provided by mHealth, many organizations and governments have engaged in innovating in this area. In this context, this work investigated the role of innovation in the area of mHealth for patient monitoring in order to determine the trends and the performance of the innovation activities in this domain. Proxy indicators, like intellectual property statistics and scientific publication statistics, were utilized to measure the outputs of innovation during the period of time from 2006 to 2015 in Europe. Two studies were performed to provide quantitative measures for the indicators measuring innovation outputs in the domain of mHealth for patient monitoring and three main conclusion were observed. First, even if there was a lot of research in Europe about mHealth for patient monitoring, the vast majority of the enterprises did not protect their inventions. Second, a strong research collaboration in the area of mHealth for patient monitoring took place between researchers affiliated to institutions of different European countries and even with researchers working in Asian or American institutions. Finally, an increasing trend on the number of published articles about mHealth for patient monitoring was identified. Therefore, the findings of the studies demonstrated the great interest that had arisen the field of mHealth and the huge involvement in innovation activities in the area of mHealth for patient monitoring.

16:00-16:30Coffee Break & ho-Computer raffle
16:30-18:30 Session 10A: ICA3PP: Applications of Parallel and Distributed Computing (II)
Location: Room Alejandra
16:30
Shared Memory Tile-based vs Hybrid Memory GOP-based Parallel Algorithms for HEVC Encoder

ABSTRACT. After the emergence of the new High Efficiency Video Coding standard, several strategies have been followed in order to take advantage of the parallel features available in it. Many of the parallelization approaches in the literature have been performed in the decoder side, aiming at achieving real-time decoding. However, the most complex part of the HEVC codec is the encoding side. In this paper, we perform a comparative analysis of two parallelization proposals. One of them is based on tiles, employing shared memory architectures and the other one is based on Groups Of Pictures, employing distributed shared memory architectures. The results show that good speed-ups are obtained for the tile-based proposal, especially for high resolution video sequences, but the scalability decreases for low resolution video sequences. The GOP-based proposal outperforms the tile-based proposal when the number of processes increases. This benefit grows up when low resolution video sequences are compressed.

17:00
GPU-Based Heterogeneous Coding Architecture for HEVC

ABSTRACT. The High Efficiency Video Coding (HEVC) standard has nearly doubled the compression efficiency of prior standards. Nonetheless, this increase in coding efficiency involves a notably higher computing complexity that should be overcome in order to achieve real-time encoding. For this reason, this paper focuses on applying parallel processing techniques to the HEVC encoder with the aim of reducing significantly its computational cost without affecting the compression performance. Firstly, we propose a coarse-grained slice-based parallelization technique that is executed in a multi-core CPU, and then, with finer level of paral- lelism, a GPU-based motion estimation algorithm. Both techniques define a heterogeneous parallel coding architecture for HEVC. Results show that speed-ups of up to 4.06× can be obtained on a quad-core platform with low impact in coding performance.

17:30
Optimizing GPU code for CPU execution using OpenCL and vectorization: a case study on image coding

ABSTRACT. Although OpenCL (Open Computing Language) aims to achieve portability at the code level, different hardware platforms requires different approaches in order to extract the best performance for OpenCL-based code. In this work, we use an image encoder originally tuned for OpenCL on GPU (OpenCL-GPU), and optimize it for multi-CPU based platforms. We produce two OpenCL-based versions: i) a regular one (OpenCL-CPU) and ii) a CPU vector-based one (OpenCL-CPU-Vect). The use of CPU vectorization exploits the OpenCL support, making it much simpler than directly coding with SIMD instructions such as SSE and AVX. Globally, while the OpenCL-GPU version is the fastest when run on a high end GPU requiring around 580 seconds to encode the Lenna image, its performance drops roughly 65% when run unchanged on a multicore CPU machine. Regarding the versions tuned for CPU, the OpenCL-CPU encodes the Lenna image in 805 seconds, while the vectorization-based approach executes the same operation in 672 seconds. Results show that meaningful performance gains can be achieved by tailoring the OpenCL code to the CPU, and that the use of CPU vectorization instructions through OpenCL is both rather simple and performance rewarding.

18:00
Efficient Parallel Algorithm for Optimal DAG Structure Search on Parallel Computer with Torus Network

ABSTRACT. The optimal directed acyclic graph search problem constitutes searching for a DAG with a minimum score, where the score of a DAG is defined on its structure. It is thus not feasible to solve large instances using a single processor. Some parallel algorithms have therefore been developed to solve larger instances. A recently proposed parallel algorithm can solve an instance of 33 vertices, and this is the largest solved size reported thus far. In the study presented in this paper, we developed a novel parallel algorithm designed specifically to operate on a parallel computer with a torus network. Our algorithm crucially exploits the torus network structure, thereby obtaining good scalability. Through computational experiments, we confirmed that a run of our proposed method using up to \num{20736} cores showed a parallelization efficiency of 0.94 as compared to a 1296-core run. Finally, we successfully computed an optimal DAG structure for an instance of 36 vertices, which is the largest solved size reported in the literature.

16:30-18:30 Session 10B: TAPEMS (II)
Location: Room Beatriz II
16:30
A parallel model for heterogeneous cluster

ABSTRACT. The LogP model was used to measure the effects of latency, occupancy and bandwidth on distributed memory multiprocessors. The idea was to characterize distributed memory multiprocessor using these key parameters, studying their impacts on performance in simulation environments. This work proposes a new model, based on the LogP, that describes the impacts on performance of applications executing on an heterogeneous cluster. This model can be used, in a near future, to help choose the best way to split a parallel application to be executed on this architecture. The model considers that an heterogeneous cluster is composed by distinct types of processors, accelerators and networks.

17:00
Formalizing Data Locality in Task Parallel Applications

ABSTRACT. Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes with the flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling tools have been developed to understand these characteristics, the interplay between task scheduling and data reuse in the cache hierarchy has not been explored. These interactions are particularly intriguing due to the flexibility task-based runtimes have in scheduling tasks, which may allow them to improve cache behaviour. This work presents StatTask, a novel statistical cache model that can predict cache behaviour for arbitrary task schedules and cache sizes from a single execution, without programmer annotations. StatTask enables fast and accurate modeling of data locality in task-based applications for the first time. We demonstrate the potential of this new analysis to scheduling by examining applications from the BOTS benchmarks suite, and identifying several important opportunities for reuse-aware scheduling.

17:30
OTFX: An In-memory Event Tracing Extension to the Open Trace Format 2

ABSTRACT. In event-based performance analysis the amount of collected data is one of the most urgent challenges. It can massively slow down application execution, overwhelm the underlying file system and introduce significant measurement bias due to intermediate memory buffer flushes. To address these issues we propose an in-memory event tracing approach that dynamically adapts the volume of application events to an amount that is guaranteed to fit into a single memory buffer, and therefore, avoiding file interaction entirely. These concepts include runtime filtering, enhanced encoding techniques, and novel strategies for runtime event reduction. The concepts further include the hierarchical memory buffer a multi-dimensional, hierarchical data structure allowing to realize these concepts with minimal overhead. We demonstrate the capabilities of our concepts with a prototype implementation called OTFX, based on the Open Trace Format 2, a state-of-the-art open source tracing library used by the performance analyzers Vampir, Scalasca, and Tau.

18:00
Tuning the Blocksize for Dense Linear Algebra Factorization Routines with the Roofline Model

ABSTRACT. The optimization of dense linear algebra operations is a fundamental task in the solution of many scientific computing applications. The Roofline Model is a tool that provides an estimation of the performance that a computational kernel can attain on a hardware platform. Therefore, the RM can be used to investigate whether a computational kernel can be further accelerated. We present an approach, based on the RM, to optimize the algorithmic parameters of dense linear algebra kernels. In particular, we perform a basic analysis to identify the optimal values for the kernel parameters. As a proof-of-concept, we apply this technique to optimize a blocked algorithm for matrix inversion via Gauss-Jordan elimination. In addition, we extend this technique to multiblock computational kernels. An experimental evaluation validates the method and shows its convenience. We remark that the results obtained can be extended to other computational kernels similar to Gauss-Jordan elimination such as, e.g., matrix factorizations and the solution of linear least squares problems.

16:30-18:30 Session 10C: SCDT (III)
Location: Room Beatriz I
16:30
Hardware-Specific Selection the Most Fast-Running Software Components
SPEAKER: Alexey Sidnev

ABSTRACT. Software development problems include, in particular, selection of the most fast-running software components among the available ones. In the paper it is proposed to develop a prediction model that can estimate software component runtime to solve this problem. Such a model is built as a function of algorithm parameters and computational system characteristics. It also has been studied which of those features are the most representative ones. As a result of these studies a two-stage scheme of prediction model development based on linear and non-linear machine learning algorithms has been formulated. The paper presents a comparative analysis of runtime prediction results for solving several linear algebra problems on 84 personal computers and servers. The use of the proposed approach shows an error of less than 22% for computational systems represented in the training data set.

16:50
Educational and Research Systems for Evaluating the Efficiency of Parallel Computations
SPEAKER: Victor Gergel

ABSTRACT. In this paper we consider the educational and research systems that can be used to estimate the efficiency of parallel computing. ParaLab allows parallel computation methods to be studies. With the ParaLib library, we can compare the parallel programming languages and technologies. The Globalizer Lab system is capable of estimating the efficiency of algorithms for solving computationally intensive global optimization problems. These systems can build models of various high-performance systems, formulate the problems to be solved, perform computational experiments in the simulation mode and analyze the results. The crucial matter is that the described systems support a visual representation of the parallel computation process. If combined, these systems can be useful for developing high-performance parallel programs which take the specific features of modern supercomputing systems into account.

17:10
Workshop Closing
16:30-18:30 Session 10D: UCER (II)
Location: Room Beatriz III
16:30
Exploring a Distributed Iterative Reconstructor based on Split Bregman using PETSc

ABSTRACT. The proliferation in the last years of many iterative algorithms for Computed Tomography is a result of the need of finding new ways for obtaining high quality images using low dose acquisition methods. These iterative algorithms are, in many cases, computationally much more expensive than traditional analytic ones. Based on the resolution of large linear systems, they normally make use of backprojection and projections operands in an iterative way reducing the performance of the algorithms compared to traditional ones. They are also algorithms that rely on a large quantity of memory because they need of working with large coefficient matrices. As the resolution of the available detectors increase, the size of these matrices starts to be unmanageable in standard workstations. In this work we propose a distributed solution of an iter- ative reconstruction algorithm with the help of the PETSc library. We show in our preliminary results the good scalability of the solution in one node (close to the ideal one) and the possibilities offered with a a larger number of nodes. However, when increasing the number of nodes the performance degrades due to the poor scalability of some fundamental pieces of the algorithm as well as the increase of the time spend in both MPI communication and reduction.

17:00
I/O-focused Cost Model for the Exploitation of Public Cloud Resources in Data-Intensive Workflows

ABSTRACT. Ultrascale computing systems will blur the line between HPC and cloud platforms, transparently offering to the end-user every possible available computing resource, independently of their characteristics, location, and philosophy. However, this horizon is still far from complete. In this work, we propose a model for calculating the costs related with the deployment of data-intensive applications in IaaS cloud platforms. The model will be especially focused on I/O-related costs in data-intensive applications and on the evaluation of alternative I/O solutions. This pa- per also evaluates the differences in costs of a typical cloud storage service in contrast with our proposed in-memory I/O accelerator, Hercules, showing great flexibility potential in the price/performance trade-off. In Hercules cases, the execution time reductions are up to 25% in the best case, while costs are similar to Amazon S3.

17:30
Geocon: A Middleware for Location-aware Ubiquitous Applications

ABSTRACT. A core functionality of any location-aware ubiquitous system is storing, indexing, and retrieving information about entities that are commonly involved in these scenarios, such as users, places, events and other resources. The goal of this work is to design and provide the prototype of a service-oriented middleware, called Geocon, which can be used by mobile application developers to implement such functionality. In order to represent information about users, places, events and resources of mobile location-aware applications, Geocon de nes a basic metadata model that can be extended to match most application requirements. The middleware includes a geocon-service for storing, searching and selecting metadata about users, resources, events and places of interest, and a geocon-client library that allows mobile applications to interact with the service through the invocation of local methods. The paper describes the metadata model and the components of the Geocon middleware. A prototype of Geocon is available at https://github.com/SCAlabUnical/Geocon.

16:30-18:30 Session 10E: SUT4Coaching (II)
Location: Room Beatriz IV
16:30
Study of Wearable And 3D-Printable Vibration-Based Energy Harvesters

ABSTRACT. There has been a significant increase in the research on vibration-based energy harvesting in recent years. The aim of this study is to present a vibration-based harvester which is based on a structure as much 3D printable as possible. This harvester has an electromagnetic power generator and is designed to convert body motion vibration energy to electrical power. The structure of the generator has been printed in 3D, and it has been designed to contend the different parts of the generator. The mechanism consists in a 3D printed spiral, a magnet and a non-magnetic bar. The magnet movement is defined by the spiral and controlled by a cylindrical structure on the top of the complete design. Its performance is measured on vibration testing, and the investigation shows that the designed device works with low-frequency vibrations within a range of 8-16 Hz.

17:00
Activity Recognition in a Home Setting using Off the Shelf Smart Watch Technology

ABSTRACT. Being able to detect in real-time the activity performed by a user in a home setting provides highly valuable context. It can allow more effective use of novel technologies in a large variety of applications, from comfort and safety to energy efficiency, remote health monitoring and assisted living. In a home setting, activity recognition has been traditionally studied based on either a large sensor network infrastructure already set up in a home, or a network of wearable sensors attached to various parts of the user's body. We argue that both approaches suffer considerably in terms of practicality and propose instead the use of commercial off-the-shelf smart watches, already owned by the users. We test the feasibility of this approach with two different smart watches of very different capabilities, on a variety of activities performed daily in a domestic environment, from brushing teeth to preparing food. Our experimental results are encouraging, as using standard Support Vector Machine based classification, the accuracy rates range between 88% and 100%, depending on the type of smart watch and the window size chosen for data segmentation.

17:30
The Effectiveness of Upward and Downward Social Comparison of Physical Activity in an Online Intervention
SPEAKER: Julia Mollee

ABSTRACT. It has been established that social processes play an important role in achieving and maintaining a healthy lifestyle, but there are still gaps in the knowledge on how to apply such processes in behavior change interventions. One of these mechanisms is social comparison, i.e. the tendency to self-evaluate by comparing oneself to others. Social comparison can be either downward or upward, depending on whether individuals compare themselves to a target that performs worse or better. Depending on personal preferences, the variants can have beneficial or adverse effects. In this paper, we present the results of an experiment where participants (who indicated to prefer either upward comparison or downward comparison) were sequentially shown both directions of social comparison, in order to influence their physical activity levels. The results show that presenting users with the type of social comparison they do not prefer may indeed be counter-effective. Therefore, it is important to take this risk into account when designing physical activity promotion programs with social comparison features.

18:00
First Approach to Automatic Performance Status Evaluation and Physical Activity Recognition in Cancer Patients

ABSTRACT. The evaluation of cancer patients’ recovery is still under big subjectivity from physicians. Many different systems have been successfully implemented for physical activity evaluation, nonetheless there is still a big leap into Performance 
Status evaluation with ECOG and Karnofsky’s Performance Status scores. An automatic system for data recovering based on Android smartphone and wearables has been developed. A gamification implementation has been designed for increasing patients’ motivation in their recovery. Furthermore, novel and without-precedent algorithms for Performance Status (PS) and Physical Activity (PA) assessment have been developed to help oncologists in their diagnoses.