NPC 2016: THE 13TH IFIP INTERNATIONAL CONFERENCE ON NETWORK AND PARALLEL COMPUTING
PROGRAM FOR FRIDAY, OCTOBER 28TH
Days:
next day
all days

View: session overviewtalk overview

10:00-10:20Coffee Break
10:20-12:00 Session 3: Regular Paper: Models and Systems
10:20
Toward A Parallel Turing Machine Model
SPEAKER: unknown

ABSTRACT. In the field of parallel computing, the late leader Ken Kennedy, has raised an alarming warning in early 1990s: "Is Parallel Computing Dead?" Now, we have witnessed the tremendous momentum of the "second spring" of parallel computing in recent years. But, what lessons should we learn from the history of parallel computing when we are walking out from the bottom state of the field? To this end, this paper examines the disappointing state of the work in parallel Turing machine models in the past 50 years of parallel computing research. Lacking a solid yet intuitive parallel Turing machine model will continue to be a serious challenge. Our paper presents an attempt to address this challenge -- by presenting a proposal of a parallel Turing machine model -- the PTM model.

10:45
dCompaction: Delayed Compaction for LSM-tree
SPEAKER: Fengfeng Pan

ABSTRACT. Key-value (KV) stores have become a backbone of large-scale ap- plications in today’s data centers. Write-optimized data structures like LSM- tree (Log-Structured Merge-tree) and their variants are widely used in KV storage systems like BigTable and RocksDB. Conventional LSM-tree organizes KV items into multiple, successively larger components, and uses compaction to push KV items from one smaller component to another adjacent larger com- ponent until the KV items reach the largest component. Unfortunately, current compaction scheme incurs significant write amplification due to repeated KV items reads and writes, and then results in poor throughput. We propose a new compaction scheme, delayed compaction (dCompaction), that decreases write amplification. dCompaction postpones parts of compactions and gather theses compactions into the following compaction. In this way, it avoids KV items reads and writes during compaction, and consequently im- proves the throughput of LSM-tree based KV stores. We implement dCom- paction on RocksDB, and conduct extensive experiments. Validation using YCSB framework shows that compared with RocksDB dCompaction has about 30% write performance improvements and also comparable read performance.

11:10
FOG: A Fast Out-of-core Graph Processing Framework
SPEAKER: unknown

ABSTRACT. In this paper we present FOG, an open source graph processing framework designed for out-of-core (external memory) graph processing. FOG provides a set of programming interfaces that break down update functions of vertices to their incident edges so as to process the functions with edge-centric manner. By these, FOG gives intuitive and productive programming interfaces, and achieves high main memory utilization rate and processing efficiency at the same time. Moreover, FOG proposes an in-place update shuffling mechanisms to improve the performance by dramatically reducing disk I/Os during computing. By extensive evaluations on typical graph algorithms and large real-world graphs, we show that FOG outperforms existing out-of-core graph processing systems, including GraphChi, X-Stream and TurboGraph. By comparing the performances of FOG and those of state-of-art distributed graph processing frameworks, we show that only by using just a commodity PC, FOG achieves comparable or even better performance than the best distributed graph processing framework that uses an Amazon EC2 cluster with 128 nodes.

11:35
Introspection-based Memory Pruning for Fast and Effective Live VM Migration
SPEAKER: unknown

ABSTRACT. Virtual Machine (VM) migration is an appealing technique on nowadays cloud platforms to achieve high availability, load balancing and power saving. Unfortunately, migration of VM involves transferring a large amount of data, thereby imposing high overheads on network traffic, and consequently results in significant application performance degradation. In this paper, we propose an introspection-based memory pruning method for fast and effective live VM migration. Firstly, we classify memory pages into five categories including anonymous, inode, kernel, free and cache pages, according to how they are used by OS. Then, upon migration, we drop the free pages which are insignificant and cache pages which are redundant. In this way, a large amount of unnecessary data are precluded, so that the migration time is reduced as well. Our system can classify memory pages into specific categories precisely using introspection. Besides cache pages, we also eliminate the pages that are ever used but are freed later which is different from most of the works that only eliminate free pages which are marked as zero pages by OS. Experiments show that our work achieves preferable reduction (72% on average ) in terms of the total migration time compared with the original pre-copy algorithm within QEMU/KVM.

12:00-13:30Lunch Break
13:30-15:10 Session 4: Regular Paper : Non-Volatile Memory
13:30
Efficient Management for Hybrid memory in Managed Language Runtime
SPEAKER: unknown

ABSTRACT. Hybrid memory, which leverages the benefits of traditional DRAM and emerging memory technologies, is a promising alternative for future main memory design. However memory access recording and page migration may invoke non-trivial overhead in execution time and hardware space. Nowadays, managed language applications are increasingly dominant in every kind of platforms. Managed runtime provides services for automatic memory management. So it is important to adapt it for the underlying hybrid memory. This paper explores two opportunities, heap partition placement and object promotion, inside managed runtime for allocating hot data in fast memory space without any access recording or data migration overhead. For heap partition placement, we quantitatively analyze LLC miss den- sity and performance effect for each partition. Results show that most LLC misses especially store misses hit in nursery partition. Placing nursery in fast-space use 20% total memory footprint on average as fast-space, but only 10% performance difference from all in fast-space. During object promotion, hot objects will be allocated to fast-space directly. We develop a tool to analyze the LLC miss density for each method of workloads, because we find LLC misses normally concentrate on several percent of total methods. The objects visited by the top- ranked methods are recognized as hot. Results show that hot objects do have higher access density, more than 3 times of random distribution for SPECjbb and pmd, and placing them in fast-space further reduces their execution time by 6% and 13% respectively.

13:55
Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory
SPEAKER: unknown

ABSTRACT. The MTTF (Mean Time to Failure) continues to decrease as the scaling of computing systems. To maintain the reliability of computing systems, checkpoint has to be taken more frequently. Incremental checkpointing is a well-researched technique that makes frequent checkpointing possible. It monitors the changes of memory blocks and only copies the modified portion to persistent memory. Most implementations of incremental Checkpointing monitor memory data at fine granularity. Fine-grained incremental checkpointing minimizes the checkpoint size but suffers from significant monitoring overhead (accounting 70% of total overhead). We observe the memory access pattern of applications at page granularity and find that the size of contiguous memory regions visited by applications tends to be proportional to size of corresponding memory allocation. In this paper, based on our observation, we propose an Application-Based Coarse-Grained Incremental Checkpointing (ACCK) mechanism that leverages the priori information of the memory allocation to release the memory monitoring granularity in an incremental and appropriate way. This provides better opportunities for balancing the tradeoff between monitoring overhead and copying overhead in incremental checkpointing. ACCK is also assisted by huge page to alleviate the TLB overhead. Our experimental results shows that ACCK presents 2.56x performance improvement over the baseline page-grained incremental checkpointing.

14:20
Exploiting Cross-Layer Hotness Identication to Improve Flash Memory System Performance
SPEAKER: unknown

ABSTRACT. Flash memory has been widely deployed in modern storage systems. However, the density improvement and technology scaling would decrease its endurance and I/O performance, which motivates the search to improve flash performance and reduce cell wearing. Wearing reduction can be achieved by lowering the threshold voltages, but at the cost of slower reads. In this paper, the access hotness characteristics are exploited for read performance and endurance improvement. First, with the understanding of the reliability characteristics of flash memory, the relationship among flash cell wearing, read latency and bit error rate is introduced. Then, based on the hotness information provided by buffer management, the threshold voltages of a cell for write-hot data are decreased for wearing reduction, while these for read-hot data are increased for read latency reduction. We demonstrate analytically through simulation that the proposed technique achieves significant endurance and read performance improvements without sacrificing the write throughput performance.

14:45
VIOS: A Variation-aware I/O Scheduler for Flash-based Storage Systems
SPEAKER: unknown

ABSTRACT. NAND flash memory has gained widespread acceptance in storage systems because of its superior write/read performance, shock-resistance and low-power consumption. I/O scheduling for Solid State Drives (SSDs) has received much attention in recent years for its ability to take advantage of the rich parallelism within SSDs. However, most state-of-the-art I/O scheduling algorithms are oblivious to the increasingly signicant inter-block variation introduced by the advanced technology scaling. This paper proposes a variation-aware I/O scheduler by exploiting the speed variation among blocks to minimize the access conflict latency of I/O requests. The proposed VIOS schedules I/O requests into a hierarchical-batch structured queue to preferentially exploit channel-level parallelism, followed by chip-level parallelism. Moreover, the blocks of each chip are organized in a red-black tree according to their detected write speeds and conflict write requests are allocated to faster blocks to reduce access conflict of waiting requests based on a global chip-state vector, where the current number of requests for each chip is recorded. Our study based on intensive experiment shows that VIOS reduces write latency signicantly compared to the state-of-the-art NOOP and PIQ while attaining high read eciency.

15:10-16:00 Session 5: Posters
15:10
A Mixed Clustering Algorithm Based on Agglomerative Clustering and DBSCAN
SPEAKER: unknown

ABSTRACT. Clustering the data with varying density and complicated structure is important. Agglomerative clustering and DBSCAN can handle this problem to some extent, but still face difficulties respectively. We propose a new clustering, mixing agglomerative clustering and DBSCAN, that owns their advantages and overcome their disadvantages. Our algorithm uses DBSCAN with different parameters as a density-connected partition detector instead of traditional linkage criteria, such as ward and average, of agglomerative clustering in each level. Then density-connected partitions are aggregated to agglomerative clusters gradually. Mean of silhouette coefficient indicate the cluster performance and denoising function is implemented to eliminate influence of noise and outliers. Many challenging experiments indicate that: 1. our algorithm has good performance to the data with widely varying densities and extremely complex structures; 2. it decides the optimal number of clusters automatically; 3. background knowledge is not needed and tuning parameters is easy.

15:10
CSMLRC: Novel Erasure Codes with Optimal Local Multi-group Repair in Cloud Storage
SPEAKER: unknown

ABSTRACT. To achieve reliability in cloud storage systems, erasure codes like Reed Solomon (RS) codes are widely adopted. RS codes have the optimal storage efficiency, but introduce a huge burden on the repairing of failed storage nodes. For high repair efficiency, a new family of erasure codes, called Locally Repairable Codes (LRCs), has recently been proposed and becomes popular. Prior works on LRCs mainly focus on the fundamental theory, whereas there are very few explicit or practical coding constructions. Based on prior theoretical works, we present a novel m-LRCs, an explicit and practical LRCs with multiple repair groups. In this paper, we extend the coding theory of our m-LRCs and then implement them together with RS codes as well as LRCs with a single repair group (s-LRCs). Specifically, we design and implement a novel cloud storage system CSMLRC that adopts our m-LRCs, s-LRCs and RS codes. Extensive experiments show that our m-LRCs achieve a reduction of half of the repair cost (I/O cost) compared with RS codes while having much higher repair performance.

15:10
Dynamic-Priority based Profit-Driven Scheduling in Mobile Cloud Computing
SPEAKER: unknown

ABSTRACT. Mobile cloud computing is now emerging as a promising way to enlarge the capabilities of mobile devices by computation offloading. One of the critical challenges faced by the mobile cloud providers today is how to increase the profitability of their cloud services. In this paper, we deal with the problem of scheduling parallelizable computation jobs offloaded by mobile users in public cloud to maximize cloud providers’ profit. We propose an efficient dynamic-priority based and profit-driven scheduling mechanism. We first compute the inherent profitability of a job based on the marginal effect theory and use it as the initial priority. Then, we update the priorities of waiting jobs based on the proposed time-dependent dynamic-priority calculation model as the elapse of time. The results of numerical experiments and simulations show that our approach are efficient in scheduling these kind of jobs in cloud data centers external to mobile devices and considerable profit improvements can be achieved by our proposed dynamic-priority based scheduling mechanism.

15:10
Persistent-Process: A Fault Tolerant Process Model Based on Persistent Memory
SPEAKER: unknown

ABSTRACT. Current High Performance Computing (HPC) systems are facing reliability wall. That is, most of the computational resources are dedicated to achieve fault tolerance which leads to poor overall system utility. New persistent memory technologies (e.g., Phase Change Memory, STT-RAM) provide fast access, byte-addressable, and non-volatile features. These features will bring new opportunities to fault tolerance. This paper aims to achieve fault tolerance transparently at whole process level based on persistent memory. We propose persistent-process, a new fault tolerance process model. Persistent-process runs directly on persistent memory and we introduce DRAM buffer to store intermediate states and accelerate accessing to persistent memory. For user space data, we introduce shadow update to update persistent data in an atomic way. For the data structures in operating system kernel, we introduce a novel record and replay mechanism to manage and recover them. Experimental results show persistent- process can achieve fault tolerance with little overhead (less than 20%) and it is resilient to the increasing of program scales.

15:10
S-Storm: A Slot-Aware Scheduling Strategy for Even Scheduler in Storm
SPEAKER: unknown

ABSTRACT. Storm has been a popular distributed real-time computation system for stream data processing, which currently provides an even scheduler to distribute all executors and workers of topology among all worker nodes. In this paper, we find that the even scheduler ignores the allocation and dependence relationship among slots, and would bring the load-unbalancing problem when topology just run failed and is killed by its user, or more new machines are extended. Aiming at solving them, we design the S-Storm, a slot-aware scheduling strategy for even scheduler in Storm, which achieves fine-grained EvenScheduler using slot-aware sorting queue and merger factor. S-Storm has the following desirable features: 1) It evenly allocates slots for multi-topologies in load-balancing cluster. 2) When load-unbalancing happen, it distributes workers to slots among light-load worker nodes. 3) It achieves a fine-grained EvenScheduler by accessing one slot from worker node of lowest load to slot-aware sorting queue. 4) It uses merger factor to avoid over competition of same resources among same spout/bolt tasks in machine. We implemented S-Storm based on Storm 0.10.0 and used two different applications for performance evaluation. Extensive experimental results show that compared to Storm, S-Storm can achieve a better performance in average processing time and throughput, while recovering balance from load-unbalancing.

15:10
A RCP-Based Multi-Path Congestion Control Mechanism in Named Data Networking
SPEAKER: unknown

ABSTRACT. Named data networking (NDN), which is a representative of the next generation networks, has attracted more and more attention in recent yeas. A congestion control mechanism is indispensable for NDN to make efficient use of the network link bandwidth while ensuring the fairness of the data streams. A lot of works have been done to achieve the congestion control by limiting the transmission rate of interest packets, but most of them overlooked the intelligent forwarding feature in NDN. In view of this, we propose a new congestion control algorithm, which combines Rate Control Protocol(RCP) and smart forwarding strategy to achieve the multi-path congestion control. We use RCP to control the transmission rate of interest packets and present a clever forwarding strategy in this algorithm to better utilize the bandwidth of the link and achieve multi-path congestion control. In addition, simulation results of our algorithm in ndnSIM show that compared with other congestion control mechanisms, our algorithm is able to take advantage of the link bandwidth and enhances the performance of data transfer by ensuring fairness convergence quickly.

15:10
Cooperative Communication with Sliding Window Network Coding in Wireless Networks
SPEAKER: unknown

ABSTRACT. Cooperative communication is a new communication paradigm that allows multiple transceivers to collaborate as a cluster for data transmission, and such clustering could greatly improve the transmission quality due to coop-erative diversity. Network coding (NC) has been proved to provide significant performance improvement for cooperative communication systems in wireless networks. Sliding-window Network Coding is a variation of NC that is an addition to multimedia streaming and improves the data delay on wireless networks. In this paper, we propose a Cooperative Communication with Sliding Window Network Coding in wireless networks (CC-SWNC). CC-SWNC preserves the degree distribution of the encoded packets through the recombination at the nodes. CC-SWNC enables to control the decoding complexity of each sliding-window independently from the packets received and recover the original data. The performance of the CC-SWNC is studied using NS2 and evaluated in terms of the encoding overhead, decoding delay, packet loss probability and throughput when a packet is transmitted. The simulations result shows that the CC-SWNC with our proposition can significantly improve the network throughput and achieves higher diversity order.

15:10
Molecular computation based on tile assembly model: modular-multiplication and modular-square over finite field GF(2n)
SPEAKER: Yongnan Li

ABSTRACT. With the rapid development of DNA computing, there are some questions worth study that how to implement the arithmetic operations used in cryptosystem based on DNA computing models. This paper proposes a DNA computing model to calculate modular-multiplication and modular-square over finite field GF(2^n). Comparing to related works, it is easier to encode seed configuration with input parameters and both assembly time complexity and space complexity are more optimal. This model requires 274 types of computation tiles and 10 types of boundary tiles. The computation tiles performing 4 different functions assemble into the seed configuration with inputs to figure out the result. It is given that how the computation tiles be bitwise coded and how assembly rules work. The assembly time complexity is 3n−2 and the space complexity is 2n^2−n.

15:10
An Optimization of Double Buffering in Streaming Dataflow Architectures
SPEAKER: unknown

ABSTRACT. Streaming dataflow architectures are promising in high-performance computing. Double buffering mechanism is a most effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, the swapping of two buffers during execution of many tiles decreases the performance because of repeated filling and draining the dataflow accelerator. In this paper, we propose an optimization of double buffering in streaming dataflow architectures. The optimization modifies the tile control logic and the work-flow program to assign tiles to the PE array without stopping the execution of PEs. The filling and draining of the PE array will be done only once across the execution of all tiles belonging to a same dataflow graph. Experimental results show that the proposed optimization for double buffering in streaming dataflow architectures achieves a 16.2% average efficiency improvement over that without the optimization.

15:10
OpenMP 4.0 based Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application
SPEAKER: unknown

ABSTRACT. LESAP is a combustion simulation application capable of simulating the chemical reactions and supersonic flows in the scramjet engines. It can be used to solve practical engineering problems and involves a large amount of computations. In this paper, we port and optimize LESAP with OpenMP 4.0, targeting the emerging heterogeneous many-core platform composed of general CPU and Intel Many Integrated Core (MIC). Several performance optimization techniques, including data movement optimization, grid-partition based load-balancing and SIMD optimization, are applied. Our experiments for a practical problem of combustion simulation, with 5,320,896 grid cells, on one Tianhe-2 supercomputer node (consisting of two 12-core Intel Xeon E5-2692 CPUs and three Intel Xeon Phi 31S1P coprocessors), show that the simulation time is reduced from 64.72 seconds to 21.06 seconds, achieving a speedup of 3.07 times over the original CPU only computing.

15:10
Identifying emerging Influential Nodes in time evolving networks
SPEAKER: unknown

ABSTRACT. Identifying emerging influential or popular node/item in future on network is a current interest of the researchers. We have considered time evolving network. We can identify leaders in three ways either on the basis of structure of the network or on the basis of activity of the node or considering both together. To know the link formation rate of the node we have considered all the temporal detail between a given time period. If the node has received so many links in recent past then its probability of getting popular in future is high. Although node can have so many ratings in past (before given a past time window) but if the node has not received links in recent past then its popularity decays exponentially. By exploiting this wisdom we came up with a model that considers both structural as well as recent activity of the node. We have compared the results with PageRank algorithm and linear preferential attachment based model. Our method outperforms specially in finding the emerging influential nodes that were not popular in past time window. Our model is able to identify novel popular items without any significant cost of precision. We have considered three data sets namely Movielens, Netflix and Facebook wall post data set.

16:00-17:40 Session 6: Cloud and Big Data
16:00
DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets
SPEAKER: unknown

ABSTRACT. Statistical analysis of aggregated records is widely used in various domains such as market research, sociological investigation and network analysis, etc. Stratified sampling (SS), which samples the population divided into distinct groups separately, is preferred in the practice for its high effectiveness and accuracy. In this paper, we propose a scalable and efficient algorithm, DSS, for SS to process large datasets. DSS executes all the sampling operations in parallel by calculating the exact subsample size for each partition according to the data distribution. We implement DSS on Spark, a big-data processing system, and we show through large-scale experiments that DSS can achieve lower data-transmission cost and higher efficiency than state-of-the-art methods.

16:25
DASM: Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-based Backup Systems
SPEAKER: unknown

ABSTRACT. Data deduplication yields an important role in modern backup systems for its demonstrated ability to improve storage efficiency. How- ever, in deduplication-based backup systems, the consequent exhausting fragmentation problem has drawn ever-increasing attention over backup times, which leads to the degradation of restoration speed. Various Meth- ods are proposed to address this problem. Most of them purchase restore speed at the expense of deduplication ratio reduction, which is costly. In this paper, we present a Dynamic Adaptive Forward Assembly Area Method, called DASM, to accelerate restore speed for deduplication- based backup systems. DASM exploits the fragmentation information within the restored backup streams and dynamically trades off between chunk-level cache and container-level cache. DASM is a pure data restora- tion module which pursue optimal read performance without sacrific- ing deduplication ratio. Meanwhile, DASM is resource independent and cache efficient, which works well under different memory footprint re- strictions. To demonstrate the effectiveness of DASM, we conduct sev- eral experiments under different backup workloads. The results show that DASM is sensitive to fragmentation degree and can adapt to fragmen- tation change accurately. Besides, experiments also show that DASM improves the restore speed of traditional LRU and ASM methods by up to 58.7% and 57.1%, respectively.

16:50
LCS: An Efficient Data Eviction Strategy for Spark
SPEAKER: unknown

ABSTRACT. As an in-memory distributed computing system, Spark is often used to speed up iterative applications. It caches intermediate data generated by previous iterations into memory, so there is no need to repeat the generation when reusing these data later. This sharing mechanism of caching data in memory makes Spark much faster than other systems. When memory used for caching data reaches the capacity limits, data eviction will be performed to supply space for new data, and the evicted data need to be recovered when they are used again. However, classical strategies do not aware of recovery cost, which could cause system performance degradation. This paper shows that the recovery costs have significant difference in Spark, thus a cost aware eviction strategy can obviously reduces the total recovery cost. To this end, a strategy named LCS is proposed, which gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future. Experimental results show that this approach can achieve better performance when memory space not sufficient, and reduce 30% to 50% of the total execution time.

17:15
Pathfinder: Application-aware Distributed Path Computation in Clouds
SPEAKER: unknown

ABSTRACT. Path computation in a network is dependent on the network’s processes and resource usage pattern. While distributed traffic control methods improve the scalability of a system, their topology and linkstate conditions may influence the sub-optimal path computation. Herein, we present Pathfinder, an application-aware distributed path computation scheme. Pathfinder can improve path computation functions through software-defined network (SDN) controls. In the present paper, we first analyze the key issues in distributed path computation functions, and then present Pathfinder’s system architecture, followed by its design principles and orchestration environment. Simulations quantifying Pathfinder’s performance gains are also presented.