View: session overviewtalk overview
10:20 | Toward A Parallel Turing Machine Model SPEAKER: unknown ABSTRACT. In the field of parallel computing, the late leader Ken Kennedy, has raised an alarming warning in early 1990s: "Is Parallel Computing Dead?" Now, we have witnessed the tremendous momentum of the "second spring" of parallel computing in recent years. But, what lessons should we learn from the history of parallel computing when we are walking out from the bottom state of the field? To this end, this paper examines the disappointing state of the work in parallel Turing machine models in the past 50 years of parallel computing research. Lacking a solid yet intuitive parallel Turing machine model will continue to be a serious challenge. Our paper presents an attempt to address this challenge -- by presenting a proposal of a parallel Turing machine model -- the PTM model. |
10:45 | dCompaction: Delayed Compaction for LSM-tree SPEAKER: Fengfeng Pan ABSTRACT. Key-value (KV) stores have become a backbone of large-scale ap- plications in today’s data centers. Write-optimized data structures like LSM- tree (Log-Structured Merge-tree) and their variants are widely used in KV storage systems like BigTable and RocksDB. Conventional LSM-tree organizes KV items into multiple, successively larger components, and uses compaction to push KV items from one smaller component to another adjacent larger com- ponent until the KV items reach the largest component. Unfortunately, current compaction scheme incurs significant write amplification due to repeated KV items reads and writes, and then results in poor throughput. We propose a new compaction scheme, delayed compaction (dCompaction), that decreases write amplification. dCompaction postpones parts of compactions and gather theses compactions into the following compaction. In this way, it avoids KV items reads and writes during compaction, and consequently im- proves the throughput of LSM-tree based KV stores. We implement dCom- paction on RocksDB, and conduct extensive experiments. Validation using YCSB framework shows that compared with RocksDB dCompaction has about 30% write performance improvements and also comparable read performance. |
11:10 | FOG: A Fast Out-of-core Graph Processing Framework SPEAKER: unknown ABSTRACT. In this paper we present FOG, an open source graph processing framework designed for out-of-core (external memory) graph processing. FOG provides a set of programming interfaces that break down update functions of vertices to their incident edges so as to process the functions with edge-centric manner. By these, FOG gives intuitive and productive programming interfaces, and achieves high main memory utilization rate and processing efficiency at the same time. Moreover, FOG proposes an in-place update shuffling mechanisms to improve the performance by dramatically reducing disk I/Os during computing. By extensive evaluations on typical graph algorithms and large real-world graphs, we show that FOG outperforms existing out-of-core graph processing systems, including GraphChi, X-Stream and TurboGraph. By comparing the performances of FOG and those of state-of-art distributed graph processing frameworks, we show that only by using just a commodity PC, FOG achieves comparable or even better performance than the best distributed graph processing framework that uses an Amazon EC2 cluster with 128 nodes. |
11:35 | Introspection-based Memory Pruning for Fast and Effective Live VM Migration SPEAKER: unknown ABSTRACT. Virtual Machine (VM) migration is an appealing technique on nowadays cloud platforms to achieve high availability, load balancing and power saving. Unfortunately, migration of VM involves transferring a large amount of data, thereby imposing high overheads on network traffic, and consequently results in significant application performance degradation. In this paper, we propose an introspection-based memory pruning method for fast and effective live VM migration. Firstly, we classify memory pages into five categories including anonymous, inode, kernel, free and cache pages, according to how they are used by OS. Then, upon migration, we drop the free pages which are insignificant and cache pages which are redundant. In this way, a large amount of unnecessary data are precluded, so that the migration time is reduced as well. Our system can classify memory pages into specific categories precisely using introspection. Besides cache pages, we also eliminate the pages that are ever used but are freed later which is different from most of the works that only eliminate free pages which are marked as zero pages by OS. Experiments show that our work achieves preferable reduction (72% on average ) in terms of the total migration time compared with the original pre-copy algorithm within QEMU/KVM. |
15:10 | A Mixed Clustering Algorithm Based on Agglomerative Clustering and DBSCAN SPEAKER: unknown ABSTRACT. Clustering the data with varying density and complicated structure is important. Agglomerative clustering and DBSCAN can handle this problem to some extent, but still face difficulties respectively. We propose a new clustering, mixing agglomerative clustering and DBSCAN, that owns their advantages and overcome their disadvantages. Our algorithm uses DBSCAN with different parameters as a density-connected partition detector instead of traditional linkage criteria, such as ward and average, of agglomerative clustering in each level. Then density-connected partitions are aggregated to agglomerative clusters gradually. Mean of silhouette coefficient indicate the cluster performance and denoising function is implemented to eliminate influence of noise and outliers. Many challenging experiments indicate that: 1. our algorithm has good performance to the data with widely varying densities and extremely complex structures; 2. it decides the optimal number of clusters automatically; 3. background knowledge is not needed and tuning parameters is easy. |
15:10 | CSMLRC: Novel Erasure Codes with Optimal Local Multi-group Repair in Cloud Storage SPEAKER: unknown ABSTRACT. To achieve reliability in cloud storage systems, erasure codes like Reed Solomon (RS) codes are widely adopted. RS codes have the optimal storage efficiency, but introduce a huge burden on the repairing of failed storage nodes. For high repair efficiency, a new family of erasure codes, called Locally Repairable Codes (LRCs), has recently been proposed and becomes popular. Prior works on LRCs mainly focus on the fundamental theory, whereas there are very few explicit or practical coding constructions. Based on prior theoretical works, we present a novel m-LRCs, an explicit and practical LRCs with multiple repair groups. In this paper, we extend the coding theory of our m-LRCs and then implement them together with RS codes as well as LRCs with a single repair group (s-LRCs). Specifically, we design and implement a novel cloud storage system CSMLRC that adopts our m-LRCs, s-LRCs and RS codes. Extensive experiments show that our m-LRCs achieve a reduction of half of the repair cost (I/O cost) compared with RS codes while having much higher repair performance. |
15:10 | Dynamic-Priority based Profit-Driven Scheduling in Mobile Cloud Computing SPEAKER: unknown ABSTRACT. Mobile cloud computing is now emerging as a promising way to enlarge the capabilities of mobile devices by computation offloading. One of the critical challenges faced by the mobile cloud providers today is how to increase the profitability of their cloud services. In this paper, we deal with the problem of scheduling parallelizable computation jobs offloaded by mobile users in public cloud to maximize cloud providers’ profit. We propose an efficient dynamic-priority based and profit-driven scheduling mechanism. We first compute the inherent profitability of a job based on the marginal effect theory and use it as the initial priority. Then, we update the priorities of waiting jobs based on the proposed time-dependent dynamic-priority calculation model as the elapse of time. The results of numerical experiments and simulations show that our approach are efficient in scheduling these kind of jobs in cloud data centers external to mobile devices and considerable profit improvements can be achieved by our proposed dynamic-priority based scheduling mechanism. |
15:10 | Persistent-Process: A Fault Tolerant Process Model Based on Persistent Memory SPEAKER: unknown ABSTRACT. Current High Performance Computing (HPC) systems are facing reliability wall. That is, most of the computational resources are dedicated to achieve fault tolerance which leads to poor overall system utility. New persistent memory technologies (e.g., Phase Change Memory, STT-RAM) provide fast access, byte-addressable, and non-volatile features. These features will bring new opportunities to fault tolerance. This paper aims to achieve fault tolerance transparently at whole process level based on persistent memory. We propose persistent-process, a new fault tolerance process model. Persistent-process runs directly on persistent memory and we introduce DRAM buffer to store intermediate states and accelerate accessing to persistent memory. For user space data, we introduce shadow update to update persistent data in an atomic way. For the data structures in operating system kernel, we introduce a novel record and replay mechanism to manage and recover them. Experimental results show persistent- process can achieve fault tolerance with little overhead (less than 20%) and it is resilient to the increasing of program scales. |
15:10 | S-Storm: A Slot-Aware Scheduling Strategy for Even Scheduler in Storm SPEAKER: unknown ABSTRACT. Storm has been a popular distributed real-time computation system for stream data processing, which currently provides an even scheduler to distribute all executors and workers of topology among all worker nodes. In this paper, we find that the even scheduler ignores the allocation and dependence relationship among slots, and would bring the load-unbalancing problem when topology just run failed and is killed by its user, or more new machines are extended. Aiming at solving them, we design the S-Storm, a slot-aware scheduling strategy for even scheduler in Storm, which achieves fine-grained EvenScheduler using slot-aware sorting queue and merger factor. S-Storm has the following desirable features: 1) It evenly allocates slots for multi-topologies in load-balancing cluster. 2) When load-unbalancing happen, it distributes workers to slots among light-load worker nodes. 3) It achieves a fine-grained EvenScheduler by accessing one slot from worker node of lowest load to slot-aware sorting queue. 4) It uses merger factor to avoid over competition of same resources among same spout/bolt tasks in machine. We implemented S-Storm based on Storm 0.10.0 and used two different applications for performance evaluation. Extensive experimental results show that compared to Storm, S-Storm can achieve a better performance in average processing time and throughput, while recovering balance from load-unbalancing. |
15:10 | A RCP-Based Multi-Path Congestion Control Mechanism in Named Data Networking SPEAKER: unknown ABSTRACT. Named data networking (NDN), which is a representative of the next generation networks, has attracted more and more attention in recent yeas. A congestion control mechanism is indispensable for NDN to make efficient use of the network link bandwidth while ensuring the fairness of the data streams. A lot of works have been done to achieve the congestion control by limiting the transmission rate of interest packets, but most of them overlooked the intelligent forwarding feature in NDN. In view of this, we propose a new congestion control algorithm, which combines Rate Control Protocol(RCP) and smart forwarding strategy to achieve the multi-path congestion control. We use RCP to control the transmission rate of interest packets and present a clever forwarding strategy in this algorithm to better utilize the bandwidth of the link and achieve multi-path congestion control. In addition, simulation results of our algorithm in ndnSIM show that compared with other congestion control mechanisms, our algorithm is able to take advantage of the link bandwidth and enhances the performance of data transfer by ensuring fairness convergence quickly. |
15:10 | Cooperative Communication with Sliding Window Network Coding in Wireless Networks SPEAKER: unknown ABSTRACT. Cooperative communication is a new communication paradigm that allows multiple transceivers to collaborate as a cluster for data transmission, and such clustering could greatly improve the transmission quality due to coop-erative diversity. Network coding (NC) has been proved to provide significant performance improvement for cooperative communication systems in wireless networks. Sliding-window Network Coding is a variation of NC that is an addition to multimedia streaming and improves the data delay on wireless networks. In this paper, we propose a Cooperative Communication with Sliding Window Network Coding in wireless networks (CC-SWNC). CC-SWNC preserves the degree distribution of the encoded packets through the recombination at the nodes. CC-SWNC enables to control the decoding complexity of each sliding-window independently from the packets received and recover the original data. The performance of the CC-SWNC is studied using NS2 and evaluated in terms of the encoding overhead, decoding delay, packet loss probability and throughput when a packet is transmitted. The simulations result shows that the CC-SWNC with our proposition can significantly improve the network throughput and achieves higher diversity order. |
15:10 | Molecular computation based on tile assembly model: modular-multiplication and modular-square over finite field GF(2n) SPEAKER: Yongnan Li ABSTRACT. With the rapid development of DNA computing, there are some questions worth study that how to implement the arithmetic operations used in cryptosystem based on DNA computing models. This paper proposes a DNA computing model to calculate modular-multiplication and modular-square over finite field GF(2^n). Comparing to related works, it is easier to encode seed configuration with input parameters and both assembly time complexity and space complexity are more optimal. This model requires 274 types of computation tiles and 10 types of boundary tiles. The computation tiles performing 4 different functions assemble into the seed configuration with inputs to figure out the result. It is given that how the computation tiles be bitwise coded and how assembly rules work. The assembly time complexity is 3n−2 and the space complexity is 2n^2−n. |
15:10 | An Optimization of Double Buffering in Streaming Dataflow Architectures SPEAKER: unknown ABSTRACT. Streaming dataflow architectures are promising in high-performance computing. Double buffering mechanism is a most effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, the swapping of two buffers during execution of many tiles decreases the performance because of repeated filling and draining the dataflow accelerator. In this paper, we propose an optimization of double buffering in streaming dataflow architectures. The optimization modifies the tile control logic and the work-flow program to assign tiles to the PE array without stopping the execution of PEs. The filling and draining of the PE array will be done only once across the execution of all tiles belonging to a same dataflow graph. Experimental results show that the proposed optimization for double buffering in streaming dataflow architectures achieves a 16.2% average efficiency improvement over that without the optimization. |
15:10 | OpenMP 4.0 based Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application SPEAKER: unknown ABSTRACT. LESAP is a combustion simulation application capable of simulating the chemical reactions and supersonic flows in the scramjet engines. It can be used to solve practical engineering problems and involves a large amount of computations. In this paper, we port and optimize LESAP with OpenMP 4.0, targeting the emerging heterogeneous many-core platform composed of general CPU and Intel Many Integrated Core (MIC). Several performance optimization techniques, including data movement optimization, grid-partition based load-balancing and SIMD optimization, are applied. Our experiments for a practical problem of combustion simulation, with 5,320,896 grid cells, on one Tianhe-2 supercomputer node (consisting of two 12-core Intel Xeon E5-2692 CPUs and three Intel Xeon Phi 31S1P coprocessors), show that the simulation time is reduced from 64.72 seconds to 21.06 seconds, achieving a speedup of 3.07 times over the original CPU only computing. |
15:10 | Identifying emerging Influential Nodes in time evolving networks SPEAKER: unknown ABSTRACT. Identifying emerging influential or popular node/item in future on network is a current interest of the researchers. We have considered time evolving network. We can identify leaders in three ways either on the basis of structure of the network or on the basis of activity of the node or considering both together. To know the link formation rate of the node we have considered all the temporal detail between a given time period. If the node has received so many links in recent past then its probability of getting popular in future is high. Although node can have so many ratings in past (before given a past time window) but if the node has not received links in recent past then its popularity decays exponentially. By exploiting this wisdom we came up with a model that considers both structural as well as recent activity of the node. We have compared the results with PageRank algorithm and linear preferential attachment based model. Our method outperforms specially in finding the emerging influential nodes that were not popular in past time window. Our model is able to identify novel popular items without any significant cost of precision. We have considered three data sets namely Movielens, Netflix and Facebook wall post data set. |