SOICT 2017: 8TH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY
PROGRAM FOR THURSDAY, DECEMBER 7TH
Days:
next day
all days

View: session overviewtalk overview

09:00-09:45 Session 3: Keynote Talk: Deep Learning for Food Recognition (SoICT 2017)
Chair:
Location: Yersin Ballroom
09:00
Deep Learning for Food Recognition

ABSTRACT. In multimedia, dishes recognition is regarded as a difficult problem due to diverse appearance of food in shape and color because of different cooking and cutting methods. As a result, while there is a large number of cooking recipes posted on the Internet, finding a right recipe for a food picture remains a challenge. The problem is also shared among health-related applications. For example, food-log management, which records dairy food intake, often requires manual input of food/ingredients for nutrition estimation. This talk will share with you the challenge of recognizing ingredients in dishes for recipe retrieval. Finding a recipe that exactly describes a dish is challenging because ingredient compositions vary across geographical regions, cultures, seasons and occasions. I will introduce deep neutral architectures that explore the relationship among food, ingredients and recipes for recognition. The learnt deep features are used for cross-modal retrieval of food and recipes.

09:45-10:30 Session 4: Keynote Talk: AI and Big Data analytics for health and bioinformatics (CSBio 2017)
Chair:
Location: Yersin Ballroom
09:45
AI and Big Data Analytics for Health and Bioinformatics

ABSTRACT. With the technological advances that allow for high throughput profiling of biological systems at a low cost. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this talk, I will start will the concepts in the analysis of big data, specifically the AI algorithms.

My group has in The Biomedical Informatics Lab (BIL) is a research Centre is the focus of the education, research and development, and human-resource training in heath informatics and bioinformatics at NTU. The mission of BIL is to provide the interdisciplinary environment and training for students and researchers to engage in leading and cutting edge research in bioinformatics, and thereby become a part of the life sciences workforce in Singapore and elsewhere.

This talk, by presenting selected research activities, will provide an overview of some of the innovative and creative approaches with the application of AI in big data analytics to address the challenges and solutions in both health and bioinformatics.

10:35-10:50CSBio Poster session and Coffee Break
10:50-12:10 Session 6A: Genome & Transcriptome Analysis (CSBio 2017)
Location: Hon Chong Room
10:50
The effect of reference species on reference-guided genome assembly

ABSTRACT. The rapid improvement of the next-generation sequencing (NGS) technologies has enabled unprecedented production of huge DNA sequence data at low cost. However, the NGS technologies are still limited to generate short DNA sequences, which has led to the development of many assembly algorithms to recover whole genome sequences from those short sequences. Unfortunately, the assembly algorithms alone can only construct scaffold sequences, which are generally much shorter than chromosome sequences. To generate chromosome sequences, additional expensive experimental data is required. To overcome this problem, there have been many studies to develop new computational algorithms to further merge the scaffold sequences, and produce chromosome-level sequences by utilizing an existing genome assembly of a related species called a reference. However, even though the quality of the chosen reference assembly is critical for generating a good final assembly, its effect is not well uncovered yet. In this study, we measured the effect of the reference genome assembly on the quality of the final assembly generated by reference-guided assembly algorithms. By using the genome assemblies of total eleven reference species (eight primates and three rodents), the human genome sequences were assembled from scaffold sequences by one of the reference-guided assembly algorithms, called RACA, and they were compared with known genome sequences to measure their quality in terms of the number of misassemblies. The effect of the quality of the reference assemblies was investigated in terms of divergence time against human, alignment coverage between the reference and human, and the amount of inclusion of core eukaryotic genes. We found that the divergence time is a good indicator of the quality of the final assembly when reference assemblies with high quality are used. We believe this study will contribute to broaden our understanding of the effect and importance of a reference assembly on the reference-guided assembly task.

11:10
Position-Residue Specific Dynamic Gap Penalty Scoring Strategy for Multiple Sequence Alignment

ABSTRACT. Multiple Sequence Alignment (MSA) is a basic tool for biological sequence analysis. Effective Alignment of multiple sequences having biologic relevance is still an open problem.. However, MSA is a crucial step utilized by biologists to analyze phylogentic, gene regulations, homology marker, drug discovery and predicting the protein structure and its functions. Accuracy of MSA is highly dependent on the scoring function, which aligns a given residue to its appropriate position during alignment. Scoring function has three possible cases to score a pair of residues: i) a residue with same residue, ii) a residue with different residue and iii) a residue with gap. A number of biological meaningful approaches are developed for the first two cases. However, for the third case, most of the approaches follow default score for gap penalty, which is provided as an input by an expert. In this study, we propose a new, biologically relevant, and position-residue specific dynamic scoring approach for gap penalty. Position-Residue Specific Dynamic Gap Penalty (PRSDGP) scoring function is tested on benchmark dataset. The proposed PRSDGP scoring approach is compared with the CLUSTAL O program and Quality metric improvement ranges from 46.2% to 81.5%.

11:30
Proposal of application method of Inductive Logic Programming to microarray data

ABSTRACT. This paper describing a method of specifying common terms of genes from microarray data in 3 steps. First, we use random forest for extracting disease-related genes and it give each gene variable importance. The higher the variable importance, the more effective feature for classification. We extract genes whose variable importance more than 0 and set them positive samples and the rest set negative samples for ILP. Next, we annotate extracted genes by using Gene Ontology (GO) and use the term as predicate for ILP. Annotation is the process of assigning GO terms to gene products. Finally, we obtain rules about common terms in positive samples by using ILP. ILP is a subfield of machine learning which uses logic programming as a uniform representation technique for examples, background knowledge and hypotheses. ILP learns based on background knowledge. Background knowledge is represented in first-order logic. In the result, we extracted 1051 mRNA as positive samples for ILP from random forest and its F-measure score was 65.1%. We obtained about 4000 terms at each dataset and use them as predicates for ILP. We got eventually some rules about positive samples.

11:50
Meta-analysis of whole-transcriptome data for prediction of novel genes associated with autism spectrum disorder

ABSTRACT. neurodevelopmental disorder with typical symptoms such as impaired social interaction, language and communication abnormalities and stereotypical behavior. Since the genetics of ASDs is so diverse, information on genome function as provided by transcriptomic data is essential to further our understanding, because transcriptome is a key link between measuring protein levels and genetic information. These studies have been often performed by comparing groups of individuals with ASD and control samples to identify which genes are dysregulated in the ASD group using statistical techniques. However, these statistical techniques can only find genes solely accounting for ASD, but cannot reflect relationship among genes which could be the etiology of ASD. In this study, we propose a novel method to find the ASD-associated genes, which are predictive for ASD. To this end, we analyze whole transcriptomic data of previous studies for ASD, which were performed using different expression profiling platforms on different issues of interest. These predictive genes, which can differentiate a sample into either ASD or non-ASD, are selected by an optimization process. Comparing subsets selected from different tissues/platforms, we conclude that tissues contain different gene sets account for ASD. In addition, a platform can supply other ASD-associated genes of which other platforms cannot. Found genes are compared to those which have been well documented in SFARI, which is the most comprehensive and up-to-date data of ASD. Interestingly, we can find 16 novel genes with evidences from literature, which have not yet been recorded in this database. Taken together, meta-analysis on whole-transcriptome data of ASD could shed light on the etiology of ASD.

10:50-12:10 Session 6B: Image and Video Processing (SoICT 2017)
Location: Yersin Ballroom A
10:50
Image Restoration With Total Variation and Iterative Regularization Parameter Estimation

ABSTRACT. Regularization techniques are widely used for solving ill-posed image processing problems and in particular for image noise removal. Total variation (TV) regularization is one of the foremost edge preserving method for noise removal from images that can overcome the over-smoothing effects of the classical Tikhonov regularization. One of the important aspect in this approach is the involvement of the regularization parameter that needs to be set appropriately to obtain optimal restoration results. In this work, we utilize a fast split Bregman based implementation of the TV regularization for denoising along with an iterative parameter estimation from local image information. Experimental results on a variety noisy images indicate the promise of our TV regularization with iterative parameter estimation with local variance method, and comparison with related schemes show better edge preservation and robust noise removal.

11:10
Compressive Online Robust Principal Component Analysis with Optical Flow for Video Foreground-Background Separation

ABSTRACT. In the context of online Robust Principle Component Analysis (RPCA) for the video foreground-background separation, we propose a compressive online RPCA with optical flow that separates recursively a sequence of frames into sparse (foreground) and low-rank (background) components. Our method considers a small set of measurements taken per data vector (frame), which is different from conventional batch RPCA, processing all the data directly. The proposed method also incorporates multiple prior information, namely previous foreground and background frames, to improve the separation and then updates the prior information for the next frame. Moreover, the foreground prior frames are improved by estimating motions between the previous foreground frames using optical flow and compensating the motions to achieve higher quality foreground prior. The proposed method is applied to online video foreground and background separation from compressive measurements. The visual and quantitative results show that our method outperforms the existing methods.

11:30
3D Graphical Representation of DNA Sequences and its application for long sequence searching over whole genomes

ABSTRACT. With the development of Next Generation Sequencing techniques, research on the whole genome has been activated. So the it is common to analyze the biological sequence by the unit of megabytesized whole genome rather than a few kilobytes DNA segments. In general genome sequence comparison is conducted by the dynamic programming based alignment algorithm model. This method is accurate, but assuming that the length of the target sequence is short(less than a few kilobytes) since it requires the quadratic time and space complexity, O(n2) where n is the length of target and query sequences. To overcome these drawbacks in whole genome scale comparison, we suggest a newmethod for finding local similar subsequences among whole genomes. First we propose a new visualization algorithm which transforms long DNA sequences into a random walk plot in 3D space. Next we try to find the similar part between two geometric random walks from the reference genome and a query sequence. Therefore, the sequence searching problem in DNA strings can be reduced to find some parts of random walk within a relatively small-scale geometric space. This means our random walk is a good approximated representation of a very long genomic sequences. Our experiment showed that our algorithm is so successful and efficient to locate a lone query sequence over a whole genome whose size is more than 100 megabytes.

11:50
Pedestrian Localization and Trajectory Reconstruction in a Surveillance Camera Network

ABSTRACT. In this paper, we propose a high accuracy solution for locating pedestrians from video streams in a surveillance camera network. For each camera, we formulate the vision-based localization service as detecting foot-points of pedestrians in the ground plane. We address two critical issues that strongly affect the foot-point's detection results: casting shadows and pruning detection results due to occlusion. For the first issue, we adopt a removing shadow technique based on a learning-based approach. For the second issue, a regression model is proposed to prune the wrong foot-point detection results. The regression model plays a role in estimating the position by using the human factors such as height, width and its ratio. A correlation of the detected foot-points and the results estimated from the regression model is examined. Once a foot-point is missed due to uncorrelation problem, a Kalman filter is deployed to predict the current location. To link the trajectory of the human in the camera network, we base on an observation about the same ground-plane/floor in view of cameras then the transformation between a pair of cameras could be computed offline. In the experiments, a high accuracy performance for locating the pedestrians and a real-time computation are achieved. The proposed method therefore is particularly feasible to deploy the vision-based localization service in scalable indoor environments such as hall-way, squares in public buildings, offices, where surveillance cameras are common used.

10:50-12:10 Session 6C: Time Series and Predictive Models (SoICT 2017)
Location: Yersin Ballroom B
10:50
DTA Hunter System: A new statistic-based framework of predicting future demand for taxi drivers

ABSTRACT. The ever-growing popularity of taxi services in modern cities creates the demand for making taxi activities more efficient. Specifically, the main aims are reducing the cruising time of taxi when drivers hunt for new passengers and maximize potential profit for the next trip, which attracts many interest of researchers. However, most research use historical GPS tracks without considering 1) the data of current day, especially a few last hours from the current time and 2) completely ignore the road-passengers (traditional passengers who hail taxi on road), which account for a large portion of taxi demand in reality. To overcome such drawbacks, we propose DTA hunter system, incorporating such information into a statistical model by vectorizing historical data and probability equations respectively. The final aim of the model is that given a taxi information (current location \& time), it will suggest $k$ parking places and optimal paths to get there that maximize the probability of picking up new passengers and the expected distance of next trip. We evaluate the model with taxi services dataset of Vietnam Vinasun Taxi in one month (from 18/10/2015 to 14/11/2015) and the result of our model (the probability of picking up new passengers in the future) is better than the daily behavior of taxi drivers in reality.

11:10
A Robust Approach for Multivariate Time Series Forecasting

ABSTRACT. Time series forecasting is often confronted with multivariate data, but few model is available in this situation. Besides, data distortion aggravates the difficulty to predict multivariate time series. To tackle such problems, we propose an approach based on convolutional neural network with a feature extraction layer added before convolution layer to extract multivariate features and handle multivariate time series data, as well as decreases the effect of distortion by transforming the sample into a denser representation with both its information and the information of its temporal neighbours. A full connection layer then fuses these extracted features and gets the final result. Given that events in the world are always related, using both the target time series and other related time series to forecast the future changes of the target dimension would achieve a better prediction. The proposed approach can process multivariate time series data and is robust to the number of samples, numeric ranges of data etc. Extensive experiments validate the effectiveness of the approach in accomplishing multivariate time series forecasting.

11:30
An Application of Similarity Search in Streaming Time Series under DTW: Online Forecasting

ABSTRACT. Time-series forecasting has had an incessant attraction to many researchers on time-series data mining. In this paper, we introduce an efficient online forecasting method based on similarity search in streaming time series under Dynamic Time Warping (DTW). The proposed method takes the newly incoming time-series subsequence, then finds k nearest neighbor subsequences and makes predictions based on the manner that these best matches evolved in the past. Prior to the similarity search, these subsequences have been extracted from the original time series by a novel segmentation technique using major extrema in time series. Experimental results show that for trend and seasonal streaming time series, the proposed method can bring out short-term forecasts with high prediction accuracy and remarkable time efficiency. Furthermore, if the streaming time series has some linear feature and no trend, another version of our proposed online forecasting method which hybridizes the aforementioned method with simple exponential smoothing can improve the prediction accuracy.

11:50
A New Model for Stock Price Movements Prediction in Vietnam Stock Market

ABSTRACT. In this paper, we introduce a new prediction model depend on Bidirectional Gated Recurrent Unit (BGRU). Our predictive model relies on both online financial news and historical stock prices data to predict the stock movements in the future. Experimental results show that our model accuracy achieves nearly 60% in S&P 500 index prediction whereas the individual stock prediction is over 65%

10:50-12:10 Session 6D: Network and Applications (SoICT 2017)
Location: Hon tre Room
10:50
Entrusting a Responsibility of Personal Information Management to Users from a Service Provider

ABSTRACT. Nowadays, various network services, such as online shops and reservation of facilities, have been used with the spread of the Internet. Some of these services request users to offer personal information. However, incidents such as information leakage are occurring frequently. Thus, the responsibility of personal information management is burden for service providers. On the other hand, users cannot know how service providers use their offered personal information. Thus, the users feel uneasy to offering personal information to the service providers. Therefore, we propose a framework that a user can designate the usage procedure of his/her personal information. A service provider can entrust the responsibility of personal information management to a user because the service provider does not determine usage procedures but the user determines. Additionally, since the user can know the usage procedure of his/her personal information, the user feels easy to offering personal information. In this paper, we discuss policies (a protection policy and a use policy) defined in this framework.

11:10
Limiting the Spread of Epidemics within Time Constraint on Online Social Networks

ABSTRACT. In this paper, we investigate the problem of limiting the spread of epidemics on online social networks (OSNs) with the aim to seek a set nodes of size at most $k$ to remove from the networks such that the number of saved nodes is maximal for cases where we already know the set of infected nodes on the networks. The problem is proved to be NP-hard and it is NP-hard to approximate the problem with ratio $n^{1-\epsilon}$, for $0 < \epsilon <1$. Besides, we also suggest two algorithms to solve the problem. Experimental results show that our propsed outperform baseline algorithms.

11:30
Implementing Genetic Algorithm Accelerated By Intel Xeon Phi

ABSTRACT. In this paper, genetic algorithm (GA) accelerated by Intel Xeon Phi coprocessor based on Intel Many Integrated Chip (MIC) Architecture is proposed and called GAPhi framework. The GAPhi framework solves the power-aware task scheduling (PATS) problems in shorter execution time than sequential genetic algorithm. We evaluate GAPhi, sequential GA (SGA) and GAGPU from [8] for solving the same problem size of PATS problems. Due to limited hardware resources (i.e. memory) for executing simulation, we created a workload that contains maximum problem size of 1000 jobs and 1000 physical machines. The experimental results show the GAPhi program executed on a single Intel Xeon Phi coprocessor (61 cores) obtains significant speedup in comparison to the SGA program executed on CPU Intel® Xeon and GAGPU program executed on NVIDIA Tesla with same input problem size. They share the same GA’s parameters (e.g. number of generations, crossover and mutation probability, etc.).

11:50
A Study of Uber-based Applications

ABSTRACT. The Uber-based applications have recently created a new business model: a taxi company without any car, a tutor company without any tutor, or a hotel without any room. These applications coordinate mobile computing and peer-to-peer technology to facilitate the peer-to-peer provision of services. This paper presents a study of Uber-based applications. The paper first explains the driving force of mobile computing and peer-to-peer technology that exploit direct communications between mobile applications for services. It then describes a common application framework with the system architecture and prevailing components. We use virtual healthcare and software outsourcing case studies to demonstrate the prototyping systems with functions and evaluate service availability and performance.

12:10-13:30Lunch at Feast Restaurant - 1st Floor
13:30-14:15 Session 7A: Keynote Talk: Revealing deep proteome diversity with community-scale proteomics big data (CSBio 2017)
Location: Hon Chong Room
13:30
Revealing deep proteome diversity with community-scale proteomics big data

ABSTRACT. Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages.

First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95% of public human HCD data.

Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data.

Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at ~1.3 variants per protein region) is far below what is observable in experimental data.

13:30-14:15 Session 7B: Keynote Talk: Trustworthy Software and Automatic Program Repair (SoICT 2017)
Location: Yersin Ballroom
13:30
Trustworthy Software and Automatic Program Repair

ABSTRACT. Software controls many critical infra-structures and a variety of software analysis methods have been proposed to enhance the quality, reliability and security of software components. In this talk, we will first study the gamut of methods developed so far in software validation research - ranging from systematic testing, to analysis of program source code and binaries, to formal reasoning about software components. We will also discuss the research on trustworthy software at NUS which make software vulnerability detection, localization and patching much more systematic. We will specifically explore research on futuristic programming environments which enable auto-patching of software vulnerabilities, with a focus on automatic program repair - where software errors get detected and fixed continuously. This research aims to realize the vision of self-healing software for autonomous cyber-physical systems, where autonomous devices may need to modify the code controlling the device on-the-fly to maintain strict guarantees about trust.

14:15-15:15 Session 8A: Drug discovery (CSBio 2017)
Chair:
Location: Hon Chong Room
14:15
A machine learning approach for drug discovery from herbal medicine: Metabolite profiles to Therapeutic effects

ABSTRACT. Vietnam has an abundant of herbal traditional medicine with accumulated experience for thousands of years. They play an important role in the drug development. However, several therapeutic effects remain unknown among these plants. To explore active ingredients in the effective Vietnamese herbal medicine formulations for individual diseases and to understand therapeutic effects under scientific viewpoint, this project predicts therapeutic effects based on metabolite profiles. The herbal medicine database has been processed to get the useful information by the supporting of computational approach, particularly Random forest algorithm, Generalized Boosted Model and Support Vector Machine. Three specific therapeutic effects – metabolites binary classification model to deal with multi-class classification and imbalanced class data problem. Since this project can reveal the main predictors of specific therapeutic effect, they are valuable information for further research of drug development.

14:35
Drug Repurposing: Targeting mToR inhibitors for Anticancer Activity.

ABSTRACT. In search of finding better and safer drugs and due to the high cost and decreasing productivity of novel drug discovery programs, scientists are now becoming more interested in finding new therapeutic indications for the existing drugs, popularly known as drug repurposing. In drug repurposing, a conventional drug is used to cure a condition which was not earlier known to be therapeutically effective. Many drugs which have failed clinical trials for not being effective in their intended therapeutic indication have been repurposed and thus they have led to huge fortune for the pharmaceutical industries. For instance, sildenafil failed its clinical trials and was repurposed and currently in use as a repurposed drug. Many methods are available for drug repurposing but computational docking is a very cheap and convenient method for drug repurposing which uses computer software to find a possible binding site of a drug within a protein. For its advantages, computational docking approach was used for the present drug repurposing study of mTOR protein, where the drugs chosen were metformin, aspirin and rosuvastatin. Autodock Vina and PyMol was used to complete the study and it was found that aspirin and metformin have poor affinity (-5.8 kcal/mol) for this protein which is upregulated in various types of cancer such as- breast cancer and ovarian cancer. On the other hand, rosuvastatin was found to have a high affinity (-7.8 kcal/mol in case of flexible docking and -10.2 kcal/mol in case of rigid docking) for mTOR and binds to the same binding pocket where the immunosuppressant and anticancer drug rapamycin binds. The study therefore indicates that rosuvastatin might have significant immunosuppressive and anticancer activity by down regulating the activity of mTOR and needs further studies to prove it.

14:55
In silico Structure Based Designing of Dihydrofolate Reductase Enzyme Antagonists and Potential Small Molecules That Target DHFR Protein to Inhibit the Folic Acid Biosynthetic Pathways.

ABSTRACT. Molecular docking has exerted a profound effect on the concept of drug discovery in recent years. Novel parameters and procedures have enabled us to discover numerous new molecules in the field of cancer research. DHFR permits the genetic evolution of carcinoma cells that are protein in nature In order to mitigate such consequences, the DHFR pathway needs to be ceased. Various principles such as structure based drug designing, computational issues, complementarity, docking strategies, rigid and flexible approaches of docking would provide the solutions to developing new drugs in upcoming times. At present, with numerous drugs and discovery tools, effective and potent small molecule candidate identification is expedient.Protein DHFR (Dihydrofolate reductase) and drugs of interest are downloaded from databases such as RCSB PDB (Protein Data Bank) and PubChem. PDB provides the pdb files of protein and pubchem gives the sdf files of the drugs which later is converted with OpenBable that converts almost every format of protein and drugs into format of interest. For more information and research purpose if needed to see the sequence similarities MEGA6 comes in handy to use and see the conserved region of that protein sequence with reference sequence. CLUSTAL Omega which is an EMI-EMBL database software from European region. To find out the binding site within the protein for the drug POCASA 1.2 ans CASTp are used. CASTp deals with more detailed information regarding binding pocket where POCASA gives the specific and exact position of amino acids that deal with binding of drug within the protein of interest. In the protein or receptor Autodock, Pymol, PyRx and Vina use rectangular boxes for the binding site identification.AutoDockTools provides an interactive method for defining the torsional tree mostly for a given ligand and rarely for the receptor. PyRx was also used to determine the binding affinity of the drug towards the protein and binding site as well. PyRx is an automated software in computational approach as every parameters and algorithms are prefixed or default in nature.Analysis of established cancer drugs and small molecules of different classes, different binding affinities towards the DHFR protein is seen. Among the established ones a second generation antifolate drug showed higher affinity of -11.6 Kcal/mole and a small molecule which is also a glucosidase inhibitor showed affinity at -13.0 Kcal/mole. Our algorithm provides a distinct and clear concept about the efficiency of the established anticancer drugs working through DHFR pathway throughout generations. On the other hand, small molecules from different class also shows a good binding affinity towards this DHFR protein which leads us to the extended version of this work to find out anticancer activity of very specific class of molecules other than related to cancer.

14:15-15:15 Session 8B: Industrial Talks (SoICT 2017)
Location: Yersin Ballroom
14:15
Leveraging advanced technologies and operation excellence to build the world's first Knowledge as a Service (KaaS) platform

ABSTRACT. People have dozen of questions everyday and some are very difficult that can make users stuck in work or study. There are two popular ways for everyone to find answers online: Google and community places like Reddit, StackOverflow, or forums. While Google can give instant search results for free it can only find the generic information that can't address user's personalized needs. Community places can provide users with personalized answers to very specific questions but they are often slow and there is no guarantee about the services. Got It is the world's first Knowledge as a Service (KaaS) platform to address above issues with Google and community places. A user with a question is connected instantly with an expert who can help via 10-minute chat sessions anytime and anywhere. In this presentation we will present the approaches Got It has employed to leverage advanced technologies like AI and operations excellence to successfully deliver millions of sessions to millions of users around the world. We will also present few directions that our R&D team is heading toward to stay ahead in the marketplace.  

14:35
BigData insights, Machine learning and AI in VCCORP

ABSTRACT. "We think the future of coding is no coding at all" - CEO Gitub Chris Wanstrath has predicted recently, opening many debate questions about the future of Artificial Intelligence (AI). Will artificial intelligence replace humans?. It is highly possible. Nowadays, computer vision algorithms - automated translation, image recognition - have surpassed others in the industry even humans. AI technology improves human life, facilitating their working performance, thanks to the breakthroughs in computational technology with the rapid development of hardware (CPUs/GPUs). In this presentation, we will be discussing AI platforms in VCCORP, the challenges and possibilities.

14:55
Cooperation between research institutions and enterprises: challenges and solutions

ABSTRACT. The Industry 4 is taking place at unprecedented speed. It has both positive and negative impacts on every economic and social aspect. One of the most important components of Industry 4 is AI. Big corporates and research institutions in developed countries are investing a lot of resources in this field, and amazing results have been achieved. 
In Vietnam, universities, research institutes are showing great interests in the field of AI. One of the problems in the country, however, is that it takes so long from research to applications. 
This talk mentions challenges in cooperation between research institutions and enterprises; it then proposes some solutions that can advance the efficiency of cooperation, benefiting all parties. 

15:15-15:35CSBio Poster session and Coffee Break
15:35-17:15 Session 9A: Computational Methods (CSBio 2017)
Chair:
Location: Hon Chong Room
15:35
Formal Validation of Neural Networks as Timed Automata

ABSTRACT. We propose a formalisation of spiking neural networks based on timed automata networks. Neurons are modelled as timed automata waiting for inputs on a number of different channels (synapses), for a given amount of time (the accumulation period). When this period is over, the current potential value is computed taking into account the current inputs and the previous decayed potential value. If the current potential overcomes a given threshold, the automaton emits a broadcast signal over its output channel, otherwise it restarts another accumulation period. After each emission, the automaton is constrained to remain inactive for a fixed refractory period. Spiking neural networks are formalised as sets of automata, one for each neuron, running in parallel and sharing channels according to the structure of the network. The model is then validated against some crucial properties defined via proper temporal logic formulae.

15:55
Estimating respiration rate using an accelerometer sensor

ABSTRACT. Breathing activity can be independently measured electronically, e.g., using a thoracic belt or a nasal thermistor or be reconstructed from noninvasive measurements such as an ECG. In this paper, the use of an accelerometer sensor to measure respiratory activity is presented. Movement of the chest was recorded by an accelerometer sensor attached to a belt around the chest. The acquisition is realized in different status: normal, apnea, deep breathing or after exhaustion and also in different postures: vertical (sitting, standing) or horizontal (lying down). The results of the experimental evaluation indicate that using a chest-accelerometer can correctly detect the waveform and the respiration rate. This method could, therefore, be suitable for automatic identification of some respiratory malfunction, for example during the obstructive apnea.

16:15
Modelling and Formal Verification of Neuronal Archetypes Coupling

ABSTRACT. In the literature, neuronal networks are often represented as graphs where each node symbolizes a neuron and each arc stands for a synaptic connection. Some specific neuronal graphs have biologically relevant structures and behaviors and we call them archetypes. Six of them have already been characterized and validated using formal methods. In this work, we tackle the next logical step and proceed to the study of the properties of their couplings. For this purpose, we rely on Leaky Integrate and Fire neuron modeling and we use the synchronous programming language Lustre to implement the neuronal archetypes and to formalize their expected properties. Then, we exploit an associated model checker called kind2 to automatically validate these behaviors. We show that, when the archetypes are coupled, either these behaviors are slightly modulated or they give way to a brand new behavior. We can also observe that different archetype couplings can give rise to strictly identical behaviors. Our results show that time coding modeling is more suited than rate coding modeling for this kind of studies.

16:35
Identifying microRNA targets in epithelial-mesenchymal transition using joint-intervention causal inference

ABSTRACT. microRNAs (miRNAs) are important gene regulators, controlling a wide range of biological processes and being involved in several types of cancers. Thus, exploring miRNA functions is important for diagnostics and therapeutics. Currently, several computational approaches have been developed to elucidate the miRNA-mRNA regulatory relationships. However, these approaches have their own limitations and we are still far from understanding the miRNA-mRNA relationships, especially in specific biological processes. In this paper, we adapt a causal inference method to infer miRNA targets from the Epithelial Mesenchymal Transition (EMT) dataset. EMT is a key process of cancer metastasis, and therefore elucidating miRNA-mRNA relationships in EMT plays an important role in understanding cancer metastasis. Our method utilises a causality based method that estimates the causal effect of each miRNA on a mRNA while controlling the effects of other miRNAs on the mRNA. The inferred causal effect is similar to the effect of a miRNA on a mRNA when we knockout all the other miRNAs. The experimental results show that our method is better than existing benchmark methods in finding experimentally confirmed miRNA targets. Moreover, we have found that the miR-200 family members (miR-141, miR-200a/b/c, and miR-429) synergistically regulate a number of target genes in EMT, suggesting their roles in controlling cancer metastasis. In addition, functional and pathway enrichment analyses show that the discovered miRNA-mRNA regulatory relationships are highly enriched in EMT, implying the validity of the proposed method. Novel miRNA-mRNA regulatory relationships discovered by our method provide a rich resource for follow up wet-lab experiments and EMT related studies.

16:55
Extraction of disease-related genes from PubMed paper using word2vec

ABSTRACT. Finding disease-related genes is important in drug discovery. Many genes are involved in the disease, and many studies have been conducted and reported for each disease. However, it is very costly to check these one by one. Therefore, machine learning is a suitable method to address this problem. By extracting study results from research papers by text mining, it is possible to make use of that knowledge. In this research, we aim to extract disease-related genes from PubMed papers using word2vec, which is a text mining method. The method extracts the top 10 genes whose known disease genes and vectors are close to those obtained by word2vec. Based on these, genes other than known disease-related genes are extracted and used as disease-related genes. We conducted experiments using schizophrenia, and confirmed the likelihood of this disease-related gene using random forest. Pattern 1: Only known genes. Pattern 2: Pattern 1 plus disease-related genes extracted in this study. Pattern 3: Pattern 1 plus the same number of random genes. Using these three patterns, we performed a random forest with microarray data and compared the classification accuracy. The result was that Pattern 2 had the highest accuracy. Therefore, we could extract genes with using genes related to disease by our method.

15:35-17:35 Session 9B: Natural Language Processing I (SoICT 2017)
Location: Yersin Ballroom A
15:35
Enhancing extractive summarization using non-negative matrix factorization with semantic aspects and sentence features

ABSTRACT. The main task in extractive text sum-marization is to evaluate the important of sentences in a document. This paper aims at improving the quality of an unsupervised summarization method, i.e. non-negative matrix factorization, by using sentence features and considering semantically related words using word embeddings (i.e. word2vec) in sentence scoring. The experiments were carried out with different scenario using the DUC 2007 dataset. Experimental results showed that when NMF was combined three types of sentence features (i.e., sur-face, content, and relevant features) and word2vec, the system got best performance with 42.34% for Rouge-1 and 10.77% for Rouge-2, increasing 0.57% Rouge-1 and 0.78% Rouge-2 in compared with only NMF.

15:55
Utilizing User Posts to Enrich Web Document Summarization with Matrix Co-factorization

ABSTRACT. In the context of social media, users usually post relevant information corresponding to the content of an event mentioned in a Web document. This information (called by users posts) has two important characteristics: (i) reflecting the content of an event and (ii) sharing hidden topics with the sentences in the main document. In this paper, we present a model to capture the nature of relationship between sentences and user posts such as comments in sharing hidden topics for summarization. Unlike the previous methods which usually base on hand-crafted features, our approach ranks sentences and comments based on their importance affecting the topics. The sentence-comment relation is formulated in a share topic matrix, which presents their mutual reinforcement support. Our newly proposed matrix co-factorization algorithm computes the score of each sentence and comment and extracts top \emph{m} ranked sentences and m comments as the summarization. Experimental results on two datasets in two languages of the social context summarization task (English and Vietnamese) and DUC 2004 confirm the efficiency of our model in summarizing Web documents.

16:15
Parallel Multi-feature Attention on Neural Sentiment Classification

ABSTRACT. The analysis of the review's sentiment polarity is a fundamental task in NLP. However, most of the existing sentiment classification models only focus on extracting features but ignore features' own differences. Additionally, these models only pay attention to content information but ignore the user's ranking preference. To address these issues, we propose a novel Parallel Multi-feature Attention (PMA) neural network which concentrates on fine-grained information between user and product level content features. Moreover, we use multi-feature, user's ranking preference included, to improve the performance of sentiment classification. Experimental results on IMDB and Yelp datasets show that PMA model achieves state-of-the-art performance.

16:35
Combining Convolution and Recursive Neural Networks for Sentiment Analysis

ABSTRACT. This paper addresses the problem of sentence-level sentiment analysis. In recent years, Convolution and Recursive Neural Networks have been proven to be effective network architects for sentence-level sentiment analysis. Nevertheless, each of them has their own potential drawbacks. For alleviating their weaknesses, we combined Convolution and Recursive Neural Networks into a new network architect. In addition, we employed transfer learning from a large document-level labeled sentiment dataset to improve the word embedding in our models. The resulting models outperform all recent Convolution and Recursive Neural Networks. Beyond that, our models are able to achieve comparable performance with the state-of-the-art systems of Stanford Sentiment Treebank.

16:55
News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion

ABSTRACT. With the development of Internet and mobile devices, people are surrounded with wide abundant of information from various online sources. News classification is among essential needs for people to organize, better understand, and utilize information from the Internet. This motivates the authors to propose a novel method to classify news from social media. First, we propose to vectorize an article with TD2V, our pre-trained Twitter-based universal document representation following Doc2Vec approach. We then define Modified Distance to better measure the semantic distance between two document vectors. Finally, we apply retrieval and automatic query expansion to get the most relevant labeled documents in a corpus to determine the category for a new article. As our TD2V is created from 297 million words in 420,351 news articles from more than one million tweets in Twitters from 2010 to 2017, it can be used as one of the efficient pre-trained models for English document representation in various applications. Experiments on datasets from different online sources show that our method achieves the classification accuracy better than existing methods, specifically 98.4 +/- 0.3 (BBC dataset), 98.9 +/- 0.7% (BBC Sport dataset), 94.1 +/- 0.2% (Amazon4 dataset), and 78.6% (20NewsGroup dataset). Furthermore, in the classification training process, we just encode all articles in the training set with TD2V, not to train a dedicated classification model for each of these datasets.

17:15
Towards State-of-the-art English-Vietnamese Neural Machine Translation

ABSTRACT. Machine translation is one of the most challenging topics in natural language processing. The common approaches to machine translation base on either statistical or rule-based methods. Rule-based translation analyzes sentence structures, requires extensive lexicons with morphological, syntactic, and semantic information, and large sets of manually created rules. Statistics-based translation faces the challenge of collecting bilingual text corpora, which is particularly difficult for low resource language pairs as English-Vietnamese. This research aims at building state-of-the-art English-Vietnamese machine translation. Our contribution includes: (1) an enormous effort in collecting training dataset, (2) an application of advanced methods in neural machine translation to optimize the translation model, (3) an experimental result suggested the unnecessary of Vietnamese tokenization as a common pre-processing step. Our model achieves a highest BLEU score in comparison with other researches.

15:35-17:35 Session 9C: Security (SoICT 2017)
Location: Yersin Ballroom B
15:35
Protecting consensus seeking NIDS modules against multiple attacks

ABSTRACT. This work concerns distributed consensus algorithms and application to a network intrusion detection system (NIDS) [20]. We consider the problem of defending the system against multiple data falsification attacks (Byzantine attacks), a vulnerability of distributed peer-to-peer consensus algorithms that has not been widely addressed in its practicality. We consider both naive (independent) and colluding attacks. We test three defense strategy implementations, two classified as outlier detection methods and one reputation-based method. We have narrowed our attention to outlier and reputation-based methods because they are relatively light computationally speaking. We have left out control theoretic methods which are likely the most effective methods, but their computational cost increase rapidly with the number of attackers. We compare the efficiency of these three implementations for their computational cost, detection performance, convergence behavior and possible impacts on the intrusion detection accuracy of the NIDS. Tests are performed based on simulations of distributed denial of service attacks using the KSL-KDD data set

15:55
FDDA: A Framework For Fast Detecting Source Attack In Web Application DDoS Attack

ABSTRACT. Anomaly detection technique is used in Intrusion Detection System/Intrusion Prevention System (IDS/IPS) products to find out Zezo-day attacks. However, the anomaly detection technique needs to conduct a training phase in order to learn or set up parameters of the system when the system is under free attack status. Moreover, the efficiency of detecting abnormal signals mainly depends on the data learned from the training phase, as well as, the updating data learned during the detection phase. In this research, we propose a framework named FDDA which can improve the speed and efficiency in defensing DDoS attacks to web application. FDDA allows to detect and quickly remove the (IP) sources of requesting packets in DDoS attacks to web application, i.e. greatly reduces the slow process of training phase. Additionally. FDDA introduces a procedure of automatically update dynamic featured data (for detecting and blocking attacking requests). It hence provides the flexibility and strength to deal with the hackers that can change their methods and forms of attacking

16:15
GINTATE: Scalable and Extensible Deep Packet Inspection System for Encrypted Network Traffic

ABSTRACT. Deep packet inspection (DPI) is a basic monitoring technology, which realizes network traffic control based on application payload. The technology is used to prevent threats (e.g., intrusion detection systems, firewalls) and extract information (e.g., content filtering systems). Additionally, transport layer security (TLS) monitoring is required as the use of the TLS protocol, including hypertext transfer protocol secure (HTTPS), is increasing. TLS monitoring is different from TCP monitoring in two aspects. First, monitoring systems cannot inspect the contents in TLS communication, which is encrypted. Second, TLS communication is a session unit composed of one or more TCP connections.

In enterprise networks, dedicated TLS proxies are deployed to perform TLS monitoring. However, the proxies cannot be used when monitored devices are unable to use a custom certificate. Additionally, the networks contain problems of scale and complexity which affect the monitoring. Therefore, the DPI processing using another method requires high-speed processing and various protocol analyses across TCP connections in TLS monitoring. However, it is difficult to realize both simultaneously.

We propose GINTATE, which decrypts TLS communication using shared keys and monitors results. GINTATE is scalable architecture that uses distributed computing and considers a relational session across multiple TCP connections in TLS communication. Additionally, GINTATE performs DPI processing that is achieved by adding an extensible analysis module. We show that GINTATE performs DPI processing by treating the relational session in distributed computing and that it is scalable by comparing the system with other systems.

16:35
DGA Botnet Detection Using Supervised Learning Methods

ABSTRACT. Modern botnets are based on Domain Generation Algorithms (DGAs) to build a resilient communication between bots and Command and Control (C&C) server. The basic aim is to avoid blacklisting and evade the Intrusion Protection Systems (IPS). Given the prevalence of this mechanism, numerous solutions have been developed in the literature. In particular, supervised learning has received an increased interest as it is able to operate on the raw domains and is amenable to real-time applications. Hidden Markov Model, C4.5 decision tree, Extreme Learning Machine, Long Short-Term Memory networks have become the state of the art in DGA botnet detection. There also exist several advanced supervised learning methods, namely Support Vector Machine (SVM), Recurrent SVM, CNN+LSTM and Bidirectional LSTM, which have not been suitably appropriated in such domain. This paper presents a first attempt to thoroughly investigate all the above methods, evaluate them on the real-world collected DGA dataset involving 38 classes with 168,900 samples, and should provide a valuable reference point for future research in this field

16:55
Using CPR metric to Detect and Filter Low-rate DDoS Flows

ABSTRACT. Low-rate distributed TCP-targeted denial-of-service(LDDoS) attack now becomes a big challenge for existing defense mechanisms. It throttles TCP throughput by exploiting TCP's timeout mechanism, which emphasizes the use of a common minimum retransmission timeout (minRTO) of 1 second. Congestion participation rate (CPR) metric and a CPR-based approach have been proposed by Zhang et al. to detect and filter LDDoS flows. The approach uses a threshold τ to judge whether a flow is an attack flow or not. If a flow having CPR greater than τ, it is considered as an attack flow, otherwise it is not. Problem arises when using the CPR-based approach with τ fixed. With that, the approach cannot simultaneously achieve high TCP throughput under attack and fairness to new TCP flows in normal time. We then propose a method of adapting τ to solve this problem. Simulation results show that the adaptive CPR-based approach can preserve TCP throughput under attack fairly well, while maintaining fairness between new TCP flows in normal time.

17:15
Efficient Secure Text Retrieval on Multi-Keyword Search

ABSTRACT. It is necessary to protect the data, while the data owner still let the users retrieve the information. In this paper, we present a secure text retrieval on multi keyword search, where the data owners and users can guarantee the privacy of their documents and searching keywords against the semi-trusted document servers while maintaining the functionality of ranked text retrieval. Our scheme also supports access control where the data owners can specify the users that can search and access their files. We build our scheme based on the term frequency ranking function that is widely used in many real text retrieval systems. Hence, the efficiency of our secure scheme is verified empirically with real text corpus.

15:35-17:15 Session 9D: Software Engineering (SoICT 2017)
Location: Hon tre Room
15:35
A Compositional Type Systems for Finding Log Memory Bounds of Transactional Programs

ABSTRACT. In our previous works, we proposed several type systems that can guarantee log memory bounds of transactional programs. One drawback of these type systems is their restricted compositionality. In this work, we develop a type system that is completely compositional. It allows us to type any sub-terms of the program, instead of bottom-up style in our previous works. In addition, we also extend the language with basic elements that are close to real world languages instead of abstract languages as in our previous works. This increases the implementability of our type systems to real world languages.

15:55
A Test Data Generation Method for C/C++ Projects

ABSTRACT. This research proposes an automated test data generation method for C/C++ projects to generate the lower number of test data while gaining higher code coverage in comparison with KLEE, CAUT, PathCrawler, and CREST. In order to do that, the proposed method contributes an algorithm named loop depth first search by combining both static testing and concolic testing together. Besides, the paper also provides an improvement symbolic execution for avoiding the initial test data problem in the concolic testing. Currently, a tool supporting the proposed method has been developed and applied to test on different C/C++ projects in several software companies. The experimental results show the higher coverage with the lower number of test data compared with the existing methods. The experimental results display the effectiveness and practical usefulness of the proposed method for automated test data generation in practice.

16:15
Mutants Generation For Testing Lustre Programs

ABSTRACT. Lustre is synchronous language, widely used for the development of reactive systems, control systems and monitoring systems, such as nuclear reactors, civil aircraft, automobile vehicles... In particular, Lustre is suitable for developing real-time systems. In such applications, testing activities for fault detection play a very important role. Mutation testing is one of the most commonly used techniques for evaluating the probability of fault detection of test data. Typically, the mutants generated by a set of mutation operators of a programming language are very large, so the manual mutant generation is often very costly. In this paper, we present a mutants generator by using the set of mutation operators defined for the Lustre language. Automatic mutants generation strategy is implemented in the generator in order to reduce test cost. Mutant generation and random test data generation are also experimented on different Lustre programs.

16:35
Design and implementation of a new execution model for CAPE

ABSTRACT. CAPE, which stands for Checkpointing-Aided Parallel Execution, is an approach based on checkpoints to automatically translate and execute OpenMP programs on distributed-memory architectures. This approach demonstrates high-performance and completes compatibility with OpenMP on distributed-memory system. This paper presents the new design and implementation model for CAPE that improves the performance and makes CAPE even more flexible.

16:55
USL: Towards Precise Specification of Use Cases for Model-Driven Development

ABSTRACT. Use cases have been widely employed as an efficient means to capture and structure software requirements. A use case model is often represented by a loose combination between a UML use case diagram and a textual description in natural language. The use case model expressed in such a form often contains ambiguous and imprecise parts. This prevents integrating it into model-driven approaches, where use case models are often taken as the source of transformations. This paper introduces a domain specific language named the Use case Specification Language (USL) to precisely specify use cases with two main features: (1) The USL has a concrete syntax in graphical form that allows us to achieve the usability goal;(2) The precise semantics of USL that is defined by mapping the USL to a Labelled Transition System (LTS) opens a possibility for transformations from USL models to other artifacts such as test cases and analysis class models.

19:00-21:30Gala Dinner at Champa Island, 304, 2/4 road, Nha Trang