Program for Friday, December 4th

09:15	Huong Le Thanh (Hanoi University of Science and Technology, Viet Nam) Luan Tran Van (Hanoi University of Science and Technology, Viet Nam) Hoai Nguyen Xuan (Hanoi University, Viet Nam) Hien Nguyen Thi (Le Quy Don Technical University, Viet Nam) Optimizing Genetic Algorithm in Feature Selection for Named Entity Recognition SPEAKER: unknown ABSTRACT. This paper proposes some strategies to reduce the running time of genetic algorithms used in a feature selection task for the problem of named entity recognition. They include: (i) reduction of population size during the evolution process of the genetic algorithm; (ii) parallelization of the fitness computation; and (iii) use of progressive sampling for calculating the optimal sample size of the training data. Maximum Entropy algorithm is then used, as a test classifier, to compute the accuracy of the named entity recognition system with the reduced feature sets identified by the genetic algorithm. Experimental results show that our improved genetic algorithm run three time faster than the standard genetic algorithm, while the accuracy of the named entity recognition system (using Maximum Entropy) on the induced feature subset does not decrease. In addition, the feature subset induced by our improved genetic algorithm is much smaller than the original feature set and has helped Maximum Entropy to achieve higher accuracy than the original one.
09:35	Anton Dries (KU Leuven, Belgium) Declarative Data Generation with ProbLog SPEAKER: Anton Dries ABSTRACT. In this paper we describe a novel declarative approach to data generation based on probabilistic logic programming. We show that many data generation tasks can be described as a probabilistic logic program. To this end, we extend the ProbLog language with continuous distributions and we develop a simple sampling algorithm for this language. We demonstrate that many data generation tasks can be described as a model in this language and we provide examples of generators for attribute-value data, sequences, graphs and logical interpretations and we show how to model common extensions such as noise, missing values and concept drift.
09:55	Van-Hau Nguyen (Hung Yen University of Technology and Education, Viet Nam) Thai Son Mai (University of Transport, HoChiMinh City, Viet Nam) A New Method to Encode the At-Most-One Constraint into SAT SPEAKER: unknown ABSTRACT. One of the most widely used constraints during the process of translating a practical problem into a propositional satisfiability (SAT) instance is the at-most-one (AMO) constraint. This paper proposes a new encoding for the AMO constraint, the so-called AMO {bimander} encoding which can be easily extended to encode cardinality constraints, which are often used in constraint programming. Experimental results reveal that the new encoding is very competitive compared with all other state-of-the-art encodings. Furthermore, we will prove that the AMO {bimander} encoding allows unit propagation to achieve arc consistency - an important technique in constraint programming. We also show that a special case of the AMO {bimander} encoding outperforms the AMO {binary} encoding, a widely used encoding, in all our experiments.
10:15	Duy Du Nguyen (Hanoi University of Science and Technology, Viet Nam) Hoang Huy Nguyen (Vietnam National University of Agriculture, Viet Nam) Xuan Hoai Nguyen (Hanoi University, Viet Nam) The impact of high dimensionality on SVM when classifying ERP data - A solution from LDA SPEAKER: unknown ABSTRACT. Brain-computer interfaces (BCI) based on P300 event-related potentials (ERP) could help to select characters from a visually presented character-matrix. They provide a communication channel for users with neurodegenerative disease. Associated to these kinds of BCI systems, there is the problem to determine whether or not a P300 was actually produced in response to the stimuli. The design of this classification step involves the choice of one or several classification algorithms from many alternatives. Support Vector Machines (SVM) and Linear Discriminant Analysis (LDA) have been used to achieve acceptable results in numerous P300 BCI applications. However, both of them suffers from the high dimensional problem which leads to deterioration of their performance. In this paper, we introduce a novel and combined approach of LDA and SVM to reduce the negative effect of high dimensional data on SVM and LDA, and investigate the performance of our method. The results shows that the new approach achieves similar or slightly better performance than the state-of-art method.

09:15-10:35 Session 8: Security and Distributed application

Presentation session

Chair:

Michel Toulouse (Vietnam Germany University, Viet Nam)

Location: Meeting room 2

09:15	Thanh Nguyen (VNG Corporation, Viet Nam) Anh Nguyen (VNG Corporation, Viet Nam) Ly Vu (VNG Corporation, Viet Nam) Tuan Nguyen (VNG Corporation, Viet Nam) Uy Nguyen (VNG Corporation, Viet Nam) Long Dao Hai (VNG Corporation, Viet Nam) Unsupervised Anomaly Detection in Online Game SPEAKER: unknown ABSTRACT. Online game is one of the most successful business on the Internet. As online game business grows, cheating in game becomes popular and is the biggest challenge of online game systems. In this paper, we investigate the application of anomaly detection techniques to cheating detection in an online game (JX2) of VNG company. A method to evaluate the performance of unsupervised anomaly detection tech- niques was proposed. Six unsupervised anomaly detection algorithms were tested. The experimental results show that the kernel density based technique and ensemble techniques performed best on this game data. Our post analysis helped to identify and eliminate some cheating players in the game.
09:35	Tuan-Dung Cao (Hanoi University of Science and Technology, Viet Nam) Trung-Duc Nguyen (Viettel, Viet Nam) Linh Giang Nguyen (Hanoi University of science and technology, Viet Nam) A detection method for DGA-based Botnet on BigData Platform SPEAKER: unknown ABSTRACT. Botnets are groups of malware-compromised machines, or bots, that can be remotely controlled by an attacker (the botmaster) through a command and control (C&C) communication channel [4]. They are typically used for malicious activities such as spamming email, making a DDOS attack, mining Bitcoins or stealing user's information, etc… Nowadays, reverse engineering malware samples is the main way to detect botnet. This task is difficult, tedious and ineffective in many cases. On the other hand, new generation botnet have ability to avoid detection and exclusion by the traditional methods. By using domain generation algorithm to create a list of candidate C&C server domain, botnet can be still alive even when a C&C server domain is detected and taken down. In this paper we present a novel method to detect DGA botnets without reversing. After logging all DNS traffic of a monitored network, we use a combination of clustering and classification algorithm that relies on the similarity in characteristic distribution of domain names to remove noise and group similar domains to correspond cluster. In the next step, we apply Collaborative Filtering (CF) technique to find out bots in each botnet. With CF algorithm, we can find out offline malwares infected-machine in each botnet, which do not create access log to NXDomains. In this way, botnet detection can be automatically performed and we can find out a lot of botnets simultaneously. We applied Big Data techniques to analyze a huge amount of DNS traffic log (6 Gbps bandwidth and 18.000 users) of Viettel Group and obtained positive results.
09:55	Son Nguyen (University of Engineering and Technology, Viet Nam) Hieu Vo (University of Engineering and Technology, Viet Nam) Hung Pham (University of Engineering and Technology, Viet Nam) A Correlation-aware Negotiation Approach for Service Composition SPEAKER: unknown ABSTRACT. Composing existing services to create new services has been considered as an important activity of developing service-oriented architecture systems. The increment of services that provide the same functionality but different qualities leads to the complexity in finding the best solution for a composite service. In a heterogeneous and dynamic environment, QoS negotiation provides a flexible means for choosing suitable atomic services for service compositions. However, most proposed negotiation approaches assume that services are independent of others in terms of quality. Consequently, these negotiation approaches are not able to handle the correlation factors among services. This paper presents a flexible correlation-aware negotiation approach for service compositions. In our approach, the service correlations are considered as factors affecting the choice of concrete services. The approach employs a gradual benefit reduction mechanism for determining suitable QoS proposals of the composite side. The effectiveness of the approach is demonstrated via experiments.
10:15	Amiza Amir (School of Computer and Communication Engineering, Universiti Malaysia Perlis, Malaysia) Bala Srinivasan (Clayton School of IT, Monash University, Australia) Asad Khan (Clayton School of IT, Monash University, Australia) A Communication-Efficient Distributed Algorithm for Large-scale Classification within P2P Networks SPEAKER: unknown ABSTRACT. This paper proposes a supervised and fully-distributed intelligent classification algorithm that is accurate and scalable for large networks. In addition, the resulting algorithm has the following interesting features: fully-distributed, asynchronous, light-weight, online learning, and fast responses. These characteristics make it scalable for large networks. A major distinction of our method compared to the other approaches is that it forms a single global classifier, instead of building many local classifiers (one at every site). Fine-granularity components of the classifier are distributed across the network by using Distributed Hash Table (DHT) --- which provides efficient linking to these components and ensures the system remains fully-distributed. Our simulation results also show that the proposed method is more communication-efficient than several other distributed algorithms. The results also show that the distributed algorithm is able to produce accurate results that are comparable to the available state-of-the-art machine learning techniques.

10:35-10:55Coffee Break

10:55-12:35 Session 9: Data Retrivial and Extraction

Presentation session

Chair:

Yew-Soon Ong (Nanyang Technological University, Singapore)

Location: Meeting room 2

10:55	Quang Vu Bui (Ecole Pratique des Hautes Etudes (EPHE), France) Karim Sayadi (University Pierre and Marie Curie, France) Marc Bui (Ecole Pratique des Hautes Etudes, France) A multi-criteria document clustering method based on topic modeling and pseudoclosure function SPEAKER: unknown ABSTRACT. We address in this work the problem of document clustering. Our approach is based on the following pipeline. First, we quantify the topics in a document. Then, a number of clusters is set automatically. Finally, a multi-criteria distance is defined to cluster the documents. The advantage of this approach is that it allows us to have a number of multi-criteria clusters based on structural analysis of each document. We have applied our method on Twitter data and showed the accuracy of our results compared to a random choice number of clusters.
11:15	Cam Vu-Manh (FPT University, Viet Nam) Anh Tuan Luong (FPT University, Viet Nam) Phuong Le-Hong (Hanoi University of Science, Viet Nam) Improving Vietnamese Dependency Parsing using Distributed Word Representations SPEAKER: unknown ABSTRACT. Dependency parsing has become an important line of research in natural language processing in recent years. This is due to its usefulness in a wide variety of real world applications. This paper presents the improvement of Vietnamese dependency parsing using distributed word representations. Our parser achieves an accuracy of 76.29% of unlabelled attachment score or 69.25% of labelled attachment score. This is the most accurate dependency parser for the Vietnamese language in comparison to others which are trained and tested on the same dependency treebank. The distributed word representations are produced by two recent unsupervised learning models, the Skip-gram model and the GloVe model. We also show that distributed representations produced by the GloVe model are better than those produced by the Skip-gram model when being used in dependency parsing. Our dependency parsing system, including software, corpus and distributed word representations, is released as an open source project, freely available for research purpose.
11:35	Hien Luong (Department of Information System, SOICT, HUST, Viet Nam) Oanh Nguyen (Department of Information System, SOICT, HUST, Viet Nam) A Copy Detection Method Based on SCAM and PPCHECKER SPEAKER: unknown ABSTRACT. With the widespread use of the Internet and the availability of a huge amount of digital documents online, plagiarism is increasing. This is a serious problem not only in publishing of scientific documents but also in education. Copying is a frequent way used in plagiarism. Documents can be copied completely or some parts. Many document copy detection (DCD) methods have been proposed, however, few of them allow us to detect partial copy with high efficiency and in reasonable time. In this paper, we propose a schema for detecting copies including partial copies. This proposed method is based on SCAM and PPCHECKER methods, that benefits advantages of both methods. Experimental results with high precision demonstrate the effectiveness of the proposed method
11:55	Tai Dinh (Department of Computer Science Ho Chi Minh City Industry and Trade College, Ho Chi Minh City, Viet Nam, Viet Nam) Minh Nguyen Quang (Academy of Cryptography Techniques, Ho Chi Minh City, Viet Nam, Viet Nam) Bac Le (NU HCMC, University of Science, Ho Chi Minh City, Viet Nam, Viet Nam) A Novel Approach for Hiding High Utility Sequential Patterns SPEAKER: Tai Dinh ABSTRACT. Privacy Preserving Data Mining (PPDM) has become an important research topic in recent years. Hiding high utility sequential patterns is very necessary in business, health and security applications, etc. The goal of hiding is to find the way to hide all high utility sequential patterns so that the adversaries cannot mine them from the sanitized database. However, there are a few methods in the literature for hiding high utility sequential patterns. In this paper, we present a new approach for solving this problem. First, we use an expansion algorithm of USpan [2] for mining all high utility sequential patterns. Then we present two proposed algorithms HHUSP (Hiding High Utility Sequential Pattern) and MSPCF (Maximum Sensitive Patterns Conflict First) to hide all high utility sequential patterns. Experimental results show the evaluation of execution time, memory usage on the large-scale datasets.
12:15	Chien Ta (The HCM univeristy of technology, Viet Nam) Tuoi Phan (HCM City University of Technology, Viet Nam) An Approach for Searching Semantic-based Keywords over Relational Database SPEAKER: unknown ABSTRACT. Ontologies apply to many applications in recent years, especially on the semantic web, information retrieval, information extraction, and question answering. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. There are some languages in order to represent ontologies, such as RDF, OWL. However, these languages are only suitable with ontologies having a small data. It usually uses a database for representing ontologies having big data. However, most of the databases do not sufficiently support the semantic orientated search by Structured Query Language (SQL). Therefore, this paper introduces an approach for semantic-based keyword search over relational databases. This approach can be applied to any relational database system

10:55-12:35 Session 10: Communication Systems

Presentation session

Chair:

The Ngoc Dang (Posts and Telecommunications Institute of Technology, Viet Nam)

Location: Meeting room 2

10:55	Vinh Tran (Australia center for space engineering research, Australia) Nagaraj Shivaramaiah (Australia center for space engineering research, Australia) Andrew Dempster (Australia center for space engineering research, Australia) A pipeline dynamically configured GNSSs baseband circuit SPEAKER: unknown ABSTRACT. This paper proposes a pipeline programmable interleaved baseband circuitry for modern Global Navigation Satellite Systems (GNSSs) signals. The involvement of new GNSSs leads to the requirement of designing a programmable multi-GNSS baseband receiver that can be reconfigured across GNSS signals. The baseband circuit parallelism is the easiest approach, but resource consumption will be significant. Hardware time multiplexing is more effective technique. However, it requires an efficient memory hierachy and an efficient design. These challenges are tackled in the proposed architecture which utilises as 3.2%, 6.6%, 12.5% and 50\% resources as the conventional baseband circuitry consisting of 16 GPS L1 C/A channels, 8 BEIDOU B1I channels, 4 GALILEO E1 channels, and 1 GPS L5 channel, respectively. The power consumption is also reduced by 29% to 70% compared to the corresponding conventional digital correlator circuits; while it is still meeting the timing constrain of the baseband receiver.
11:15	Tat-Nam Nguyen (Le Qui Don Technical University, Viet Nam) Quoc-Binh Nguyen (Le Qui Don Technical University, Viet Nam) Thanh Nguyen (Le Qui Don Technical University, Viet Nam) A novel signal constellation set for communication system using APSK signals of DVB-S2 standard with high nonlinearity SPEAKER: unknown ABSTRACT. Being robust to nonlinear distortion by owning a lower peak to average power ratio in comparision with QAM signal, APSK was recommended for using in the digital satellite video broadcasting, second generation (DVB-S2). However, the adverse effects of nonlinearity could still be reduced by redistributing signal points in the constellation. In this pa- per, we propose a new signal set and the mapping for the APSK signal conforming to the DVB-S2 standard at high nonlinearity level. Moreover, this new proposed APSK sig- nal further improves the performance of HPA in terms of power efficiency.
11:35	Truc Tran Thanh (Danang city Department of Information and Communication, Viet Nam) Sy Ngo Van (Danang city Department of Information and Communication, Viet Nam) Nhu Nguyen Gia (Duy Tan University, Viet Nam) Modulation-based Collaboration: Interference-free for Cooperative Spectrum Sharing for 4-QAM based Co-existing System SPEAKER: unknown ABSTRACT. This article proposes a type of cooperative spectrum sharing in which the secondary signal does not cause any degradation in the primary signal. Consequently, the performance of the primary user system when the simultaneous spectrum sharing occurs is the same as when the secondary user system collaborates with the primary user system without accessing the primary user’s spectrum for the secondary usage.This simplifies the spectrum management because the interference constraint is not necessary. The investigation is limited to the application of the quadrature amplitude modulation (QAM) alphabet and the decode and forward protocol. The symbol error rates (SERs) are considered as the merit for evaluating the performance of the PU and SU systems. They are theoretically analyzed and its agreement with the practice is examined by simulation.
11:55	Linh Truong Hoang (Hanoi university of science and technology, Viet Nam) Khanh Bui Duy (Hanoi university of science and technology, Viet Nam) Trung Tran Viet (Hanoi university of science and technology, Viet Nam) GPSInsights: Towards an efficient framework for storing and mining massive real-time vehicle location data SPEAKER: unknown ABSTRACT. Intelligent Transport System (ITS) has seen growing interest in collecting vehicle location data in order to build up real-time traffic monitoring and analytic systems. However handling these data creates challenges, as they are massive in volume and arriving in near real-time. In this paper, we proposed GPSInsights, a distributed system that is scalable and efficient in processing huge volume of location data stream. GPSInsights is built up on open-source, scalable and distributed components. We demonstrated our system with a scalable map matching implementation and performed experiments with real big datasets.

12:35-13:30Lunch