Program for Thursday, December 8th

10:55	Van Ho (Hoa Sen University, Viet Nam) Ioanis Nikolaidis (University of Alberta, Canada) Modeling Time-Varying Topologies of Duty-Cycled Wireless Sensor Networks SPEAKER: unknown ABSTRACT. Transceiver duty-cycling (DC) is a popular technique to conserve energy in a wireless sensor network (WSN). We therefore focus on studying the performance of a DC WSN with respect to throughput, and the related energy consumption. We assume a deterministic DC behavior. Our elementary notion of traffic is that of a traffic flow, routed via multiple hops from a source to a destination node. Since DC results in different network “links” to be present at different points in time (a “link” exists only if both adjacent nodes are ON at the same time), the topology of the network varies over time. In this paper, we propose a general approach of modeling time-varying topologies of duty-cycled networks via node phases by dividing node cycles into unchanged-topology stages, in which the rates allocated to multi-hop flows can be expressed as solution of a lex-max problem (assuming predetermined single-shortest-path routing) and computed via the maxmin programming (MP) algorithm.
11:15	Dang Thanh Hai (University of Dalat, Viet Nam) Nguyen Thi Tam (VNU University of Science Ha Noi, Viet Nam) Le Hoang Son (VNU University of Science Ha Noi, Viet Nam) Le Trong Vinh (VNU University of Science Ha Noi, Viet Nam) A novel energy-balanced unequal fuzzy clustering algorithm for 3D Wireless Sensor Networks SPEAKER: unknown ABSTRACT. In this paper, we consider the problem of designing an efficient topology in 3D Wireless Sensor Network (3D-WSN) that balances node energy consumption, improves efficiency of data transmission and prolongs network lifetime. 3D-WSN has attracted significant interests in recent years due to its applications in various disciplinary fields such as target detection, object tracking, and security surveillance. The proposed method called FCM-PSOEB firstly generates the energy-efficient clusters including cluster heads (CHs) and cluster members (non-CHs) using an improved Fuzzy C-Means algorithm. Then, Particle Swarm Optimization is used to determine optimal CHs for reducing the number of network disconnects from the current clusters. Finally, a new procedure is proposed to assign non-CHs to the most appropriate clusters in order to maintain load balancing between clusters. FCM-PSOEB is empirically validated on real 3D datasets against the relevant protocols such as LEACH, LEACH-C and K-Means. The exprimental results demonstrate the efficiency of the proposed method.
11:35	Nga Nguyen Thi Thanh (Hanoi University of Science and Technology, Viet Nam) Khanh Nguyen Kim (Hanoi University of Science and Technology, Viet Nam) Son Ngo Hong (Hanoi University of Science and Technology, Viet Nam) Entropy Correlation and Its Impact on Routing with Compression in Wireless Sensor Network SPEAKER: unknown ABSTRACT. The existence of correlation characteristics brings significant potential advantages for the development of efficient routing protocols in wireless sensor network. However, one of the main problem is to identify the correlation region. This paper proposes an estimated joint entropy model to evaluate joint entropy of multiple sensed data. Using the proposed model, a definition of correlation region based on entropy theory is proposed and following, a correlation clustering scheme with less computation is developed. In addition, using the proposed model, some routing protocols with compression scheme are considered in order to select the most appropriate schemes. The clustering of correlation region to optimize the communication cost is also considered in this paper.
11:55	Huy Vu (Hanoi University of Science and Technology, Viet Nam) Tien D. Nguyen (People’s Police University of Technology and Logistics, Viet Nam) Chi Q. Nguyen (Posts and Telecommunications Institute of Technology, Viet Nam) Van K. Nguyen (Hanoi University of Science and Technology, Viet Nam) An efficient geographic algorithm for routing in the proximity of a large hole in Wireless Sensor Networks SPEAKER: unknown ABSTRACT. Geographic routing is well suited for large-scale wireless sensor networks (WSNs) because of its simplicity and scalability. With the occurrence of routing holes, however, geographic routing suffers from the so-called local minimum phenomenon and the issue of traffic concentrating on the hole boundary. Several recent proposals attempt to fix these issues by deploying a special, keep-away area around the hole, which helps to improve the congestion on the hole boundary but they still are deficient if the source or destination is close to the hole. We propose a novel approach to target this problem of routing in a hole proximity while ensure both two main requirements in energy efficiency and load balancing. Our simulation experiments show that our proposed routing scheme strongly outperforms previous approaches considering routing in a hole proximity, especially in energy efficiency and load balancing.

10:55-12:15 Session 3B: Data analytics I

Chair:

Sergey Dvoenko (Tula State University, Russia)

Location: Room Sunflower 2

10:55	Khanh Hiep Tran (FPT University, Viet Nam) Minh Duc Nguyen (FPT University, Viet Nam) Quoc Trung Bui (FPT Technology Research Institute, Viet Nam) Local Search Approach For The Pairwise Constrained Clustering Problem SPEAKER: unknown ABSTRACT. The pairwise constrained clustering is the problem of partitioning a set of data points into clusters when we know in advance that some pairs of points should be in the same cluster and some pairs should not. Previous studies on this problem can be divided into three categories: modifying a traditional clustering algorithm to incorporate constraints, learning a distance measure or combining both approaches. Local search is a heuristic method for finding high-quality solutions for hard optimization problems in a reasonable computation time. It has been applied to the traditional clustering problem in many studies. However, it has never been used for the pairwise constrained clustering problem. Therefore, this paper proposes Tabu search algorithms for this problem, which were tested on several datasets. The experimental results show that these algorithms are very interesting in comparison with some state-of-the-art algorithms, COP K-means, MPC K-means and LCVQE.
11:15	Van The Huy (Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Viet Nam) Duong Tuan Anh (Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Viet Nam) An Efficient Implementation of Anytime K-medoids Clustering for Time Series under Dynamic Time Warping SPEAKER: unknown ABSTRACT. Time series clustering is one of the crucial tasks in time series data mining. So far, time series clustering has been most used with Euclidean distance. Dynamic Time Warping (DTW) distance measure has increasingly been used as a similarity measurement for various data mining tasks in place of traditional Euclidean distance due to its superiority in sequence-alignment flexibility. However, there exist some difficulties in clustering time series with DTW distance, for example, the problem of speeding up DTW distance calculation in the context of clustering. Recently, Zhu et al. proposed a framework of anytime clustering for time series with DTW which uses a data-adaptive approximation to DTW. In this paper, we present an efficient implementation of anytime K-medoids clustering for time series data with DTW distance. In our method, we exploit the anytime clustering framework with DTW proposed by Zhu et al., apply a method for medoid initialization, and develop a multithreading technique to speed-up DTW distance calculation. Experimental results on benchmark datasets validate our proposed implementation method for anytime K-medoids clustering for time series with DTW.
11:35	Van Do Luong (Ho Chi Minh City University of Technology, Viet Nam) Tuan Anh Duong (Ho Chi Minh City University of Technology, Viet Nam) Some Improvements for Time Series Subsequence Join based on Pearson Correlation Coefficients SPEAKER: unknown ABSTRACT. The exact method JOCOR proposed by Mueen et al. is the first method for joining two time series on subsequence correlation. However, this algorithm still has some weaknesses. First, users of JOCOR are required to choose the suitable value of the parameter minLength which is unknown. Second, JOCOR still suffers from high computational cost even for medium-size time series. In this paper, we proposed some novel techniques to improve JOCOR algorithm. These techniques consist of (i) applying some time series segmentation method to divide the outer time series into subsequences, (ii) using a dynamic programming technique in computing the shifted cross product between two time series and (iii) speeding up the subsequence join process by a new way of shifting the sliding window on the outer time series. Extensive experiments have demonstrated that the proposed approach can not only improve the subsequence join in time efficiency but also guarantee the result accuracy.
11:55	Sergey Dvoenko (Tula State University, Russia) Denis Pshenichny (Tula State University, Russia) A Recovering of Violated Metric in Machine Learning SPEAKER: unknown ABSTRACT. Experimental results in machine learning, data analysis and data mining often appear as comparisons between elements from a limited set. If a matrix of pairwise similarities is positively definite, then the set of elements is considered to be immersed in some metric space, e.g. Euclidean, with dimensionality not higher than the rank of the matrix. But this matrix not necessarily appears to be positively definite, because measurements are not scalar products. Metric needs to be recovered for correct use for clustering or machine learning. Violations arise not only in the triangle inequality, but also relative to positions of more than three elements. In general, all similarity submatrices for triples of elements are positively definite, but simultaneously the whole matrix is negatively definite. We discuss here the approach to recover violated metric based on the idea of appropriate corrections of normalized similarity matrix and develop it for non-normalized similarity and dissimilarity ones.

12:15-13:30LUNCH

13:30-14:15 Session 4: Keynote II

Chair:

Toulouse Michel (Vietnamese-German University, Binh Duong New City & CIRRELT, Viet Nam)

Location: Room Sunflower

13:30

André Langevin (CIRRELT, Canada)

Quantitative Approaches for Road Maintenance

SPEAKER: André Langevin

14:15-15:15 Session 5A: Networks and Communication systems

Chair:

The Ngoc Dang (PTIT, Viet Nam)

Location: Room Sunflower 1

14:15	Trang Ngo (Posts and Telecommunications Institute of Technology, Viet Nam) Nhan D. Nguyen (Posts and Telecommunications Institute of Technology, Viet Nam) Hieu T. Bui (Posts and Telecommunications Institute of Technology, Viet Nam) A simple performance analysis of IM-DD OFDM WDM systems in long range PON application SPEAKER: unknown ABSTRACT. Orthogonal Frequency Division Multiplexing (OFDM) has been proposed as a promising technology for the WDM-based long range passive optical network (LR-PON) because of its resilience in the presence of fiber dispersion and its high bandwidth efficiency. However, there are many system parameters that need to be considered in practical system design. In this paper, we provide a simple performance analysis of the IM-DD OFDM-WDM systems in impact of clipping noise, FWM effect as well as noises at the photo-detector. The obtained results show the influence of important system parameters such as launched power, optical gain, modulation index and transmission distance on the system performance that is significant in practical system design.
14:35	Hoang Trong Minh (Posts and Telecoms Institute of Technology, Viet Nam) Lang Tuan Nguyen (Posts and Telecoms Institute of Technology, Viet Nam) Nguyen Thi (Thevoice of Viet Nam, Viet Nam) Analyzing the Multi hop Connectivity Performance in Brownian Underwater Wireless Sensor Networks SPEAKER: unknown ABSTRACT. Wireless sensor networks have emerged as the major criteria that enable the Internet of things evolution. To transport a data packet from sensing nodes to collected data node, a typical wireless sensor network usually operates in multi hop fashion due to its coverage and transmission limitations. Otherwise, several applications required the wireless sensor network operating in dynamical environments such as underwater, micro sensors or social sectors. These operated environments create a new challenge affected to network performance because the natural movement of nodes is unpredicted. The paper presents a novel analysis for validating performance of multi hop connectivity in underwater wireless sensor network that contains Brownian motion sensed nodes except a sink node. Moreover, a destination oriented routing strategy for optimizing path quality is proposed and evaluated by numerical results.
14:55	Truong Thao Nguyen (SOKENDAI (The Graduate University for Advanced Studies), Japan) Ikki Fujiwara (National Institute of Informatics, Japan) Michihiro Koibuchi (National Institute of Informatics / SOKENDAI, Japan) A Diagonal Cabling Approach to Datacenter and HPC Systems SPEAKER: unknown ABSTRACT. Low cable delay becomes a critical concern in High Performance Computing system and high density Data Center since the switch delay becomes very low, e.g., 40ns/switch as the technology driven. The cable delay almost corresponds to the cable length. It signicantly requires a short total end-to-end cable length when deploying a logical topology into a physical cabinets-layout in a server room. Recent works illustrate to allow to map any topology with an efficient cable length by modeling the mapping problem to an optimal problem which based on the Manhattan Cabling, i.e., cables are organized into several horizontal and vertical pathways and the length of each cable becomes Manhattan distance. To reduce the aggregate cable length, we aggressively take diagonal cabling method by increasing the number of directions on a surface. The analysis results with dierent topologies show that our approach can reduce the total cable length up to 11% that leads to 7% reduction of network latency.

14:15-15:15 Session 5B: Data analytics II

Chair:

Sergey Dvoenko (Tula State University, Russia)

Location: Room Sunflower 2

14:15	Van Long Tran (University of Transport and Communications, Hanoi, Viet Nam) iRadviz: An Inversion Radviz for Class Visualization of Multivariate Data Visualization SPEAKER: Van Long Tran ABSTRACT. Multivariate data visualization is an interesting research field with many applications in ubiquitous fields of sciences. Radial visualization is one of the most common information visualization techniques for visualizing multivariate data. Unfortunately, Radviz display different information about structures of multivariate data on the different the order of the data dimensions and all points with different scale maps into the same point in the visual space. In this paper, we propose a method that improve the Radviz layout for class visualization of multivariate data. The basic idea of our method is finding a good corner viewing of a hypercube. Our method provides an improvement visualizing class structures of multivariate data sets on the radviz. We present our method with two kinds of quality measurement. We proce the efficiency of our method for several data sets.
14:35	Nguyen Bui (University of Information Technology, VNU-HCM, Viet Nam) Bay Vo (HUTECH, Viet Nam) Van-Nam Huynh (Japan Advanced Institute of Science and Technology, Japan) Chun-Wei Lin (Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China) Loan Nguyen (Nguyen Tat Thanh Unversity, Viet Nam) Mining Closed High-Utility Itemsets from Uncertain Databases SPEAKER: unknown ABSTRACT. In order to reduce the number of HUIs, closed high-utility itemsets (CHUIs) have been proposed. However, most tech- niques for mining CHUIs require certain databases; i.e., there are no probabilities. However, in many real-world applica- tions, an item or itemset may have a probability. Actual data can be affected by the use of noisy sensors. Many algo- rithms have been proposed to effectively mine frequent item- sets from uncertain databases; however, there are no algorithms for mining CHUIs from uncertain databases. This paper proposes an algorithm called CPHUI-List (closed potential high-utility itemset PEU-List-based mining algorithm) for mining closed potential high-utility itemsets (CPHUIs) from uncertain databases without generating candidates. CPHUI-List performs a depth-first search of the search space, and uses the downward closure property of high transaction- weighed probabilistic and utilization itemsets to prune non-closed potential high-utility itemsets. Experiments show that the runtime and memory consumption of CPHUI-List are lower than those of CHUI-Miner
14:55	Van-Dai Ta (National Taipei University of Technology, Taiwan) Chuan-Ming Liu (National Taipei University of Technology, Taiwan) Stock Market Analysis Using Clustering Techniques: The Impact of Foreign Ownership on Stock Volatility in Vietnam SPEAKER: unknown ABSTRACT. Data mining techniques have been used for various aspects of the financial market, such as stock indices and prices prediction, portfolio risk management, and trend detection. In the stock market, there are a huge amount of data, including firms’ profile, characteristics, and historical trading data. This paper investigates the impact of foreign ownership on stock market volatility in Vietnam stock market by using one year historical daily live trading of 100 major stocks on Ho Chi Minh Stock Exchange (HOSE) for the period from September 01, 2015 to August 31, 2016. K-mean cluster algorithm and hierarchical clustering methods are used to visualize the stock market, in term of net trading volume, price variation and return volatility ratio. Based on the visualized patterns on a stock market, we can evaluate the impact of foreign capital on the market volatility. Furthermore, those patterns indicate the investment behavior that will optimize the portfolio investment management.

15:15-15:35COFFEE BREAK

15:35-17:35 Session 6A: Security

Chair:

John K. Zao (National Chiao-Tung University, Taiwan)

Location: Room Sunflower 1

15:35	Michel Toulouse (Vietnamese-German University, Viet Nam) Hai Le (Vietnamese-German University, Viet Nam) Cao Vien Phung (Vietnamese-German University, Viet Nam) Denis Hock (Frankfurt University of Applied Sciences, Germany) Robust Consensus-Based Network Intrusion Detection in Presence of Byzantine Attacks SPEAKER: unknown ABSTRACT. Consensus is a form of distributed algorithm to compute global states in a network of computing nodes that share information only with adjacent nodes (no routing). Consensus algorithms have been used for applications in wireless and sensor networks, spectrum sensing for cognitive radio, even for some IoT services. Applications based on consensus algorithms can be the target of Byzantine attacks where a compromised node send falsified data to its neighbors. Several solutions have been proposed in the literature inspired from reputation based systems, outlier detection or model-based fault detection techniques in process control. Inspired from existing solutions, this paper proposes two mitigation techniques against Byzantine attacks. These techniques are then applied to protect the consensus phases of the Network Intrusion Detection System proposed by [1]. We analyze several implementation issues such as computational overhead, fine tuning of the solution parameters, impacts on the convergence of the consensus phase, accuracy of the intrusion detection system.
15:55	Van Tong (HUST, Viet Nam) Giang Nguyen (HUST, Viet Nam) A method for detecting DGA Botnet based on semantic and cluster analysis SPEAKER: unknown ABSTRACT. Botnets play major roles in a vast number of threats to network security, such as DDoS attacks, generation of spam emails, information theft. Detecting Botnets is difficult task in due to the complexity and performance issues when analyzing huge amont of data from real large-scale networks. In major Botnet malwares, the use of Domain Generation Algorithms allows to decrease possibility to be detected using white list – black list scheme and thus DGA Botnets have higher survivability. This paper proposes a DGA Botnet detection scheme based on DNS traffic analysis which utilizes semantic measures such as entropy, meaning level of the domain, frequency of n-gram appearances, … and Mahalanobis distance for domain classification. The proposed method is improvement of Phoenix botnet detection mechanism , where in the classification phase, the modified Mahalanobis distance is proposed instead of using the original one for classification. The clustering phase is based on modified k-means algorithm for archiving better effectiveness. The effectiveness of the proposed method was measured and compared with Phoenix, Linguistic and SVM Light methods. The experimental results show the accuracy of proposed Botnet detection scheme ranges from 90 to 99,97% depending on Botnet type
16:15	Ly Vu Thi (Le Quy Don University, Viet Nam) Dong Van Tra (Technical Economic College, Viet Nam) Quang Uy Nguyen (Le Quy Don University, Viet Nam) Learning from Imbalanced Data for Encrypted Traffic Identification Problem SPEAKER: unknown ABSTRACT. Identifying encrypted application traffic represents an important issue for many network tasks including quality of service, firewall enforcement and security. Solutions should ideally be both simple – therefore efficient to deploy – and accurate. One of the challenging problems of classifying encrypted application traffic is the imbalanced property of network data. Usually, the amount of unencrypted traffic is much higher than the amount of encrypted traffic. To date, the machine learning based approach for identifying encrypted traffic often purely focuses on examining and improve algorithms. In this paper, we present a thorough analysis of the impact of various techniques for handling imbalanced data when machine learning approaches are applied to identifying encrypted traffic. The experiments are conducted on a well-know network traffic data set and the results showed using some methods for addressing imbalanced data helps machine learning algorithms to achieve better performance in this problem.
16:35	Duy An Ha (National Chiao Tung University, Taiwan) Khả Thọ Nguyen (National Chiao Tung University, Taiwan) John K. Zao (National Chiao Tung University, Taiwan) Efficient Authentication of Resource-Constrained IoT Devices Based on ECQV Implicit Certificates and Datagram Transport Layer Security (DTLS) Protocol SPEAKER: unknown ABSTRACT. This paper introduces a design and implementation of a security scheme for the Internet of Things (IoT) based on ECQV Implicit Certificates and Datagram Transport Layer Security (DTLS) pro-tocol. In this proposed security scheme, Elliptic curve cryptography based ECQV implicit certificate plays a key role allowing mutual authentication and key establishment between two resource-constrained IoT devices. We present how IoT devices get ECQV implicit certificates and use them for authenticated key exchange in DTLS. An evaluation of execution time of the implementation is also conducted to assess the efficiency of the solution.
16:55	Cuong Pham (Japan Advanced Institute of Science and Technology, Japan) Dat Tang (Japan Advanced Institute of Science and Technology, Japan) Ken-Ichi Chinen (Japan Advanced Institute of Science and Technology, Japan) Razvan Beuran (Japan Advanced Institute of Science and Technology, Japan) CyRIS: A Cyber Range Instantiation System for Facilitating Security Training SPEAKER: unknown ABSTRACT. Cyber ranges are well-defined controlled virtual environments used in cybersecurity training as an efficient way for trainees to gain practical knowledge through hands-on activities. However, creating an environment that contains all the necessary features and settings, such as virtual machines, network topology and security-related content, is not an easy task, especially for a large number of participants. Therefore, we propose CyRIS (Cyber Range Instantiation System) as a solution towards this problem. CyRIS provides a mechanism to automatically prepare and manage cyber ranges for cybersecurity education and training based on specifications defined by the instructors. In this paper, we first describe the design and implementation of CyRIS, as well as its utilization. We then present an evaluation of CyRIS in terms of feature coverage compared to the Technical Guide to Information Security Testing and Assessment of the U.S National Institute of Standards and Technology, and in terms of functionality compared to other similar tools. We also discuss the execution performance of CyRIS for several representative scenarios.
17:15	Qutaibah Malluhi (KINDI Lab, Qatar University, Qatar) Abdullatif Shikfa (KINDI Lab, Qatar University, Qatar) Viet Cuong Trinh (KINDI Lab, Qatar University, Viet Nam) An Efficient Instance Hiding Scheme SPEAKER: unknown ABSTRACT. Delegating computation, which is applicable to many practical contexts such as cloud computing or pay-TV system, concerns the task where a computationally \emph{weak} client wants to securely compute a very complex function $f$ on a given input with the help of a remote computationally \emph{strong} but \emph{untrusted} server. The requirement is that the computation complexity of the client is much more efficient than that of $f$, ideally it should be in constant time or in $NC^0$. This task has been investigated in several contexts such as instance hiding, randomized encoding, fully homomorphic encryption, garbling schemes, and verifiable schemes. In this work, we specifically consider the context where only the client has an input and gets an output, also called instance hiding. Concretely, we first give a survey of delegating computation, we then propose an efficient instance hiding scheme with \emph{passive} input privacy. In our scheme, the computation complexity of the client is in $NC^0$ and that of the server is exactly the same as the original function $f$. Regarding communication complexity, the client in our scheme just needs to transfer $4\|f\|+\|x\|$ bits to the server, where $\|f\|$ is the size of the circuit representing $f$ and $\|x\|$ is the length of the input of $f$.

15:35-17:35 Session 6B: Natural Language Processing

Chair:

Xuan Hoai Nguyen (HANU, Viet Nam)

Location: Room Sunflower 2

15:35	Nhi-Thao Tran (University of Science, Viet Nam) Viet-Thang Luong (University of Science, Viet Nam) Ngan Luu-Thuy Nguyen (University of Information Technology, Viet Nam) Minh-Quoc Nghiem (University of Science, Viet Nam) Effective Attention-based Neural Architectures for Sentence Compression with Bidirectional Long Short-Term Memory SPEAKER: unknown ABSTRACT. We propose a novel model that apply an extension of the Long Short-Term Memory neural network for sentence compression task. In our model, only the most relevant context of each word is concentrated to avoid the redundant information. Our model is based on two new models that have been successfully used recently in neural machine translation. The first is Bidirectional model that can be trained using all the available input information in the past and future. The second is Attention model that focus not only the whole sentence information but also the particular context of each word in this sentence. Experimental results show that our model significantly outperforms all the recently state-of-the-art method, the Bidirectional and the Attention model on the Google sentence compression dataset.
15:55	Hy Nguyen (University of Science, Viet Nam) Tung Le (University of Science, Viet Nam) Viet-Thang Luong (University of Science, Viet Nam) Minh-Quoc Nghiem (University of Science, Viet Nam) Dien Dinh (University of Science, Viet Nam) The Combination of Similarity Measures for Extractive Summarization SPEAKER: unknown ABSTRACT. The key task in extractive summarization is to determine the importance of the sentence in the input. To effectively assess the sentence's significance, several recent studies have focused on comparing the similarity between sentences. Each comparison method has its own strengths and weaknesses. In this paper, we propose the combination of similarity measures for sentence comparison. Experiments conducted on both English and Vietnamese datasets demonstrate the efficiency of our proposed approach. Our model outperforms the recent works in English with the significant improvement (9.4 ROUGE-2 F1-score) and achieved the state-of-the-art result in Vietnamese
16:15	Nguyen Le Thanh (Saigon Hi-tech Park Incubation Center, Viet Nam) Toan Nguyen Xuan (University of Information Technology, Vietnam National University of HCMC, Viet Nam) Dien Dinh (University of Science, Vietnam National University of HCMC, Viet Nam) Vietnamese plagiarism detection method SPEAKER: Nguyen Le Thanh ABSTRACT. Nowadays, with the era of information technology development, not only in Vietnam but also all over the world, plagiarism problem is becoming very popular in many areas of life. Besides, the application of natural language processing in detecting plagiarism has been studied and made the significant progress in recent years. However, the Vietnamese plagiarism detection method just starts at a basic level for exact copy and near copy cases, even the foreign plagiarism detection software cannot detect Vietnamese plagiarism cases effectively, especially paraphrasing cases. In fact, the violators can create plagiarism cases very easily by rewriting the sentences by using the synonyms, near-meaning words.... In this paper, we propose the Vietnamese plagiarism detection method combining four methods: substrings n-gram, LCS, CS and Fuzzy-based. Our model not only achieves good results in simple cases like exact copy and near copy but also detects paraphrasing cases effectively. The experimental results show that our model has 90% precision, 88.3% recall and 89.1% F-measure.
16:35	Ngo Xuan Bach (Posts and Telecommunications Institute of Technology, Viet Nam) Vu Thanh Hai (FPT Software Research Lab, Viet Nam) Tu Minh Phuong (Posts and Telecommunications Institute of Technology, Viet Nam) Cross-Domain Sentiment Classification with Word Embeddings and Canonical Correlation Analysis SPEAKER: unknown ABSTRACT. A common approach for automatic sentiment classification is using classifiers trained on labeled text data (reviews, blog posts etc.) to predict the sentiment polarity of new data. Because people express sentiment differently in different domains, this approach requires annotated corpora for each domain. However, annotating data for every domain of interest is laborious and impractical. In this paper, we address the domain adaptation problem for sentiment classification. We explore the effect of generic methods for feature learning and feature subspace mapping, namely word embeddings and canonical correlation analysis (CCA), on cross-domain sentiment classifiers. We show that by using only such rather generic methods, it is possible to get results very competitive with those of sophisticated methods specially developed for the considered problem. An advantage of using word embeddings and CCA is their availability out-of-the-box, which is important for the applicability of the proposed method. Experiments on a widely used benchmark dataset shows that both word embeddings and CCA contribute to accuracy improvement and their combination provides the best results.
16:55	Karim Sayadi (CHArt Laboratory EA 4004, Ecole Pratique des Hautes Etudes, PSL Research University, France) Quang Vu Bui (CHArt Laboratory EA 4004, Ecole Pratique des Hautes Etudes, PSL Research University and Hue University of Sciences, France) Marc Bui (CHArt Laboratory EA 4004, Ecole Pratique des Hautes Etudes, PSL Research University, France) Distributed Implementation of the Latent Dirichlet Allocation on Spark SPEAKER: unknown ABSTRACT. The Latent Dirichlet Allocation (LDA) is one of the most used topic models to discover complex semantic structure. However, for massive corpora of text LDA can be very slow and can require days or even months. This problem created a particular interest in parallel solutions, like the Approximate Distributed LDA (AD-LDA), where clusters of computers are used to approximates the popular Gibbs sampling used by LDA. Nevertheless, this solution has two main issues : first, requiring local copies on each partition of the cluster (this can be inconvenient for large datasets). Second, it is common to have read/write memory conflicts. In this article we propose an algorithm which can be considered as an extension of the AD-LDA algorithm where we provide computation in memory and a good communication between the processors. The implementation of the algorithm was made possible with the syntax of Spark. We show empirically with a set of experimentations that our parallel implementation with Spark: has the same predictive power as the sequential version and has a considerable speedup. We finally document an analysis of the scalability of our implementation and the super-linearity that we obtained. We provide an open source version of our Spark LDA.
17:15	Hiroshi Suzuki (Waseda University, Japan) Reiko Hishiyama (Waseda University, Japan) An analysis of expert knowledge transmission using machine translation services SPEAKER: unknown ABSTRACT. Due to globalization, there is an increase in not only the use of machine translation service but also in the risk of failure to transfer. Especially, it is remarkable when you transmit your country's specific knowledge and know-how, cultural information to foreigner by using machine translation service.To overcome this risk, the adaptation to the translation services from users is also necessary as improving translation's accuracy.Thus, this study focused on transmission effect of teaching a technical knowledge to knowledge receiver having different tongue (Chinese) by using machine translation service. In the experiments, we simulated expert knowledge transmission classes and tasks with system consist of multilingual chat and corpus of terminologies. From evaluating transmission, this study reveals how Chinese knowledge receiver could have learned depending on the pattern of organizational structure and the level of detail of the information. Furthermore, from analyzing these results and system data, this study reveals that symbol and number are well used in the groups having good result and, finally, propose how to use the machine translation in transmitting technical knowledge.