SOICT 2023: 12TH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY
PROGRAM FOR FRIDAY, DECEMBER 8TH
Days:
previous day
all days

View: session overviewtalk overview

08:30-09:50 Session 8: Keynote Speaker
Location: Conference Hall
08:30
Achieving Cloud Resiliency with Expert-Augmented AI

ABSTRACT. In the dynamic landscape of cloud computing, ensuring the resiliency of systems and applications is paramount for maintaining business continuity and minimizing downtime. This paper explores the integration of expert-augmented artificial intelligence (AI) as a strategic approach to enhance cloud resiliency. Leveraging the capabilities of automated monitoring, predictive analysis, and incident response, AI systems play a pivotal role in identifying and mitigating potential failures. The collaboration between AI and human expertise introduces a nuanced decision-making paradigm, where the strengths of each are harnessed for optimal outcomes in complex situations. The paper delves into the adaptive nature of AI, continually learning from incidents and improving resiliency over time. It also addresses the significance of redundancy strategies, adaptive security measures, and simulation testing in fortifying cloud infrastructures. Through this comprehensive exploration, the paper advocates for a holistic model that combines the efficiency of AI with the contextual understanding of human experts to achieve a robust and resilient cloud environment.

09:10
Heads-Up Computing: Towards The Next Generation Interactive Computing Interaction

ABSTRACT. Heads-up computing is an emerging concept in human-computer interaction (HCI) that focuses on natural and intuitive interaction with technology. By making technology more seamlessly integrated into our lives, heads-up computing has the potential to revolutionize the way we interact with devices. With the rise of large language models (LLMs) such as ChatGPT and GPT4, the vision of heads-up computing is becoming much easier to realize. The combination of LLMs and heads-up computing can create more proactive, personalized, and responsive systems that are more human-centric. However, technology is a double-edged sword. While technology provides us with great power, it also comes with the responsibility to ensure that it is used ethically and for the benefit of all. That’s why it is essential to place fundamental human values at the center of research programs and work collaboratively among disciplines. As we navigate through this historic transition, it is crucial to shape a future that reflects our values and enhances our quality of life.

09:50-10:20 Session 9: Multimedia Processing & Artificial Intelligence and Data Science for Smart Health and Medicine (Poster Session)
Deep Learning-based Prediction of Alertness and Drowsiness using EEG Signals

ABSTRACT. In this paper, we propose to apply two advanced deep learning models, comprising Bidirectional long short-term memory (LSTM) and Transformer, to predict the alertness or drowsiness of individuals via utilizing electroencephalogram (EEG) signals. In the both models, we extract some features from brainwave data including alpha, beta, delta, and theta waves. For the Bidirectional LSTM model, we construct a four-layer deep learning network to learn brainwave features and classify the human statuses. For the Transformer model, we use an encoder and replace the position encoding with the vector representation of time to capture good features of sequential brainwave data. Experimental results demonstrate that the two proposed models provide very high accuracy, up to 99.55%, in predicting the alertness and drowsiness of individuals.

Enhancing Intensive Care Patient Prognostics with Machine Learning

ABSTRACT. This article delves into the challenge of foreseeing patient discharges and unplanned returns to the intensive care unit. Its primary objective is to enhance the decision-making process for healthcare providers and administrators, facilitate resource allocation, and elevate the quality of patient care. The focal point of our endeavor involves the prediction of patient discharges and unforeseen readmissions to the intensive care unit, leveraging the comprehensive Medical Information Mart for Intensive Care (MIMIC-III) database. Our approach employs a repertoire of machine learning algorithms, encompassing logistic regression, support vector machines (SVM), random forest, and others. These algorithms undergo meticulous training to project the probabilities of both patient discharge and readmission. This training process is reliant on a diverse set of variables, encompassing vital signs, demographic information, and age. In the realm of patient readmissions, the ensemble of Neural Network, SVM, and Random Forest models demonstrates exceptional performance, achieving an impressive accuracy of 99%. Meanwhile, the Gradient Boosting model showcases remarkably high accuracy, reaching 97%. In the context of patient discharges, the Random Forest model emerges as the most proficient, boasting an accuracy of approximately 79%. The findings underscore the effectiveness of these machine learning models in anticipating the likelihood of ICU patient readmissions and discharges. Nonetheless, it's vital to acknowledge that the outcomes may vary contingent upon the specific model employed. Nevertheless, these collective results hold the promise of augmenting patient care and refining administrative practices within medical clinics and hospitals.

Improving Depression Classification in Social Media Text with Transformer Ensembles

ABSTRACT. Depression is an emotional disorder and it can arise in each of us individually. In the era of the Fourth Industrial Revolution, social media platforms provide an opportunity for people to share their thoughts and feelings, which may reveal early signs of depression. Our goal is to classify depression levels based on social media posts, specifically from Reddit. Building on the existing literature, we propose a three-step method for this task. We leverage external data to pre-train RoBERTa, a general language model, into DepRoBERTa, a domain-specific language model. We also employ ensemble and preprocessing techniques to improve the performance of our method. We test our method on a dataset from “Data Set Creation and Empirical Analysis for Detecting Signs of Depression from Social Media Postings”, which consists of three depression-level labels: “Not depression”, “Moderate”, and “Severe”. We achieve a macro-averaged F1-score of 59.6% and an accuracy of 66.8%. These results underscore the practical value and utility of our approach in effectively predicting depression.

IncepSE: Leveraging InceptionTime's performance with Squeeze and Excitaion mechanism in ECG analysis

ABSTRACT. Our study focuses on the potential for modifications of Inception-like architecture within ECG domain. To this end, we introduce IncepSE, a novel network characterized by strategic architectural incorporation that leverages the strengths of both InceptionTime and channel attention mechanisms. Furthermore, we propose a training setup that employs stabilization techniques that are aimed at tackling the formidable challenges of severe imbalance dataset PTB-XL and gradient corruption. By this means, we manage to set a new height for deep learning model in a supervised learning manner across the majority of tasks. Our model consistently surpasses InceptionTime by substantial margins compared to other state-of-the-arts in this domain, noticeably 0.013 AUROC score improvement in the "all" task, while also mitigating the inherent dataset fluctuations during training.

Efficient Machine Learning-based Gene Selection Exploiting Immune-related Biomarkers and Recursive Feature Elimination for Sepsis Diagnosis

ABSTRACT. Differential expression gene (DEG) analysis of transcriptomic data allows for a comprehensive examination of the regulation of gene expression profiles related to specific biological states. The result of this analysis typically consists of an extensive record of genes that display varying levels of expression among two or more groups. A portion of these genes with altered expression could potentially function as candidate biomarkers, chosen through either existing biological insights or data-driven techniques. In diagnosing sepsis, a life-threatening health problem, our work proposes a novel approach using immune-related gene data to identify the optimal gene combination as signature biomarkers to improve the diagnosis performance. Our proposed method involves sequential gene selection procedures, including the DEG analysis and the machine learning-based importance assessment, and a Recursive Feature Elimination (RFE) process supported by Principal Component Analysis (PCA). The selected gene combination, which consists of twelve immune-related genes show remarkable cross-validation results with an accuracy of 99.35%, an AUC score of 99.56%, Sensitivity and Specificity of 99.44% and 90.00%, respectively. Besides, the proposed 12 gene markers combined with the XGBoost algorithm were also tested in three individual cohorts with appropriately significant results, demonstrating the effectiveness of our developed method in different cohorts and the reliability of the proposed gene selection procedure.

OsteoGA: An Explainable AI Framework for Knee Osteoarthritis Severity Assessment

ABSTRACT. Knee osteoarthritis is among the most common joint disorders. Recent studies have investigated the application of Artificial Intelligence (AI) in automated diagnosis using knee joint X-ray images. However, these studies have primarily focused on diagnosing the severity of osteoarthritis without providing explanations for the underlying reasons that led to those results. In this paper, we present OsteoGA, an AI framework that focuses on the interpretability of the model in order to assist in the diagnosis of knee osteoarthritis. OsteoGA introduces a novel generative adversarial autoencoder model, called GAE, to reconstruct an assumed healthy knee joint image from the original ones. The reconstruction process in OsteoGA combines image-to-image translation, data imputation, and adversarial learning to generate high-quality images that are consistent and coherent with the patient's image. Beside effectively diagnosing the severity of knee osteoarthritis, the OsteoGA framework also produces an anomaly map. This map highlights valuable information about the abnormal regions in X-ray images, offering additional insights to medical experts during the diagnosis process

Optimizing Results in Aerial Images through Post-Processing Techniques on YOLOv7

ABSTRACT. Object detection in aerial images has garnered significant attention from the research community in recent years. The challenges posed by small objects, diverse orientations, and complex backgrounds have spurred extensive research efforts. In this paper, we focus on object detection in a YOLOv7-based framework, highlighting its limitations and proposing a post-processing method to enhance object detection results in cases of overlapping predicted regions. The analyses are extended and demonstrated to be effective on the UCAS AOD dataset.

Entangled topologies for quanvolutional neural networks in quantum image processing
PRESENTER: Vu Tuan Hai

ABSTRACT. Image classification and image processing, in general, are critical problems in computer vision. Recently, numerous classification techniques based on quantum machine learning have been proposed, such as quanvolutional neural network (QNN) - a hybrid quantum-classical model inspired by the convolutional neural network (CNN) which has the potential to process high-resolution images and outperforms current image processing techniques. In this article, we investigate the use of entangled topologies in QNN to extract features more efficiently. We also propose a training strategy for the quantum part of the QNN model. Our approach aims to improve the accuracy of QNN and address its scalability issues, which can hinder its practical use in real-world scenarios.

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

ABSTRACT. In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.

Analyze video action with a deep learning model based on space-time and context information

ABSTRACT. Action detection aims to determine the temporal boundaries during which actions occur in untrimmed videos. In the real world, each moment in a video can involve multiple actions, making it challenging to detect and label actions accurately. In recent years, the effective interaction of relationships short-term and long-term information contributes an important step forward for efficient extraction of significant factors in temporal information. However, one aspect has not been highlighted in previous studies but it can include useful context information. The aspect is the use of features related to the region of interest (RoI - Region of Interest). Adding information about regions containing major interactions, as in the case of using the RoI feature by humans, can help focus the model on the most important parts of the temporal data. This is especially useful in dealing with complex situations and helps predict action more exactly. However, when applying the RoI features, there will be a challenge that some frames in the video may be missed. This can lead to the loss of important information and affect predictability. In this work, we propose a Context Normalization Module to address this problem. In summary, in this work, we have three main contributions: Proposing an enhanced framework which combines spatial-temporal information and context information, applying RoI feature (specifically, human feature) as the context information to the model to improve performance of the model, and applying KNN to handle information loss in extracting RoI information.

A New Coarse-To-Fine 3D Face Reconstruction Method Based On 3DMM Flame and Transformer: CoFiT-3D FaRe

ABSTRACT. The process of 3D face reconstruction has the remarkable capability to recover 3D information from a single human facial image, enabling its utilization in accelerated 3D modeling, facial animation, and enhanced face recognition performance. Currently, the predominant approach in 3D reconstruction employs deep learning Convolutional Neural Networks (CNNs) to forecast intermediary representations of three-dimensional facial structures, including sets of 3D Morphable Model (3DMM) parameters or UV maps. While these methods demonstrate promise, they still encounter limitations when confronted with occlusions stemming from factors such as varying viewpoints and the presence of accessories. Furthermore, these techniques heavily depend on pre-existing linear 3DMM, which restricts the utilization of input image data. In this paper, we introduce CoFiT-3D FaRe, a novel Coarse-to-Fine methodology for reconstructing 3D human faces. This approach makes use of the Dual Vision Transformer network (DaViT) and 3DMM Flame. The DaViT proficiently captures both local features and global context within the images, effectively overcoming limitations imposed by CNNs. In the initial coarse stage, the DaViT is employed to regress 3DMM parameters from the input image, facilitating the reconstruction of 3D human face’s shape and texture at a coarse level. During this phase, self-supervised learning techniques and widely acknowledged loss functions are employed, akin to previous research endeavors. While the 3DMM provides valuable three-dimensional information, its linear nature imposes constraints on achieving a realistic representation. To tackle this issue, the subsequent fine reconstruction stages extract more intricate information from the image and refine the outcomes of the initial coarse reconstruction. The paper employs detail reconstruction block of DECA (Detailed Expression Capture and Animation) for precise shape reconstruction and introduces an innovative approach for accurate texture reconstruction. This enables the restoration of occluded regions and the retention of intricate facial features. Our experiments on the NoW evaluation set demonstrate competitive outcomes in both shape and texture reconstruction when compared to state-of-the-art (SOTA) solutions.

Bayesian method for bee counting with noise-labeled data

ABSTRACT. Bee counting is an essential task for monitoring the health of bee colonies. However, it is challenging, as bees are often small and difficult to see. One approach to bee counting is identifying individuals and using that information to count. However, this approach can be unreliable if the data are noisy, meaning some labels are incorrect. This paper presents a novel approach to bee counting with noise-labeled data. The proposed method uses a density-based counter that is robust with noisy data. The results of experiments show that the proposed method can achieve significantly higher accuracy than the detection-based approach for bee counting with noise-labeled data. The proposed method can improve the accuracy of bee counting in various scenes, such as different weather conditions and the density of bees. The approach could also be used to develop new tools for monitoring bee populations and tracking changes in their abundance. Source codes of the proposed method and the dataset are publicly available on GitHub.

A Novel Approach to Streaming QoE Score Calculation by Integrating Error Impacts

ABSTRACT. Video streaming services have become prominent in the last decade. As any other cloud service, these services are error-prone, and the errors during startup and/or playback affect the viewing experience of end-users. Hence, the calculation of Quality-of-Experience (QoE) scores should also account for error impacts. In this paper, we in- troduce a player-based error classification scheme, which classifies errors based on origin and severity. We use this scheme to quantify the quality degradation due to errors, and propose to improve the QoE score by integrating these quality factors. We instrument the open-source media players dash.js and Exoplayer in our proposed system which follows the guidelines of various multimedia stream- ing standards. We define several scenarios focusing on different QoE influencing factors, and assess our proposed model’s performance. Comparisons with various state-of-the-art QoE models show that our model captures the effect on user experience better in scenarios induced with player-related errors.

LiPoSeg: A Lightweight Encoder-Decoder Network for LiDAR-based Road-Object Semantic Segmentation

ABSTRACT. LiDAR point cloud segmentation is one of the most challenging tasks in autonomous driving systems, as it requires a cutting-edge perception method that should be accurate, low-cost, and real-time. In the last decade, several deep encoder-decoder neural networks like PointSeg were proposed for semantic segmentation of point clouds with noticeable achievements, but they still posed some shortcomings of high complexity and limited compatibility with resource-constrained devices. With a keen awareness of these limitations, this paper introduces LiPoSeg, an intricately designed network inspired by PointSeg. LiPoSeg significantly reduces the network size while maintaining almost no accuracy degradation. In particular, two vital innovations characterize the LiPoSeg architecture: (i) a grouped enlargement layer with grouped convolution layers to replace regular convolution layers for complexity reduction while maintaining a good segmentation accuracy and (ii) a novel region-wise attention mechanism enables to capture locally meaningful features. The combination of these two innovations results in our proposed network that is 63% smaller than PointSeg in terms of the network size (a.k.a., the total number of learnable parameters). Remarkably, this reduction is achieved with only 1% decrease of segmentation accuracy when evaluated on the WPI and Panda datasets for LiDAR-based road-object semantic segmentation.

Language Knowledge-Assisted in Topology Construction for Skeleton-Based Action Recognition

ABSTRACT. Skeleton-based action recognition is a challenging problem due to the high dimensionality and noisy nature of skeleton data. Graph convolution networks (GCNs), which use graph topology to extract representative features, have been effective for skeleton-based action recognition in recent years. However, learning topology information and aggregate features effectively is a challenging problem. In this work, we propose a strategy to construct topology representation to skeleton-based action recognition that combines language knowledge to learn the topology. Specifically, borrows the idea from Language Knowledge-Assisted Representation Learning (LAGCN) [20], which uses a large-scale language model (LLM) to learn a priori global relationship (GPR) topology that captures the global structural relationships between the joints and a priori category relationship (CPR) topology between nodes in the skeleton graph to captures the category-specific relationships between the joints. We propose to apply into the Channel-wise Topology Refinement Graph Convolution (CTRGCN) [4] the GPR topology as a prior topology, which can provide great momentum to learn the model along with the CPR which is used to learn the class-distinguishable features. The proposed approach is evaluated on the NTU RGB+D and NW-UCLA datasets. The results show that the proposed approach archives promising results with 97% on the NTU dataset and 96.76% on the NW-UCLA dataset with the cross-view benchmark.

Lite FPN_SSD: A Reconfiguration SSD with Adapting Feature Pyramid Network Scheme for Small Object Detection

ABSTRACT. Detecting small objects poses a significant challenge in computer vision because of the low resolution and fuzzy feature representation. Although one-stage detection techniques alleviate the problem caused by scale difference to some extent, they also retain redundant features, resulting in resource wastage and slower processing speeds. This research first primarily centers around elucidating the SSD algorithm, a new improved framework, namely Lite FPN-SSD (Lite Single Shot Multibox Detector with Adapting Feature Pyramid Network), then is proposed to solve the weakness of the SSD algorithm. The Lite FPN-SSD is build upon the popular FPN and SSD architectures to create a learnable fusion scheme with controlling the feature information that deep layers deliver to shallow layers. Its lightweight nature, with a minimal increase in parameters, ensures high efficiency for real-time applications. Extensive experiments conducted on VOC, VEDAI and SOHAS datasets demonstrate an impressive results of the proposed models in comparison with the original SSD and its other variations. Particularly, by making a minimal addition of only 2 million parameters, the proposed model achieves a mean average precision (mAP) of 78.36% on the VOC dataset, which is close to another architecture that achieved a 78.40% mAP but require more than 2.6 million parameters.

10:20-12:00 Session 10A: AI Foundation and Big Data
Chair:
Location: Conference Hall
10:20
A Generalized Autorec Framework Applying Content-based Information for Resolving Data Sparsity Problem

ABSTRACT. Recent years have witnessed several neural network-based collaborative filtering systems that yield immense success in providing users with personalized suggestions. These systems, however, severely suffer from the data sparsity problem. Similar to other neural network-based collaborative filtering systems, the sparse rating information also affects Autorec at the input layer. This paper represents a generalized hybrid Autorec framework that can accept a variety of content-based representations as input to create a more efficient system. Subsequently, to fully employ the capability of the Autoencoder architecture in compressing information, rating vectors compressed via the conventional Autorec, together with content-based information, creates a robust hybrid input for our framework to resolve the sparsity problem. Empirical experiments demonstrate the state-of-the-art performance of our system compared to other hybrid recommendation systems.

10:40
Sharpness and Gradient Aware Minimization for Memory-based Continual Learning

ABSTRACT. Continual Learning, CL, is an emerging machine learning problem that requires deep neural networks to learn new data without catastrophically forgetting previously learnt one, which is similar to the process of knowledge accumulation in humans. Among CL methods, there is one group of works preserving performance on old data by storing a small buffer of seen samples to re-learn with current data, named replay-based approach. Despite their impressive results, these methods may still obtain sub-optimal solutions as a result of overfitting on training data, especially on the limited buffer. This can be attributed to their employment of empirical risk minimization, directly minimizing the empirical loss on training data. To overcome this problem, we leverage Sharpness Aware Minimization (SAM), a recently proposed training technique, to improve models' generalization, and thus CL performance. In particular, SAM seeks for flat minima whose neighbors' loss values are also low by simultaneously guiding a model towards SAM gradient direction corresponding to low-loss regions and flat regions. However, we conjecture that directly applying SAM to replay-based CL methods whose loss function contains multiple objectives may cause gradient conflict among them. We then propose to manipulate each objective's SAM gradient such that their potential conflict is minimized. Finally, through extensive experiments, we empirically verify our hypothesis and show consistent improvements of our method over strong memory-replay baselines.

11:00
Tensor Analysis Of Convolutional Neural Network For Reducing Network Parameters

ABSTRACT. Convolutional Neural Networks have demonstrated excellent ability in image recognition, yet they frequently have significant computational and memory requirements. It is advantageous to create a tensor factorization framework that can effectively compress networks in order to solve this problem. With its ability to express high-dimensional tensors using a smaller set of core tensors and fewer parameters, the Tensor-Train (TT) structure is a viable choice for compressing neural networks. The choice of appropriate TT-ranks, however, is now without theoretical guarantees. In our study, we introduce a method that employs the widely recognized heuristic network slimming structure through batch normalization techniques. This method serves as metrics to gauge the requisite parameter count, encompassing the intricacies of the training data complexity derived from the original CNNs. Our focus remains on the TT-rank, which denotes the scale parameter choice facilitated by the estimated size. This approach provides a comprehensive strategy to ascertain network compression requirements and resource optimization. Our results reveal that our calculated TT-ranks lead to substantial reduction in computational complexity and memory usage, while maintaining competitive accuracy compared to baseline CNNs.

11:20
GAT-FP: Addressing Imperfect Multimodal Learning using Graph Attention Networks and Feature Propagation

ABSTRACT. Multimodal learning tries to increase generalization performance by leveraging information from several data modalities. However, effectively integrating such multi-modality data remains a complex endeavor, particularly when dealing with incomplete data. In real-world scenarios, modalities are often not entirely absent but rather incomplete due to various external or internal factors. For instance, audio data can be corrupted by noise, and text data may suffer from inaccuracies stemming from automatic speech recognition errors. To address these challenges, we propose a novel framework for incomplete multimodal learning in conversational contexts, named GAT-FP. Our GAT-FP model incorporates two key graph neural network-based modules, namely, “Feature Propagation” and “Graph Attention Network”. These modules are designed to estimate missing features and discern the significance of interactions among incomplete feature nodes within the graph structure. To validate the effectiveness of our approach, we conduct extensive experiments on two well-established benchmark multimodal conversational datasets: IEMOCAP and MELD. The experimental results demonstrate that our GAT-FP model surpasses existing state-of-the-art methods in the realm of incomplete multimodal learning.

11:40
High Dimensional Data Classification Approach with Deep Learning and Tucker Decomposition

ABSTRACT. Along with the success of deep learning are extensive models and huge amounts of data. Therefore, efficiency in deep learning is one of the most concerning problems. Many methods have been proposed to reduce the complexity of models and have achieved promising results. In this paper, we look at this problem from the data perspective. By leveraging the strengths of Tensor methods in data processing as well as the efficiency of deep learning models, we aim to reduce costs in many aspects such as storage and computation. We present a data-driven deep learning approach for high-dimensional data classification problems. Specifically, we use Tucker Decomposition, a tensor decomposition method, to factorize the large, complex structure raw data into small factors and use them as the input of lightweight deep model architecture. We evaluate our approach on high-dimensional, complex datasets such as video classification on the Jester dataset and 3D object classification on the ModelNet dataset. Our proposal achieves competitive results at a reasonable cost.

10:20-12:00 Session 10B: Artificial Intelligence and Data Science for Smart Health and Medicine
Location: Tulip
10:20
Automatic Surgical Step Recognition Using Video and Kinematic Data for Intelligent Operating Room

ABSTRACT. With the continuous development of intelligent operating room systems, the segmentation and automatic recognition of surgical workflow have become challenging research fields. In recent years, an increasing number of models have been proposed to address this challenge, with deep learning becoming the mainstream approach. In this paper, we propose a multi-stage network for surgical step recognition by using surgical video and kinematic data. Firstly, a convolutional neural network (ResNet34) is used to extract visual features from video frames. Next, since surgical videos are a form of sequential data, a Temporal Convolutional Network (TCN) is employed as a temporal extractor to process temporal information between video frames for classification. Finally, a multi-stage TCN network, consisting of Encoder-Decoded TCN and Dilated TCN architectures, is used to refine the result. The proposed network is compared against a LSTM network from our prior work and is evaluated on a surgical dataset named MISAW in two modes - video data with and without kinematic data. Experimental results indicate that kinematic data is crucial for robot motion control in the operating rooms of the future. The technology will also find application in robotic lab for development and optimization of chemical manufacturing processes.

10:40
A Novel Approach for Extracting Key Information from Vietnamese Prescription Images

ABSTRACT. In this paper, we present a novel approach for the task of Key Information Extraction (KIE) from medical prescriptions, a crucial endeavor in healthcare systems. We identify and address the limitations of existing models, including ViBERTGrid and LayoutLMv2, by integrating a series of architectural refinements. Specifically, our model replaces the conventional Convolutional Neural Network (CNN) block with a Swin Transformer to achieve superior spatial recognition. Furthermore, we employ PhoBERT in place of BERT to offer enhanced performance on prescriptions in Vietnamese. The architecture is further optimized by eliminating the redundant semantic segmentation head, streamlining the overall model. Empirical evaluations on the VAIPE$_P$ dataset—a comprehensive compilation of prescriptions from multiple leading Vietnamese healthcare institutions—indicate that our model sets a new benchmark in the field. It outperforms the existing state-of-the-art by achieving an average F1 score improvement of $0.58\%$ over LayoutLMv2 and $1.05\%$ over ViBERTGrid. Our findings demonstrate the robustness and effectiveness of the proposed method, making it a viable candidate for real-world healthcare applications.

11:00
Combining Deep Learning And Medical Knowledge to Detect Cadiomegaly and Pleural Effusion in Chest X-rays Diagnosis

ABSTRACT. X-ray imaging plays a crucial role in diagnosing various medical conditions, especially those affecting the respiratory and cardiovascular systems. However, interpreting X-ray images can be time-intensive for radiologists. This paper addresses this challenge by developing algorithms to assist radiologists in identifying two specific anomalies in chest X-ray images: cardiomegaly and pleural effusion. Additionally, a key focus is to enhance the understandability and trustworthiness of AI-generated results for medical professionals. To achieve this, we merge deep learning techniques with medical expertise to transform the detection of cardiomegaly and pleural effusion in chest X-rays. We introduce precise U-Net-based segmentation algorithms that delineate critical structures like the heart, lungs, and diaphragm. Furthermore, we propose a novel algorithm to calculate the cardiothoracic ratio, improving cardiomegaly detection accuracy. We also present a method for measuring costophrenic angles to aid in pleural effusion diagnosis and introduce the innovative pneumophrenic contact rate concept for assessing pleural effusion severity. Our performance evaluations reveal superior results compared to the YOLOv5 model, with precision/recall rates of 78%/93% for cardiomegaly and 72%/93% for pleural effusion. This research advances chest X-ray diagnostics, promising more precise disease identification and facilitating AI integration into clinical practice.

11:20
Multiclass Skin Disease Classification within Dermoscopic Images Using Deep Neural Networks

ABSTRACT. Computer-aided skin lesion classification has been gaining attention in dermoscopy and skin cancer diagnosis, as early detection reduces the complexity of the treatment process. Various techniques which utilized the power of deep neural networks have been found in recent years to tackle this problem. In this study, we evaluate the effects of different preprocessing methods as well as classifiers on the multiclass categorization of seven skin diseases, with InceptionV3 being used as our backbone. These methods are evaluated using ISIC 2018 Task 3 dataset, on which we achieve 76.32% accuracy after employing segmentation with dilating and cropping on our training dataset. This study reveals that image preprocessing properly can enhance the models' performance in classifying skin diseases, and deep learning has great potential for assisting human experts in repetitive tasks.

11:40
ConvTransNet: Merging Convolution with Transformer to Enhance Polyp Segmentation

ABSTRACT. Colonoscopy is widely acknowledged as the most efficient screening method for detecting colorectal cancer and its early stages, such as polyps. However, the procedure faces challenges with high miss rates due to the heterogeneity of polyps and the dependence on individual observers. Therefore, several deep learning systems have been proposed considering the criticality of polyp detection and segmentation in clinical practices. While existing approaches have shown advancements in their results, they still possess important limitations. Convolutional Neural Network - based methods have a restricted ability to leverage long-range semantic dependencies. On the other hand, transformer-based methods struggle to learn the local relationships among pixels. To address this issue, we introduce ConvTransNet, a novel deep neural network that combines the hierarchical representation of vision transformers with comprehensive features extracted from a convolutional backbone. In particular, we leverage the features extracted from two powerful backbones, ConvNeXt as CNN-based and Dual Attention Vision Transformer as transformer-based. By incorporating multi-stage features through residual blocks, ConvTransNet effectively captures both global and local relationships within the image. Through extensive experiments, ConvTransNet demonstrates impressive performance on the Kvasir-SEG dataset by achieving a Dice coefficient of 0.928 and an IOU score of 0.882. Additionally, when compared to previous methods on various datasets, ConvTransNet consistently achieves competitive results, showcasing its effectiveness and potential.

10:20-12:00 Session 10C: Multimedia Processing
10:20
MCLDA: Multi-level Contrastive Learning for Domain Adaptive Semantic Segmentation

ABSTRACT. In this paper, we present a domain adaptation method for semantic segmentation in the context of autonomous vehicles. Our method builds upon a self-training approach with teacher-student architecture and integrates multi-level contrastive learning mechanisms. More specifically, we apply the contrastive learning at two levels: feature level and output level, which encourages pixel representations of the same category to cluster together while promoting dispersion between different categories. Additionally, we adapt a mixing technique to address the imbalance problem of dataset. The central concept involves the iterative copying of pixels from common classes in source images and pasting them into target images based on their occurrence frequency. The experiments conducted on the widely used GTA5 -> Cityscapes benchmarks yield a mIoU score of 56.76%, comparable with the best methods using the same segmentation architecture Deeplabv2 (ResNet101). It is worth noting that our approach performs notably well in difficult classes such as ’truck’ and ’bus’ with respective scores of 56.7% and 57.0%. This further underscores the effectiveness of our proposal.

10:40
PointGANet: A Lightweight 3D Point Cloud Learning Architecture for Semantic Segmentation

ABSTRACT. PointNet++ has gained significant acknowledgement for point cloud data processing capabilities. Over time, various network improvements have been developed to enhance its global learning efficiency, thus boosting the correct segmentation rate. However, these improvements have often resulted in a significant increase in complexity, i.e., the model size and the processing speed. Meanwhile, improvements that focus on complexity reduction while preserving accuracy have been relatively scarce, particularly compared to some simpler models like SqueezeSegV2. To overcome this challenge, we embark on the development of a compact version of the PointNet++ model, namely PointGANet, tailored specifically for three-dimensional point cloud semantic segmentation. In PointGANet, we introduce a grouped attention mechanism in an encoder with grouped convolution incorporated with element-wise multiplication to enrich feature extraction capability and emphasise relevant features. In a decoder, we replace unit pointnet modules with mini pointnet modules to save a massive number of trainable parameters. Through rigorous experimentation, we successfully fine-tune the network to obtain a significant reduction in model size while maintaining accuracy, hence resulting in a substantial enhancement in overall performance. Remarkably, relying on the intensive evaluation using the DALES dataset, PointGANet is more lightweight than the original PointNet++ by approximately five times with some noteworthy improvements in mean accuracy by 9.637% and mean IoU 9.964%. These innovations open up exciting possibilities for developing point cloud segmentation applications on IoT and resource-constrained devices.

11:00
A Shrinkage Method for Learning, Registering and Clustering Shapes of Curves

ABSTRACT. This paper introduces a shrinkage statistical model for analyzing, registering, and clustering multi-dimensional curves. The model utilizes reparametrization functions that act as local distributions on curves. Given the intricate nature of the model, we establish a connection with well-understood Riemannian manifolds. This connection enables us to simplify the reparametrization space and enhance the manageability of the optimization task. Moreover, we provide empirical evidence of the practical usefulness of our proposed method by applying it to a potential application involving the clustering of hominin cochlear shapes. Looking ahead, our research interests lie in developing theoretical extensions that can accommodate more complex spaces. By exploring new aspects of manifold learning and inference on high-dimensional manifolds, we aim to advance the field further.

11:20
Abusive Span Detection for Vietnamese Narrative Texts

ABSTRACT. Abuse in its various forms, including physical, psychological, verbal, sexual, financial, and cultural, has a negative impact on mental health. However, there are limited studies on applying natural language processing (NLP) in this field in Vietnam. Therefore, we aim to contribute by building a human-annotated Vietnamese dataset for detecting abusive content in Vietnamese narrative texts. We sourced these texts from VnExpress, Vietnam's popular online newspaper, where readers often share stories containing abusive content. Identifying and categorizing abusive spans in these texts posed significant challenges during dataset creation, but it also motivated our research. We experimented with lightweight baseline models by freezing PhoBERT and XLM-RoBERTa and using their hidden states in a BiLSTM to assess the complexity of the dataset. According to our experimental results, PhoBERT outperforms other models in both labeled and unlabeled abusive span detection tasks. These results indicate that it has the potential for future improvements.

11:40
Improving Single Positive Multi-label Classification via Knowledge-based Label-weighted Large Loss Rejection

ABSTRACT. It is undeniable that high-quality data plays a crucial role in achieving good outcomes. However, obtaining such a dataset is always challenging, particularly in multi-label classification, where a fully labeled dataset is required in traditional approaches. This challenge has led to the emergence of several effective learning techniques called single positive multi-label learning (SPML), which utilize multi-label training images annotated with only one single positive label. In our work, we propose an effective method that improves the cutting-edge BoostLU baseline by leveraging efficient label-weighted loss and prior knowledge based regularization strategies. Firstly, we present a novel approach for reweighting the contribution of each label to the total loss, giving higher weight to the true positive label while eliminating unreliable pseudo negative labels by assigning them zero weight. This is achieved through the exploration of large per-label losses. Additionally, we introduce an auxiliary loss to regulate the expected value of positive labels, encouraging the model to predict a reasonable number of positive labels per image based on prior knowledge about the dataset. Experimental results across several benchmark datasets showcase the superior performance of our method compared to both the baseline and other state-of-the-art methods.

13:30-14:10 Session 11: Keynote Speaker
Location: Conference Hall
13:30
Some Interesting Issues in Data Mining

ABSTRACT. Data mining discovers useful rules or patterns from a set of data. Its role is more and more important in the era of big data because it can provide compact and relevant knowledge instead of detailed data to users. Meanwhile, the research of computational intelligence has been very popular because it combines some intelligent approaches such as fuzzy logic, evolutionary computation and neural networks to make programs or systems effective and efficient, although the solutions may not be optimal. In this speech, I would like to link the techniques of computational intelligence and data mining together for different kinds of mining problems. I would also like to talk some other interesting issues, including incremental data mining, duality of data mining, federated data mining, and so on.

14:10-15:10 Session 12A: Software Engineering, Human Computer Interaction & Intelligent Systems
Location: Conference Hall
14:10
Sketch2Reality: Immersive 3D Indoor Scene Synthesis via Sketches

ABSTRACT. Sketching indoor scenes is helpful in daily activities as it allows for quick visualization and planning of room layouts, furniture arrangements, design ideas, or scene creation for games and entertainment. This motivates our proposal of Sketch2Reality, a system to simplify the creation of immersive 3D indoor scenes from 2D sketch images. Users sketch their desired scene, and our system identifies sketched objects and their positions, then retrieves and populates corresponding 3D models into the generating 3D scene. Users can then modify the scene, rearrange furniture, adjust lighting, and add or remove objects. Integration with Virtual Reality technology allows users to experience and interact with the scene realistically. Our experiments with three groups of users with different experience levels in 3D scene design and creation demonstrate the efficiency and usefulness of our solution. Sketch2Reality empowers users to dynamically bring their ideas to life, combining sketching, AI assistance for 3D generation, and VR for enhanced creativity and design exploration.

14:30
A Helpful Reviews Composition Assistance System for Online Shopping Users

ABSTRACT. This paper proposes an assistance system for composing helpful reviews for online shopping users. In online shopping, reviews are essential for users to make informed purchase decisions. However, the quality of reviews varies greatly, and it can be difficult for users to write helpful reviews. To address this issue, we propose an assistance system for writing helpful reviews. The proposed system has two modules: a helpfulness evaluation module and a random example providing module. The helpfulness evaluation module is used to assess the helpfulness of a review. It is implemented by learning labeled review texts with fastText. The random example providing module is used to provide users with examples of helpful reviews. It randomly shows a helpful review from the dataset. We conducted an experiment with 15 participants to evaluate the effectiveness of the proposed system. The results showed that both the helpfulness evaluation module and the random example providing module were effective in supporting writing helpful reviews. Based on the results of our experiment, we conclude that the proposed system is a promising approach for improving the quality of online shopping reviews.

14:50
SER-Fuse: An Emotion Recognition Application Utilizing Multi-Modal, Multi-Lingual, and Multi-Feature Fusion

ABSTRACT. Speech emotion recognition (SER) is a crucial aspect of affective computing and human-computer interaction, yet effectively identifying emotions in different speakers and languages remains challenging. This paper introduces SER-Fuse, a multi-modal SER application that is designed to address the complexities of multiple speakers and languages. Our approach leverages diverse audio/speech embeddings and text embeddings to extract optimal features for multi-modal SER. We subsequently employ multi-feature fusion to integrate embedding features across modalities and languages. Experimental results archived on the English-Chinese emotional speech (ECES) dataset reveal that SER-Fuse attains competitive performance in the multi-lingual approach compared to the single-lingual approaches. Furthermore, we provide the implementation of SER-Fuse for download at https://github.com/nhattruongpham/SER-Fuse to support reproducibility and local deployment.

14:10-15:10 Session 12B: Artificial Intelligence and Data Science for Smart Health and Medicine
Location: Tulip
14:10
ALGNet: Attention Light Graph Memory Network for Medical Recommendation System

ABSTRACT. Medication recommendation is a challenging task that requires considering the patient’s medical history, the drug’s efficacy and safety, and the potential drug-drug interactions (DDI). Existing methods often fail to capture the complex and dynamic relationships among these factors. In this paper, we propose ALGNet, a novel model that leverages light graph convolutional networks (LGCN) and augmentation memory networks (AMN) to enhance medication recommendation. ALGNet can efficiently encode the patient records and the DDI graph into low-dimensional embeddings, while AMN can augment the patient representation with external knowledge. We evaluate our model on the MIMIC-III dataset and show that it outperforms several baselines in terms of recommendation accuracy and DDI avoidance. We also conducted an ablation study to analyze the effects of different components of our model. Our work demonstrates the effectiveness and efficiency of using ALGNet for medication recommendation and provides a new perspective for improving patient care.

14:30
Improving Loss Function for a Deep Neural Network for Lesion Segmentation

ABSTRACT. Identifying and segmenting lesions are challenging tasks automatic analysis of endoscopic images in computer aided diagnosis systems. Models based on encoder-decoder architectures have been proposed to segment lesions with promising results. However, those approaches have limitations in modeling the local appearance, dealing with imbalanced data, and over-fitting. This paper proposes a novel method to address these limitations. We propose to improve a state-of-the-art encoder-decoder based model for image segmentation by introducing a new loss function for training. The novel loss function called Focal - Binary Cross Entropy - Intersection over Union (FBI), consisting three terms, a Focal term, a Binary Cross Entropy (BCE) term, and an Intersection over Union (IoU) term. In addition, we employs the lasso regression sparsity technique in learning to reduce over-fitting. As a result, the proposed model can effectively segment lesions of various sizes and shapes, leading to an improved accuracy of the lesion segmentation task. Our proposed model outperforms existing deep learning models on two challenging data sets of gastrointestinal endoscopy for cancerous lesion segmentation.

14:50
Volumetric CT Segmentation with Mask Propagation Using Segment Anything

ABSTRACT. Medical imaging is both an interesting and challenging field to for researchers, mainly due to their lack of labelled data, especially in segmentation task. Many interactive system has been introduce to help streamline the annotation workflow, and most recently Segment Anything Model (SAM) has been a breakthrough as a foundation model in interactive segmentation using prompts. In this paper, we adapt SAM, a 2D segmentation for 3D organ interactive segmentation task, then we propose several performance improvement strategies, and achieve a comparable result 0.8694 of mean DSC in full-supervised setting with small amount of data. Furthermore, in the process of finding a method to reduce human effort when using our algorithm, we also develop a novel method inspired by beam search algorithm, stem from NLP domain, which can run volumetric segmentation for one target (i.e. the liver) with very limited human manual input, and still achieve an interesting result of 0.9410 Liver's DSC, allow it to be integrated to annotation system to aid medical expert.

14:10-15:10 Session 12C: Multimedia Processing
14:10
LSegDiff: A Latent Diffusion Model for Medical Image Segmentation

ABSTRACT. Initially designed for image generation, diffusion models can also be effectively applied to various tasks, including semantic segmentation. However, most existing diffusion-based approaches for semantic segmentation operate in high-dimensional pixel space, demanding a lot of computing and memory resources during training and inference. This paper makes the first attempt to utilize latent diffusion models for semantic segmentation. Specifically, we propose a fast yet effective latent diffusion model and evaluate it on medical image segmentation tasks. Firstly, we train a Variational Autoencoder (VAE) network to convert binary image masks into compact latent vectors. The diffusion process can then be executed in this low-dimensional latent space and thus drastically accelerated. Subsequently, we employ the VAE's decoder to reconstruct a precise prediction map from the latent output vector produced by the diffusion process. Eventually, we refine the final segmentation results through a straightforward post-processing step using morphological operations. We report our results on two public datasets, including colon polyp images and skin cancer images. Experiments show that our approach achieves competitive accuracy compared to traditional diffusion models while having much better training and inference speed, as well as much more efficient memory consumption.

14:30
Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning

ABSTRACT. Keypoint detection is one of the main focused fields in computer vision with various applications. Traditional fully-supervised deep learning methods currently dominate the field with impressive accuracy, but typically require careful, expensive, and laborious effort for keypoint annotations. To tackle this problem, recent semi-supervised methods have emerged and shown great potential in utilizing a large amount of available unlabeled data. In this work, we explore a novel semi-supervised keypoint detection method, aiming to reduce the annotations required while maintaining the accuracy of traditional fully-supervised methods. We further augment the method by integrating it with a robust backbone network that has been pre-trained through self-supervised learning, thereby enabling better utilization of unlabeled data. Experimental results on three different datasets show that models trained using our semi-supervised method outperform their fully-supervised counterparts in accuracy despite using the same amount of labeled data. Additionally, under specific settings, our method can match the performance of existing semi-supervised methods even when using a reduced set of labeled data.

14:50
Few-shot Object Counting with Low-cost Counting Phase

ABSTRACT. Object counting, a fundamental task in the field of computer vision, holds a pivotal role in a wide range of applications, including surveillance systems, environmental monitoring, and crowd management. However, conventional models that rely on the early-fusion mechanism often face difficulties when seamlessly integrating with retrieval systems. While this mechanism is effective, it demands significant computational resources, thereby impeding the scalability of counting processes and limiting their practical applicability. In response to these challenges, our study introduces an innovative two-phase approach to object counting. By breaking down the counting task into distinct phases, we substantially reduce computational overhead while maintaining performance levels comparable to those of previous methods relying on resource-intensive counting decoders. This approach is opening up exciting new possibilities for real-world applications.

15:10-15:40 Session 13: Software Engineering & Recent Advances in Cyber Security & Human Computer Interaction and Intelligent Interactive Systems (Poster Session)
RAC-SAC: An improved Actor-Critic algorithm for Continuous Multi-task manipulation on Robot Arm Control

ABSTRACT. Controlling a robot arm in a complex environment is a challenging task. In this problem, the robot arm interacts with objects of uncertain shapes. To move an object to a required position where it can be grasped, the robot arm needs the ability to learn autonomously. It must be agile and intelligent enough to identify interaction points with the object, move it, and rotate it in a sensible direction without prior learning in complex environmental conditions.In this paper, we propose Deep Reinforcement Learning approach to tackle this problem, utilizing model-free method called Soft Actor-Critic (SAC) and Realistic Actor Critic(RAC). To enhance accuracy and expedite the learning process, we fine-tune the model by redesigning the reward function to enable the agent to learn more effectively. Additionally, we combine it with methods such as Hindsight Experience Replay (HER), Relay Hindsight Experience Replay (RHER), and Projected Conflicting Gradients (PCGrad) to increase stability in executing sequential actions over extended periods.The experimental results were simulated on a Kuka 7-degree-of-freedom robot arm. The results demonstrate that the proposed method achieves stable and faster convergence with an accuracy of up to 97% when pushing objects such as cubes and scissors. Comparative results also indicate that the proposed method is more effective than previous state-of-the-art methods. Additionally, we have built a framework that supports various Reinforcement Learning algorithms and can be applied to different objects when addressing the problem.

Applying Deep Learning for UAV Obstacle Avoidance: A Case Study in High-Rise Fire Victim Search

ABSTRACT. This article presents a comprehensive scenario for utilizing drones to search for victims during high-rise fires. Given the complexity and difficulty of accessing such environments, employing drones for rapid victim searches may significantly reduce rescue operation time. Specifically, we have devised a two-phase drone control scenario. In phase 1, the drone autonomously navigates to the target location, benefiting from Deep Learning technology to automatically avoid obstacles during its movement. In the second phase, we have introduced a scanning algorithm to assist the drone in systematically surveying the entire rescue and recovery area to locate victims. We have constructed a test scenario and assessed the results using AirSim simulation software.

An Ensemble Approach to Graph Layout for Movie Poster Similarity Visualization

ABSTRACT. An integral element within the realm of movie-making is the movie poster image. This visual medium holds significant importance, conveying essential details about the movie, including its title, characters, and genre. Hence, the features extracted from the movie poster are valuable for measuring the similarity of the movies. This study proposes an ensemble approach to identify movie similarity for visualizing the similarity of movie posters. Specifically, we consider three methods i)pixel-by-pixel, ii) handcrafted feature-based, and iii) deep learning feature-based regarding constructing the similarity matrix. In this way, we can estimate the movie similarity more accurately to provide it as the input dataset for personalized movie recommendation systems. In particular, our method focuses on extending the color-based search algorithms that use low-level image histogram features. This approach helps the classification processing time increase, leading to the quick generation and comparison of feature vectors.

A Comprehensive Web Annotation Application for Organ Image Segmentation and Predictive Inference

ABSTRACT. In this paper, we present a web annotation application tailored to streamline the process of annotating 3D organ volumes using preprocessed 2D images data. Our platform introduces advanced annotation assistance and predictive inference to significantly enhance medical imaging workflows. Although we leverage curated sets of 2D images rather than the original 3D volume data, our innovative approach focuses on optimizing the annotation process and integrating predictive algorithms to improve accuracy. By capitalizing on the preprocessing step of 3D volume segmentation dataset, we offer a solution that expedites the creation of accurate organ segmentation, ultimately contributing to more efficient medical image analysis.

An eye tracking - based system for capturing visual strategies in reading of children with dyslexia
PRESENTER: Duc Duy Le

ABSTRACT. Dyslexia is a learning disability that makes reading and language-related tasks more challenging. In recent years, the use of eye tracking to analyze eye movements during reading in children with dyslexia has gained popularity and received increased attention in the community. However, most previous studies have focused on English-speaking children, leaving limited research applicable to Vietnamese-speaking children, whose language exhibits distinct characteristics compared to English. This study aims to develop a system that utilizes eye-tracking technology to provide a visual representation of the eye movements in Vietnamese children with dyslexia, helping us understand their visual strategies during reading. We have designed a system capable of capturing eye movements of children through several tests that are suitable for Vietnamese kids. This allows us to analyze specific eye characteristics during reading, such as fixations and saccades, and visualize them using heatmap or scanpath. By conducting tests on both children with dyslexia and children with typical development to initially compare their eye movement patterns, the findings suggest the potential use of this system in addressing subsequent issues, such as detection, intervention, and application development aimed at assisting Vietnamese children with dyslexia based on their visual reading strategies.

Resnet Video 3D for Gait Retrieval: A Deep Learning Approach to Human Identification

ABSTRACT. Gait, the distinctive way a person walks, is a useful biometric trait for various applications such as crime prevention, forensic identification, and social security. Gait retrieval, which aims to find the person who matches a given gait, is an active research area, its research has drawn a significant increase. However, learning discriminative temporal features from gait data is difficult due to the subtle variations in the spatial domain of the silhouette. Recent deep learning methods have demonstrated their effectiveness for gait retrieval by learning more robust features from raw video data. In this paper, we propose a baseline network based on ResNet video R3D-18, which can capture both spatial and temporal information from the data, to address the gait retrieval problem. Our experimental results show that our optimized backbone network can extract powerful vector representations of gait and achieve high performance in retrieving the person who matches the gait from the database. On CASIA-B dataset, we obtain Rank-1 accuracy of 97.09% and Rank-10 accuracy of 99.27% under normal walking condition. The source code will be available at.

LOG ANALYSIS FOR NETWORK ATTACK DETECTION USING DEEP LEARNING MODELS

ABSTRACT. System logs play a vital role in upholding information security by capturing events to address potential risks. Numerous research initiatives have harnessed log data to create machine learning models geared towards spotting unusual activities within systems. In this pragmatic study, we introduce an innovative approach to detecting anomalies in log data, employing a three-step process encompassing preprocessing, advanced natural language processing (NLP) utilizing BERT, and a custom 1D-CNN classification model. During the preprocessing phase, we tokenize the data and eliminate non-essential elements, while BERT enriches log message representations. Our Sliding Window and Overlapping Mechanism ensures consistent input dimensions. The 1D-CNN model extracts temporal features for robust anomaly detection. Empirical findings on HFDS, BGL, Spirit, and Thunderbird datasets illustrate that our method outperforms prior approaches in identifying network attacks.

Contextual Language Model and Transfer Learning for Reentrancy Vulnerability Detection in Smart Contracts

ABSTRACT. The proliferation of smart contracts on blockchain technology has led to several security vulnerabilities, causing significant financial losses and instability in the contract layer. Existing machine learning-based static analysis tools have limited detection accuracy, even for known vulnerabilities. In this study, we propose a novel deep learning-based model combined with attention mechanisms for identifying security vulnerabilities in smart contracts. Our experiments on two large datasets (containing approximately 70,000 and 42,000 smart contracts) demonstrate that our approach successfully achieves a 90% detection accuracy in identifying smart contract reentrancy attacks (e.g. outperforming existing state-of-the-art deep learning-based approaches). In addition, this work also establishes the practical application of deep learning-based technology in smart contract reentrancy vulnerability detection, which can promote future research in this domain.

IU-TransCert: A Blockchain-Based System for Academic Credentials with Auditability

ABSTRACT. Recent blockchain-based systems for managing credentials show advantages over paper-based procedures. However, issuing credentials with blockchain could conflict with current management rules and policies. One of the possible conflicts is the auditability. Most blockchain-based systems for credentials focus on security, efficiency and privacy while ignoring the auditability of the system. In this paper, we propose a new system IU-TransCert for issuing, verifying, and auditing academic credentials. The system uses a new data structure named the Auditable Merkle Tree that enables credential issuance and built-in auditing capabilities. The auditable data fields can be customized to meet regulations. Credentials are published to the blockchain in the root node of the Auditable Merkle Tree, allowing access for auditors while preserving privacy. The system provides automated and transparent auditing processes for educational authorities to independently verify credentials without involving issuers. We also present a prototype to demonstrate feasibility, and a security analysis to examine protections against threats. The analysis and discussion shows that the proposed system could enhance credential privacy, efficiency, integrity, and auditability across the university ecosystem.

Shark-Eyes: A multimodal fusion framework for multi-view-based phishing website detection

ABSTRACT. In the era of escalating cyber threats, phishing attacks continue to exploit vulnerabilities in online security. This paper presents Shark-Eyes, a novel multimodal fusion framework designed for the detection of phishing websites using a multi-view approach. The proposed approach leverages a combination of two distinct attributes, namely domain features and HTML tag features, extracted from the target websites. The framework's effectiveness is evaluated through comprehensive experiments on a dataset sourced from Phishtank, OpenPhish, and Alexa, encompassing real-world phishing instances. Our results demonstrate the robustness and efficiency of the Shark-Eyes framework in accurately identifying phishing websites, showcasing its potential as a powerful tool for enhancing online security and thwarting malicious activities.

Towards Privacy-aware Manufacturing Data Exchange Platform

ABSTRACT. Reducing their operating costs and optimizing manufacturing processes are main challenges for manufacturers that need no-doubt help from machine suppliers - OEMs. However, like 64% of Business entities, they do not intend to collaborate as long as their confidential data can be seen by anyone. Until now, some solutions on market using technologies like Confidential Computing, Differential Privacy, Multi Party Computation cannot completely fit to industrial requirements, data are sometimes partially encrypted or using trust execution environment (a bunker) to analyse in clear format. For this reason, until now, no secure computation and no solution for privacy - preserving data analysis are yet completely satisfactory (in terms of privacy and security constraints) and moreover they are often tested for different applications and on different datasets. Fully Homomorphic Encryption (FHE) technology is going to change the game. FHE allows service providers to work directly on encrypted data without ever decrypting it, which offers a privacy data protection for both customers and OEMs. In the collaboration with Siemens France, we provide FHE-based manufacturing data exchange space which is a part of RaiseSens® Data eXchange Platform (RS – DXP). In this respect, API-driven RS – DXP architecture allow the practical and easy integration of FHE techniques combined with optimisation engine, and non-moving data techniques applied in lightweight yet real-world manufacturing applications and deploying them in Cloud computing environment to offer a solution at low software engineering cost. This will pave the way for a wide deployment, boosting data-enabled manufacturing services.

A Machine Learning-Based Anomaly Packets Detection for Smart Home

ABSTRACT. The advent of smart homes has revolutionized residential living, integrating advanced technologies and intelligent devices to create secure, comfortable, and efficient environments. However, this integration of diverse smart devices has brought significant cybersecurity challenges. Detecting and analyzing abnormal network packets have become paramount, signifying potential intrusions, malicious activities, or system errors ensuring the security and stability of smart home systems. Machine learning techniques, such as Decision Trees, Support Vector Machines (SVM), Convolutional Neural Networks (CNN), K-Nearest Neighbors (KNN), Recurrent Neural Networks (RNN), and Random Forests, have shown promise in addressing these challenges. However, most research has concentrated on anomaly detection rather than malicious activity in smart homes. The vast datasets collected from various scenarios pose methodological and algorithmic challenges for applying machine learning techniques. To fill these research gaps, our study introduces traditional machine learning methods for detecting abnormal network packets in smart homes using the IoT-23 dataset. It involves preprocessing the dataset, extracting relevant features, and training various machine learning models. The correlation matrix helps validate the feature selection of the best models based on performance metrics like precision, F1-score, recall, accuracy ratio, training score, and training time cost. Additionally, the study classifies 12 types of malicious malware across different machine learning models, considering performance within the context of smart home devices. This study implements real-time anomaly detection on the Raspberry Pi using packet captures and Zeek flowmeter methods. The findings contribute insights into models suitable for smart home security. In addition, our research enhances the understanding and application of machine learning methods for bolstering security in smart homes.

On the Value of Code Embedding and Imbalanced Learning Approaches for Software Defect Prediction

ABSTRACT. Automated software defect prediction aims to identify and estimate the likelihood of defects in software source code elements, seeking to enhance software quality while reducing testing costs. Previous research on software defect prediction primarily concentrated on investigating design-related features such as source code complexity and object-oriented design metrics for the purpose of classifying program elements into two categories: (i) defective and (ii) non-detective. Nevertheless, the majority of these studies have relied solely on hand-crafted software metrics, neglecting the valuable asset of source code instruction, which can play a pivotal role in detecting bugs. This study leverages the use of source code embedding techniques to extract essential information from program elements through a convolutional neural network. The likelihood of a source file element (e.g., class or method) being defective is established through the utilization of a fully connected network that incorporates both source code features and design-related attributes. Additionally, we explore specific imbalanced learning strategies to address the skewed defect data distribution issue. To assess the effectiveness of our proposed approach, we conducted experiments on the publicly available dataset, PROMISE with 34 projects. The empirical results consistently showcase the superior performance of our method, as it effectively predicts defective source files, outperforming other state-of-the-art models

A Systematic Literature Review of DevOps Success Factors and Adoption Models

ABSTRACT. This paper investigates DevOps adoption, emphasizing its cultural dimensions in the context of the information systems domain. Uti-lizing a robust systematic literature review methodology, we dis-cerned nine salient success factors and delineated three stage mod-els instrumental for a smooth transition into DevOps-centric pro-jects. Despite the increasing breadth of research in the DevOps are-na, the absence of a consolidated definition remains a challenge, underscoring the importance of a unified interpretative framework. This research notably extends the current body of knowledge by focusing on the cultural dynamics of DevOps adoption. As a result, it offers a refined lens through which both practitioners involved in the adoption process and academic researchers can comprehend the nuances and implications of cultural shifts in the evolving landscape of DevOps within information systems.

15:40-16:40 Session 14A: Software Engineering, Human Computer Interaction & Intelligent Systems
Location: Conference Hall
15:40
Multi-Branch Network for Imagery Emotion Prediction

ABSTRACT. For a long time, images have proved perfect at both storing and conveying rich semantics, especially human emotions. A lot of research has been conducted to provide machines with the ability to recognize emotions in photos of people. Previous methods mostly focus on facial expressions or body texture but fail to consider the scene context. Recent studies have shown that scene context plays an important role in predicting emotions, leading to more accurate results. In addition, Valence-Arousal-Dominance (VAD) values offer a more precise quantitative understanding of continuous emotions, yet there has been less emphasis on predicting them compared to discrete emotional categories. In this paper, we present a novel Multi-Branch Network (MBN) to predict both discrete and continuous emotions. Our method utilizes various perceptual information, including facial features, body features, and contextual features to predict human emotions in an image. Experimental results on EMOTIC dataset, which contains large-scale images of people in unconstrained situations labeled with 26 discrete categories of emotions and VAD values, show that our proposed method significantly outperforms state-of-the-art methods with 28.4% in mAP and 0.93 in MAE. The results highlight the importance of perceptual information in emotion prediction and illustrate the potential of our proposed method in a wide range of applications, such as effective computing, human-computer interaction, and social robotics.

16:00
An Enhanced Tendermint Consensus Protocol Powered by Elliptic Curve VRF for Beacon Chain Mode

ABSTRACT. Blockchain technology has been undergoing a major shift recently, with the adoption of Proof-of-Stake consensus mechanism in lieu of Proof-of-Work due to the former's efficiency and speed. One notable example of Proof-of-Stake is the Tendermint protocol, which has been powering the entirety of Cosmos system - an ecosystem of multiple interlocked chains. However, Tendermint's choice of deterministically deciding the next block proposer instead of utilizing a randomization function presents a huge window for malicious actors to prepare and coordinate attacks on upcoming validator nodes, possibly crippling the attacked chain, especially small chains with a few working nodes. Furthermore, randomized number generation in this blockchain ecosystem still proves to be a challenge by reason of blockchain's inherent deterministic nature. Aiming at the above problems, in this paper, we propose an improvement over Tendermint consensus protocol, utilizing Elliptic Curve Verifiable Random Function, a fast and secure pseudorandom generation algorithm suitable for deterministic systems like blockchain. This novel approach will solve the problem of knowing the validator ahead of time, whilst the verifiable random function module will be capable of supplying reliable random numbers to the overlaying beacon chain - a blockchain capable of distributing random numbers to users through smart contracts, and even other blockchains through Cosmos' InterBlockchain Communication Protocol. The performed experiments with a prototype blockchain using the enhanced consensus protocol demonstrated that the new consensus protocol improves Tendermint's resilience against network-layer attack vectors, while still maintaining adequate fairness and performance.

16:20
An Approach to Generating API Test Scripts Using GPT

ABSTRACT. As more software systems publish and use web services or APIs today, automated API testing is an important activity to help effectively ensure the quality of software services before they are released for their usage. Generating test scripts and data is a crucial step to perform API test automation successfully. In this paper, we propose an approach leveraging GPT, a large language model, and API's Swagger specification to automatically generate test scripts and test data for API testing. Our approach also applies GPT's self-refining with the feedback by executing tests on Katalon. We evaluate our proposed approach using a data set of seven APIs consisting of 157 endpoints and 179 operations. The result shows that while our approach generates fewer test scripts and data inputs, it can cover more successful status codes of 2xx than a state-of-the-art tool. This result suggests that leveraging the ability of GPT as a large language model to interpret API's Swagger specification has the potential for improving the efficacy of generating test scripts and data for API testing.

15:40-16:40 Session 14B: Recent Advances in Cyber Security
Location: Tulip
15:40
Black-Box Adversarial Attacks Against Language Model Detector

ABSTRACT. The Language Model (LM) Detector has gained attention for its remarkable performance in detecting machine-generated texts. It however remains unclear how this detector would perform against different adversarial attacks. In this paper, we aim to address this question by conducting a systematic analysis on the resilience of the LM Detector against eight black-box adversarial attack methods. We also propose a new technique, called StrictPWWS that introduces the semantic similarity constraint into the conventional Probability Weighted Word Saliency (PWWS). Our finding reveals that the selection of a search algorithm helps the attack methods generate better adversarial samples that can bypass the LM Detector. Moreover, tightening linguistic constraints emerges as an effective way to improve the attack success rate. StrictPWWS demonstrates achieving superior performance compared to other adversarial attack methods.

16:00
WebGuardRL: An Innovative Reinforcement Learning-based Approach for Advanced Web Attack Detection

ABSTRACT. Web-based applications are often potential targets for attackers due to the important data and assets that they manage. With the explosion and increasing complexity of recent attacks aiming at these applications, traditional security solutions such as intrusion detection systems (IDS) or web application firewalls (WAF) become ineffective against unpredictable threats. Meanwhile, in the trend of applying AI techniques to achieve practical effectiveness in various fields, cutting-edge reinforcement learning (RL) has also gained more attention for its promising applications, one of which is sophisticated attack detection. In this study, we introduce an RL-based model, named WebGuardRL, to detect multiple advanced web attacks by analyzing URLs in HTTP requests containing various attack types. To achieve this, our model is equipped with the capability of representing URLs that differ from attack to attack in the same form for use in RL training. The experimental results and comparisons with other methods indicate the high accuracy and remarkable capability of our WebGuardRL in web attack detection.

16:20
A Machine Learning-Based Framework for Detecting Malicious HTTPS Traffic

ABSTRACT. The use of encryption protocols, especially HTTPS, to secure communications creates new challenges for attack detection methods that are based on observation and analysis of network traffic because malicious data is also encrypted. Attackers can utilize encryption protocols to hide attack behaviors, including malware. Various machine learning techniques have been employed to identify encrypted network traffic originating from systems infected with malware. In this paper, we propose a framework to detect HTTPS traffic generated from malware-infected computers. The significance of this framework lies in its proposal of using tools and network traffic processing techniques, as well as conducting a comparative analysis of the detection performance produced by some machine learning algorithms. Additionally, we provide empirical findings to demonstrate the influence of the selected features on the overall accuracy and computation time of the detection framework.

15:40-16:40 Session 14C: Multimedia Processing
15:40
Applying Adaptive Sharpness-Aware Minimization to Improve Out-of-distribution Generalization

ABSTRACT. Out-of-distribution (OoD) generalization in machine learning occurs when models trained on specific source domains struggle to generalize to unseen target domains due to variations in distributions, data collection conditions, and biases. These differences, known as domain shifts, can considerably impact model performance when facing domains not encountered during training. This circumstance arises partially due to the limitations of commonly used optimizers like SGD or ADAM, which lack the capability to preferentially converge towards optimal points characterized by high generalization capacity, commonly referred to as flat minima. Sharpness-Aware Minimization (SAM) emerges as a powerful tool for facilitating generalization in independent and identically distributed (i.i.d) data by aiming to find flat minima in the loss landscape that remain robust in the face of input sample perturbations. Among the variants of SAM, one notable variant is ASAM, an adaptive version designed to withstand parameter re-scaling and ensure robustness. In this study, our primary focus lies in the utilization of ASAM. Diverging from the conventional utilization of ASAM, which predominantly functions within the confines of i.i.d conditions, this paper applies ASAM in OoD data scenarios to evaluate its robustness. Our findings reveal that ASAM exhibits stable training behavior and superior generalization capabilities when compared to the Adam optimizer. Our experiments, conducted on the NICO dataset encompassing multiple domains, showcase ASAM's remarkable performance, surpassing other state-of-the-art methods without the need for intricate data augmentation. ASAM achieves an accuracy rate of 87.25% (Animal) and 80.79% (Vehicle).

16:00
Deep Learning Hierarchical Methods for Insect Pest Recognition on Plants

ABSTRACT. Insect pests have always been a global agricultural problem because the severity and extent of their occurrence threaten crop yield. Recognizing them early can help farmers have efficient measures to handle them, which can help mitigate negative impacts from insect pests. However, insect pest recognition still relies heavily on experts, which is expensive and time-consuming. With the power of Deep Learning, in this paper, our team proposes two methods to solve this task. First, We proposed a method that uses models pre-trained on the ImageNet dataset including ResNet-50, EfficientNet-B4, and VisionTransformer-B16 respectively. We also change the structure of these models by adding a Dropout layer before the output layer of these pre-trained models to avoid overfitting. Second, we apply hierarchical learning for this task. In the latter approach, our team first uses the baseline model to create a confusion matrix. Through this matrix, we cluster classes that the baseline model misses to each other because of the similar appearance across classes into bigger classes and we consider them as sub-datasets. Then, we build each model for each sub dataset using the same backbones as the baseline methods with the hope that it helps the method classify better in these classes. We do experiments to evaluate the performance of methods on the IP102 dataset and from experiments, our proposed method which uses VisionTransformer-B16 backbone combined with hierarchical learning gets the best accuracy of 74.50% on the IP102 dataset.

16:20
Improving Bird Sounds Classification with Additional Fine-tuning on Augmented Data

ABSTRACT. Many bird species are currently on the verge of extinction and functionally extinct. Therefore, measures to note down the quantity and distribution of bird species using AI models have always been in consideration to support the biodiversity monitoring process. This study aims to enhance the accuracy of bird sound identification by applying fine-tuning with data augmentation to the existing CNN-based model. On the dataset from the BirdCLEF 2021 competition, the model subsequently achieves the F1-score of 0.6928, increasing noticeably by 0.0465 compared to the rebuilt bird classification solution proposed by Naoki Murakami, Hajime Tanaka and Masataka Nishimori, and has the ability to predict nearly 400 species of birds more accurately. The improved performance of the model is expected to support the efforts of conservationists in monitoring bird distribution and promoting species preservation.