A hybrid multifactorial evolutionary algorithm for the minimum s-Club cover problem
ABSTRACT. The Minimum s-Club Cover problem (Min s-Club Cover) is widely applied in social network analysis and group object modeling. The objective of Min s-Club Cover is to determine the fewest possible s-Clubs to cover a graph’s vertices. This study describes a hybrid between evolutionary multitask optimization and a local search algorithm that is inspired by a simulated annealing algorithm to address the Min s-Club Cover. The local search algorithm is applied to the best individual of each task for every generation. The proposed algorithm is evaluated on datasets from the DIMACS library. Experiments are carried out to demonstrate the effectiveness of the proposed algorithm.
Leveraging Dynamic Graph Word Embedding for Efficient Contextual Representations
ABSTRACT. Sentence structure consists of a complex structure of words and relationships between words, which can be hard to represent with only a sequential representation and makes it challenging to learn long-range dependencies. This research presents a novel dynamic word embedding method designed to improve text classification performance. The method leverages a next word prediction model trained on a massive text corpus to extract dynamic text representations. These representations capture the evolving meaning of words based on context and are then combined with static embeddings like Word2Vec. A noteworthy approach is that the method incorporates an undirected graph model to capture contextual relationships between words. Three variations of the method are explored: ELMo-Like Baseline Dynamic, ARMA Graph Dynamic, and ARMA+ELMo Dynamic. Experiments utilizing deep learning models for sentiment analysis and disaster tweet classification demonstrate the effectiveness of the proposed approach.
H-LSHADE: An Efficient Hybrid Approach for Solving Heterogeneous Target Coverage in Visual Sensor Networks
ABSTRACT. Visual Sensor Networks(VSNs) have become a key domain within sensor-based distributed intelligent systems, presenting distinct challenges in ensuring the quality of service (QoS) by guaranteeing coverage requirements and extending network lifetime. Among these challenges, the Heterogeneous Coverage of Targets (HCTs) problem stands out, where coverage requirements differ for each target based on its significance, and sensor orientation is restricted to discrete directions. This paper solves the HCTs problem under two critical scenarios: over-provisioned environments, where a sufficient number of sensors can meet all coverage demands, and under-provisioned environments, where sensor resources fall short of fulfilling the required coverage. Leveraging the success of evolutionary algorithms in finding near-optimal solutions, we propose an enhanced version of the traditional L-SHADE algorithm, namely Hybrid LSHADE with novel Local Search strategy (H-LSHADE). Our improvements include a specialized mutation strategy to maintain search space diversity and a local search mechanism to refine solutions. Comprehensive simulations demonstrate the proposed approach's efficiency and effectiveness compared to existing methods.
Developing a Mobile Virtual Assistant using Large Language Models for Task Automation
ABSTRACT. Mobile virtual assistants have come a long way, but they still stumble when faced with complex requests and adapting to different app interfaces. This is largely due to their reliance on pre-defined commands and limited contextual awareness. Our research introduces a novel approach to overcome these limitations by integrating Multimodal Large Language Models (MLLMs) into mobile virtual assistants. By harnessing the power of MLLMs, our proposed virtual assistant can process complex requests and seamlessly interact with diverse app interfaces. Traditional systems struggle with maintaining context and understanding dynamic app interfaces, but our approach tackles this by enabling the assistant to analyze and interpret visual elements within these interfaces. Our methodology centers on a comprehensive learning process where the virtual assistant learns to interact with various applications and understand their functionalities. This is further enhanced by the MLLM’s ability to capture and interpret visual cues, allowing for more precise handling of complex queries and task executions.
BKCrawler: A Scalable Web Data Extraction System Using Weak Supervision
ABSTRACT. Automated data collection from diverse websites poses a significant
challenge in web mining. This paper introduces BKCrawler, a
novel system that automatically detects and extracts information from
multiple websites without predefined structure definitions. We propose
an integrated architecture comprising two key components: a Site Scouter
for automatic website analysis, and a Data Harvester incorporating machine
learning models for data collection and extraction. Notably, we apply
weak supervision techniques in data labeling for the information extraction
model, substantially reducing manual labeling costs while maintaining
high accuracy. Experiments on real estate data from 17 websites
demonstrate the system’s effectiveness, achieving high accuracy in page
classification (F1-score 0.93) and competitive performance in extracting
key information fields (F1-scores ranging from 0.65 to 0.88). BKCrawler
has been successfully deployed and is currently in use, proving the effectiveness
and scalability of the proposed method. This research opens
avenues for developing intelligent data extraction systems applicable to
various domains.
ABSTRACT. Clustering is used to find structure in unlabeled data sets, which is one of the fundamental problems for advanced data mining steps. They play an important role in many fields, especially big data analysis and mining. Usually, clustering is based on the similarity between data samples in a cluster or the difference between data samples in two different clusters. However, there are data samples located at the border of clusters, and it is difficult to determine which cluster they belong to. Clustering algorithms based on fuzzy sets have advantages in handling uncertain data, but they have difficulties with large data, noisy data, outliers, data with unclear cluster boundaries, etc. The paper presents the border fuzzy c-means clustering algorithm to minimize the above disadvantages. The proposed algorithm consists of two stages, initializing the initial cluster center based on the data distribution; stage 2 describes the proposed algorithms border fuzzy c-means clustering (B-FCM) and border semi-supervised fuzzy c-means clustering techniques (B-SSFCM) in which data samples located on the fuzzy boundaries of clusters are considered based on border information to decide which cluster they belong to. Stage 2 is repeated until the samples on the boundaries are assigned to clusters. This approach also helps to quickly assign edge data samples to clusters, thereby helping the algorithm converge faster. Experiments on the proposed method on three large data sets show that the proposed method not only gives better clustering results but also gives much faster time than some other algorithms.
On the Effects of Training Objectives of Multi-agent Reinforcement Learning for Energy Consumption in Residential Buildings
ABSTRACT. The application of reinforcement learning (RL) to optimize residential power systems presents a promising research avenue, particularly by leveraging the flexibility of RL in learning and the vast data resources made available by the advancement of IoT technologies. This approach holds significant potential in addressing challenges related to reducing power consumption and carbon emissions. However, several challenges hinder the practical application of RL models in this domain. A primary obstacle is the difficulty in accurately defining the reward function (objective function), a critical factor that profoundly impacts both the training process and the alignment with the actual needs of the power system users, due to the diverse range of evaluation parameters. This study is undertaken to explore the practicality of RL in controlling residential power systems and to assess the impact of various reward functions on both the agent's learning process and the system's performance.
The research aims to identify the effective reward functions that yield a favorable balance among energy consumption, system stability, and user comfort under specific scenarios.
Enhancing Software Fault Localization with Variational Autoencoder and Residual Neural Networks
ABSTRACT. Debugging is a critical, costly, and labor-intensive activity in software development. Many fault localization techniques have been proposed to mitigate this issue. Spectrum-based fault localization is a widely used technique that analyzes execution traces (spectra) from test cases and applies a ranking formula to determine the suspiciousness score of each program unit. However, most of the existing spectrum-based fault localization techniques fail to consider complex dependencies between program units and test results. To overcome this limitation, deep-learning-based fault localization techniques have been developed, which utilize an artificial neural network to capture and learn the complex nonlinear relationship between the program spectra and test results. In this study, we propose an effective framework that integrates the Variational Autoencoder and Residual Neural Networks (ResNet) to enhance the accuracy of fault localization. First, VAE is utilized to address imbalanced input data issues, and then the ResNet network is used to capture nonlinear relationships in program execution data. The experimental results show that our approach outperforms state-of-the-art techniques in terms of EXAM and RImp metrics.
GCGE: GAN+CFM-powered Data Augmentation and GBT Ensemble Learning for Improving Diabetes Mellitus Prediction
ABSTRACT. Non-invasive diabetes diagnosis poses significant challenges, particularly when critical information is missing or incomplete in the dataset. This paper focuses on improving predictive accuracy for non-invasive diabetes diagnosis by leveraging advanced tabular data augmentation techniques. We explore a range of oversampling methods to enhance machine learning models' ability to make decisive predictions from limited data. Conditional Flow Matching (CFM) showed particularly promising results, especially when combined with other augmentation methods. After extensive testing with oversampling techniques and using Gradient-Boosted Trees for synthetic data generation, we identified that a combination of WGAN and CFM, trained with the CatBoost algorithm, produced the best results. This approach achieved specificity of 98\%, sensitivity of 95.91\%, an accuracy of 96.26\%, and an F1-Score of 96.05\% on our non-invasive diabetes dataset, outperforming single-method augmentation strategies. These findings demonstrate the potential of multi-method augmentation to significantly improve the performance of machine learning models in non-invasive medical diagnostics.
Analysis of Behavioral Facilitation Information During Typhoon Period Based on Victim Attributes
ABSTRACT. During a disaster, a large amount of Behavioral Facilitation information (BF information) is posted on SNSs. We have proposed the method of extracting and categorizing BF information into four labels (behavioral axis).
In this paper, we analyze the differences in how BF information is received before, during, and after a typhoon, focusing on the attributes of disaster victims. This research aims to clarify the appropriate information for the target victims in each typhoon period and analyze their relationship. The results have shown differences in the relationship between the victims' attributes and their perception of BF information.
Towards a Unified Delegated Authorization Framework for Microservice-based ERP Systems
ABSTRACT. Microservice-based ERP systems face the challenge of providing a consistent authorization and delegation mechanism to control actions on resources for different authorities, including their users and the services they use or connect to. These challenges arise due to the complex and varied requirements of different business management models, such as position-based, task-based, or mixed models.
This paper proposes a simple, unified delegated authorization model that allows for defining authorization and delegation policies based on controlled actions by entities on objects, using predefined and customizable attributes related to positions, tasks, and organizational policies applied to the authorities. Building on this model, we introduce a framework that addresses the need for a consistent and manageable delegation process and an efficient handling of dynamic authorization and delegation requirements for both users and services across the ERP system. The framework consists of four key components: AuthorizationController, DelegationController, PolicyController, and AttributeController. It enables the retrieval of attributes for authorization checks and the establishment of delegation. The framework is integrated into a microservice-based ERP system and experimented with real-world authorities, demonstrating its effectiveness in making access control and delegation decisions on their actions.
Power and Subcarrier Optimization for Heterogeneous QoS Requirement in Wireless Sensor Networks
ABSTRACT. A hierarchical wireless sensor network consists of multiple sensor nodes sending their sensed data to multiple access points (APs) to relay the aggregated data to a system server. In the data collection scenarios, the access points can reuse the sub-carriers used by the sensors in the data relaying phase to improve spectral efficiency. The data rate requirements of sensors are heterogeneous, and energy consumption is the primary concern in any sensor-based application. This work aims to minimize the total power consumption of the whole device while satisfying each device’s minimum data rate requirements. The paper formulates the problem as a joint optimization of power allocation, AP selection, and sub-carrier assignment. The deployment of a two-phase communication and the application of the Lagrangian method in each phase to allocate power and sub-carriers for sensor nodes and APs bring the optimal solution. Numerical results show that the proposed algorithm achieves significant power savings compared to the existing methods.
Improving Quality of Vietnamese to Khmer Neural Machine Translation Using Multi-stage Fine-tuning Strategy
ABSTRACT. Machine translation for low-resource language pairs remains a significant challenge with large language models, despite their demonstrated superiority in many other tasks. For translation task, multilingual large language models require a substantial amount of monolingual data for pre-training phase, as well as large and high-quality parallel datasets for fine-tuning phase. In this paper, we present our research on exploring an effective fine-tuning solution on a pre-trained LLM to enhance the quality of machine translation for the Vietnamese-Khmer language pair. Through experimental results, we found that by self-supervised learning the pre-trained LLM, then fine-tuning its on related tasks, and finally fine-tuning on both the original and augmented datasets yielded a better result with a BLEU score improvement of over 13% compared to the best results from previous studies, and 7% higher compared to Google Translator and gpt-4o for in-domain test.
Developing A Vietnamese Regional Voice Dataset and Benchmark For Region Recognition Based On Speech
ABSTRACT. To enhance human-computer interaction, enable personalized customer service, and improve accessibility for diverse populations across various regions, recognizing regional accents in speech is a critical challenge to address. In this paper, we developed a comprehensive speech dataset comprising over 9,000 audio samples from 63 provinces in Vietnam, representing six distinct regional accents. We applied various machine learning and deep learning techniques for accent classification, achieving classification accuracies ranging from 71.01% to 83.17%. Extracting regional speech features and designing robust classification algorithms remain challenging tasks. The dataset, alongside the performance comparison of state-of-the-art models, provides valuable insights for developing more efficient speech classification techniques.
Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
ABSTRACT. Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
A Novel Gradient-based Defense Method against Model Poisoning Attacks in Federated Learning
ABSTRACT. Federated learning (FL) allows multiple clients to train a model without sharing data. However, the decentralized nature of FL may make the model vulnerable to attacks such as model poisoning attacks (MPA), which aim to reduce the accuracy of the central model. Most current defense mechanisms perform significantly well against model poisoning attacks under the scenario where the attacker has access to a few genuine clients. However, if the attacker injects a sufficiently large number of fake clients into the FL system, conventional defense methods may fail to recognize the attack and protect the global model. To address this challenge, we propose an effective defense mechanism against the MPA that can accurately detect all fake clients and disregard their contribution to the aggregation process. Specifically, we analyze the sum of the final layer’s bias gradients and demonstrate that this sum theoretically should be zero for genuine classification models. The evaluation results show that, regardless of the number of fake clients, our proposed method could precisely detect all of them and maintain the system's performance as it has not been attacked. Also, our proposed defense method exhibits much lower computational complexity while outperforming the others in system accuracy compared to the other defense methods.
Dual-Domain Reconstruction Network for Enhancing Sparse-View and Low-Dose CT Imaging
ABSTRACT. Computed Tomography (CT) is a crucial diagnostic tool, but concerns about patient radiation exposure have increased due to its widespread use. Sparse-view CT, which reduces the number of projection angles, has been proposed as a potential solution. However, traditional reconstruction methods, such as Filtered Back Projection (FBP), struggle to produce high-quality images from sparse data. To address these challenges, the DD-ReconNet model for CT image reconstruction is introduced. This model leverages both the Sinogram and Image domains to enhance image quality and consists of three stages: Sinogram Restoration, Image Reconstruction using FBPConvNet, and Image Restoration. The Sinogram and Image Restoration modules integrate the Swin Transformer V2 block and an Improved Edge Convolution layer to boost restoration performance. Additionally, a hybrid objective function is used to optimize the reconstructed images. Experimental results demonstrate the superiority of the DD-ReconNet model over conventional methods, positioning it as a promising approach for low-dose and sparse-view CT reconstruction, improving diagnostic accuracy while minimizing radiation exposure.
An Evaluation of HTTP/3 and WebTransport over QUIC in Live Low Latency Video Streaming
ABSTRACT. Live video streaming on the Internet has become increasingly popular in recent years. This trend highlights the need for efficient and robust streaming protocols. The advent of HTTP/3, with its core component QUIC, introduces new features that could potentially enhance media delivery in areas such as cloud gaming and real-time rendering applications. WebTransport’s ability to facilitate access of HTTP/3 and QUIC within web browsers makes it highly attractive to cloud gaming providers and game engines, which predominantly use WebRTC to stream cloud-rendered content. Accessible directly within browsers, WebRTC has been a key technology in media streaming due to its universal availability and low-latency features. However, the emerging use of WebTransport promises scalable, lower latency solutions, potentially similar to existing low-latency techniques like WebRTC. This paper analyzes the performance of HTTP/3 and WebTransport in live low latency media streaming context, discusses the possible advantages of adopting these protocols, and compares them with WebRTC. To evaluate the performance of WebTransport and HTTP/3, a comprehensive testbed was developed, enabling a comparative analysis with WebRTC focused on end-to-end latency. Utilizing the widely-used gaming engine Unity for remote rendering, we established a pipeline architecture that employs various streaming protocols. The findings reveal that WebTransport and HTTP/3 are capable of achieving low-latency levels comparable to WebRTC. This underscores the potential of HTTP/3 and WebTransport to perform well low-latency applications and require reduced configuration overhead compared to WebRTC in client-server setups.
A MAC Protocol for multi-cluster scheduling based on geographical segmentation and Precoloring Extension
ABSTRACT. Large-scale monitoring Wireless Sensors Networks present a specific challenge for its data collection operation when the network is broken down into multiple isolated clusters, specifically in the conducting of said network data packet transmissions. While this can be done using a combination of TDMA-based MAC protocol and limiting each clusters to only use a certain amount of timeslots, we find this approach only sidesteps the problem of multi-clusters transmission scheduling. In this work we investigate this problem, addressing the existence and discovery of cross-cluster interference and how to avoid it, followed by the proposal of a scheduling procedure taking into account this interference avoidance. First round result shows our algorithm potential for scalability and reliability when applied to networks of increasingly larger sizes.
CoverNexus: Multi-Agent LLM System for Automated Code Coverage Enhancement
ABSTRACT. This paper presents CoverNexus, a novel multi-agent system leveraging Large Language Models (LLMs) to improve code coverage through automated unit test generation. We introduce a flexible architecture combining LLMs with specialized testing components, outperforming existing methods in coverage and correctness. Our approach is evaluated using CoverBench, a new benchmark derived from HumanEval, tailored for assessing test generation and coverage improvement. Comprehensive experiments demonstrate CoverNexus's superiority, with GPT-4 achieving 99.91% coverage and 77.44% correctness in multi-agent setups. We observe that closed-source models excel in multi-agent configurations, while open-source models perform better in single-agent scenarios. This work provides valuable insights into the trade-offs between coverage and correctness, contributing to the advancement of AI-assisted software testing and more efficient software development processes.
ABSTRACT. Convolutional Neural Networks (CNNs) are crucial for processing multimedia data. Their growth has been fueled by advancements in computing platforms, e.g. GPUs and TPUs. With massively parallel computing capabilities, GPUs can significantly speed up the training and inference of CNNs. The advent of CUDA, a parallel computing and programming model, has also eased efforts to parallelize these computational tasks, primarily attributed to the convolutional layers, on GPUs. However, the current implementation of convolution algorithms still has many limitations, lowering the performance of the convolutional layers. In this paper, we propose a method to optimize Winograd-based convolution on GPUs to speed up convolution operations. Experiments on a commercial GPU demonstrate that the proposed method remarkably outperforms two state-of-the-art algorithms in the cuDNN library, GEMM and Winograd. In addition, the proposed method also brings significant benefits in terms of reducing the required memory usage. These results have a positive impact on accelerating CNNs.
ASC: Aggregating Sentence-level Classifications for Multi-label Long Text Classification
ABSTRACT. Classification is a fundamental task for metadata estimation in archival document management within a digital library. Although pre-trained language models (PLMs) have evolved significantly, multi-label long text classification (MLLTC) remains challenging for PLM-based text classification methods due to their input text length limitations. Existing PLM-based classifiers typically utilize a single representation for a long text. In contrast, this paper explores a sentence-level classification approach.
The basic idea is two-fold: a sentence in a text can often focus on one or a few classes, meaning multiple classes can be derived from the individual sentences; furthermore, sentences can typically fit within the length limit. There are two main issues with implementing a sentence-level classifier: the loss of context for each sentence and the increased training cost due to the larger number of documents that need to be processed by a PLM-based model.
To address these issues, this paper proposes a framework, ASC, that uses sentence-level n-grams to form a sentence representation and employs a sentence selection method to reduce the number of sentences needed for training. The experimental results demonstrate that ASC outperforms existing text-level classifiers, achieving 25% and 48% improvements in Macro F1 metrics.
VSum-HB: A Vietnamese Text Summarization Dataset For Reinforcement Learning From Human Feedback
ABSTRACT. This paper introduces a novel Vietnamese text summariza-
tion dataset named VSum-HB, designed specifically for Reinforcement
Learning from Human Feedback (RLHF). A total of 5,000 samples from
the Vietnews corpus were carefully selected, covering a wide range of top-
ics. We employed a hybrid method combining automated summarization
using ChatGPT 4.o with human annotation to refine the summaries. This
approach ensures the dataset is highly suitable for training RLHF models
to generate summaries closely aligned with human preferences. Exper-
imental results indicate that models trained on this dataset, enhanced
by RLHF, significantly outperform traditional summarization models,
with marked improvements in ROUGE scores and positive evaluations
from human. Our work highlights the potential of this dataset to ad-
vance Vietnamese NLP research, with future studies aiming to expand
its applications to other NLP tasks.
Exploring Vegan Dining Experiences: Insights from User-Generated Content Analysis
ABSTRACT. User-generated content (UGC) is a vital source of information for under-standing consumer experiences and preferences in the hospitality industry. This study evaluates factors influencing the dining experience of vegan con-sumers by analyzing 13,935 Google Reviews of vegan-friendly restaurants. Utilizing Latent Dirichlet Allocation (LDA) to identify latent topics and VADER sentiment analysis to assess sentiment polarity, the research uncov-ers key themes affecting customer satisfaction and dissatisfaction. The anal-ysis reveals eight primary topics influencing vegan dining experiences. Space emerged as the most frequently mentioned topic, significantly enhancing customer satisfaction, while "staff attitude" was a major factor in dissatisfac-tion. These insights offer actionable recommendations for restaurant manag-ers to improve service quality and better meet vegan consumer needs. The study demonstrates the efficacy of combining LDA and VADER sentiment analysis to evaluate UGC, providing a comprehensive understanding of ve-gan dining experiences and guiding future research and practical applications in the hospitality industry.
ABSTRACT. In military applications, detecting camouflaged objects is a challenging task due to the ability of targets to blend seamlessly into their surroundings. This study investigates the impact of style transfer approaches on synthetic data generation and their effectiveness in improving camouflaged object detection. By utilizing style transfer techniques, we augment existing datasets with synthetic imagery that mimics various environmental textures and conditions. The goal is to enhance the training of detection models, enabling them to better recognize camouflaged objects under diverse operational scenarios. Experimental results demonstrate that style-transfer-based data augmentation improves detection accuracy and robustness in military camouflaged object detection systems. This research highlights the potential of style transfer for augmenting training data in detecting camouflaged military objects.
KidRisk: Benchmark Dataset for Children Dangerous Action Recognition
ABSTRACT. Children are naturally energetic, and during their spontaneous activities, they often encounter potentially dangerous situations, especially when lacking parental supervision. Identifying actions that pose risks plays a crucial role in ensuring their safety. This paper build two challenging datasets, including 2,500 short videos of children's actions and 10,000 images for dangerous action of children. We also introduce a benchmark on our newly constructs datasets and find that traditional deep learning models demonstrated limited effectiveness on these datasets. Therefore, we develop vision-language based baselines with exceptional context understanding of visual information. Our proposed methods achieved an accuracy of 83.53% in classifying children's actions and 96.14% in recognizing children's dangerous actions, significantly outperforming traditional approaches. These results confirm that vision-language models are not only feasible but also highly effective in detecting hazardous actions, contributing positively to safeguarding children's safety.
ABSTRACT. In this paper, we aimed to develop a neural parser for Vietnamese based on simplified Head-Driven Phrase Structure Grammar (HPSG). The existing corpora, VietTreebank and VnDT, had around 15% of constituency and dependency tree pairs that did not adhere to simplified HPSG rules. To attempt to address the issue of the corpora not adhering to simplified HPSG rules, we randomly permuted samples from the training and development sets to make them compliant with simplified HPSG. We then modified the first simplified HPSG Neural Parser for the Penn Treebank by replacing it with the PhoBERT or XLM-RoBERTa models, which can encode Vietnamese texts. We conducted experiments on our modified VietTreebank and VnDT corpora. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art F-score of 82% for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser obtained lower Labeled Attachment Score (LAS) scores likely due to our focus on arc permutation without changing the original labels, as we did not consult with a linguistic expert. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic expert when developing treebanks for Vietnamese natural language processing.
DanceDuo: Bridging Human Movement and AI Choreography
ABSTRACT. In recent years, advancements in deep learning and generative models have revolutionized music-driven dance generation. This paper introduces a novel platform, namely DanceDuo, leveraging diffusion models to generate AI-choreographed dance sequences synchronized with a variety of music genres, to encourage dancing practice. The system allows users to interact with AI by selecting music tracks, humanoid models, and importing personal dance videos for comparison, fostering a rich and engaging user experience. DanceDuo not only offers dance generation but also integrates human pose estimation models to provide users with insightful comparisons of their own performances with AI-generated sequences. We conducted a comprehensive user study, revealing that users found the interface intuitive, with particular praise for the dance comparison feature. Our DanceDuo contributes significantly to the integration of AI in dance choreography, offering novel avenues for both recreational and professional applications.
ABSTRACT. Referring Video Object Segmentation (RVOS) is a challenging computer vision task that requires segmenting and tracking objects in video based on natural language descriptions . Traditional RVOS methods typically focus on static visual features such as color and shape, often extending image segmentation techniques to video using mask propagation and memory attention. While these approaches have seen varying levels of success, they often struggle with the dynamic nature of video content. Current RVOS datasets and methodologies have not fully addressed the complexity posed by motion and other temporal factors in video. To bridge this gap, the MeViS dataset emphasizes motion expressions in conjunction with language-based object segmentation. MeViS presents unique challenges, including the use of motion-centric language expressions, complex scenes with multiple objects of the same category, interactions between text and objects, and long video sequences. These complexities require a deeper understanding of both temporal and spatial information in video. This paper enhances existing RVOS techniques to meet the specific demands of the MeViS dataset. Our model is built upon the Swin-Large architecture and is initially trained on the Ref-Youtube-VOS-2021 dataset before being fine-tuned with the MeViS dataset. We implement a multi-step approach that leverages masks generated during training to accurately track object movement and eliminate misidentifications. Our solution achieves a J & F score of 0.5319 on the validation set, demonstrating its effectiveness in handling the complexities of motion and dynamic video content in RVOS.
A User Privacy Risk - Driven Approach to Web Cookie Classification
ABSTRACT. This paper presents a novel approach for classifying web cookies based on five levels of user privacy risks, ranging from very low (rare / unlikely occurrence or no impact) to very high (frequent / certain occurrence or significant impact).
The approach consists of two main phases: risk identification and risk assessment. To identify risks, a GRU-based model is first defined to categorize both encrypted and non-encrypted cookies into five purposes of use (necessary, functionality, analytics, advertising, or undisclosed). Next, user consent is detected from the HTML content of cookie banners, which are extracted from website screenshots using retrained YOLOv8-L visual detection combined with DOM-based feature detection. Potential user privacy risks are then determined by cross-referencing the cookies' usage purposes with user consent, and real-time web cookie contents.
To assess the user privacy risk level, risks are mapped into a matrix with two dimensions: the legal dimension, which represents the purpose of use of cookies, and the technical dimension, which represents the risks' frequency and impact according to the OWASP Top 10 Privacy Risks V2.0.
This approach is experimented using a dataset of 300 cookies from 10 Vietnamese websites, enabling the correct classification of web cookies into five user privacy risk levels.
MADFuzz: A Study on Automatic Exploitation of Smart Contract Vulnerabilities Using Multi-Agent Reinforcement Learning-guided Fuzzing
ABSTRACT. Smart contracts, which serve as the backbone of decentralized applications (dApps), self-execute key functions on blockchain platforms. They currently manage assets valued in the trillions of dollars in the cryptocurrency space. However, their immutability after deployment makes them particularly vulnerable to exploitation if any weaknesses are present. Identifying and addressing these vulnerabilities is critical to avoiding significant financial and reputational damage. One of the commonly used automated methods for fast and efficient vulnerability detection is Fuzzing. However, both traditional fuzzing techniques and also those based on machine learning encounter challenges, such as selecting ineffective transaction sequences, either by generating them randomly or pre-generating them before running the fuzzer. This leads to a failure in updating transaction sequences based on the dynamical smart contract states during fuzzing process. Additionally, some methods mutate test cases and store them in a pool, which becomes problematic when physical memory is no unlimited. In this paper, we present MADFuzz, a Multi-Agent Deep Reinforcement Learning (DRL)-based approach designed to address the challenges of smart contract fuzzing. To improve the selection of effective transaction sequences, we develop agents that dynamically generate optimal functions and arguments based on the current state of the smart contract. By utilizing DRL, our approach generates transaction sequences in real-time without the need for memory storage, efficiently overcoming the limitations of previous methods. Finally, we conduct experiments to compare MADFuzz with existing state-of-the-art techniques, and the results demonstrate that our approach significantly outperforms the competition.
TL-SOINN: A Transfer Learning-Enhanced Self-Organizing Incremental Neural Network for Network Intrusion Detection
ABSTRACT. The increasing sophistication of cyberattacks has escalated the need for advanced network intrusion detection systems (NIDS). This thesis presents TL-SOINN, an innovative algorithm combining Deep Transfer Learning with a Self-Organizing Incremental Neural Network (SOINN). TL-SOINN leverages the strengths of both approaches, enabling effective learning from diverse datasets and rapid
adaptation to new attack types. The model uses a Convolutional Neural Network (CNN) trained on a comprehensive dataset of cyberattacks to extract key features, which are then processed by SOINN. This method allows the system to retain knowledge of known threats while continuously adapting to new ones. Experimental results demonstrate that TL-SOINN outperforms traditional methods in detecting a broad spectrum of cyberattacks, making it a robust solution for evolving network security
needs. Additionally, the thesis integrates the Suricata tool to enrich the dataset, enhancing the model’s generalization capabilities across different network environments. The successful application of TL SOINN across various datasets highlights its potential as an effective and adaptable NIDS solution, capable of improving cybersecurity in increasingly complex and dynamic systems
ABSTRACT. In the Internet of Things (IoT) domain, the development and efficiency of Intrusion Detection Systems (IDS) powered by machine learning (ML) have seen substantial growth. Especially, Federated Learning-based (FL-based) IDS have experienced notable growth, focusing on reducing data privacy violations and alleviating the communication and high-cost burdens associated with dataset collection. However, these approaches continue to face several challenges, particularly the presence of non-independent and identically distributed (Non-IID) data and a lack of labeled data on the client side, which remain significant concerns. Additionally, adversarial attacks also pose a significant concern for ML classification models in general, and particularly for ML-based IDS. To overcome these challenges, our paper proposes a semi-supervised federated learning approach for IDS, called SeFed-IDS, designed to mitigate the impact of limited labeled data. Additionally, we incorporate an autoencoder network alongside the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) to augment data, effectively addressing the challenges posed by Non-IID data and adversarial attacks. In this study, we carried out experiments to assess the effectiveness of our approach in various scenarios, including those involving Non-IID data and different data distribution patterns. These experiments are conducted on two real-world datasets, NF-UNSW-NB15 and NF-CSE-CIC-IDS2018. Furthermore, the results indicate that our approach outperforms the original FL approach when dealing with adversarial data.
Hybrid Compression: Integrating Pruning and Quantization for Optimized Neural Networks
ABSTRACT. Deep neural networks have witnessed remarkable advancements in recent years and have become integral to various applications. However, alongside these developments, training and deployment of neural network models on embedding and edge devices face significant challenges due to limited memory and computational resources. These problems can be addressed with deep neural network compression, which involves a trade-off between model size and performance. In this paper, we propose a novel method for model compression through two phases. First, we utilize model compression techniques, such as pruning and quantization, to significantly reduce the model size. Then, we use Mixture of Experts to route the previously compressed models to enhance performance while maintaining a balance in inference efficiency. MoEs consist of multiple expert models (i.e., compressed models) that are moderately sized and deliver stable performance. Experimental results on several benchmark datasets show that our method successfully compresses CNN models which achieves substantial reductions in FLOPs and parameters with a negligible accuracy drop.
Enhancing Unsupervised Person Re-identification with Multi-View Image Representation
ABSTRACT. Person Re-Identification (ReID) is a critical research domain with applications spanning social security and education, aimed at recognizing specific pedestrians across cross-camera surveillance footage. While initial ReID methods relied heavily on supervised approaches utilizing deep network models, the substantial data and manual annotation requirements have spotlighted the need for more scalable solutions. This paper addresses these challenges by focusing on unsupervised ReID methods, specifically delving into unsupervised domain adaptation and fully unsupervised learning techniques. Despite their robustness, which employs clustering algorithms, memory banks, and contrastive loss functions, they often suffer from noise in pseudo-labels, leading to inefficiencies in learning accurate feature representations because of the weak feature image representation. To mitigate these limitations, we introduce a novel Multi-View Embedding Model designed to enhance image representation by capturing diverse views of an image, thus improving the robustness and discriminativeness of the features. Additionally, we propose a diversity loss function to facilitate the learning of these multi-view representations. Extensive experiments on two well-known datasets, Market-1501 and MSMT17, demonstrate that our method can significantly enhance the performance of the unsupervised person ReID.
OSA: FPGA-based Octa-core SPHINCS+ Accelerator for IoT Security Applications
ABSTRACT. The rapid expansion of large-scale computing systems, including those focused on IoT security, necessitates the development of high-speed, low-power hardware platforms for post-quantum cryptography (PQC) to meet stringent security and data protection requirements. However, current research on PQC software and hardware accelerators continues to encounter issues with limited performance and excessive power consumption. Therefore, this paper introduces the Octa-core SPHINCS+ Accelerator (OSA), designed to optimize both processing speed and power efficiency for SPHINCS+-SHA-256 operations. OSA employs two major optimizations: optimal OSA memory organization for multiple instances and long message processing support and a SHA-256 octa-core to accelerate WOTS+ procedures. Evaluation conducted on the Xilinx ZCU102 FPGA shows that OSA achieves signing speeds that are 7.34 to 521 times faster than existing FPGA-based implementations. Additionally, compared to baseline designs, OSA reduces signing latency by 83.2%, increases signing speed by 4.7 to 5.95 times, and improves the power delay product (PDP) by 5.95 to 8.6 times. Real-time comparisons further highlight OSA's superior performance, outperforming high-end CPUs with speeds 2.19 to 7.61 times faster and significantly enhancing power efficiency, achieving a PDP improvement of 17.46 to 45.17 times.
Decoding Deepfakes: Caption Guided Learning for Robust Deepfake Detection
ABSTRACT. The rapid advancement of generative image models has sparked concerns over their misuse, making the development of detection tools a critical area of research. While progress has been made, almost existing methods prioritize short-term gains and overlook the long-term generalization ability of these models. This paper focuses on detecting deepfakes (i.e., images generated by deep learning models) across diverse visual data types (e.g., faces, landscapes, objects, and scenes). Particularly, we investigate the use of visual-language model CLIP for deepfake detection. Previous works have shown promising results with CLIP, but its effectiveness in generative image detection has not been fully explained. Our analysis reveals that CLIP’s strength lies in aligning image features with their language-based descriptions, allowing it to distinguish deepfakes effectively. By extracting CLIP’s features and generating captions through a decoding model, we demonstrate its ability to identify deepfake images. Additionally, we propose a novel training method combining pseudo-captions and semantic captions to enhance the generalization of deepfake detection. Extensive experiments show that our proposed method achieved 98.6% accuracy on the ProGAN dataset and 95.9% on unknown diffusion model datasets. Our code is available at: https://github.com/genkerizer/CGL.
AYO-GAN: A novel GAN-based adversarial attack on YOLO object detection models
ABSTRACT. Adversarial attacks present a significant challenge in artificial intelligence (AI) and deep learning by subtly altering input data to cause misclassification or incorrect outputs. These attacks manipulate input data in ways imperceptible to humans, tricking even the state-of-the-art models into making errors. This vulnerability affects many AI applications, including image recognition, natural language processing, and autonomous driving. To counter these threats, researchers are developing methods to improve the robustness and reliability of AI systems. This study proposes AYO-GAN, an adversarial attack method utilizing a Generative Adversarial Network to generate perturbations designed to deceive object detection models. The goal is to evaluate the model's robustness against such attacks. Experimental results on the COCO dataset show average Structural Similarity Index (SSIM) of 0.936 and average Attack Success Rate (ASR) of 22.25%, surpassing Xiao et al.'s the best SSIM of 0.842 and average ASR of 12.67%.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image
ABSTRACT. It is challenging to render novel views from a single image input due to inherent ambiguities in the geometry and texture information of the desired scene. As a consequence, existing methods often encounter various types of distortions in synthesized views. To this end, we propose a distortion-resilient Depth-Image-Based Rendering (DIBR) method for synthesizing novel views given a single image input. The proposed method is qualitatively and quantitatively evaluated on the Real-Estate 10K dataset, showing superior results compared to baselines.
DehazeCLNet: A Contrastive Learning Framework with Advanced Feature Extraction for Image Dehazing
ABSTRACT. In recent years, significant progress has been made in image dehazing, particularly within indoor environments, as demonstrated by the SOTS-Indoor dataset. Previous approaches, such as MSBDN, FFA-Net, DeHamer, and MAXIM-2S, have utilized advanced techniques like feature fusion, attention mechanisms, and multi-scale feature extraction to address the challenge of haze removal. MSBDN achieved a Peak Signal-to-Noise Ratio (PSNR) of 33.67 and a Structural Similarity Index (SSIM) of 0.985 on the SOTS-Indoor dataset, while FFA-Net further improved these results with a PSNR of 36.39 and SSIM of 0.989. Subsequent methods, such as DeHamer and MAXIM-2S, continued to improve performance, reaching PSNRs of 36.63 and 38.11, respectively. In this study, we introduce DehazeCLNet, a novel model incorporating contrastive learning to enhance haze suppression capabilities. By leveraging contrastive loss and feature extraction at multiple depth levels, our method significantly improves image restoration quality. DehazeCLNet achieved a PSNR of 42.57 and SSIM of 0.996 on the SOTS-Indoor dataset, outperforming existing methods. These results underscore the effectiveness of our proposed approach, which not only addresses the limitations of prior techniques but also establishes new benchmarks for dehazing performance. This work highlights the potential of contrastive learning in the image dehazing domain and offers a promising direction for future research in efficient haze removal.
A Lightweight End-to-End Multi-task Learning System for Vietnamese Speaker Verification
ABSTRACT. Automatic speaker verification (ASV) in low-capacity devices utilized for industrial Internet of Things (IoT) applications is faced with two major challenges: lack of annotated training data and model complexity. To address these challenges, this paper introduces the first Vietnamese audio dataset for training a multi-task learning method named Vi-LMM that jointly performs command detection, fake voice recognition, and speaker verification tasks. To optimize Vi-LMM for low-capacity devices, we further employ knowledge distillation to reduce the number of parameters by 3.5 times. An empirical experiment is conducted to evaluate the effectiveness of the proposed method and the results show that Vi-LMM outperforms strong single-task models in terms of both reducing the number of learnable parameters and achieving higher F1 scores while maintaining comparable error rates.
Boosting Image Super-Resolution: Incorporating Locally-enhanced FFN and Data Augmentation in the Swin Transformer architecture
ABSTRACT. Image super-resolution is a critical task in computer vision that aims to enhance the resolution of low-resolution images by generating high-resolution counterparts. In recent years, Image Super-Resolution models have evolved significantly, with traditional methods being replaced by deep learning-based approaches. The trend has shifted from basic Convolutional Neural Networks to more advanced architectures like Generative Adversarial Networks and, more recently, Transformer-based models. SwinIR, a state-of-the-art model, applies the Swin Transformers architecture to address both image super-resolution and restoration tasks, leveraging window-based self-attention mechanisms to efficiently model long-range dependencies. In this work, we propose two key enhancements to the SwinIR model: (1) the application of the CutBlur data augmentation method to diversify the training data in the training step, and (2) the addition of the LeFF layer to the Swin Transformer blocks. The CutBlur is a method that enhances the model’s ability to learn where and how to super-resolve by cutting and pasting patches between matching areas of low-resolution and high-resolution image pairs. LeFF, or Locally-enhanced Feed-Forward Network, is a component of the Uformer architecture that enhances the capture of local context in imageresolution by incorporating a depth-wise convolutional block within the feed-forward network of the Transformer block, which improves the network’s ability to capture feature interactions. Our experimental results demonstrate that the proposed SwinIR-Lecut model consistently outperforms the baseline across several datasets. It is worth noting that on the Set5 and Set14 datasets (simple complexity), the model achieves a PSNR of 38.37 dB and 34.17 dB, respectively, which are better than the baseline SwinIR. On the BSD100 and Manga100 datasets (medium complexity), our model reaches a PSNR of 39.61 dB, maintaining superior performance. However, on the more challenging Urban100 dataset (high complexity), the performance remains comparable to the baseline, suggesting areas for further improvement.
Distribution-Guided Object Counting with Optimal Transport and DINO-Based density Refinement
ABSTRACT. Prompt-based object counting refers to estimating the number of object correspondence to a selected category based on the text description provided by the user. Current state-of-the-art methods estimate object counts by summing the values in the predicted density map, not caring about the distribution of object locations. This is reflected by their loss function, mainly MSE loss, a loss function that focuses solely on quantity. This leads to the model overestimating the count of object class due to certain factors like overlapping, occlusion, or object with the trait of self-similarity. To address this, we propose OptiCount, a framework using Optimal transport plan to measure the difference between the density map and ground truth for training. Furthermore, we introduce a density-refinement module that validates the number of objects counted to avoid overcounting. This module significantly reduces the counting error of the model, making it more robust to various challenges. Experiments on the FSC147 dataset show that OptiCount outperforms state-of-the-art methods in terms of Mean Absolute Error (MAE), demonstrating its effectiveness in counting task.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras
ABSTRACT. Depth estimation techniques typically involve extracting features of objects in the environment and their relationships. However, these methods require multiple images, making them less feasible for many real-time scenarios. To alleviate these challenges, the rise of efficient convolutional neural networks (CNNs) with the ability to infer depth from a single image opens a new ave-nue for investigation. Current research introduces an efficient FDE-Net de-signed to generate cost-effective depth maps from a single image. The new framework consists of a PP-LCNet as the convolutional encoder and a fast decoder as the decoder. Moreover, this combination integrates the Squeeze-Exploit (SE) module using the MKLDNN optimizer to enhance convolution-al efficiency and rationalize model size with efficient training. Meanwhile, the proposed multi-scale pixel-wise fast decoder generates state-of-the-art depth maps while maintaining an efficient structure. Experimental results demonstrate that our model achieves state-of-the-art performance on four datasets: NYU-V2, KITTI, Cityscapes, and and a simulated environment. Unexpectedly, FDE-Net utilizes merely 0.04 times the parameter count of Resnet-Upconv. The computational efficiency is profoundly underscored by FLOP and MAC, showcasing a considerable superiority relative to competing models. FDE-Net exhibits a remarkably reduced latency of 4.2 times, in addi-tion to a 3.9 times enhancement in throughput when contrasted with Res-net18-Upconv.
Minimalist Preprocessing Approach for Image Synthesis Detection
ABSTRACT. Generative networks, particularly those specializing in image generation (GANs and Diffusion models), have achieved significant advancements, resulting in the production of images that are increasingly indistinguishable from authentic ones. The advancement of hardware, coupled with pretrained models and readily available tools on the internet, has facilitated the rapid and efficient generation of images. However, the creation and dissemination of fake images with malicious intent is becoming a growing concern for society. In this study, we introduce a method that is both straightforward and highly effective. Our experimental results indicate that this method attains accuracy levels comparable to several state-of-the-art approaches. Notably, our approach requires very low computational resources, making it feasible to deploy and run offline on low-end devices such as smartphones.
A Novel Reversible Data Hiding for JPEG Images Based on Zero AC Coefficients Shifting
ABSTRACT. JPEG is a widely used lossy compression format for digital images, significantly reducing storage requirements and transmission bandwidth. It is of great significance to use reversible data hiding (RDH) technology to protect the privacy and confidentiality of data. This paper proposes a novel RDH scheme based on optimal zero AC coefficients shifting for JPEG images. The proposed scheme decomposes a cover JPEG image to retrieve quantized DCT coefficient blocks. Each block is scanned in zigzag order and divided into two areas with different DCT frequencies. One contains some lower-frequency DCT coefficients called the embedding area, and the other contains middle and high-frequency DCT coefficients called the non-embedding area. To achieve high imperceptibility under a given payload, an optimal zero AC coefficient selection strategy is employed in the embedding area. This strategy determines the location of zero AC coefficients that occur most frequently across all blocks, prioritizing the shifting of these zero AC coefficients for data embedding. Since zero AC coefficients located in low-frequency components correspond to small quantization steps, image quality remains visually good. Additionally, our method avoids shifting zero AC coefficients in high-frequency components, resulting in a minimal increase in file storage size. Experimental results demonstrate that the proposed method significantly outperforms previous works in terms of both embedding capacity and image quality.
ABSTRACT. In this paper, we address the recognition of motion illusions in static images. To this end, we collect a new dataset containing images both with and without motion illusions. We then benchmark state-of-the-art deep learning models to determine the presence of illusions in the images. Additionally, we assess the role of color in the recognition process. The experimental results show that deep learning models are effective in identifying motion illusions, with superior performance on color images, highlighting the importance of color in analyzing motion within static images.
AI-Generated Image Recognition via Fusion of CNNs and Vision Transformers
ABSTRACT. Recent advancements in synthetic data technology have opened a new era where images of remarkable quality are generated, blurring the lines between real-life images and those produced by Artificial Intelligence (AI). This evolution poses a significant challenge to ensuring the reliability and authenticity of data, underscoring the need for robust detection methods. In this paper, we present a robust approach aimed at addressing these pressing concerns. Our methodology revolves around leveraging fusion strategies, combining the strengths of multiple detection methods for identifying AI-generated images. Through extensive experimentation on the CIFAKE dataset, our model showcases remarkable performance, achieving an impressive accuracy rate of 97.32%. This accomplishment underscores the efficacy of our approach in accurately distinguishing between AI-generated images and real-life images, thus contributing to the advancement of data authentication techniques amidst the proliferation of synthetic data.
Diffusion-Based Purification for Adversarial Defense in Medical Image Classification
ABSTRACT. Adversarial attacks pose a significant threat to the robustness and reliability of machine learning models in medical-related tasks. These attacks involve the deliberate manipulation of input data to deceive models into making incorrect predictions, often by introducing subtle perturbations that lead to drastic changes in predicted outcomes. In the medical domain, where machine learning models are increasingly relied upon for disease diagnosis and treatment recommendation, the consequences of such attacks can be severe, potentially leading to misdiagnosis or inappropriate clinical decisions.
Research has been conducted into building defense systems to safeguard machine learning models from adversarial attacks. Recently, Shi et al. proposed the Zero-shot Image Purification (ZIP) framework that leverages diffusion models to transform adversarially-attacked images back into clean ones without knowing the attack mechanism a priori. However, the datasets used in their experiments are just standard benchmark datasets (i.e., CIFAR-10, GTSRB, and Imagenette), which might differ a lot from real-world datasets.
Therefore, in this paper, we evaluate the vulnerability of deep learning architectures commonly used in medical imaging, and investigate the effectiveness of the ZIP framework under different attack strategies.
We also fine-tune the framework parameters and the classifiers for performance enhancement. The results demonstrate that after applying image purification and model fine-tuning, the classifiers exhibit improved robustness against adversarial perturbations, where the defense accuracy increases significantly across all medical datasets and attack methods.
Source code is available at: https://github.com/ELO-Lab/ZIP-medical-adversarial-defense.
A combination of YOLO and OSNet Re-ID neuronal networks for tracking abnormalities in Upper Gastrointestinal Endoscopy Videos
ABSTRACT. Endoscopic image processing is vital in medicine, but manual analysis is time-consuming and labor-intensive. Tracking abnormalities in endoscopic videos is a key focus of artificial intelligence (AI), yet it remains a complex task that depends heavily on medical experts. Utilizing AI can provide benefits like real-time detection during procedures and improved diagnostic support afterward, along with aiding medical research and education. However, challenges such as tissue deformation, unstable lighting, and camera movement complicate the process, requiring sophisticated image processing and machine learning techniques for accurate tracking. Given the challenges, this research explores the use of recent advances in neural networks including the YOLO (You Only Look Once), StrongSORT tracking algorithm, and OSNet (Omni-Scale Feature Learning Network) Re-identification models for tracking abnormalities in upper gastrointestinal (GI) endoscopy videos. The results show that combining these models improves detection efficiency and accuracy, enabling real-time operation. Moreover, the ratio of accurately identified detections to the average number of validated detections (IDF1 70.5%), along with Precision (90.9%) and Recall (56.1%), are relatively satisfactory results, indicating that the integration of these technologies has great potential in improving the process of diagnosis and monitoring of pathological conditions through endoscopic videos. Notably, the reduction of ID switches to an insignificant amount further highlights the robustness of the system, ensuring continuous and accurate tracking over time. This significant improvement offers substantial potential for advancing endoscopic diagnosis and healthcare quality.
Integrating Graph and Transformer-Based Models for Enhanced Chemical-Drug Relation Extraction in Document-Level Contexts
ABSTRACT. The relationship between chemicals and drugs, such as Chemical-Induced Diseases (CID), is essential to many aspects of biological research and health care. Traditionally, methods divide CID tasks into two categories: those that extract relations within sentences and those that extract relations between sentences, namely intra-sentence and inter-sentence, respectively. The inter-sentence level is more sophisticated because long-distance interactions must be captured. Our focus in this research is on inter-sentence or document-level. Numerous approaches have been put forth in the past to tackle this problem, some researchers have focused on graph-based approaches to understand sentence relationships, while others have chosen to directly encode entities and then calculate their relationships using the transformer base model. In this work, we present a CID relation extraction model that combines graph and transformer-base models, specifically extracting node embedding utilizing pre-trained transformer-based language models as an encoder and using enhanced GCNII to learn more effective node embedding. The results of our experiments demonstrate that our method works much better than baseline techniques, highlighting its usefulness in the extraction of CID relations between sentences.
MedGraph-RPE: Graph-Based Medical Segmentation Enhanced by Novel Relative Positioning Encoding
ABSTRACT. Medical image segmentation is crucial in various clinical applications, such as surgical planning and disease monitoring. However, existing learning approaches often struggle with accurately capturing the complex structures and spatial relationships in medical images. While graph neural networks (GNNs) offer a versatile approach by modeling element relationships, these methods face challenges in effectively learning intricate graph structures and relationships for medical image segmentation. To this end, we introduce MedGraph-RPE, which enhances Vision Graph UNet (ViG-UNet) with our novel Relative Positioning Encoding (RPE) module. During graph construction, RPE integrates relative positional relationships between nodes, improving learning capabilities without increasing model complexity. This module provides explicit spatial relationship information, enabling more precise segmentation of complex structures. MedGraph-RPE also preserves contextual information, ensuring that crucial spatial data is retained throughout the learning process. Empirically, our comprehensive experiments demonstrate state-of-the-art performance in brain tumor and skin lesion segmentation tasks.
Predicting Bee Swarming: Leveraging Machine Learning and Audio Feature Extraction
ABSTRACT. Swarming is a natural process that leads to reduced honey
production and poses a challenge for beekeepers. Precision beekeeping
provides swarm notifications to help prevent this phenomenon. The pa-
per investigates the utilization of machine learning for the early detection
of bee swarming behavior through the analysis of audio data. It employs
three feature extraction methods—Mel Frequency Cepstral Coefficients
(MFCCs), Short-Time Fourier Transform (STFT), and Chroma—to cap-
ture important characteristics of bee sounds. The effectiveness of five
machine learning models (K-Nearest Neighbor (KNN), Support Vector
Machine (SVM), Naive Bayes (NB), Random Forest (RF), and Gradi-
ent Boosting (GB)) is evaluated in distinguishing between swarming and
non-swarming states using two real-world datasets collected in Vietnam.
To ensure the models’ generalizability, the models are assessed on a com-
pletely separate validation set that was not used during training. The
experiment results reveal the significant potential of employing machine
learning methods for the detection of bee swarming.
BSRBF-KAN: A combination of B-splines and Radial Basis Functions in Kolmogorov-Arnold Networks
ABSTRACT. In this paper, we propose BSRBF-KAN, a Kolmogorov Arnold Network (KAN) that combines B-splines and radial basis functions (RBFs) to fit input data during training. We perform experiments with BSRBF-KAN, multi-layer perception (MLP), and other popular KANs, including EfficientKAN, FastKAN, FasterKAN, and GottliebKAN over the MNIST and Fashion-MNIST datasets. BSRBF-KAN shows stability in 5 training runs with a competitive average accuracy of 97.55\% on MNIST and 89.33\% on Fashion-MNIST and obtains convergence better than other networks. We expect BSRBF-KAN to open up many combinations of mathematical functions for designing KANs. Our repo is publicly available at: https://github.com/hoangthangta/BSRBF_KAN.
Diverse Adversarial Samples for Text-to-Image Generation via Quality-Diversity Optimization
ABSTRACT. The popularity of text-to-image generative models goes with the potential danger of intentional attacks from input texts that might lead to misleading image generation. Recently, several works have studied the robustness of these models by automatically designing adversarial prompts. Prior work, namely RIATIG, utilizes a genetic algorithm evolving from an irrelevant prompt to craft a visually natural attack prompt that generates a desired target image without explicitly describing it to avoid detection. This method avoids being semantically similar to the target prompt that describes the target image controlled by a fixed threshold. However, many attack scenarios require different ranges of stealthiness represented by distinctive threshold values of similarity. Quality-Diversity optimization which searches for diverse high-quality solutions is a natural fit for this problem. We propose applying MAP-Elites, a Quality-Diversity optimization method, to seek diverse adversarial texts that are versatile for different types of stealthiness in black-box settings. Experimental results on three widely-used generative models suggest that our method successfully finds various adversarial prompts with similarities to target and initial texts spread out in many values, allowing transfer to different attack settings.
Contour-enhanced Segmentation: A Novel Approach for Ambiguous Boundary in Polyp Segmentation
ABSTRACT. Polyp segmentation has greatly benefited doctors in early diagnosing and evaluating colorectal cancer since it took advantage of modern deep-learning methods. The precision and robustness of polyp segmentation models have gradually improved in recent years despite hard challenges such as tiny polyps or edge missing. However, these problems are not alleviated completely and also exist in other complex tasks of medical image processing. Our study proposes a new method called contour-enhanced segmentation to enhance the capability of capturing the exact edge of polyps for accurate segmentation, which is important not only in this task but also in other segmentation tasks. While we are not attempting to build an end-to-end model to resolve the problem, we first train a model to detect the contour of the polyps and then use transfer learning for the downstream segmentation. We also devise a simple but efficient method to infer the contour of the polyps. By leveraging cutting-edge advancement in the computer vision field including attention mechanism and MetaFormer, our model achieves competitive results to other state-of-the art methods and even outperforms them in some cases on three popular datasets CVC-ClinicDB, Kvasir-SEG and ETIS-PolypDB.
Adversarial Robustness of Medical Image Classifiers via Denoised Smoothing
ABSTRACT. Deep neural networks (DNNs) have been shown to be highly susceptible to small changes in input data, making them vulnerable to adversarial attacks. These sensitivities in artificial intelligence (AI) systems can be easily exploited, posing substantial risks to the reliability and safety of DNNs, especially in critical applications like medical imaging. Enhancing adversarial robustness is essential to the secure deployment of DNNs in these sensitive domains. Traditional defense methods often require full access to the target model or re-training with adversarial augmentation data. Yet, medical models are trained on sensitive or domain-specific data, and typically undergo rigorous validation. Modifications to the parameters or architecture could inadvertently disrupt the model’s performance. Recently, Denoised Smoothing (DS) offers a promising defense mechanism by prepending a denoiser neural network to a pre-trained image classifier. The denoiser is trained to remove Gaussian noises from the input data, potentially mitigating adversarial perturbations. The DS-based defense is particularly suitable for medical imaging systems since no alteration to the original classifier is required. In this paper, we experiment with different denoiser training objectives to properly adopt the DS technique for defending brain tumor and cervical cancer classification models. We then evaluate the performance of the models under three widely-used attack methods: Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and Decoupled Direction and Norm (DDN).
Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders
ABSTRACT. Existing speaker diarization systems typically rely on large amounts of manually annotated data, which is labor-intensive and difficult to obtain, especially in real-world scenarios. Additionally, language-specific constraints in these systems significantly hinder their effectiveness and scalability in multilingual settings. In this paper, we propose a cluster-based speaker diarization system designed for multilingual telephone call applications. Our proposed system supports multiple languages and eliminates the need for large-scale annotated data during training by utilizing the multilingual Whisper model to extract speaker embeddings. Furthermore, we introduce a network architecture called Mixture of Sparse Autoencoders (Mix-SAE) for unsupervised speaker clustering. Experimental results on the evaluation dataset derived from two-speaker subsets of benchmark CALLHOME and CALLFRIEND telephonic speech corpora demonstrate the superior performance of the proposed Mix-SAE network to other autoencoder-based clustering methods. The overall performance of our proposed system also highlights the promising potential for developing unsupervised, multilingual speaker diarization systems within the context of limited annotated data. It also indicates the system's capability for integration into multi-task speech analysis applications based on general-purpose models such as those that combine speech-to-text, language detection, and speaker diarization.
Domain Generalization in Vietnamese Dependency Parsing: A Novel Benchmark and Domain Gap Analysis
ABSTRACT. Dependency parsing has received significant attention from the research community due to its recognized applications across diverse areas of natural language processing (NLP). However, the majority of dependency parsing studies to date have not addressed the out-of-domain problem, where the data in the testing phase are in a different distribution compared with data in training domains, despite this being a common problem in practice. Furthermore, Vietnamese is still considered a low-resource language in parsing tasks, as most standard treebanks are primarily developed for more widely spoken languages such as English and Chinese. This shortage pushes the difficulty of studies of Vietnamese dependency parsing task even further. To advance research on domain generalization in Vietnamese dependency parsing task, this paper introduces a new treebank called DGDT (Vietnamese [D]omain [G]eneralization [D]ependency [T]reebank), where domains in train/dev/test set are completely separated. This is the distinction of our treebank, compared to other Vietnamese dependency treebanks. We also release DGDTMark, a cross-domain Vietnamese dependency parsing benchmark suite using our treebank to assess the generalization ability of parsers over domains. Moreover, our suite can support further research in analyzing the impacts of domain gaps on the dependency parsing task. Through experiments, we observe that the performance of parsers is most affected by two gaps: newspaper topics and writing styles. Besides, the performance drops remarkably by 3.27% UAS and 5.09% LAS in the scenario with the largest domain gap, which proves that our treebank poses a significant challenge for further research.
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
ABSTRACT. This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a discrepancy problem towards the effectiveness of multi-modalities fusion. Therefore, we introduce Text-Image Joint Embedding Predictive Architecture (TI-JEPA), an innovative pre-training strategy that leverages energy-based model (EBM) framework to capture complex cross-modal relationships. TI-JEPA combines the flexibility of EBM in self-supervised learning to facilitate the compatibility between textual and visual elements. Through extensive experiments across multiple benchmarks, we demonstrate that TI-JEPA achieves state-of-the-art performance on multimodal sentiment analysis task (and potentially on a wide range of multimodal-based tasks, such as Visual Question Answering), outperforming existing pre-training methodologies. Our findings highlight the potential of using energy-based framework in advancing multimodal fusion and suggest significant improvements for downstream applications.
Unifying Convolution and Self-Attention for Liver Lesion Diagnosis on Multi-phase Magnetic Resonance Imaging
ABSTRACT. Accurate liver lesion diagnosis is crucial for effective treatment planning, with Magnetic Resonance Imaging (MRI) being a key diagnostic tool due to its ability to provide detailed anatomical and functional information. Despite its benefits, the manual analysis of 3D multi-phase MR images is challenging for radiologists due to the complexity of the data and the variability in lesion characteristics. To address
this issue, in this paper, we propose a novel approach that integrates convolutional neural networks and self-attention mechanisms using the UniFormer framework. This method combines local and global feature extraction to enhance the accuracy of liver lesion classification. By leveraging pretrained weights from video tasks, the model performs better in identifying and classifying lesions than traditional methods. Extensive experiments with the LLD-MMRI2023 dataset, which includes multiphase MR images for liver lesions, demonstrate significant advancements in diagnostic accuracy. This approach not only aids in automating the analysis process but also supports radiologists by reducing diagnostic errors and improving patient care. The research highlights the effectiveness of combining convolutional and self-attention mechanisms in medical image analysis and suggests promising avenues for future automated diagnostic systems.
Log-based Representation Transferable Learning for Cross-System Anomaly Detection
ABSTRACT. Log-based anomaly detection is vital for identifying system failures as well as early attacks. However, analyzing logs and identifying anomalies from new deployment systems often face challenges due to insufficient labeled log data, particularly anomalies, which can diminish the effectiveness of supervised learning approaches. In this paper, we introduce a novel cross-system log-based anomaly detection method, termed Log Representation Transferable - LogReT. This method harnesses prior knowledge from established systems and integrates a small number of logs (with few anomalies) from the new system to improve its adaptation ability to the cross systems. Our method employs a shared Long Short-Term Memory model to learn representations that effectively distinguish between normal and anomalous log data, while ensuring that normal logs are consistently mapped cross systems. Experimental results on the BGL and Thunderbird log datasets indicate that LogReT frequently surpasses recent log-based anomaly detection methods. This demonstrates a significant promise for few-shot anomaly detection in cross-system contexts.
A Deep Learning Approach to Early Identification of Remote Access Trojans
ABSTRACT. Remote Access Trojans (RATs) have grown dramatically, becoming more complex and difficult to detect. Although there is increased interest in RATs, it remains difficult to deliver accurate and rapid RAT detection. In this paper, we propose RATID, a multi-layer deep learning architecture that can accurately identify different kinds of RATs. This architecture is based on Convolutional Neural Networks, leveraging network flows for the early detection of RATs during their connection establishment phase. We have created a dataset that contains 19,000 sessions of 23 different RAT families. Our comprehensive analysis indicates that the proposed mechanism can achieve high accuracy in both binary and multi-class classification scenarios, surpassing existing state-of-the-art methods. Additionally, it only takes 2 milliseconds to render a decision, thus making it suitable for practical use.
Privacy Challenges in Genomic Data: A Scoping Review of Risks, Mitigation Strategies, and Research Gaps
ABSTRACT. Advances in genomic research have created new privacy challenges. This scoping review analyzes the risks associated with the processing, storage, and sharing of genomic data including epigenetics, and examines current privacy protection strategies. It also attempts to identify research gaps in this area. Using the PRISMA methodology, 35 relevant studies were identified and analyzed. The results of the risk assessment can be grouped into four main themes: Risks posed by processing of functional genomic data, sharing of genomic data, patient (re-)identification, and dividuality, i.e. the extending of privacy risks to blood relatives. The identified risk mitigation approaches and strategies were systematically categorized into five themes: pre-release measures, governance, secure data processing and exchange, access restriction and transparency, anonymization and masking. However, there are some important research gaps that still need to be addressed. The current literature neglects to assess the likelihood of potential breaches and tends to focus only on assessing possible scenarios of privacy risks. It also mainly fails to assess the role of contextualized data and the effectiveness of policies and governance systems with respect to privacy risks.
An Efficient Explainable Unsupervised Machine Learning Approach for Network Intrusion Detection in IoMT
ABSTRACT. The Internet of Things is progressively becoming prevalent in different industries, including medicine and healthcare. Implementing the Internet of Medical Things (IoMT) offers substantial advantages in diagnosis and treatment. Nonetheless, the IoMT-based healthcare system encounters security concerns that adversely impact the quality of therapy and directly jeopardize patient health. Many studies have employed Machine Learning to detect network intrusion on IoMT systems; however, most utilize supervised learning techniques. This research presents a detection method employing unsupervised machine learning algorithms to identify potential future attack techniques. The proposed approach incorporates Explainable AI principles to identify significant elements that enhance prediction accuracy. We evaluated three distinct algorithms: Kmeans, One Class SVM, and Autoencoder. The One-Class SVM model demonstrated superior performance, with an accuracy of 99.94% and a false positive rate of below 2.6% on the CIC-IoMT2024 dataset.
Post-Correction of Handwriting Recognition Using Large Language Models
ABSTRACT. Handwriting recognition enables the automatic transcription of large volumes of digitized collections, providing access to the content. However, regardless of the system used, some recognition errors still occur. With the advancement of Large Language Models (LLMs), the question arises whether these models can improve handwriting recognition as a post-processing step. We have developed a method for LLM-based post-correction and evaluated it on three benchmark datasets, namely Washington, Bentham, and IAM. We consistently achieved a character error rate reduction of up to 30%, though we observed significant variability depending on the prompt and the LLM used.
A Proposed Large Language Model-Based Smart Search for Archive System
ABSTRACT. This study presents a novel framework for smart search in digital archival systems, leveraging the capabilities of Large Language Models (LLMs) to enhance information retrieval. By employing a Retrieval-Augmented Generation (RAG) approach, the framework enables the processing of natural language queries and transforming non-textual data into meaningful textual representations. The system integrates advanced metadata generation techniques, a hybrid retrieval mechanism, a router query engine, and robust response synthesis, the results proved search precision and relevance. We present the architecture and implementation of the system and evaluate its performance in four experiments concerning LLM efficiency, hybrid retrieval optimizations, multilingual query handling, and the impacts of individual components. Obtained results show significant improvements over conventional approaches and have demonstrated the potential of AI-powered systems to transform modern archival practices.
SCA-DS: Face Anti-Spoofing Leveraging Enhanced Spatial and Channel-wise Attention and Depth Supervision
ABSTRACT. In this paper, we present an effective method for address-
ing the face anti-spoofing (FAS) problem. We approach it as a binary
classification task and propose a method based on the simple concept of
end-to-end binary supervision, augmented with several enhancements to
boost model performance. Firstly, features extracted from a CNN model
are refined using an upgraded version of the spatial/channel-wise at-
tention module. Secondly, we introduce auxiliary branches to supervise
the accuracy of depth map estimation at two different levels. This ad-
dition aids in capturing depth-related information effectively, providing
the model with additional cues to discriminate between real and spoofed
faces. We evaluated our proposed method on various public datasets us-
ing both intra-testing and cross-testing scenarios. The experimental re-
sults demonstrate the effectiveness of our approach, outperforming both
referenced methods and current state-of-the-art methods.
DOLG-CNet: Deep Orthogonal Fusion of Local and Global Features combined with Contrastive Learning and Deep Supervision for Polyp Segmentation
ABSTRACT. Accurate polyp segmentation is vital for diagnosing colorectal cancer but faces challenges due to varying sizes, colors, and clinical conditions. Despite advancements, deep learning systems still have significant limitations in effectively detecting and segmenting polyps. Convolutional Neural Network-based methods struggle to capture long-range semantic relationships, whereas Transformer-based approaches often fail to understand local pixel interactions effectively. Moreover, these methods sometimes inadequately extract detailed features and face limitations in scenarios requiring optimized local and global feature modeling. To tackle these challenges, we introduce DOLG-CNet, a novel one-stage, end-to-end framework specifically crafted for polyp segmentation. Initially, we employ the cutting-edge ConvNeXt for its superior segmentation capabilities. Additionally, we integrate an orthogonal fusion module that adeptly merges global and local features to generate a rich combined feature set. We also introduce a unique training strategy that marries contrastive learning with segmentation training, enhanced by an auxiliary deep supervision loss to boost performance. Specifically, we create both high and low augmented versions for each input image and train the system to align their vector embeddings closely, regardless of the augmentation level. This method, combined with standard segmentation loss and deep supervision, facilitates faster and more effective convergence. Our experimental results demonstrate that DOLG-CNet achieves impressive performance, with a dice coefficient score of 0.913 on Kvasir-SEG, 0.761 on CVC-ColonDB, and 0.722 on ETIS. Additionally, in qualitative and quantitative benchmarks across various datasets, DOLG-CNet consistently outperforms well-known methods, proving its efficacy and potential in the field.
VisChronos: Revolutionizing Image Captioning Through Real-Life Events
ABSTRACT. This paper aims to bridge the semantic gap between visual content and natural language understanding by leveraging historical events in the real world as a source of knowledge for caption generation. We propose VisChronos, a novel framework that utilizes large language models and dense captioning models to identify and describe real-life events from a single input image. Our framework can automatically generate detailed and context-aware event descriptions, enhancing the descriptive quality and contextual relevance of generated captions to address the limitations of traditional methods in capturing contextual narratives. Furthermore, we introduce a new dataset, specifically constructed using the proposed framework, designed to enhance the model's ability to identify and understand complex events. The user study demonstrates the efficacy of our solution in generating accurate, coherent, and event-focused descriptions, paving the way for future research in event-centric image understanding.
ViEduQA: A New Vietnamese Dataset for Question Answer Generation in Education
ABSTRACT. Large-scale and high-quality corpora are essential for evaluating question-answer generation (QAG) models, especially in low-resource languages such as Vietnamese. Within the education domain, QAG holds significant potential for practical applications; however, research in this field remains limited. To fill the gap, this paper introduces ViEduQA, a novel dataset designed to advance QAG in Vietnamese educational contexts. The dataset, covering four high school subjects (Biology, Geography, History, and Civic Education), consists of a QAG task that generates question-answer pairs directly from educational texts. We explore the capabilities of pre-trained language models (PLMs) like ViT5 and BARTPho and large language models (LLMs) such as SeaLLMs and Qwen2, leveraging fine-tuning and instruction-tuning methods. Finally, our analysis shows that the instruction tuning method has the potential to enhance the performance of QAG models, though it requires additional data. We also provide a system demonstrating how our QAG models function in education. The code and datasets are also made available \footnote{https://github.com/Shaun-le/AlphaEdu}.
VOI-VR:Voice-driven Object Interaction in Virtual Reality with Large Language Models
ABSTRACT. This study explores the integration of voice interaction in virtual reality environments to enhance user engagement and accessibility. Utilizing the virtual reality headset, users can interact with 3D objects, such as selecting a cup hidden behind a flower vase, through voice commands instead of traditional controllers, which can be cumbersome in occluded scenarios. Leveraging advancements in large language models (LLMs), we enhance the processing of user voice input for more intuitive interactions. To evaluate effectiveness, we conducted a user study comparing object search and arrangement using controllers versus voice commands in a VR object-finding game. Results indicate that voice interaction significantly improves object identification speed and overall user satisfaction, demonstrating the potential for more immersive VR experiences through innovative interaction modalities.
Towards Real-Time Open World Instance Segmentation
ABSTRACT. Instance segmentation is a common task in computer vision
specifically, and computer science in general. Its applications are widely
used in areas such as autonomous driving and automotive systems. However, current instance segmentation models are often limited, as they only
perform well on fixed training sets. This creates a significant challenge in
real-world applications, where the number of classes is strongly dependent on the training data. To address this limitation, we propose the concept of Open World Instance Segmentation (OWIS) with two main objectives: (1) segmenting instances not present in the training set as an “unknown" class, and (2) enabling models to incrementally learn new classes
without forgetting previously learned ones, with minimal cost and effort.
These objectives are derived from open world object detection task Joseph et al.
We also introduce new datasets following a novel protocol for evaluation,
along with a strong baseline method called ROWIS (Real-Time Open
World Instance Segmentor), which incorporates an advanced energy-based strategy for unknown class identification. Our evaluation, based
on the proposed protocol, demonstrates the effectiveness of ROWIS in
addressing real-world challenges. his research will encourage further exploration of the OWIS problem and contribute to its practical adoption.
Our code was published at https://github.com/4ursmile/ROWIS.
MythraGen: Two-Stage Retrieval Augmented Art Generation Framework
ABSTRACT. Text-to-image generation has seen rapid advancements, especially with the development of generative models. However, challenges remain in achieving high-quality, contextually accurate image outputs that faithfully match the provided textual descriptions, especially in artistic generation. In this paper, we present a simple yet efficient retrieval augmented generation framework, namely MythraGen, for text-to-artistic image generation by integrating an art retrieval mechanism with LoRA-based model fine-tuning. Our method extracts features from a large-scale art dataset, optimizing the generation process by combining artist-specific styles and content. Particularly, retrieved images from an external art database that have the highest similarity to the query prompt are used to finetune Stable Diffusion using LoRA for desired art generation. Experimental results and user studies on the WikiArt dataset show that our proposed method can generate artworks that closely match the user’s input, significantly outperforming existing solutions.
A Study On Explainable Graph Presentation Learning With Semantic Features Embedding For Windows Malware Detection
ABSTRACT. Nowadays, the Windows operating system is widely spread in the world, so the number of malware is also increasing. Therefore, automatic methods in detecting malware are essential, so Machine Learning (ML) and Deep Learning (DL) models have been widely applied in this domain. In this paper, we use others Graph Neural Network (GNN) such as Graph Convolutional Network (GCN), Graph Isomorphism Network (GIN), GraphSAGE, Graph Attention Network (GAT) with two architectures, GAT version 2 (GATv2). Key information in malware analysis includes API Functions; therefore, utilizing API Functions as input data becomes essential, so we conducted experiments on a dataset of approximately 26,000 PE files collected from datasets APIMDS and MalBehavD-V1 with their APIs extracted. After that, using documentation each API embedded by two versions of model Bidirectional Encoder Representations from Transformers (BERT) and the method Word2Vec to generate nodes for the graph. The reason we can create a graph is that the API functions of the executable file are called sequentially, making them suitable for being represented as a graph. Our approach yields outstanding results, achieving an accuracy of 99.46% in distinguishing between malware and benign files. This approach addresses the limitations of static analysis by using graphs and GATv2 model to capture complex relationships between information in the graph. In addition, we utilize Explainable Artificial Intelligence (XAI) models, specifically Parameterized Explainer for GNN (PGExplainer), to provide insights into why the model classifies a file as either malicious or benign.
A Study on Efficient Provenance-Based Intrusion Detection System using Few-shot Graph Representation Learning
ABSTRACT. Today, the evolution of attacks has made traditional defense methods insufficient for modern complex situations. Advanced Persistent Threats (APTs), characterized by their persistence, sophistication, and diversity, are often initiated by large, well organized, highly skilled hacker groups with clear objectives. Provenance-based Intrusion Detection Systems have become increasingly popular for their ability to detect sophisticated APTs attacks. Despite their potential, they face significant challenges related to accuracy, practicality, and scalability, especially in situations with insufficient training data. We propose PROVSHOT, the few-shot graph representation learning framework for intrusion detection system based on provenance data, combined with the Model-Agnostic Meta-Learning (MAML) algorithm to effectively classify malicious entities in scenarios with limited data. PROVSHOT incorporates semantic encoding of node attributes to enhance the representational capability of the nodes, helping the model make better predictions. We evaluate the model on three public datasets: StreamSpot, Unicorn and DARPA E3. The results indicate that PROVSHOT can accurately predict APT attack types across all datasets, even with limited data.
ABSTRACT. Application Programming Interfaces (APIs) continue to be the primary and most accessible data source for malware detection and classification methods. With recent Deep Learning (DL) breakthroughs, techniques for analysing API call sequences have become increasingly effective at extracting valuable insights. However, the length and complexity of these sequences can pose challenges, making them difficult to interpret and analyse comprehensively. Furthermore, traditional DL models may struggle to capture long-range dependencies and sequential patterns in such extended API call sequences, which are essential for accurate malware detection. This paper proposes a novel malware detection approach based on API call sequences that leverages the power of the Bidirectional Encoder Representations from Transformers (BERT) and a DL model combining Convolutional Neural Network (CNN) + Extended Long Short-Term Memory (xLSTM) techniques. BERT effectively captures the contextual relationships between API calls, while CNN-xLSTM proves highly effective at classifying sequences by preserving long-term dependencies and handling the complexities of sequential data. Experimental results on the EMBER dataset show that our approach performs better than existing state-of-the-art embedding and detection methods in both accuracy and robustness.