PRICAI 2025: PACIFIC RIM INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE 2025
PROGRAM FOR WEDNESDAY, NOVEMBER 19TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 10: Keynote 1
Location: RHLT1
09:00
Evolutionary Intelligence: From DNA to Deployment

ABSTRACT. While AI has become synonym with deep learning for many in science and society, traditionally, the field of AI is defined as something much larger. By focusing exclusively on deep learning, we risk overlooking powerful synergies with other AI paradigms. In this keynote, I will explore the concept of Evolutionary Intelligence, representing combinations evolutionary computation and machine learning and highlight how cross-pollination between these paradigms has already led to algorithmic advances. I will showcase how Evolutionary Intelligence can unlock new possibilities for real-world applications, particularly when harnessed with modern GPU architectures and applied toward the development of more interpretable, explainable AI systems. Drawing from personal experience, I will present concrete examples from the medical domain, including how cancer patients are today being treated using plans generated by EI algorithms, demonstrating that EI is not just theoretical promise, but also clinical reality. I will argue that meaningful impact requires deep integration, not just between AI techniques, but between AI and the domains we aim to transform as well. Moving beyond getting top scores on benchmarks, we should embrace close collaboration with domain experts to ensure that algorithmic innovation is truly meaningful and can be translated into tangible societal benefit.

10:00-10:30Coffee Break
10:30-12:30 Session 11A: Computer Vision 1
Location: RHLT1
10:30
ReDo-Net: Reconstruction of Depth under Occlusion for 2D and 3D shape analysis of fruits and vegetables
PRESENTER: Geoffroy Heurtel

ABSTRACT. Reconstructing occluded objects is essential for accurate product size estimation in industrial and agricultural contexts, where occlusions often result from overlapping items or environmental clutter. Existing approaches often rely on class-specific priors, which limits their generalization across diverse product types and occlusion patterns. In this work, we present ReDO-Net, a novel shape-conditioned pipeline for 2D and 3D object completion under occlusion. Our approach encodes visible object shapes into a continuous latent space using a Variational Autoencoder (ReDO-VAE), which guides a U-Net-based module (ReDO-UNet) for amodal silhouette reconstruction. A dedicated GAN module (ReDO-GAN) then performs depth inpainting to recover full 3D shape information. Unlike class-conditioned methods, ReDO-Net leverages geometric priors directly from partial observations, allowing robust and category-agnostic completion. Experiments on both synthetic and real-world datasets demonstrate significant improvements over baselines: a 15% gain in 2D completion accuracy and high-quality 3D reconstructions, achieving an RMSE of 0.083 on occluded vegetable samples. To promote reproducibility, we release Fruits360Occlu, a public dataset with strong occlusions tailored for amodal mask on agricultural products.

10:50
Extracting Automaton from Video Recognition Model

ABSTRACT. Video recognition is a powerful tool for identifying various actions from videos. Recent video recognition methods are based on deep neural networks (DNNs), which suffer from a lack of interpretability, making it difficult to understand their underlying decision logic. The aim of this study is to improve the interpretability of a video recognition model by extracting an automaton from the model and visualizing this automaton, which represents the model’s decision logic. The automaton is an effective representational framework from the perspective of temporal state transitions. We propose a novel approach to identifying transition destinations by dynamically searching at runtime, rather than training them entirely, thereby enabling the extraction of an automaton from a complex video recognition model. We quantitatively demonstrate that a video recognition model can be reproduced by an automaton, and qualitatively show that the extracted automaton can make the decision logic of a video recognition model interpretable. To the best of our knowledge, this study is the first attempt to extract an automaton from a video recognition model.

11:10
DEF-YOLO: Leveraging YOLO for Concealed Weapon Detection in Thermal Imaging

ABSTRACT. Concealed weapon detection aims at detecting the weapon hidden beneath a person's clothing or luggage. Various imaging modalities like Millimeter Wave, Microwave, Terahertz, Infrared, etc. are exploited for the concealed weapon detection task. These imaging modalities have their own limitations such as poor resolution in microwave imaging, privacy concerns in millimeter wave imaging, etc. To provide a real-time, 24 x 7 surveillance, low cost and privacy preserved solution, we opted for thermal imaging in spite of the lack of availability of a benchmark dataset. We propose a novel approach and a dataset for concealed weapon detection in thermal imagery. Our YOLO-based architecture, DEF-YOLO, is built with key enhancements in YOLOv8 tailored to the unique challenges of concealed weapon detection in thermal vision. We adopt deformable convolutions at SPPF layer to exploit multi-scale features; backbone and neck layers to extract low, mid and high level features, enabling DEF-YOLO to adaptively focus on localization around the objects in thermal homogeneous regions, without sacrificing much of the speed and throughput. In addition to this simple yet effective key architectural changes, we introduce a new, large-scale Thermal Imaging Concealed Weapon dataset, TICW, featuring a diverse set of concealed weapons and capturing a wide range of scenarios. To the best of our knowledge, this is the first large-scale contributed dataset for this task. We also incorporate focal loss to address the significant class imbalance inherent in the concealed weapon detection task. The efficacy of the proposed work establishes a new benchmark through extensive experimentation for concealed weapon detection in thermal imagery.

11:30
MDGFusion: A Mask-Guided Depth-Reliable Generative Approach for 3D Object Detection via 4D Radar-Camera Fusion
PRESENTER: Chenghao Wang

ABSTRACT. 4D millimeter-wave radar, an emerging sensor capturing spatial and Doppler velocity information, has gained significant attention in autonomous driving. However, its performance in 3D object detection remains limited due to its sparsity of point clouds and susceptibility to noise. Recent studies have explored temporal multi-frame fusion of radar point clouds to mitigate sparsity, but it often leads to depth distortions that hinder accurate object localization in multi-modal fusion approaches. To address these limitations, we propose a novel radar-camera fusion framework, MDGFusion, which contains two key innovations: (1) an Adaptive Radar Virtual Points Generation Module (ARGM) that generates depth-reliable dense virtual radar points by filtering anomaly depth information through segmentation-derived ROIs and coarse semantic labels; and (2) a Dual-Stream BEV Enhancing Fusion Module (DBEM) that employs multi-attention to fuse complementary radar and camera features, combining radar's precise depth cues and camera's rich semantics through dual-stream feature enhancement and dynamic modality weighting. Our framework significantly improves 3D object detection accuracy and enhances robustness under adverse illumination. Extensive experiments on VoD and TJ4DRadSet datasets demonstrate that our method achieves state-of-the-art results: 60.20% and 80.04% mAP in VoD's EAA and driving corridor regions, and 38.46%/44.04% of 3D/BEV mAP on TJ4DRadSet.

11:50
UniPT: A Unified Representation Pre-Training for Multi-dataset 3D Object Detection
PRESENTER: Chenghao Wang

ABSTRACT. Current multi-dataset 3D object detection pipelines directly train the detectors from scratch, which achieves suboptimal performance owing to intrinsic discrepancies between various scenarios. In this paper, we observe that there are underlying shared spatial and semantic characteristics that can be excavated in a self-supervised and fully-supervised manner. Motivated by this, we propose a novel pre-training method (UniPT) with geometry-aware and semantic-aware supervision to obtain a universal representation of multiple datasets, thus reducing the vast differences. Concretely, we firstly perform the reconstruction of point cloud and point density estimation. Then, the occupancy prediction and classification guidance are provided in semantic-aware supervision. Moreover, we devise a cross-dataset integration scheme to minimize the differences of dataset-level features between domains. Extensive experiments on Waymo Open Dataset (WOD), nuScenes and KITTI with consolidation settings illustrate the effectiveness of our method. It is notable that UniPT surpasses the previous work Uni3D by 2.29 mAPH (L1) on WOD, 1.45 mAP (BEV) on nuScenes and 2.08 mAP (3D) on KITTI. Our code will be available at https://github.com/microjie372/UniPT

12:10
Enhanced YOLO11-based Real-time Helmet Detection for Construction Safety
PRESENTER: Wei Lou

ABSTRACT. Effective helmet detection is critical for construction safety but challenging to deploy in resource-limited environments. This paper proposes an enhanced object detector based on YOLO11, integrating Deformable Weighted Residual (DWR) modules and a BIdirectional Feature Pyramid Network (BIFPN) architecture. This integration significantly reduces computational demands while improving accuracy over YOLO11.Comprehensive experiments confirm that our proposed model outperforms other leading detection methods, providing a highly accurate, efficient solution for real-time helmet detection in resource-constrained environments.

10:30-12:30 Session 11B: Large Language Model 1
Location: RHLT2
10:30
UniAVLM: Unified Large Audio-Visual Language Models for Comprehensive Video Understanding
PRESENTER: Lecheng Yan

ABSTRACT. Modern video understanding requires integrating multimodal signals, but current Multimodal Large Language Models (MLLMs) often process audio and visual streams separately, missing key relationships and causing fragmented understanding with a disjointed audio-visual representation. In this work, we propose UniAVLM, a large audio-visual language model for comprehensive video understanding, which first employing Whisper-style audio feature extraction to capture relevant auditory information. We then introduce spatiotemporal position encoding to enhance the video representation with temporal dynamics. Finally, we implement cross-modal attention mechanisms to explicitly fuse the audio and visual features, allowing the model to learn the intricate relationships between these modalities and creating a cohesive multimodal representation. We conduct extensive experiments on the Audio-Visual Scene-Aware Dialogue (AVSD) benchmark, comparing our model against seven representative multimodal baselines and demonstrate state-of-the-art performance, with our model achieving 48.91% accuracy and 89.93 BERTScore-F1. Specifically, our model outperforms the best vision-language model by 6.79% accuracy, and surpasses the state-of-the-art full multimodal model by 4.07% accuracy, while using only parameter-efficient fine-tuning. Comprehensive ablation studies highlight the critical impact of lightweight integration strategies and thorough cross-modal fusion on comprehensive video understanding.

10:50
TLoRA: Tri-Matrix Low-Rank Adaptation of Large Language Models

ABSTRACT. We propose TLoRA, a novel tri-matrix low-rank adaptation method that decomposes weight updates into three matrices: two fixed random matrices and one trainable matrix, combined with a learnable, layer-wise scaling factor. This tri-matrix design enables TLoRA to achieve highly efficient parameter adaptation while introducing minimal additional computational overhead. Through extensive experiments on the GLUE benchmark, we demonstrate that TLoRA achieves comparable performance to existing low-rank methods such as LoRA and Adapter-based techniques, while requiring significantly fewer trainable parameters. Analyzing the adaptation dynamics, we observe that TLoRA exhibits Gaussian-like weight distributions, stable parameter norms, and scaling factor variability across layers, further highlighting its expressive power and adaptability. Additionally, we show that TLoRA closely resembles LoRA in its eigenvalue distributions, parameter norms, and cosine similarity of updates, underscoring its ability to effectively approximate LoRA's adaptation behavior. Our results establish TLoRA as a highly efficient and effective fine-tuning method for LLMs, offering a significant step forward in resource-efficient model adaptation.

11:10
Can Large Language Models Handle Numeric Constraints? A Comprehensive Study and Solutions

ABSTRACT. Large Language Models (LLMs) perform well in reasoning and generation but often fail to meet explicit numerical constraints, such as fixed word counts or token lengths. This limitation affects tasks like summarization and structured generation but remains underexplored. In this work, we present a systematic study on numerically constrained gen- eration with LLMs, aiming to assess their ability to process and follow quantitative requirements. We introduce a bilingual benchmark in En- glish and Chinese to assess LLMs’ ability to follow seven common types of numerical constraints, from simple to logically complex cases. Evalu- ation on six LLMs shows that performance drops as constraints become stricter, especially when targeting larger values or specific numerical cat- egories (e.g., 13, non powers of two), revealing controllability gaps. To address this, we explore three strategies: (1) prompt engineering, (2) stepwise generation, and (3) fine-tuning with chain-of-thought data. We release our benchmark and code to support future work on controllable text generation.

11:30
Combining Non-Numerical Text and Numerical Sequences in LLM-based Survival Prediction

ABSTRACT. In clinical diagnosis, medical corpora often comprise many numerical values, which poses a challenge for Large Language Models(LLMs) to make accurate decision. In order to evaluate the LLMs' ability to reason with both non-numerical text and numerical sequences, we simulate real-world clinical scenarios and set up survival prediction tasks, thereby developing Survival Prediction Dataset for COVID-19(SPDC), which contains three datasets sampled under different conditions. Based on SPDC, we propose a highly adaptable framework using Concatenated Embedding of Non-Numerical Text and Numerical Sequences, which is denoted as CETS. Compared to conventional method of processing plain text input, our framework embeds non-numerical text and numerical sequences separately, achieving up to a 3.46% average increase in accuracy on SPDC. Through comparative experiments, we further clarify the impacts of standardization, patch length and stride, and the position embedding on the performance of CETS. As a reusable and easy-to-implement framework, CETS facilitates the performance of LLMs in processing clinical corpora and has extensive application potential in clinical medicine.

11:50
Can Large Language Models Play to Win? Game-Theoretic Benchmarks in Poker for Probabilistic Reasoning Evaluation
PRESENTER: Xinyu Wang

ABSTRACT. Recent advances in Large Language Models (LLMs) have raised new questions about their abilities in reasoning and decision-making, particularly under uncertainty. While traditional evaluation benchmarks primarily focus on text understanding or deterministic tasks, they fail to probe the complex probabilistic reasoning and strategic planning essential for real-world intelligence. To bridge this gap, we propose PokerBench, a game-theoretic evaluation framework based on Texas Hold'em poker, for systematically assessing LLMs' capacity for probabilistic reasoning and decision making under incomplete information. Experiments with several state-of-the-art LLMs reveal significant limitations in their strategic behavior when compared with established algorithmic baselines, highlighting the challenges that current models face in game-theoretic and real-world decision scenarios. These findings emphasize the need for more comprehensive and fine-grained evaluation methods to drive progress toward robust and general AI reasoning. Data and code will be made available.

12:10
When Vision Becomes a Threat: Adversarial Prompt Injection via Visual Embedding Manipulation
PRESENTER: Yajing Ma

ABSTRACT. Recent progress in multimodal large language models (MLLMs) has brought impressive capabilities but also introduced critical safety vulnerabilities due to their susceptibility to adversarial manipulation. Unlike textual inputs that pass through symbolic-level filtering, visual inputs are mapped into continuous embeddings by frozen vision encoders and injected directly into the language model without explicit safety checks. This unveils an overlooked security risk. We propose the first stealthy embedding-level jailbreak attack that directly perturbs visual token embeddings to inject harmful semantics, thereby bypassing alignment filters and reliably triggering unsafe behavior in MLLMs. Our method constructs a latent semantic embedding matrix from a curated and model-assisted harmful text set and blends it into selected visual token embeddings. To validate the effectiveness of our injection approach, we systematically evaluate multiple spatial attack strategies guided by a segment-wise sensitivity analysis. Experiments on three representative MLLMs (LLaVA-1.5, LLaVA-1.6, and mPLUG-Owl2) demonstrate that our method achieves significantly higher attack success rates (ASR), outperforming the strongest baselines by up to 4.5% absolute ASR. Our results demonstrate that embedding-level injection presents a potent and stealthy jailbreak vector, outperforming prior methods and revealing an overlooked threat surface in MLLMs.

10:30-12:30 Session 11C: Time Series Analysis
Location: RHMZ02
10:30
Diff-DTF: Dynamic Temporal Feature Extraction and Refinement with Diffusion Model in Time Series Anomaly Detection

ABSTRACT. In recent years, generative models have demonstrated promising performance for anomaly detection in time series. However, time series data in the real world often presents inherent spatiotemporal uncertainties caused by noise and non-stationary environmental factors in sensor measurements. Moreover, there is a significant amount of redundant information association between various dimensions of data attributes. Existing approaches are unable to dynamically capture critical features and suppress noise interference in complex conditions like network-traffic, the importance of each individual dimension in the temporal features of time series often evolves dynamically over time. To solve these problems, we propose Diff-DTF, a novel anomaly detection approach integrating diffusion models with dynamic dimension-aware mechanisms. Our method proposes a dynamic temporal feature extraction mechanism that adaptively allocates dimension-wise weights based on temporal characteristics of the dataset to achieve dynamic focus on critical features. Furthermore, we improve information propagation by innovatively combining partial convolution (PConv) and depthwise separable convolution (DWConv), enabling our model to refine important information by adaptively emphasizing the most critical temporal characteristics. Diff-DTF represents the successful integration of diffusion models with dynamic temporal feature extraction and refinement, significantly advancing multivariate time series anomaly detection performance. Extensive experiments on four real-world time series datasets demonstrate substantial improvements compared to baselines, validating its effectiveness in detecting anomalies within complex multivariate time series data.

10:50
Dynamic-Segment-Masking Pre-Training for Multivariate Time-Series Classification

ABSTRACT. We propose a self-supervised pre-training framework for multivariate time-series classification that addresses the mismatch between fixed-window tokenization and the inherently variable temporal structure of real-world signals. Our framework combines Dynamic-Segment Masking (DSM) with a channel-independent Transformer encoder. DSM uses a recursive linear-fit validator to partition each sequence into content-adaptive, $\epsilon$-linear segments and then randomly masks a proportion of segments. Besides, a lightweight, channel-independent Transformer encoder is trained to reconstruct the missing intervals, thereby learning temporal dependencies between observed and missing intervals. TDespite having only 2.4 million parameters, our model achieves an average accuracy of 0.74 across 14 UEA benchmark datasets—exceeding a randomly initialized baseline by 7% and outperforming four larger state-of-the-art models, while also converging more rapidly. A comprehensive comparison with ten existing methods shows that our model achieves higher average accuracy than four baselines and ranks better than six. Ablation studies and sensitivity analyses further demonstrate the effectiveness of DSM and the robustness of the framework across a wide range of hyperparameters.

11:10
XicorAttention: Time Series Transformer Using Attention with Nonlinear Correlation

ABSTRACT. Various Transformer-based models have been proposed for time series forecasting. These models leverage the self-attention mechanism to capture long-term temporal or variate dependencies in sequences. Existing methods can be divided into two approaches: (1) reducing computational cost of attention by making the calculations sparse, and (2) reshaping the input data to aggregate temporal features. However, existing attention mechanisms may not adequately capture inherent nonlinear dependencies present in time series data, leaving room for improvement. In this study, we propose a novel attention mechanism based on Chatterjee's rank correlation coefficient, which measures nonlinear dependencies between variables. Specifically, we replace the matrix multiplication in standard attention mechanisms with this rank coefficient to measure the query-key relationship. Since computing Chatterjee's correlation coefficient involves sorting and ranking operations, we introduce a differentiable approximation employing SoftSort and SoftRank. Our proposed mechanism, ``XicorAttention,'' integrates it into several state-of-the-art Transformer models. Experimental results on real-world datasets demonstrate that incorporating nonlinear correlation into the attention improves forecasting accuracy by up to approximately 9.1\% compared to existing models.

11:30
VARMA-Enhanced Transformer for Time Series Forecasting
PRESENTER: Jiajun Song

ABSTRACT. Although Transformer-based models have significantly advanced time series forecasting, their effectiveness and architectural complexity remain subjects of intense debate. Recent work, such as the Cross-Attention-only Time Series transformer (CATS), has demonstrated that eliminating the permutation-invariant self-attention mechanism can lead to superior performance and efficiency. However, these streamlined architectures may overlook the fine-grained, local temporal dependencies effectively captured by classical statistical models like VARMA. To address this gap, we propose VARMAformer, a novel architecture that synergizes the efficiency of a cross-attention-only framework with the principles of classical time series analysis. Our model introduces two key innovations: (1) a dedicated VARMA-inspired Feature Extractor (VFE) that explicitly models autoregressive (AR) and moving-average (MA) patterns at the patch level, and (2) a VARMA-Enhanced Attention (VE-atten) mechanism that employs a temporal gate to make queries more context-aware. By fusing these classical insights into a modern backbone, VARMAformer captures both global, long-range dependencies and local, statistical structures. Through extensive experiments on widely-used benchmark datasets, we demonstrate that our model consistently outperforms existing state-of-the-art methods. Our work validates the significant benefit of integrating classical statistical insights into modern deep learning frameworks for time series forecasting.

11:50
Detecting Domain Shifts in Myoelectric Activations: Challenges and Opportunities in Stream Learning

ABSTRACT. Detecting domain shifts in myoelectric activations poses a significant challenge due to the inherent non-stationarity of electromyography (EMG) signals. This paper explores the detection of domain shifts using data stream (DS) learning techniques, focusing on the DB6 dataset from the Ninapro database. We define domains as distinct time-series segments based on different subjects and recording sessions, applying Kernel Principal Component Analysis (KPCA) with a cosine kernel to preprocess and highlight these shifts. By evaluating multiple drift detection methods such as CUSUM, Page-Hinckley, and ADWIN, we reveal the limitations of current techniques in achieving high performance for real-time domain shift detection in EMG signals. Our results underscore the potential of streaming-based approaches for maintaining stable EMG decoding models, while highlighting areas for further research to enhance robustness and accuracy in real-world scenarios.

12:00
MEET-Sepsis: Multi-Endogenous-View Enhanced Time-Series Representation Learning for Early Sepsis Prediction

ABSTRACT. Sepsis is a severe infectious complication, leading high mortality in Intensive Care Unit (ICU). In clinical practice, early and reliable Sepsis Prediction (SP) has crucial impacts on timely initiation of antibiotic therapy and fluid resuscitation. With the help of Artificial Intelligence (AI) techniques, efficiency and cost for SP have been considering improved. However, the weak initial signals of early-stage sepsis and its steeply rising mortality over time pose significant challenges in practical SP, as existing solutions struggle to accurately capture the subtle early-stage sepsis patterns. To bring the prediction timing forward under the constraint of extremely limited early-stage weak temporal signals, e.g., Temperature, Lactate and Respiratory Rate, this paper proposes a Multi-Endogenous-view Representation Enhancement (MERE) mechanism to construct sufficient endogenous views of input data to mine the latent information and then enhances temporal-level representation learning through a Cascaded Dual-convolution Time-series Attention (CDTA) mechanism. It turns out that such design fully explores the distribution patterns of the temporal descriptions of subjects, and achieves competitive SP accuracy using only 20\% of the ICU duration taken by the existing SOTA counterparts. Extensive experimental results demonstrate the effectiveness of the proposed Multi-Endogenous-view Enhanced Time-series representation learning for Sepsis prediction (MEET-Sepsis) approach. The code of the paper is available at: https://anonymous.4open.science/r/132345pric

10:30-12:30 Session 11D: Machine Learning 1
Location: RHMZ03
10:30
AURA-Net: Adaptive Uncertainty-weighted Ranking and Attention-driven Network for Generalized Colon Polyp Segmentation

ABSTRACT. Polyp segmentation in colonoscopy plays a crucial role in early colorectal cancer diagnosis, requiring efficient and accurate mod- els to ensure clinical deployment. In this paper, we propose AURA- Net (Adaptive Uncertainty-weighted Ranking and Attention-driven Net- work), an innovative Mixture-of-Encoders framework for robust polyp segmentation. Our method integrates multiple encoder networks, includ- ing ResNetV2-50, EfficientNetV2-S, DenseNet-121, and MobileNetV2, with a pixel-wise spatial soft-gating mechanism that adaptively assigns weights to predictions based on local uncertainty. Additionally, we in- troduce the Spatial-Enhanced Edge Attention (SEEA) module, which refines boundary features without adding significant computational over- head, and the Adaptive Contrastive-Correlation Loss (AC2L) to improve segmentation performance by balancing encoder diversity and expert- ensemble consistency. Extensive evaluations on the Kvasir-SEG dataset demonstrate the superiority of AURA-Net over state-of-the-art meth- ods, achieving a mean IoU of 0.85, F2 score of 0.90, and a segmenta- tion accuracy of 91.5%. We also analyze the model’s performance in an out-of-distribution (OOD) scenario, where AURA-Net exhibits a robust accuracy drop of only 3%, compared to a 15% degradation observed in traditional models. These results highlight the effectiveness of AURA- Net in not only achieving high accuracy but also maintaining reliability when faced with unseen data, making it a promising solution for real-time clinical applications.

10:50
An End-to-end Deep Learning Framework for Alzheimer's Disease Diagnosis by Using Multi-Site and Multi-Modal MRI Data
PRESENTER: Qichen Zhang

ABSTRACT. Alzheimer’s disease (AD) is an irreversible neurodegenerative disorder, and the early diagnosis is crucial for prognosis and delaying progression. Multi-modal neuroimaging, particularly T1-weighted (T1w) magnetic resonance imaging (MRI) and diffusion MRI (dMRI), has been widely adopted for the early diagno-sis of AD. T1w MRI captures cortical atrophy patterns, while dMRI reveals microstructural degeneration (e.g., myelin damage and axonal loss), with both modalities demonstrating strong diagnostic potential and high inter-modal correla-tion. Existing multi-modal deep learning models benefit from large-scale, multi-site datasets, they often neglect the site-specific biases that may compromise model performance. Meanwhile, image-level harmonization approaches are gen-erally challenging to integrate directly into diagnostic pipelines and are limited in their ability to handle multi-modal data effectively. To address this issue, we pro-pose an end-to-end deep learning framework for Alzheimer's Disease diagnosis by using multi-site and multi-modal MRI data, which includes an multi-site ad-versarial harmonization module (MAHM) and a mild cognitive impairment (MCI) enhancement module (MCIEM). Specifically, MAHM mitigates site-specific data shifts by aligning data from different sites to a pseudo target domain, while adversarially preserving the model’s ability to recognize site-specific fea-tures, thus enhancing its domain adaptation capabilities. Additionally, the MCIEM includes feature-level adaptive boundary loss and a classifier-level pen-alty term, which are used to increase the margin between MCI and Natural Con-trol (NC). In summary, our framework mitigates site-related biases in multi-site MRI data, improving diagnostic accuracy. Evaluated on a multi-site dataset (N=860, 7 sites) with T1w MRI and dMRI, it achieved advanced performance in NC/MCI/AD classification, outperforming existing methods. Code is publicly available at https://anonymous.4open.science/r/UMMAD-1FEC.

11:10
AMNN: Uniting Attention Mechanism and Adversarial Contrastive Learning for Adaptive Multimodal Knowledge Graph Completion
PRESENTER: Jian Gou

ABSTRACT. Multi-Modal Knowledge Graph Completion (MMKGC) aims to predict missing triples in knowledge graphs by integrating multisource heterogeneous information—including structured entity relationships, visual semantics, and textual descriptions—and constructing discriminative models through deep collaborative reasoning. However, existing MMKGC methods commonly suffer from insufficient fine-grained cross-modal alignment when fusing multimodal information.To address this issue, this paper proposes an Adaptive Multi-modal Neural Network framework, named AMNN. Specifically, AMNN employs a Bidirectional Alternating Attention Mechanism to enhance the fine-grained association process from local to global levels across semantic units of different modalities, achieving precise mapping between image regions and textual phrases. Simultaneously, an adversarial neural network utilizes attention weights to generate modality-aware adversarial samples that augment contrastive learning, thereby improving the robustness of the MMKGC model.These two components engage in closed-loop collaborative training: the Bidirectional Alternating Attention Mechanism directs the adversarial contrastive learning module to generate modality-aware adversarial samples, while the adversarial contrastive learning module, through its contrastive loss, inversely optimizes the attention distribution of the Bidirectional Alternating Attention Mechanism.Experiments conducted on two benchmark datasets demonstrate that AMNN achieves significant improvements: MRR increases by 5.1% and 2.0%, and Hit@1 increases by 5.2% and 2.5% on the DB15K and MKG-W datasets, respectively. These results surpass the performance of 14 existing state-ofthe-art models, validating the effectiveness of the proposed framework.

11:30
PBNAT: Overcoming the Accuracy-Robustness Trade-off via Parallel Batch Normalization

ABSTRACT. The efficiency and convergence of adversarial training are compromised by the pronounced distributional divergence between clean and adversarial samples, which has been largely attributed to Batch Normalization (BN). Although researchers have attempted to address this mismatch via BN-free or dual-BN frameworks, but these approaches invariably sacrifice natural accuracy for adversarial robustness or vice versa. To overcome these limitations, we introduce Parallel Batch Normalization Adversarial Training (PBNAT), which augments the network with multiple BN branches and a trainable selector that models each input’s feature statistics as a weighted combination of these branches. During training, an alternative BN-scheduling scheme and a novel BN-pruning algorithm work in concert to reduce computational overhead and bolster generalization. During inference, the selector generates a sample-specific weighted combination over all normalization branches, enabling a more flexible and adaptive normalization strategy. This dynamic normalization mechanism enables the model to adapt seamlessly to both clean and adversarial distributions without manual tuning. Empirical results also demonstrate that PBNAT reconciles the accuracy–robustness trade-off, achieving superior natural accuracy and adversarial robustness compared to single-BN, BN-free, and dual-BN baselines.

11:50
VRGNet: A Relative Geometric-Driven Network for Point Cloud Registration with Virtual Correspondences

ABSTRACT. With the continuous development of 3D laser scanning technology, point clouds can quickly and intuitively obtain real-world information. Point cloud registration can solve the limitation that a single data source cannot fully reflect objective things. However, the partial overlap and the difficulty in identifying matchable and repeatable corresponding points lead to a large number of outliers and low registration accuracy. To address these issues, we propose a relative geometric-driven registration network with virtual correspondences (VRGNet). First, robust features between scenes are extracted through self-attention and cross-attention mechanisms. The virtual point generation (VPG) module is used to optimize point cloud position information and improve the point matching probability in overlapping areas. Then, combine the original and generated point clouds to learn rotation-invariant geometric features through relative geometric embedding (RGE). Finally, the coarse-to-fine point matching method is used to obtain reliable correspondences and high-precision transformation matrices. We conduct extensive testing on the large-scale outdoor KITTI dataset. The experimental results demonstrate that our method achieves higher efficiency and registration accuracy.

12:10
AFD-STA: Adaptive Filtering Denoising with Spatiotemporal Attention for Chaos Prediction
PRESENTER: Chunlin Gong

ABSTRACT. This paper presents AFD-STA Net, a neural framework integrating adaptive filtering and spatiotemporal dynamics learning for predicting high-dimensional chaotic systems governed by partial differential equations. The architecture combines: 1) An adaptive exponential smoothing module with position-aware decay coefficients for robust attractor reconstruction, 2) Parallel attention mechanisms capturing cross-temporal and spatial dependencies, 3) Dynamic gated fusion of multiscale features, and 4) Deep projection networks with dimension-scaling capabilities. Numerical experiments on nonlinear PDE systems demonstrate the model's effectiveness in maintaining prediction accuracy under both smooth and strongly chaotic regimes while exhibiting noise tolerance through adaptive filtering. Component ablation studies confirm critical contributions from each module, particularly highlighting the essential role of spatiotemporal attention in learning complex dynamical interactions. The framework shows promising potential for real-world applications requiring simultaneous handling of measurement uncertainties and high-dimensional nonlinear dynamics.

10:30-12:30 Session 11E: Real-World Applications 1
Location: RH103
10:30
RetroLEE: Bridging the Gap Between Synthons in Single-step Retrosynthesis

ABSTRACT. Retrosynthesis aims to identify reactants for synthesizing target molecules, serving as a fundamental tool in advancing organic synthesis and drug discovery. Some methods formulate retrosynthesis as predicting a sequence of molecular edits on the product molecule graph to generate intermediate synthons, which ultimately lead to the final reactants. However, these methods typically capture only limited associations among synthetic units, especially the relationships between molecular edits, which are vital for guiding both the content and positioning of subsequent edits. To address this limitation, we propose RetroLEE, a model that enhances single-step retrosynthesis by incorporating Last Edit Embedding (LEE). First, we introduce a Last Edit Embedding Graph, which integrates the information from the previous edit into each atom's representation to facilitate more accurate inferences. Second, we present a neighborhood aggregator that leverages neighboring edits to improve the localization of reaction centers. Experiments on the USPTO-50K dataset show that RetroLEE outperforms existing semi-template methods, achieving a new state-of-the-art benchmark with 91.2\% top-10 accuracy. Case studies further demonstrate that RetroLEE can accurately predict plausible reactants, even for products with complex structures.

10:50
NewtonPIR: Communication Efficient Single-Server PIR

ABSTRACT. Private information retrieval (PIR) is widely used for privacy protection. Although some schemes achieve communication overhead that is independent of the database size $N$, they remain inefficient and thus impractical for real-world use. In this paper, we propose NewtonPIR, a communication efficient single-server PIR scheme. NewtonPIR can directly generate query values for the entire index without splitting the index and sending multiple query ciphertexts. Specifically, NewtonPIR achieves communication overhead that is 7.5$\times$ better than the state-of-the-art PIR protocol and 35.9$\sim$75$\times$ better than the other protocols. In experiments, when the database size and entry size increase, the communication overhead of NewtonPIR remains stable. By utilizing the simple Newton interpolation polynomial and precomputing coefficients offline, we reduce the computation time from hours in the previous non-preprocessing scheme to seconds. Moreover, when new entries are added to the database, NewtonPIR performs incremental preprocessing of the interpolation coefficients without recomputing them from scratch.

11:10
Efficient Selection of Low-Fidelity Data for Multi-Fidelity Surrogate Models
PRESENTER: Mitra Heidari

ABSTRACT. Abstract. Multi-fidelity surrogate models are commonly used in industries such as aerospace, marine engineering, and bioprocessing, where high-fidelity data is limited, as collecting such data is time-consuming or expensive. To supplement high-fidelity data, low-fidelity data, which is faster and cheaper to collect but less accurate, is often used. However, not all low-fidelity data necessarily improves model performance. In this research, we examine the sample spaces of high- and low-fidelity data with different distributions to identify those subsets of low-fidelity data that enhance model performance. Our hypothesis is that choosing low-fidelity data that is ‘more similar’ to high-fidelity data will result in better performance. The question is how to define similarity here. To explore this, we introduce two novel methods: a distance-based measure and a rank-based method, i.e. Kendall’s tau to quantify similarity and determine the best low-fidelity subsets for further investigation. A non-parametric permutation method for Hotelling’s T2 is then applied to assess the similarity between selected low-fidelity subsets and the high-fidelity data. We assess the effectiveness of these methods in identifying more informative low-fidelity subsets and their impact on the performance of multi-fidelity models, i.e. Autoregressive Gaussian Process and Deep Gaussian Process. We conduct a comprehensive evaluation of two selection strategies on both synthetic functions and a real-world bioprocessing case study. The experimental results further support our hypothesis. In summary, this research provides valuable insights into effectively identifying the subsets of low-fidelity data that could enhance the performance of multi-fidelity models.

11:30
Weighted Epistemic Logic: Skill Assessment and Rough Set Applications

ABSTRACT. This paper presents a weighted epistemic logic tailored for skill assessment, incorporating fuzzy skill sets to represent agent capabilities. By employing implicit weights for belief operators and explicit proficiency formulas, the logic enables reasoning about the interplay between belief and capability. An extension with quantified updates supports analysis of skill set ranges under specified conditions, addressing attribute selection in rough set contexts. The logic establishes direct correspondences with classical Pawlak rough sets and close alignments with fuzzy rough sets. The computational complexity of model checking is analyzed, with the basic logic in P and the extended logic co-NP-complete.

11:50
GAN-Prototype: Generative Adversarial Networks for Rapid UX Prototype Generation

ABSTRACT. Generative Adversarial Networks (GANs) have shown remarkable potential in various domains, and their application in user experience (UX) design offers a transformative approach. In this paper, we present GAN-Prototype, a novel framework designed to automate UX prototype generation. By employing an adversarial model, GAN-Prototype consists of a generator that creates realistic prototypes grounded on specified design parameters, while a discriminator assesses these designs for authenticity and usability. This dual mechanism allows for rapid development of high-fidelity prototypes, significantly reducing the time typically needed in conventional design practices. The framework's learning process utilizes a rich dataset of existing UX designs, capturing the complex interplay between design features and user preferences. Additionally, incorporating user feedback mechanisms facilitates iterative improvements to prototypes based on actual user interactions. Experimental results confirm that GAN-Prototype accelerates the design process without compromising user satisfaction, evidencing its role in enhancing efficiency and innovation within UX design. Practical case studies further illustrate the framework's applicability across diverse design scenarios, affirming its potential impact on designers' productivity and creativity.

10:30-12:30 Session 11F: Online Session
Location: RH104
10:30
LMSQuant: Learnable Multiscale Post-Training Quantization for LLMs

ABSTRACT. Large Language Models (LLMs) exhibit exceptional capabilities in diverse and challenging tasks but pose significant challenges for deployment on resource-constrained devices due to their vast computational and memory requirements. Post-training quantization (PTQ) techniques alleviate this issue by compressing weights and activations to lower precision. However, existing approaches, especially those based on smooth-based techniques, struggle to effectively diminish the impact of outliers in activations and weights, resulting in substantial performance degradation under challenging quantization settings. To address these challenges, we propose LMSQuant, a novel learnable multiscale PTQ framework comprising three core components. For activations, Learnable Multiscale Activation Scaling (LMAS) adaptively combines token-wise and channel-wise statistics to compute the quantization step size, effectively mitigating activation outliers. For weights, Learnable Multiscale Weight Scaling (LMWS) introduces bi-directional scaling, smoothing the distribution and suppressing outliers. Additionally, we incorporate Lookahead Composite Alignment (LCA) to optimize the parameters of both LMAS and LMWS through a multi-block strategy guided by a composite loss. Experimental results demonstrate that LMSQuant consistently outperforms existing leading PTQ methods in both language modeling and zero-shot tasks.

10:50
LID-Drug: A Localized Interactive Domain-Aggregated (LID) Framework for Protein Drug Editing

ABSTRACT. Large language models (LLMs) are revolutionizing drug discovery by providing data-driven insights into biomolecular design. However, their application to protein-based drug editing systems is hindered by the complexity of protein sequences, high computational demands, and data privacy concerns associated with remote APIs. To address these issues, we present LID-Drug, a Localized Interactive Domain-aggregated framework for protein drug editing. We first propose the Aggregator-Driven Domain Reasoning (ADDR) module which converts raw amino acid sequences into domain-aggregated input to enhance the LLMs' understanding of complex protein structures. Secondly, we design an interactive mechanism driven by two key modules, one is Domain-Aware Prompt Construction (DAPC), and the other is Retrieval and Domain Feedback (ReDF). This interactive feedback loop incrementally refines each generation step by incorporating domain-specific retrieval and structured expert feedback. To preserve data privacy, \name{} fine-tunes LLMs locally using domain-specific datasets. To address the time demands of fine-tuning large models, we introduce a dynamic low-rank projection optimizer to accelerate fine-tuning convergence. Empirical results demonstrate that LID-Drug achieves state-of-the-art hit ratio on public protein datasets, outperforming baseline methods by 22.5% to 42.62% while also reducing fine-tuning steps by 25%.

11:10
Bridging Confidence and Competence: Evaluating Self-Assessment Alignment in LLM Mathematical Reasoning

ABSTRACT. Large Language Models (LLMs) have achieved near-human performance on a wide array of benchmarks, yet deploying them in high-stakes settings still requires that their internal confidence track actual competence. In this study, we evaluate the alignment between LLMs’ self-assessment and their actual competence in solving mathematical problems. We probe this gap on three mathematical datasets (MATH, Math500, GSM8K) by eliciting LLM’s confidence and comparing it with its solution accuracy. Experiments on 8 open-source models from the Qwen and LLaMA families show three key findings. (1) All models display a certain degree of misalignment; (2) Scale and domain-specific fine-tuning matter: 7B-parameter and math-tuned Qwen variants narrow the confidence–performance gap, whereas similarly sized but untuned LLaMA models remain poorly calibrated. (3) Misalignment is also sensitive to prompt design. These results expose a persistent weakness in current LLM self-assessment. We release our evaluation pipeline and metrics to spur research on prompt design and fine-tuning strategies that make LLMs more trustworthy for high-stakes reasoning applications.

11:30
MADCAP: A Multi-Agent Deliberative Framework for Robust Assessment of Open-Ended Questions

ABSTRACT. The assessment of Open-Ended Questions (OEQs), while crucial for evaluating higher-order thinking, is persistently hampered by subjectivity and scalability challenges. Although Large Language Models (LLMs) show considerable potential for automating this task, their application is often limited by inconsistent outputs, a singular evaluative perspective, and an inability to adapt to nuanced criteria. This paper introduces MADCAP (Multi-Agent Deliberation with Cluster-Aware Pairwise-compared criteria), a novel framework designed to overcome these limitations. MADCAP operationalizes the principles of collective human intelligence by structuring the evaluation process into three synergistic stages: (1) it employs unsupervised clustering to establish context-aware evaluation environments; (2) it dynamically induces and weights cluster-specific criteria via an LLM-driven, AHP-inspired method; and (3) it deploys a panel of specialized agents to score answers through a multi-round deliberative protocol that promotes consensus. Experiments conducted on three OEQ datasets showed that MADCAP significantly enhanced the adaptability and contextual relevance of the assessments, with an average improvement of 7% across multiple metrics compared to a strong baseline. The core contribution of this work lies in shifting the evaluation paradigm from single-model direct assessment to a structured, multi-agent deliberative system, thereby advancing the objectivity, reliability, and sophistication of automated OEQ assessment.

11:50
ADSC: LLM-Augmented Dual-Stream Cooperative Learning for Robust Automated Essay Scoring
PRESENTER: Kexuan Zhang

ABSTRACT. The rapid expansion of educational systems and the widespread adoption of online learning platforms have highlighted the limitations of traditional manual essay scoring, such as inefficiency, high cost, and subjective bias. Automated Essay Scoring (AES) seeks to address these challenges, particularly in open-ended questions, where the absence of well-defined reference answers and flexible evaluation criteria makes accurate assessment inherently difficult. Existing approaches, including traditional feature engineering and large language model (LLM)-based methods, often struggle with semantic understanding, scoring consistency, and alignment with human evaluation rubrics. To tackle these issues, we propose ADSC, a novel LLM-augmented dual-stream framework that enhances small models' feature representation through few-shot-guided LLM reasoning and integrates both semantic and comment-based features via interactive attention mechanisms. This design improves interpretability while enhancing scoring robustness. Experiments on the ASAP-AES dataset show that ADSC achieves state-of-the-art performance, attaining the highest average Quadratic Weighted Kappa (QWK) score and delivering substantial gains on narrative essays, while maintaining superior computational efficiency.

12:10
PsyChild: A Child-Centric Psychological Companionship LLM with Fine-Grained Multiturn Dialogue Evaluation Benchmark
PRESENTER: Tian Wei

ABSTRACT. Early intervention in children's mental health is critical, and large language models (LLMs) offer effective psychological companionship support. However, existing psychological LLMs still face significant limitations in addressing children's unique needs, particularly in language cognition and companionship strategies. To overcome these issues, we introduce PsyChild, the first framework specifically designed as an AI expert for psychological companionship in children, featuring the dataset PsyCINS, the LLM PsyCGPT, and the benchmark PsyCBEN. To overcome the constraints of data privacy and crowdsourced annotations, we innovatively construct PsyCINS, a 25.4k multi-turn dataset of child psychology dialogues, using two LLM-guided methods: (1) Multi-channel children's daily conversations grounded in psychological theory. (2) Single-turn dialogues from professional psychological counseling platform. Unlike most methods that directly generate dialogues based on LLMs, our PsyCINS achieves superior public acceptance and language style in comprehensive evaluations. Leveraging PsyCINS, we fine-tune the Qwen2.5-7B-Instruct base LLM to develop the psychological Q\&A LLM PsyCGPT. Moreover, we introduce PsyCBEN, the first benchmark with 583 dialogues powered by the advanced, human-like GPT-4.1 as the judge, to assess the fine-grained abilities of LLMs in children's psychological dialogues. Evaluations across 19 mainstream LLMs reveal significant differences between general-purpose and psychological LLMs, with our PsyCGPT excelling. PsyChild framework demonstrates exceptional effectiveness and superiority in the field of children's mental health. More details are available at: https://anonymous.4open.science/r/PsyChild.

12:30-13:30Lunch Break
13:30-15:30 Session 12A: Computer Vision 2
Location: RHLT1
13:30
YOLOv8-UT: A Unified Training Approach for Cross-Environment Tree Crown Instance Segmentation in Aerial Imagery

ABSTRACT. Instance segmentation of individual tree crowns in aerial imagery is a critical task for forest management, carbon storage estimation, and biodiversity modeling. However, achieving effective segmentation faces significant challenges including dense canopy overlapping, diverse crown characteristics, and varying environmental conditions across different geographical regions. This paper proposes YOLOv8-UT, a unified training approach for cross-environment tree crown instance segmentation that enhances model generalization across rural and urban environments. YOLOv8-UT employs a two-stage training strategy that leverages unified pre-training on combined datasets followed by environment-specific fine-tuning to learn robust cross-environment features. Moreover, YOLOv8-UT incorporates the Large Kernel Attention mechanism to enhance feature representation for complex tree crown identification. Comprehensive experiments on aerial imagery from the Greater Wellington region demonstrate that YOLOv8-UT outperforms other recent peer competitors, achieving Box AP of 39.7 and Mask AP of 34.4 on the rural dataset, and Box AP of 48.2 and Mask AP of 40.5 on the urban dataset.

13:50
SFormer: SNR-guided Transformer for Underwater Image Enhancement from the Frequency Domain

ABSTRACT. Recent learning-based underwater image enhancement (UIE) methods have significantly advanced performance by incorporating physical priors into deep neural networks. Among these, the signal-to-noise ratio (SNR) prior has demonstrated effectiveness in mitigating wavelength-dependent attenuation. However, existing methods mainly apply SNR priors in the spatial domain, presenting two critical limitations: (i) spatial SNR maps are incapable of effectively disentangling cross-channel interference, and (ii) they offer limited guidance in selectively amplifying informative structures while suppressing periodic noise. To address these limitations, we propose migrating the SNR prior into the frequency domain by decomposing each feature into its amplitude and phase spectra. This approach enables fine-grained, channel-aware modulation. We introduce a novel Fourier Attention SNR-prior Transformer (FAST), which synergistically combines amplitude-phase interactions with SNR cues to selectively emphasize discriminative spectral components. We further propose a Frequency Adaptive Transformer (FAT) bottleneck to dynamically integrate low- and high-frequency branches through a gated channel-spatial attention mechanism, selectively retaining information that significantly enhances perceptual quality. These two modules are embedded within a unified U-shaped architecture, integrating a conventional RGB processing stream with a dedicated SNR-guided branch, collectively named SFormer. SFormer, trained on 4,800 paired images from UIEB, EUVP, and LSUI, outperforms recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM. Qualitative analyses confirm its ability to restore natural colors, textures, and contrast in underwater scenes.

14:10
HCTLR: A Hybrid CNN-Conformer Framework for Offline Handwritten Chinese Text Line Recognition
PRESENTER: Chaozong Chen

ABSTRACT. Handwritten Chinese text line recognition and handwriting verification play a crucial role in various fields such as office automation, classification of anonymous letters, and identity authentication. However, existing algorithms face significant challenges in extracting features from handwritten Chinese characters due to the complexity of their structure, image distortion and blurriness, as well as the limited availability of data samples. Furthermore, most existing studies focus on recognizing individual characters and verifying handwritten signatures. However, in practical applications, Chinese handwriting typically appears in text line format. To address these challenges, we construct the HW-CN dataset and propose an adaptive image interpolation algorithm (Otsu-Better) to tackle problems such as broken strokes and blurred characters in low-resolution images. Additionally, we introduce a recognition model specifically designed for handwritten Chinese text lines, which we refer to as Handwritten Chinese Text Line Recognition (HCTLR), to better meet the demands of real-world scenarios and reduce the impact caused by text segmentation. Experimental results demonstrate that the proposed HCTLR model achieves a total recognition accuracy of 78.7% in the HW-CN dataset, representing an improvement of 6.3% compared to the CRNN model.

14:30
Local Context-Aware Buoyancy Prediction for Mussel Farm Floats

ABSTRACT. Aquaculture is a key contributor to Aotearoa New Zealand's economy, with greenshell mussels representing a major export. As farms move further offshore into more demanding conditions, maintaining float line buoyancy becomes increasingly challenging. Buoyancy failures can lead to mussel loss, highlighting the need for automated, scalable monitoring solutions. Our deep learning-based buoyancy estimation approach offers a cost-effective, automated solution for aquaculture maintenance. We analyze the benchmark deep learning model's learned representations for buoyancy estimation by visualising intermediate feature activation, the saliency map analyses reveal that neighbouring floats serve as contextual cues. It provides motivation for the proposed novel context-aware strategy by considering both scale and location of the context region inclusion relative to the target individuals float for image-based assessment. By expanding each float's background context to include neighbouring floats, we achieve a 7-8\% increase in balanced accuracy across ResNet-18, DenseNet-169, and Vision Transformer models. An ablation study confirms that leveraging the right amount of surrounding context resolves the trade-off between useful signal and background noise. These findings demonstrate that our context-aware method can generalise well for both CNN and Transformer-based benchmarks. It can yield substantial performance gains in mussel float buoyancy monitoring.

14:50
SED-SLAM: Enhancing Monocular SLAM under Image Distortions via Spatially Equalized Deep Feature

ABSTRACT. Achieving robust and accurate monocular Simultaneous Localization and Mapping (SLAM) in the presence of image distortions remains a significant challenge. This paper introduces SED-SLAM, a novel monocular SLAM system that enhances distortion resilience through spatially equalized deep feature extraction. At its core, SED-SLAM integrates a lightweight and efficient Mobile-Superpoint network—built upon Inverted Residual blocks—for extracting rich, discriminative keypoints and descriptors. To address the issue of uneven keypoint distribution, we propose a Spatially Adaptive Thresholding (SAT) module, which adaptively regulates keypoint selection to ensure uniform spatial coverage across the image. The extracted features are incorporated into a streamlined SLAM pipeline that includes offline visual vocabulary training and online modules for tracking, mapping, loop closure, and relocalization. Extensive experiments on public benchmarks demonstrate that SED-SLAM achieves superior trajectory accuracy and robustness under various image distortions, while maintaining real-time computational efficiency comparable to existing methods. These results validate that jointly optimizing feature quality and spatial balance significantly enhances monocular SLAM performance in challenging environments.

15:00
Learning 3D Proposals in Spatio-Temporal Transformer for Multi-Camera Driving Scene Object Detection

ABSTRACT. 3D object detection is a critical technology in autonomous driving systems, particularly with the increasing preference for visual cameras due to their cost-effectiveness. Despite advancements in deep learning, exclusive reliance on visual data for 3D detection poses significant challenges, primarily due to: 1) The difficulty in deriving accurate depth information from single-frame images, which is essential for precise localization of objects in three-dimensional space; 2) The complexity of consistently projecting 3D bounding boxes onto the correct images within a multi-camera system, which is critical for coherent spatial mapping across different views. To address these challenges, we have developed a sophisticated spatio-temporal transformer network. This network effectively tracks the movement of objects across frames using homography pose transformation matrices, thereby utilizing historical data to improve current frame predictions. We have also integrated an advanced 3D Region of Interest (ROI) pooling technique that refines the generation of 3D proposals. This approach significantly enhances the precision of our object detection system. Our experimental results robustly demonstrate the superior performance of our methodology when compared to existing approaches, confirming its efficacy in real-world autonomous driving scenarios.

13:30-15:30 Session 12B: Large Language Model 2
Location: RHLT2
13:30
LLM-Distilled Surrogate Model for Expensive Multi-Objective Optimization
PRESENTER: Bingting Du

ABSTRACT. In expensive multi-objective optimization problems (EMOPs), surrogate-assisted evolutionary algorithms (SAEAs) are among the most widely used solutions. However, surrogate models often suffer from degraded performance due to limited training data—a prevalent and critical challenge in this domain. To address this issue, we propose a novel framework named DuSiM that leverages the capabilities of Large Language Models (LLMs) to assist surrogate model training. Specifically, DuSiM uses LLMs to generate additional high-quality training data, which enhances the surrogate model’s approximation accuracy despite the scarcity of evaluated training data. Specifically, DuSiM first uses the surrogate model to guide the prompt-feedback tuning of the LLM. Once the LLM adapts to predicting evaluation function values and uncertainties, it subsequently generates a substantial amount of high-quality synthetic data to assist in training the surrogate model. To evaluate the effectiveness of DuSiM, we compare it with five state-of-the-art algorithms on various problems. Experimental results demonstrate that our framework can accelerate the convergence of SAEAs and outperforms other algorithms in most cases.

13:50
Knowledge Graph Multi-Hop Reasoning Framework Based on LLM and Relation Path Matching

ABSTRACT. Knowledge graphs are widely used in applications such as question answering and recommendation systems, where multi-hop reasoning is a key task. Existing methods often overlook relational semantics, which limits their performance in complex reasoning tasks. While LLMs offer strong semantic capabilities, current approaches fail to simplify reasoning paths and make limited use of semantic cues. To address these limitations, we propose KGMRF-LRPM (Knowledge Graph Multi-hop Reasoning Framework based on LLM and Relation Path Matching), which divides the reasoning process into three stages: relation semantic enhancement, reasoning path planning, and relation path matching. The framework leverages prompt-based LLM collaboration and relation path alignment to improve reasoning efficiency and accuracy. Experimental results on three benchmark knowledge graph datasets demonstrate that KGMRF-LRPM achieves better performance than existing methods, including GQE, Q2B, PERM, TOG, and ROG, in terms of accuracy and generalization.

14:10
A Little Less Conversation, a Little More Action, Please: Investigating the Physical Common-Sense of LLMs in a 3D Embodied Environment

ABSTRACT. Large Language Models (LLMs) are increasingly used to reason about everyday physical environments and control the actions of agentic systems. The vast majority of research into how capable LLMs are at reasoning in physical environments has used static text or image-based benchmarks, that do not capture the complexity and nuance of real-life physical processes. To address this issue, we present LLM-AAI, a framework allowing direct comparison between LLMs and other embodied agents, and use it to perform the first embodied and cognitively meaningful evaluation of physical common-sense reasoning in LLMs. Our framework employs the Animal-AI environment, a simulated 3D virtual laboratory, and we compare LLMs to the entrants of the 2019 Animal-AI Olympics competition and to human children. Our results show that LLMs are currently outperformed by human children on tasks from the competition. We argue that this approach allows the study of physical reasoning using ecologically valid experiments drawn directly from cognitive science, improving the predictability and reliability of LLMs.

14:30
Best of Both Worlds? A Glance at Efficient Reasoning for LLM-based Machine Translation

ABSTRACT. Large language models (LLMs) have demonstrated significant performance improvements in numerous tasks, including machine translation (MT). Large reasoning models (LRMs) have further improved upon existing LLMs with long reasoning process known as chain-of-thought (CoT). LRMs excel in reasoning tasks with fixed answers, such as mathematics or coding challenges. Despite success, LRMs often require additional response latency and sometimes meaningless computation overhead, which is known as the ``overthinking phenomenon''. This also impedes applying LRMs to practical machine translation, which are more stringent in response time. Therefore, effectively reducing CoT length in LRMs while preserving or improving translation quality has become a critical research problem for MT. In this paper, we present \textbf{MT-CoT-Compressor}, a pipeline for reducing CoT length for machine translation tasks. It comprises of three stages, i.e. CoT summarization, format-quality mixed reward modeling, and CoT calibration. Experimental results show that our methods effectively reduced the CoT length by 49\% to 91\% without sacrificing translation quality.

14:50
TBERT: Bridging Text Generation and Score Regression through Hierarchical Feature Fusion based LLM for Automated Essay Scoring
PRESENTER: Hongxing Zhang

ABSTRACT. Most existing research on automatic essay scoring has concentrated on learning feature representations to enhance the accuracy of score predictions. However, both early studies focused on hand-crafted feature engineering and more recent approaches utilizing neural networks to model various structural features have treated feature extraction and score prediction as distinct tasks. This separation has led to limited interpretability and neglected the meaningful connection between feedbacks and scores. To address this challenge, we propose a novel framework, TBERT, which innovatively combines prompt-based feature extraction, multi-head attention fusion, and curriculum-style loss weighting, to effectively integrate regression scoring with feedback generation. TBERT not only improves scoring accuracy but also provides interpretable, rubric-aligned, multi-level aware feedback, thereby advancing the overall evaluation process with enhanced interpretability and practical value. Extensive experiments conducted on public datasets demonstrate the effectiveness and interpretability of our proposed method, particularly highlighting its exceptional adaptability in narrative essays and long-form writings.

15:10
HiGraph-LLM: Hierarchical Graph Encoding and Integration with Large Language Models

ABSTRACT. Graph Neural Networks (GNNs) have achieved remarkable performance on graph-centric tasks such as node classification and link prediction. Meanwhile, Large Language Models (LLMs) have shown impressive performance in language understanding across diverse domains. GNNs effectively capture structural information but struggle with rich semantic modeling, while LLMs offer strong contextual reasoning yet fail to encode graph topology. This dual challenge necessitates addressing both the inherent limitations in node representation learning and the complexities involved in aligning graph-structured data with the token space of LLMs. To address these challenges, we introduce HiGraph-LLM, a novel framework designed for hierarchical graph encoding and integration with large language models. HiGraph-LLM refines node representations by integrating multi-level structural features and aligns them with LLMs through curriculum-driven prompt learning. Specifically, HiGraph-LLM consists of two modules: the Hierarchical Node Information Learning Module, which effectively consolidates information from hierarchical node levels to improve node representations, and the LLM’s Graph Information Integration Module, which optimizes the alignment of graph data with the LLM. Comprehensive experiments on multiple benchmark datasets demonstrate the effectiveness of our proposed method. The code will be released upon acceptance of the paper.

13:30-15:30 Session 12C: Planning, Scheduling, and Optimization
Location: RHMZ02
13:30
Boost Cross-distribution Generalization by Expert Multi-head Attention for Capacitated Arc Routing Problem

ABSTRACT. Capacity Arc Routing Problem (CARP) is a classic combinatorial optimization problem in transportation and logistics. Neural solvers based on the attention mechanism for routing problems have gained attention because they require little expert knowledge. However, existing neural solvers are typically trained on tasks with fixed distributions, leading to poor performance when solving problems with different distributions. To address this issue, we propose an Expert Multi-Head Attention (EMHA) neural solver for cross-distribution generalization in CARP. The EMHA module combines a top-$\mathcal{K}$ router with multiple independent parameter query vectors. This allows the model to dynamically adjust attention mechanisms based on input distribution features, generating more suitable feature representations. In addition, our model employs an encoder–decoder architecture trained via the REINFORCE algorithm augmented with an auxiliary loss to encourage balanced utilization of the different query vectors. Extensive experimental results demonstrate that the proposed method improves both cross-distribution performance and solution quality.

13:50
Scalable Knee-Point Guided Activity Group Selection in Multi-Tree Genetic Programming for Dynamic Multi-Mode Project Scheduling
PRESENTER: Yuan Tian

ABSTRACT. The dynamic multi-mode resource-constrained project scheduling problem is a challenging scheduling problem that requires making decisions on both the execution order of activities and their corresponding execution modes. Genetic programming has been widely applied as a hyper-heuristic to evolve priority rules that guide the selection of activity-mode pairs from the current eligible set. Recently, an activity group selection strategy has been proposed to select a subset of activities rather than a single activity at each decision point, allowing for more effective scheduling by considering the interdependence between activities. Although effective in small-scale instances, this strategy suffers from scalability issues when applied to larger problems. In this work, we enhance the scalability of the group selection strategy by introducing a knee-point-based selection mechanism to identify a promising subset of activities before evaluating their combinations. An activity ordering rule is first used to rank all eligible activity-mode pairs, followed by a knee point selection to find the promising pairs. Then, a group selection rule selects the best activity combination. We develop a multi-tree GP framework to evolve both types of rules simultaneously. Experimental results demonstrate that our approach scales well to large instances and outperforms GP with sequential decision-making in most scenarios.

14:10
ERL-BSA: Evolution Strategies-Enhanced Reinforcement Learning for Context-Aware and Workload Balanced Dynamic Bus Scheduling

ABSTRACT. Dynamic bus scheduling plays a critical role in reducing operating costs and maintaining service quality amid real-world uncertainties such as traffic congestion and vehicle breakdowns. However, existing approaches often overlook two key aspects: (1) the lack of in-service vehicle information, limiting situational awareness and leading to suboptimal scheduling decisions; and (2) the absence of performance metrics that explicitly account for workload imbalance, resulting in uneven vehicle utilization and long-term inefficiencies. To address these limitations, we propose ERL-BSA, a reinforcement learning method enhanced by OpenAI Evolution Strategies. ERL-BSA features a dual-branch multi-head attention network that integrates both in-service and depot vehicle data, and a new reward function that jointly optimizes fleet size, deadhead trips, and workload balance. Experiments on three real-world datasets demonstrate that ERL-BSA significantly improves workload distribution while reducing fleet size, highlighting its practical value for smart transit systems.

14:30
A Quadratic Programming Framework Unifying Different Types of Visual Servoing With Obstacle Avoidance for Joint-Constrained Robots

ABSTRACT. Visual servoing (VS) is a control technique that employs visual features captured by a camera to guide robots toward desired targets. According to the retrieved visual features, VS is commonly divided into position-based VS (PBVS), image-based VS (IBVS) and homography-based VS (HBVS). Apart from the specified VS task, obstacle avoidance (OA) and joint-limit avoidance (JLA) are crucial for ensuring safety and reliability of the robot. This paper focuses on developing a quadratic programming (QP) framework that unifies the aforementioned different types of visual servoing with OA and JLA capabilities for joint-constrained redundant robots. Then, a gradient-dynamics based neurodynamic network (GDNN) is designed to serve as a QP solver. Simulations and experiments conducted using two Franka Emika Panda robots demonstrate the validity and practicality of the established QP framework for achieving VS tasks with OA and JLA considered.

14:50
Simplicity Wins: Benchmarking Evolutionary Multi-objective Optimization Algorithms for Subset Selection Problems
PRESENTER: Ke Shang

ABSTRACT. Subset selection is a fundamental problem in various domains, where the goal is to choose an optimal subset of elements from a large candidate set under given objectives and constraints. Pareto Optimization for Subset Selection (POSS) and its variants, such as PORSS and TPOSS, have recently shown promising results by reformulating the problem as a two-objective optimization and solving it using simple evolutionary multi-objective optimization (EMO) algorithms. However, it remains unclear how these algorithms compare to other well-established EMO algorithms, including both classical and specialized approaches. In this paper, we conduct a comprehensive benchmarking study involving nine EMO algorithms across two representative subset selection tasks: unsupervised feature selection and hypervolume subset selection. Our results show that TPOSS, a simple yet effective algorithm, consistently outperforms more complex classical and specialized EMO algorithms. These findings reveal that simpler EMO strategies can be surprisingly competitive and even superior in subset selection scenarios, offering valuable insights for designing future EMO algorithms tailored to such tasks.

13:30-15:30 Session 12D: Machine Learning 2
Location: RHMZ03
13:30
Improving Intent Detection with Hierarchical Multimodal Representation and Triplet Contrastive Learning

ABSTRACT. The aim of multimodal intent detection is to understand user intent through multimodal data. Currently, there are three main challenges in this task. First, modality-specific information is easily lost during fusion. Second, insufficient cross-modal interaction leads to limited synergy. The representation space lacks discriminative power, making it difficult to achieve intraclass compactness and interclass separability. To address these issues, we propose a novel framework called HMTC, which combines hierarchical modality representation learning and triplet-based contrastive representation learning. Hierarchical modality representation learning captures unique intramodal features, shared cross-modal information, and local interactions between modality pairs through dual modeling (MSE and GSE) and the pairwise synergistic encoder (PSE). It ensures information integrity and cross-modal consistency using reconstruction constraints and triplet alignment loss. Triplet-based contrastive representation learning introduces triplet contrastive loss to enhance intraclass compactness and interclass separability in the global intent embedding space.It compensates for the shortcomings of traditional classification loss in embedding space discrimination and improves the model's generalization ability. Comprehensive experiments on two datasets demonstrate that our framework outperforms multiple baseline methods.

13:50
MACT: Mutation-Aware CNN-Transformer for ESG Forecasting

ABSTRACT. Environmental, Social, and Governance (ESG) indicators are core metrics for evaluating corporate sustainability and long-term resilience. Against the backdrop of escalating climate risks and increasingly stringent environmental regulations, timely and reliable forecasting of the environmental (E) dimension has become critical—yet remains challenging in the presence of abrupt structural changes. Using the Huazheng ESG Ratings dataset, which covers 2,270 mainland A-shares and Hong Kong–listed companies from 2013 to 2022, we formulate E-score prediction as a multivariate annual time-series task and generate training samples via overlapping sliding windows. We propose the Mutation-Aware CNN-Transformer (MACT), the first hybrid architecture explicitly designed to model ESG “mutations.” MACT employs convolutional encoders to capture short-term patterns, Transformer blocks to learn long-range dependencies, and two mutation-aware augmentation strategies—synthetic mutation injection and temporal masking—that introduce sudden shocks and missing segments during training. Extensive experiments show that MACT reduces the Root Mean Square Error (RMSE) to 3.7951 and the Mean Absolute Percentage Error (MAPE) to 3.67%. These results correspond to a reduction in RMSE and MAPE of 34.02% and 46.73%, respectively, compared to the state-of-the-art (SOTA) model Long Short‑Term Memory (LSTM), and an improvement of 34.96% and 46.89% relative to the Transformer baseline. Our findings demonstrate that integrating convolutional feature extraction, attention-based sequence modeling, and mutation-aware augmentation yields a highly accurate and robust framework for forecasting corporate environmental performance.

14:10
DBFormer: Dual Branch Transformer for Visible-infrared Person Re-identification

ABSTRACT. Conventional visible-light-based person re-identification (Re-ID) techniques suffer performance degradation in nighttime or low-light conditions, while visible-infrared (RGB-IR) cross-modal Re-ID can adapt to multiple indoor and nocturnal scenarios. However, the latter faces dual challenges: significant feature distribution discrepancies and local-global feature representation imbalance. Recently, Transformer architectures and part-based methods have demonstrated great progress in traditional Re-ID tasks; however, their direct application to cross-modal scenarios exhibits critical limitations, such as compromised feature integrity from excessive fine-grained segmentation and substantially increased computational complexity. To address these challenges, we propose a Dual-Branch Transformer network (DBFormer) which horizontally partitions the feature encoding process into upper-body and lower-body branches, thereby enhancing the detailed feature modeling capability. Moreover, we design a dual-branch alignment loss function to enforce feature distribution consistency and mitigate inter-branch discrepancies, and a cross-modal alignment loss function to significantly improve Re-ID performance by optimizing cross-modal feature distances. Extensive experiments demonstrate that our method achieves superior accuracy in person re-identification, outperforming state-of-the-art approaches in recent years.

14:30
Proxy-Mamba: Training-Free Architecture Search for Mamba via Gradient-Weight Correlation

ABSTRACT. State Space Models (SSMs) such as Mamba have recently emerged as efficient alternatives to Transformers for sequence modeling, especially in long-context and low-resource scenarios. However, the design of Mamba-based architectures still relies on manual heuristics and exhaustive tuning, limiting their scalability and performance. In this work, we propose Proxy-Mamba, a training-free Neural Architecture Search (NAS) framework tailored for Mamba models. At the core of Proxy-Mamba is a novel proxy metric, SigScore, designed to estimate the performance of Mamba-based architectures without any training. By computing a Sigmoid-normalized aggregation of the magnitudes of parameter weights and their corresponding gradients within the SSM module, our proxy exhibits strong correlation with actual model performance, enabling fast and cost-effective architecture search. Experiments on CIFAR10, CIFAR100, and ImageNet16-120 demonstrate that Proxy-Mamba outperforms existing training-free NAS baselines in terms of correlation with test accuracy and the final model's classification performance. Our findings open a promising path toward scalable and efficient Mamba architecture design.

14:50
EDGM: Efficient and Dynamic Generative Model for Dataset Distillation

ABSTRACT. Generative dataset distillation aims at knowledge condensation of complete datasets through generative modeling, preserving key training information. Current more advanced methods mostly use diffusion modeling to generate compact synthetic data with high quality and diversity; however, such methods face many challenges in practical applications. For example, it is difficult to cope with long-tailed distributions by treating different categories equally, and it is difficult to adequately express the hierarchical complexity within categories with a large number of hyponyms. In addition, there is the problem of high resource utilization in the process of high-resolution image generation. To cope with the above problems, this paper proposes Efficient and Dynamic Generative Model for Dataset Distillation (EDGM). EDGM proposes dynamic noise control, dynamic clustering center, and dynamic prototype-image generation strategies by introducing a semantic-image dual complexity assessment mechanism focusing on the complex categories. In addition, EDGM introduces PCA dimensionality reduction and dimensionality enhancement strategies in the dynamic clustering phase, which effectively alleviates the memory bottleneck in high-resolution image processing and improves the computational efficiency of the system. Extensive experiments on four standard datasets verify that EDGM offers a good trade-off between accuracy and efficiency, and its comprehensive performance outperforms existing methods.

15:10
UCTPose: Uncertainty-aware Multi-view 3D Animal Pose Estimation

ABSTRACT. Data-driven quantitative analysis of animal behavior relies critically on precise video segmentation, accurate 3D animal pose estimation, and behavioral pattern interpretation. Three persistent challenges impede accurate pose estimation: cross-view occlusions, perspective-induced domain shifts, and scarcity of 3D annotations data. To address these limitations, we present UCTPose, a weakly supervised multi-view animal pose estimation framework that synergistically integrates uncertainty-aware 2D modeling with confidence-guided 3D triangulation. The core innovations include: UCTPose employs a reparameterized perturbation module that simulates view-dependent feature uncertainties, enhancing confidence calibration for 2D keypoint predictions under conclusion. A geometry-constrained triangulation head that reconstructs 3D poses by incorporating per-joint confidence scores, optimized via reprojection residual loss to enforce spatial consistency. Comprehensive evaluations on three multi-view mice behavioral datasets demonstrate that UCTPose achieves state-of-the-art performance. These results validate UCTPose’s superior cross-view generalization and occlusion resilience. The framework provides a robust tool for high-fidelity 3D kinematic profiling of naturalistic animal behaviors, significantly reducing dependency on exhaustive 3D annotations.

13:30-15:30 Session 12E: Online Session
Location: RH103
13:30
Spatio-Temporal Multi-Granularity Gated Recurrent Transformer Model for individual mobility prediction

ABSTRACT. While some studies have shown positive results, forecasting individual mobility remains a challenging endeavor due to the high variability and complexity of personal movement patterns.The challenge lies in grasping the complex spatial and temporal relationships, as well as the ever-changing patterns of individual movement.To tackle these challenges, we introduce the Spatio-Temporal Multi-Granularity Gated Recurrent Transformer Model(STMGGRT), which integrates multi-granularity structural encoding with spatio-temporal data within a Gated Recurrent Transformer framework. The Gated Recurrent Transformer layer in our model effectively captures intricate sequential patterns and longrange dependencies, which are essential for accurate mobility prediction. Through the use of stacked multiple levels of modules, our method is able to reveal hidden multi-level patterns within mobility data, leading to improved prediction performance. Comprehensive experiments on three publicly available datasets demonstrate that STMGGRT significantly outperforms current state-of-the-art methods in both accuracy and efficiency, highlighting its robustness and effectiveness in predicting future mobility locations.

13:50
HFCNet: A Spatial-Frequency Collaborative Multi-Scale Network for Infrared Small Target Detection

ABSTRACT. Infrared small target detection (IRSTD) focuses on the accurate localization of small, low-contrast targets under cluttered background conditions, with wide-ranging deployments in practice, particularly for maritime rescue operations and traffic surveillance systems. However, this task still faces multiple challenges, including vulnerability to background noise, poor adaptability to multi-scale features, and limited feature representation capacity. In response to these challenges, we present HFCNet, a multi-scale detection network that leverages collaborative modeling in both spatial and frequency domains. We design a Hessian-based feature extraction branch (HB) to capture coarse-grained structural textures of the target, followed by multi-stage feature fusion to enhance fine-grained details. Meanwhile, a Local Frequency Representation Module (LFRM) is introduced to decouple the target’s energy and spatial information in the frequency domain. This representation is further constrained by a Frequency-Domain Loss (FD) to align predicted features with ground truth in the frequency space. To further enhance performance, we integrate the Residual Coordinate Attention Block (RCB), which adaptively refines multi-scale feature representations with spatial awareness, ultimately boosting the network’s sensitivity to small and weak targets. Experimental evaluations demonstrate that the proposed HFCNet consistently surpasses leading state-of-the-art techniques when tested on the public IRSTD-1k and NUDT-SIRST datasets.

14:10
Directional Gradient Attacks for Inducing Controllable Detection Errors in DETR-based Models

ABSTRACT. End-to-end object detection models such as DETR and its variants have received widespread attention due to their unique architectural design and impressive detection performance. However, adversarial attacks targeting these models remain relatively underexplored. Existing approaches have primarily focused on traditional two-stage or anchor-based detectors, making them difficult to directly apply to the DETR architecture. To gain an in-depth understanding of the behavioral characteristics of DETR-like under adversarial examples, a structured adversarial attack method based on directional gradient projection is proposed in this paper, centering on four types of attack objectives: object vanishing, misclassification, object fabrication, and random output. This method is implemented through guided projection of gradients toward targeted directions, achieved by intervening in the bipartite matching mechanism, perturbing the decision boundaries of the classification head, reinforcing the self-enhancement mechanism, and disrupting the collaboration of feature streams, thereby enabling the model to generate controllable adversarial misdetections at the output stage. Extensive experiments on the MS COCO dataset demonstrate that the proposed attacks induce an average AP drop of 0.378 across DETR variants, reaching up to 0.408 under object-vanishing objectives. These results significantly degrade detection performance and highlight the structural vulnerability of DETR-based models to adversarial manipulation.

14:30
Focus on What Matters: Object-level Semantic Alignment for Multimodal Named Entity Recognition with multiple images

ABSTRACT. Multimodal Named Entity Recognition (MNER) is an im- portant research direction in Natural Language Processing, which aims to enhance Named Entity Recognition (NER) performance via addi- tional image information. As posts with multiple images become in- creasingly common on social media, existing methods primarily focus on single-image scenarios, leaving a significant research gap for multi-image MNER. Current multi-image approaches often fuse holistic image repre- sentations with text, a strategy susceptible to noise from irrelevant visual information and which overlooks fine-grained, object-level cues crucial for entity recognition.To address this challenge, we propose a novel Object- level Semantic Alignment Framework. The framework utilizes an object detector to decompose the multiple images into a set of fine-grained candidate visual objects. A semantic alignment module then calculates the semantic relevance of each object to the text, based on which it se- lects the Top-K most critical visual proofs from the candidates, which are then fed into a multimodal interaction module for deep fusion with the text.Extensive experiments on public multi-image MNER datasets demonstrate that our proposed method significantly outperforms exist- ing baselines. Ablation studies further validate the effectiveness of our object alignment and selection mechanism, providing a novel solution for the MNER-MI task.

14:50
Bayesianly-Corrected, Bandit-Optimized Multi-Agent LLMs: Rethinking Agents via Control-Theoretic Dynamics
PRESENTER: Xunfei Zhu

ABSTRACT. We present a rigorous theoretical foundation for multi-agent large language model (LLM) systems. Our framework, Control-Theoretic Multi-Agent LLM Dynamics (CT-MALD), models each agent as a controlled jump-diffusion with actuation delay and committed (non-preemptive) action windows. We prove a dynamic programming principle (DPP) and a viscosity characterization for the resulting nonlocal Hamilton--Jacobi--Bellman (HJB) equation with delay and commitment. Horizontal collaboration is modeled via $f$-divergence-bounded information exchange; we prove exponential reductions in expected time-to-decision under R\'enyi divergence budgets. Vertical coordination is formulated as a continuous-time hierarchical Stackelberg game; we prove equilibrium existence, uniqueness under strict quasi-concavity, and sensitivity via the implicit function theorem. For prompt/policy selection we develop Thompson-Regularized Gaussian Process Contextual Bandits (TR-GPCB) and prove high-probability regret bounds with delayed, heteroscedastic feedback and instance-dependent refinements. Finally, we propose a Hierarchical Empirical Bayes Correction (HEBC) mechanism and prove conjugate posteriors, posterior contraction, an optimal decision threshold, and strict average error reduction. 

15:10
Latent Intention-guided Prediction and Planning with Transformer in Autonomous Driving
PRESENTER: Runshan Huang

ABSTRACT. Despite recently rapid progress driven by learning-based methods in autonomous driving, existing models often struggle to produce reliable plans in complex multi-agent scenarios. A key reason is the insufficient utilization of contextual information, particularly the rich semantic and interaction cues embedded in the scene, which limits the model’s ability to anticipate how agents will behave. In this paper, we propose a latent-intention guided structure that enhances the model’s ability to extract and utilize underlying behavioral preferences from the scene context. We introduce a latent variable z as an intermediate representation of fine-grained future motion tendencies, such as turning direction with caution or aggressively, which semantically guide both trajectory prediction and motion planning. This latent variable is modeled as a Gaussian Mixture Model (GMM), capturing the discrete and continuous nature of future intentions in complex multi-agent scenarios. By integrating z into a Transformer-based encoder-decoder architecture, our method produces intention-aware predictions that are more aligned with scene semantics and agent interactions. Experiments on the nuPlan benchmark demonstrate that our approach significantly improves prediction accuracy and planning performance, while providing interpretable latent structures that reflect diverse behavioral preferences.

13:30-15:30 Session 12F: Online Session
Location: RH104
13:30
FluLLM: Speech Fluency Classification Based on Multi-modal Large Language Models

ABSTRACT. Large language models (LLMs), because of their powerful multi-modal understanding and reasoning capabilities, provide a new technical pathway for automated speech fluency assessment. Inspired by LLM-based automatic speech recognition (ASR) models and text-related scoring models, we propose FluLLM, a LLM-based speech fluency classification framework for second-language learners, capable of accommodating both the open scenario (free expression without reference text) and the follow-up scenario (read-aloud with reference text). The framework employs a pre-trained Whisper model as the speech encoder and integrates its acoustic and semantic features at various hierarchical levels through a learnable dynamic-weighting fusion strategy. A lightweight modality adapter is designed to align the fused features with the LLM input space, and a linear classification head is attached to the final hidden states of the LLM to map them to fluency levels. Clear, structured prompts are devised for both the open scenario and the follow-up scenario to guide the LLM in generating classification outputs. On the Avalinguo Audio Dataset (AAD), which represents the open scenario, FluLLM improved by 2.83 and 2.93 percentage points in accuracy and F1-score, respectively, compared to the baseline, and on SpeechOcean762 (SO762), which represents the follow-up scenario, it improved by 7.04 and 8.92 percentage points in accuracy and F1-score, respectively. Both are significantly better than the baseline models.

13:50
SCD-HDC: A Hallucination Detection and Correction Method for LLMs Based on Syntactic Component Decomposition

ABSTRACT. The issue of hallucination detection and correction in LLMs is receiving increasing attention. Existing methods primarily pursue fine-grained hallucination detection through various fact unit decomposition techniques; however, we observe the following limitations in these approaches: (1) excessive redundant judgments lead to low detection efficiency; (2) semantic information within detection units is isolated, resulting in one-sided judgments. Inspired by syntactic structures in linguistics and syntactic parsing work in natural language processing, we propose a novel two-stage hallucination detection paradigm for LLMs called “syntactic decomposition-hallucination detection”, along with a complementary method for hallucination detection and correction, termed SCD-HDC. In the hallucination detection phase, SCD-HDC incorporates a syntactic decomposition step to refine the detection granularity to syntactic component quadruples, which maintains the semantic integrity both within and between fact units while avoiding redundant detections. In the hallucination correction phase, the method utilizes the hallucinated syntactic component labels from the detection results as a guide, achieving flexible multi-scale corrections with high precision. Experimental results on three datasets from RAGTruth indicate that, in the hallucination detection phase, SCD-HDC achieves an overall F1 score that is more than 10% higher for the response level and over 24% higher for the span level compared to the best baseline methods. In the hallucination correction phase, it reduces the correction scope by an average of 65.4% compared to baseline methods, while still obtaining the best correction accuracy on two datasets.Furthermore, experiments demonstrate that SCD-HDC has good adaptability to multiple models.

14:10
Do the Instruction-fine-tuned Large Language Models Challenge Flawed Instructions? A Study on Over-Compliance and Hallucinations

ABSTRACT. Instruction fine-tuning significantly enhances the task performance of large language models (LLMs) and their ability to generalize to unseen tasks. However, existing instruction fine-tuning techniques may overlook the training of critical thinking skills in models when responding to instructions. As a result, when presented with logically flawed instructions, models often comply, generating content inconsistent with objective facts or misaligned with user inputs, thus exhibiting sycophancy and the corresponding hallucination. In this paper, we introduce the NCA-MCQ (No-Correct-Answer Multiple-Choice Questions) dataset, derived from diverse tasks, and propose a three-step testing framework to evaluate models' ability to challenge logically flawed instructions. Experimental results and sample analysis reveal that models that were fine-tuned by instruction, such as GPT-4o, Llama3.1, and the Qwen2.5 series, often exhibit excessive compliance with flawed instructions, leading to sycophantic hallucinations. Furthermore, the parameter scale of the model significantly influences its ability to question defective instructions.

14:30
Gradient Reweighting-Based Representation Intervention and Prompting Framework for Emotion Recognition in Conversation

ABSTRACT. Emotion recognition in conversation (ERC) is an interdisciplinary field focused on identifying the emotional state of each utterance in a dialogue. However, existing ERC studies use fixed and single prompt templates, and the resulting utterance embeddings fail to adequately represent the subtle semantic differences between utterances. Additionally, there is an unclear relationship between the way utterances are expressed and their corresponding emotional states, which impacts the model inference and the quality of generation. ERC datasets are typically class-imbalanced, leading to imbalanced gradients when handling different emotion categories. To address these issues, we propose the gradient reweighting-based representation intervention and prompting framework (GRIP-ERC). GRIP-ERC consists of a representation extractor and an unbiased classifier. In the representation extractor, GRIP-ERC adds soft prompts in the hidden layer of the PLMs, which are composed of task-specific and instance-specific elements, allowing for a more complete representation of the semantic differences of utterances. Additionally, GRIP-ERC intervenes in hidden representations within the linear subspace spanned by a low-rank projection matrix to guiding model behavior during reasoning and improving the quality of generation. In the unbiased classifier, GRIP-ERC reweights the imbalanced gradient matrix on a per-class basis and uses a balance vector adaptively adjusted by historical accumulated gradients. Experimental results show that GRIP-ERC outperforms state-of-the-art methods on all three benchmark datasets, validating its effectiveness.

14:50
Fine-tuning Alignment of Large Language Models via Label Smoothing and Intermediate Contrastive Learning

ABSTRACT. Fine-tuning Alignment plays a key role in large-language model training. However, the ORPO method suffers from overfitting and unstable training due to standard cross-entropy loss, as well as semantic degradation in intermediate layers caused by excessive focus on output optimization. To address these issues, this paper proposes a Hierarchical Contrastive Alignment (HCA) framework that combines dynamic label smoothing and intermediate-layer contrastive learning. The dynamic smoothing module adaptively adjusts the supervision based on the training stage and input structure, reducing overconfidence, and improving generalization. Meanwhile, the contrastive module introduces structured supervision into intermediate representations to improve semantic discrimination and prevent representational collapse. Experiments on benchmarks validate the effectiveness of HCA, showing improved performance over existing baselines and highlighting its potential for stable, semantically aligned preference learning.

15:10
FLEX-AD: AutoML for Anomaly Detection via LLM-Guided Feature Generation and Selective Ensembling
PRESENTER: Zehao Gong

ABSTRACT. Anomaly detection on tabular data is critical in finance and cybersecurity, where graph structures are unavailable or unreliable. However, building effective detection models often requires labor-intensive feature engineering, model selection, and ensemble design, limiting scalability and robustness. To address this, recent work has explored automated machine learning (AutoML) as a means of streamlining the end-to-end modeling process. Despite progress, existing AutoML frameworks face structural limitations. They typically discard features based on single-model performance, include untuned models in ensembles, and fail to account for the quality of each model in final voting. These issues are especially problematic in anomaly detection, where heterogeneous patterns and varying anomaly types demand both high feature diversity and robust decision mechanisms. We propose FLEX-AD (Feature-Level and Exitable Ensemble eXploration for Anomaly Detection), a two-stage AutoML framework tailored for tabular anomaly detection. FLEX-AD first uses large language models (LLMs) to iteratively generate candidate feature engineering code and evaluates features across base models, retaining those that improve performance on any. In the second stage, each model undergoes grid-based hyperparameter tuning ranked by validation performance. Top models are selected for weighted soft-voting, ensuring reliable ensemble decisions. Experiments on 19 real-world datasets show that FLEX-AD achieves superior performance over existing baselines, especially in clustering with up to 8% ARI gain, offering a scalable, robust solution for tabular anomaly detection.

15:30-16:00Coffee Break
16:00-17:30 Session 13: Poster Session
DisBERT: Discrepancy Modeling with Transformers for Metaphor Detection

ABSTRACT. Metaphors are used in almost every sentence in human communication; they serve to enhance the conveyance of ideas between individuals. By clarifying and drawing parallels between various concepts, metaphors deepen and enrich our language and expression. However, they pose challenges for Large Language Models (LLMs), which remain limited in metaphorical and analogical comprehension. With the introduction of encoder-only transformers such as BERT, prior studies have demonstrated superior metaphor detection performance compared to earlier machine learning techniques, though further architectural research is still desirable. In this paper, we propose DisBERT, a transformer-based model for metaphor detection that introduces a novel Word Sentence Discrepancy (WSD) module. Rooted in Black's linguistic theory of metaphor interaction, WSD reflects the semantic divergence between a word's semantics and its sentence context. Evaluated on four standard benchmarks---VUA18, VUA20, MOH-X, and TroFi---DisBERT effectively captures additional contextualized semantic features and consistently achieves results that are competitive with state-of-the-art models. Overall, our findings suggest that DisBERT generalizes well across datasets and offers a promising approach to detecting metaphors at the word level.

Fake Money, Real Threat: Fooling Wavelet-Based Banknote Authentication with AdvGAN

ABSTRACT. The vulnerability of machine learning systems to adversarial examples poses a significant threat to security-relevant applications, especially as such models are increasingly deployed. Financial transactions, in particular, rely on the assumption that payments are legitimate, which makes banknote authentication an essential use case. Banknotes incorporate several security features, and the applied printing technique itself can be leveraged for authentication. Specifically, Intaglio printing results in fine line work and microstructures that can be analyzed and distinguished by spatial frequency analysis, e. g. the wavelet packet transform. By evaluating statistical moments of wavelet coefficient histograms, fast and reliable authentication is achieved.

This paper adapts the AdvGAN framework for the context of wavelet-based banknote authentication. By proposing a customized loss function that constrains the feature space, highly effective yet subtle adversarial examples are generated. These perturbations deceive the authentication system, causing it to classify forgeries as genuine banknotes and genuine ones as forgeries. Under attack a maximum drop from 100% classification accuracy to 0% is accomplished. Furthermore, limitations and countermeasures are outlined highlighting potential challenges of deploying such attacks in practical scenarios.

RALLM-POI: Retrieval Augmented LLM for Zero-shot Next POI Recommendation with Geographical Reranking

ABSTRACT. Next point-of-interest (POI) recommendation is a fundamental task in spatiotemporal data mining, which predicts subsequent destination based on historical movements. Most existing approaches rely on training-intensive models tailored to specific domains, which demand significant computational resources. Recent advances in large language models (LLMs) have ushered in a new era of flexible, zero-shot solutions that demonstrate strong generalization capabilities. However, when directly applied to next POI recommendation, due to the lack of trajectory reference and geographical relashionship information, LLMs are prone to generate generic or irrelevant recommendations. Moreover, the lack of geographical understanding makes their recommendation result may fail to consider geographical and contextual relevance. To this end, we present a novel framework that effectively coupling LLM capabilities with retrieval augmented generation and self-rectification for next POI recommendation, namely RALLM-POI. Specifically, we first propose Historical Trajectory Retriever (HTR), which retrieves relevant historical trajectories from the training database to serve as contextual references for the LLM prompt. These references are reranked through a Geographical Distance Reranker (GDR), which ranks according to their geographical relevance and trajectory similarity, ensuring that most contextually and spatially appropriate examples are prioritized. To further refine prediction reliability, we introduce Agentic LLM Rectifier (ALR), where the LLM reflects upon its recommendation revises its output if necessary. Our framework significantly enhances recommendation accuracy, while maintaining the zero-shot adaptability of LLMs without training. We evaluate our method on three real-world Foursquare datasets, RALLM-POI demonstrates substantial improvements over conventional training-based and LLM-based methods significantly.

StructuralCoder: Repository Structure based RAG for Repository-Level Code Completion

ABSTRACT. Large language models (LLMs) with retrieval-augmented generation (RAG) have demonstrated encouraging performance in repository-level code completion. These approaches often employ a retriever to search for code snippets based on unfinished code. However, an often-neglected observation is that the presence of similarity does not inherently guarantee the provision of assistance. Furthermore, similarity based retrieval strategies can only provide partial, localized information within the repository, missing the big picture of the entire repository. In this paper, we propose StructuralCoder, a framework replacing retriever with an extractor which builds the repository structure to provide a comprehensive and more helpful context. It traverses the entire repository, generating a hybrid tree-structure combined with directory tree and abstract syntax tree (AST). During the entire process, StructuralCoder doesn’t require access or update to the weights of LLM. Our evaluations on CrossCodeEval show that StructuralCoder significantly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy when compared to several baselines.

Training strategies for nonlinearity removal from Optical Coherence Tomography signals

ABSTRACT. Optical Coherence Tomography (OCT) is a light-based, three-dimensional imaging technique, used in ophthalmology for eye disease detection. Every so often, the OCT device requires re-calibration, a procedure where setup parameters are re-evaluated to create updated correction vectors, critical for obtaining high-resolution OCT images in post-processing. More specifically, the correction vectors enable the removal of signal nonlinearities that lead to resolution degradation. Here, we demonstrate that a nonlinear raw OCT signal can be efficiently corrected, i.e., linearised, with a neural network without a priori knowledge of any OCT device parameters. We focus on the second-order nonlinearity, the biggest contributor to resolution degradation, and discuss the optimum strategies implemented to realise this universal OCT signal linearisation, involving: signal pre-processing and training dataset size. The neural network model trained using the optimum strategies is tested on experimental signals from different OCT machines and shows very good performance compared to the traditional approaches.

SAM-Lightning: Segment Anything Model for Efficient Inference and Reduced Memory Footprint

ABSTRACT. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot generalization ability in instance segmentation tasks with proper prompting by end users. However, its practical application is limited by slow inference speed and high memory requirements, primarily due to the attention mechanism. While previous work has focused on optimizing the encoder, the inefficiency of the attention mechanism it- self has not been adequately addressed in SAM, even in distilled smaller models. To tackle this issue, we introduce SAM-Lightning, a variant of SAM featuring a novel attention mechanism called Dilated Flash Attention. This re-engineered attention mechanism enhances parallelism and processing efficiency in SAM while maintaining compatibility with existing FlashAttention. We also propose a progressive distillation technique for efficient knowledge transfer from the vanilla SAM, avoiding costly training from scratch. Experiments on COCO and LVIS datasets show that SAM-Lightning significantly outperforms state-of-the-art methods in both runtime efficiency and segmentation accuracy. On images of size 1024×1024 pixels, SAM-Lightning achieves an inference speed of 7 milliseconds per image, which is 30.1× faster than vanilla SAM and 2.1× faster than the best competing method. Moreover, it requires only 244MB of memory, just 3.5% of vanilla SAM’s requirements.

A Feature-aware Label Selection Approach Based on Matrix Interpolation Decomposition

ABSTRACT. Multi-label learning focuses on a special classification problem where any instance is associated with multiple class labels simultaneously. This implies that its classes are possibly overlapped one another and thus its label matrix is low-rank and compressible. Such a speciality makes label space dimensionality reduction possible, including label embedding and label selection. In this paper, we will pay more attention on a single-stage label selection where a selected label subset and its corresponding recovery matrix are created at the same time. Interpolative matrix decomposition (IMD) approximates a primary matrix with a product of two low-rank matrices, where its left matrix is a column subset of original matrix and its right matrix acts as a recovery matrix. Such a mathematical formulation just describes our label selection problem in multi-label classification. We build a joint matrix with both feature matrix and label matrix, which is factorized using deterministic IMD, resulting in a feature-aware label selection algorithm or LS-IMDf simply. To the best of our knowledge, our LS-IMDf is the first feature-aware single-stage label selection technique. When discarding feature information, such an LS-IMDf is reduced into a simplified version LS-IMD. Finally, we validate the effectiveness of our proposed LS-IMDf empirically using some extensive experiments on three benchmark data sets with more than 100 labels, via comparing with five existing methods and our LS-IMD, according to two classification evaluation metrics (precision@k and DCG@k, k=1, 3 and 5).

Multifaceted Mastery: Novel LLM Training for Swedish and Vietnamese with Focus on Data Efficiency and RAG Applications

ABSTRACT. This work studies the multilingual capabilities of Large Language Models (LLMs), focusing on the multifaceted mastery of LLMs on low-resource languages. Despite previous works exploring the multilingual aspects of LLMs on specific tasks, little is known about the transferability of instruction-following abilities to low-resource languages, as reflected in the pretraining corpus. We present an approach to effectively fine-tune pretrained LLMs to align with human preferences in low-resource languages, such as Swedish and Vietnamese. Our methodology leverages fully automated data generation, showcasing effectiveness in real-world scenarios, e.g., Retrieval Augmented Generation. To address a gap in development and evaluation of instruction-tuned LLMs in aforementioned languages, we curate and present new benchmarking datasets tailored for LLMs. Our resulting multilingual model (i.e., MultiSV) not only achieves state-of-the-art performance for both languages, but also surpasses prior monolingual models despite its moderate size, with up to 10.1% improvement. Finally, the experiment results underscore the practical applicability of our models, affirming the potential of our approach for real-world usages. Data, models, and source code will be publicly available.

Automated Fabric Defect Detection Using RTMDet: Application in Denim Manufacturing

ABSTRACT. The textile industry faces significant challenges due to the high volume of waste generated by defective fabrics. Traditional human inspection methods particularly in the Pacific Rim, where thousands of textile and garment factories across China, Southeast Asia, and neighboring regions fuel local economies and provide millions of jobs are labor-intensive and prone to errors, making consistent quality control difficult to achieve. This research aims to develop an Automated Fabric Inspection System utilizing Artificial Intelligence (AI) to address these issues. By integrating high-resolution cameras and advanced object detection algorithms, Real-Time Models for object Detection (RTMDet), the proposed system can accurately and efficiently identify defects of various types and scales in real-time during the inspection process. While previous research in fabric defect detection has often relied on YOLO-based models, this work is the first to apply RTMDet in the textile industry, achieving superior speed and accuracy for real-time inspection tasks. The implementation of this automated system is expected to significantly reduce waste, enhance production efficiency, and ensure higher quality standards in the textile industry. Initial experiments demonstrate the system's potential in detecting fabric defects, such as streak, chalk, nip, holes, and tag pin, thread. Additionally, a comparative analysis between human eye inspection and Al inspection reveals that the AI system is approximately 2.5 times more efficient, significantly reducing inspection time and labor costs. This offers a promising solution for modernizing quality control in fabric manufacturing.

Industrial-Scale Autonomous eXplainable Iterative Learning

ABSTRACT. Explainable AI aims to improve trust and transparency, particularly in high-risk domains. It recently gained momentum, since industries are urged to comply with new regulations, such as the EU AI Act. A valuable added benefit of XAI is the potential for further model improvement, ensuring that its predictions are right for the right reasons. This is the focus of eXplainable Interactive Learning (XIL). In practice, XIL is challenging, since ground-truth labels for explanations are even harder to obtain than those for the predictions of an ML model. This is even harder in the industry, where matching an explanation of a prediction with reasons of events and outcomes in the system essentially amounts to a full-fledged root cause analysis, which is very complex and cost-intensive. Researchers and AI practitioners at an industrial scale struggle to provide a quantifiable evaluation of explanations, which impedes XIL. Using domain knowledge, we combine multimodal deep learning with XAI and propose Autonomous eXplainable Iterative Learning (AXIL), a new method of training and improving a model in a self-supervised manner, without the need for the ground truth explanation labels.

A Logic for Ramsey’s Measurement of Credence

ABSTRACT. Building on Ramsey’s seminal proposal that subjective probabilities can be derived from rational preference between actions, we can identify an approach for qualitatively comparing belief strengths via preference orderings. Specifically, an agent’s belief in proposition p exceeds that in q if and only if she prefers an action which brings about a better outcome conditional on p over another action which brings about the same outcome conditional on q. The paper presents a formal logical characterization of Ramsey’s qualitative credence measurement, introducing three key contributions: 1) a novel desire model, 2) a corresponding semantics, and 3) a complete axiomatic system. Through soundness and completeness proofs, we demonstrate how Ramsey’s behavioral credence criteria can be systematically reconstructed within logical calculus, formally bridging decision-theoretic norms with epistemic reasoning.

GLIMPSE: A Lightweight SHAP-Based Framework for Interpretable ALS Biomarker Discovery from Gene Expression Data

ABSTRACT. Amyotrophic Lateral Sclerosis is a progressive neurodegenerative disease with high genetic heterogeneity and limited interpretable biomarkers for clinical use. While machine learning models trained on transcriptomic data have shown promising classification performance, their black-box nature limits biological insight and clinical translation. In this study, we present a focused interpretability analysis using Shapley Additive Explanations applied to a tree-based ALS classifier. Our objective is to investigate SHAP’s utility in identifying gene-level contributors to ALS diagnosis, with a particular emphasis on the underexplored genes, such as BIN2. Through SHAP value analysis on transcriptomic features, BIN2 emerged as a consistently high-ranking gene alongside known ALS markers such as CYTIP and SASH3. Literature review suggests BIN2’s association with neuroimmune and mitochondrial pathways implicated in neurodegenerative processes. These findings highlight the power of SHAP to extract biologically meaningful signals from complex models and suggest that BIN2 may serve as a candidate biomarker for ALS. Unlike model-centric studies, this work isolates the interpretability layer to demonstrate how post hoc analysis can independently drive hypothesis generation and gene prioritization in precision medicine contexts.

Scheduling Heuristic Learning via Genetic Programming for Dynamic Flexible Job Shop Scheduling with Heterogeneous Batch Arrivals

ABSTRACT. Dynamic flexible job shop scheduling (DFJSS) is a challenging combinatorial optimisation problem that requires effective decision-making under dynamic environments. Although genetic programming (GP) has shown success in automatically learning scheduling heuristics, existing research predominantly addresses dynamic events involving single job arrivals, which does not always reflect real-world situations. In practice, heterogeneous batch arrivals where different jobs arrive simultaneously, are quite common and introduce a new decision-making challenge: handling multiple routing decisions (machine assignments) concurrently. However, this problem has received little attention in the literature. To fill this gap, we first formulate the DFJSS problem with heterogeneous batch arrivals. Furthermore, to effectively coordinate simultaneous routing decisions introduced by batch arrivals, GP is used to evolve scheduling heuristics to prioritise hybrid operation–machine pairs instead of jobs. An update strategy is incorporated to reflect the latest system status after each routing assignment, improving decision quality under dynamic changes. Experimental results across 18 scenarios demonstrate that using routing rules to prioritise pairs achieves the best average rank and outperforms compared methods. The effectiveness of the update strategy is also verified. Further analysis shows that concurrent routing decisions have little impact on size and feature occurrence of routing rules, suggesting that concurrent routing essentially follows the same logic as traditional routing. Therefore, a single routing rule can be effectively applied to both concurrent and individual routing decisions.

Dual-branch Mamba based multi-label Anuran species classification

ABSTRACT. Multi-label audio classification is a challenging task due to complex soundscapes, overlapping events, and class imbalance, particularly in fine-grained biodiversity monitoring scenarios. In this study, we present CNNDualMamba, a new neural architecture made for multi-label anuran species classification. Our model combines three main parts: First, it uses a convolutional neural network as the backbone to efficiently extract time-frequency features. Second, it introduces a loss function that's more suited to multi-label tasks. This function combines zero-bounded log-sum-exp and pairwise rank-based loss. Third, it designs a new dual-branch Mamba module. This module models time and frequency dependencies separately. In addition, we carry out label-aware data augmentation to solve the data imbalance problem. Experimental results on AnuraSet show that our model outperforms existing baselines and achieves an overall F1 score of 60.1%. For three subsets, our best model achieves an F1 score of 86.5%, 81.1%, and 49.8% for frequent, common, and rare classes, respectively.

Dynamic Balance Sorting and Co-evolutionary Algorithm for Expensive Many-objective Optimization

ABSTRACT. The challenge of expensive many-objective optimization problems (EMaOPs) is how to select excellent solutions as the number of objectives increases. Current research tends to focus on either convergence or diversity, both of which are ineffective for solving EMaOPs with complex Pareto fronts. For this reason, we propose a novel Dynamic Balance Sorting and Co-evolutionary Algorithm (DBS-CEO) for EMaOPs. DBS-CEO employs a dynamic balance sorting algorithm that combines dynamic decomposition sorting with dimensionality-decreasing non-dominated sorting. DBS balances convergence and diversity dynamically during the sorting process and utilizes boundary information to sort population solutions. DBS-CEO uses a co-evolutionary algorithm with Generative Adversarial Networks as an auxiliary optimizer for MOEA/D to generate better solutions. In comparison with five state-of-the-art SAEAs, DBS-CEO achieves competitive performance and computational efficiency on both commonly used benchmark problems.

RAGCD: Relation-Aware Graph for Cognitive Diagnosis

ABSTRACT. Cognitive Diagnosis (CD) aims to infer students' latent knowledge states by analyzing their historical response logs. However, a limitation of most existing models is cognitive state blurring, as they tend to overlook the heterogeneous diagnostic signals embedded in different response behaviors and aggregate all interactions indiscriminately. This monolithic modeling approach makes it difficult to distinguish between students with similar overall proficiency but different specific weaknesses, leading to representations with low differentiability. Therefore, we propose a Relation-Aware Graph for Cognitive Diagnosis (RAGCD) framework to obtain more discriminative representations and enhance diagnostic accuracy. Specifically, our model first constructs a Student Cognitive Interaction Graph that decomposes interactions into a right-response subgraph and a wrong-response subgraph to capture distinct signals of mastery and misconception, respectively. Furthermore, we design a Heterogeneous View Attention mechanism to adaptively fuse the information from these different relational views. Finally, the fused representations are used to predict student performance, thus diagnosing their knowledge states more precisely. Extensive experiments on three real-world datasets show that RAGCD outperforms baseline models, indicating the effectiveness of our relation-aware approach.

Prompt-Driven Knowledge Retrieval in Arabic Medical Agents via Graph-RAG and LLM

ABSTRACT. Human-computer interaction has been significantly transformed by the rise of intelligent agents, now widely deployed across domains such as healthcare, education, and customer service. However, Arabic-language agents remain underdeveloped due to the linguistic complexity of Arabic—including its rich morphology and context-sensitive syntax—and the scarcity of domain-specific, high-quality datasets for model training. To bridge this gap, we propose an intelligent Arabic medical agent that integrates Large Language Models (LLMs) with a Graph-based Retrieval-Augmented Generation (Graph-RAG) framework. Our method begins by annotating an Arabic medical dataset to construct a Neo4j-based knowledge graph, using LLMs and AraBERT to extract and semantically embed key clinical entities such as diseases, symptoms, and treatments. Leveraging the capabilities of LLMs, the system incorporates a preprocessing module designed to correct linguistic ambiguities and errors, followed by a classification mechanism that determines whether the query requires direct generation or retrieval from the knowledge graph. For complex queries, relevant entities are extracted, matched through semantic similarity, and the most appropriate subgraph is selected to guide the generation process. This architecture enables precise, coherent, and context-aware responses tailored to medical needs. Evaluations conducted on the Arabic Healthcare Dataset (AHD) demonstrate its superiority over existing approaches.

WaveDSTG: A Multiscale Wavelet-Based Spatio-Temporal Attention for Temporal Knowledge Graphs Reasoning

ABSTRACT. In recent years, Temporal Knowledge Graph (TKG) reasoning has made progress in modeling periodic events. However, it still faces challenges in reasoning about emergent and aperiodic events, traditional methods fail to effectively model the topological dependencies inherent in the dynamic evolution of non-periodic events, due to inadequate spatio-temporal correlation modeling. To mitigate these challenges, this paper proposes WaveDSTG, a novel multiscale wavelet transform and spatio-temporal attention-driven TKG reasoning model. By collaboratively encoding multiscale temporal features through wavelet transform and fourier basis, it effectively captures both periodic and non-periodic events, thus overcoming the limitations of single-scale modeling. Meanwhile, the proposed model incorporates spatio-temporal dynamic attention with a dual-projection contrastive learning mechanism to construct both historical and non-historical subgraphs, which are independently utilized for dual-path prediction. By leveraging contrastive objectives in conjunction with orthogonal regularization, the model effectively disentangles spatio-temporal representations, enabling more structured and discriminative feature learning. Experimental results show that WaveDSTG outperforms other similar models in terms of effectiveness and efficiency on four TKG benchmarks.

CARGO-IoT: Cost-Aware Repair-based Genetic Optimization for Budget-Constrained IoT Service Composition

ABSTRACT. Budget-constrained IoT service composition aims to find service composition under a given budget while optimizing quality of services, e.g., response time. Effectively addressing budget constraints in IoT service composition is practically important because real-world IoT composition frequently operate under strict financial limits, necessitating solutions that deliver optimal performance without overspending. Traditional constraint-handling techniques, such as penalty methods, multi-objective approaches, and random repair operators, often struggle to properly balance budget feasibility and time optimization. To tackle this issue, we propose a Cost-Aware Repair-based Genetic Optimization (CARGO-IoT) algorithm that integrates priority-based selection of infeasible solutions and a two-stage repair process. Specifically, CARGO-IoT prioritizes recently generated solutions with lower constraint violations and response time for repair. It further introduces a two-stage repair process supported by two complementary methods: a novel replace-based method, which guides service replacements using a probability distribution learned from prior feasible solutions, and a reduce-based method that removes excess services to meet budget constraints. Extensive experiments demonstrate that our approach significantly outperforms penalty-based, $\epsilon$-constrained, and existing repair-based methods, achieving lower response time, faster convergence, and better budget feasibility.

Contrastive Graph Clustering with Structure-Aware Enhancement

ABSTRACT. Contrastive graph clustering methods have attracted increasing attention due to their ability to jointly capture node semantics and structural information in an unsupervised manner. However, most existing methods rely heavily on shallow adjacency structures or sample pairs constructed in the feature space, without explicitly modeling or enhancing the underlying graph structure. This leads to limited capability in preserving topological patterns and thus restricts clustering performance. Moreover, conventional contrastive objectives struggle to capture intra-cluster structural patterns and inter-cluster differences.To address these limitations, we propose CGCSE, a contrastive graph clustering framework with structure-aware enhancement. CGCSE introduces a structural enhancement module to explicitly model graph semantics and construct a structure-aware similarity graph, generates structure-perturbed views guided by spectral features, and employs a clustering-aware contrastive loss to jointly optimize node representations and clustering discriminability. Extensive experiments on five public datasets demonstrate that CGCSE consistently outperforms existing methods in clustering performance.

IB-ToM: Human-AI Coordination for Unseen Partners with Evolving Strategies

ABSTRACT. We study the problem of human-AI collaboration for unseen human partners with evolving strategies. Existing zero-shot coordination algorithms enable agents to cooperate with human partners without prior adaptation. However, these methods typically assume that the human strategies remain fixed throughout the interaction. As a result, they are unable to accommodate shifts in human behavior. To address this limitation, we propose IB-ToM, a Theory of Mind module that leverages the Information Bottleneck principle to model the evolving behavioral patterns of human partners over time.IB-ToM learns a compact, behaviorally relevant latent representation from the human partner's past observations and actions, which is subsequently incorporated into the agent's policy to improve coordination performance. Furthermore, the Information Bottleneck constraint ensures that the ToM module extracts features most relevant to decision-making. Experimental results in the Overcooked environment demonstrate that IB-ToM significantly improves human-agent collaboration across a range of zero-shot coordination algorithms, yielding up to 89.68% higher rewards and demonstrating strong generalization to unseen partners and tasks.

Industrial Hierarchical Transformer: Process-Aware Patching and Hierarchical Attention for Industrial Tabular Fault Diagnosis

ABSTRACT. Data-driven fault diagnosis in modern industrial processes relies heavily on analyzing multi-sensor tabular data. While tree-based models like GBDT are widely used, they face limitations in complex end-to-end learning scenarios. Conversely, standard Transformer architectures often fail to exploit the implicit hierarchical structure inherent in industrial processes, treating variables as a flat or independent set. To address these limitations, we propose the Industrial Hierarchical Transformer (IHT), a novel architecture tailored for industrial fault diagnosis. IHT introduces a process-aware variable patching strategy combined with positional encoding, grouping variables into semantically coherent patches that reflect physical process stages. The model employs a hierarchical attention mechanism to capture both fine-grained dependencies among variables and high-level interactions across stages. In addition, we design a Top-Down Cross Attention (TDCA) module that enables semantic feedback from stage-level representations to refine variable-level features, improving representation quality and diagnostic accuracy. Extensive experiments on the Tennessee Eastman Process (TEP) benchmark demonstrate that IHT consistently outperforms state-of-the-art baselines in classification accuracy and F1-score. Moreover, IHT shows strong robustness under various noisy conditions, underscoring its practicality for real-world industrial deployment.

LLM-Based Simulation Tool for Clinician-Patient Communication Training: A Dual-Mode AI Approach

ABSTRACT. Effective clinician-patient communication remains a persistent challenge in healthcare, often hindered by medical jargon, cultural barriers, and time constraints. We introduce an AI-driven communication training tool leveraging large language models (LLMs) to help clinicians enhance clarity, empathy, and cultural sensitivity in clinical dialogue. The system supports both structured and open-ended dialogue simulations, providing real-time feedback to clinicians. Structured dialogue model is designed for early-career clinicians. It presents common clinical scenarios (e.g., explaining a diagnosis) with multiple-choice response options. Each selection guides the conversation with AI patient while maintaining clinical accuracy and patient-centered language. Open-Ended dialogue model is designed for experienced health workers, this mode allows free-text input, simulating more natural, unscripted conversations. The AI assumes the role of a virtual patient, adapting responses dynamically based on the clinician’s input. Feedback is available on demand and focuses on how well the clinician communicates complex information, responds empathetically, and maintains cultural appropriateness. Unlike prior work focused on patient-facing chatbots, our tool addresses the clinician communication training gap by enabling realistic, dynamic dialogue tailored to varying clinician experience levels. It is particularly useful for building foundational skills in a low-pressure environment. Evaluations with healthcare experts demonstrate the system’s potential to improve medical communication, offering a scalable, adaptive communication training framework for modern healthcare settings.

OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation

ABSTRACT. With the rapid development of embodied intelligence, leveraging large-scale human data for high-level imitation learning on humanoid robots has become a focal point of interest in both academia and industry. However, applying humanoid robots to precision operation domains remains challenging due to the complexities they face in perception and control processes, the long-standing physical differences in morphology and actuation mechanisms between humanoid robots and humans, and the lack of task-relevant features obtained from egocentric vision. To address the issue of covariate shift in imitation learning, this paper proposes an imitation learning algorithm tailored for humanoid robots. By focusing on the primary task objectives, filtering out background information, and incorporating channel feature fusion with spatial attention mechanisms, the proposed algorithm suppresses environmental disturbances and utilizes a dynamic weight update strategy to significantly improve the success rate of humanoid robots in accomplishing target tasks. Experimental results demonstrate that the proposed method exhibits robustness and scalability across various typical task scenarios, providing new ideas and approaches for autonomous learning and control in humanoid robots. The project will be open-sourced on GitHub.

Temporal Knowledge Graph Reasoning with Evolutionary Patterns and Global Latent Associations

ABSTRACT. Temporal knowledge graph reasoning aims to compute the implicit knowledge based on historical snapshots. Temporal reasoning methods generally require the consideration of both local and global information, and traditional methods typically only perform local modeling on several recent consecutive snapshots in temporal knowledge graphs to capture their evolutionary patterns: concurrent structural dependencies and the temporal evolution of facts. Existing methods have attempted to handle local and global snapshots concurrently and have achieved some progress. Yet, they still face two primary problems: (1) They are unable to fully and efficiently exploit the latent associations among cross-time snapshots, which difficult to extract the rich semantic information in the global scope; (2) The integration of local and global information is inadequate, making it difficult to combine representations at different levels adaptively. To this end, this paper proposes a novel reasoning model, EPGLA. In EPGLA, through the collaborative work of the Evolutionary Graph Unit, the Association Graph Unit, and the Gating Unit, effective modeling of local and global information is achieved. EPGLA constructs an association graph module that cross-time connects based on historical facts to mine the latent associations among cross-time snapshots and the semantic information in the global scope. A gating function is employed to fuse local and global information adaptively. Extensive experiments on five benchmark datasets demonstrate that EPGLA outperforms state-of-the-art extrapolation models in temporal knowledge graph reasoning tasks, validating the model’s effectiveness.

Faster Words Memorization: Reaction Time-Aware Interval Repeat Memory Optimization

ABSTRACT. SSP-MMC is a strategy for memorizing English words based on optimizing spaced repetition, it aims to minimize the memory cost while ensuring the memory effect (such as the target half-life). However, SSP-MMC still has a high memory cost. To address this issues, we propose a new interval repetition algorithm based on dynamic adjustment of users' reaction time (the time from seeing a word to thinking about its meaning) called SSP-MMC++. We think that reaction time is also an important parameter and reflects the fluency of memory retrieval. Specifically, we first add the reaction time variable to the memory model to improve the state transition equation. Then we add the reaction time variable to the Bellman equation to find the optimal review strategy. Experiments show that, compared with the original SSP-MMC algorithm, the new method SSP-MMC++ improves the target half-life achievement rate by 22.8\%, reduces the cost of memory retention by 16.8\%, and increases the total amount of learning by 23.8\%.

Hebb-inspired Low Rank Adapters for Large Language Models Fine-tuning

ABSTRACT. The backpropagation method is the predominant method for pre-training and fine-tuning of Large Language models. At the same time, it is considerably demanding in terms of memory and hardware. Therefore, it makes fine-tuning and pre-training very expensive, harmful for the environment due to the large carbon footprint, and raises the blocks for the development of frontline models by new companies. This paper presents a novel method of a fine-tuning strategy – Hebb-inspired Low Rank Adapters (HiLoRA) based on partial elimination of the backpropagation with a localized learning rule. Theoretically, the new strategy can bring up to 2x acceleration to any Large Language model fine-tuning. Early experiments prove the flexibility of the new method, resulting in lossless 1.47x acceleration and memory reduction of LLaMA-2-7B fine- tuning.

An Active Structure-learning Strategy for LLMs-based Document-level Event Extraction
PRESENTER: Jinming Zhang

ABSTRACT. Document-level event extraction (DocEE) is a challenging task, which should identify structured event arguments spread across a whole document. Large language models (LLMs) have strong language comprehension ability, but they know nothing about the schemas of different event types. To align the output of LLMs, before feeding each document into LLMs, it should be concatenated with predefined event schemas and description to let LLMs know the structure of event. In my opinion, this is a passive structure-learning strategy, because an explicit prompt should be given in advance to let LLMs know the structure of output. In this paper, we propose an active structure-learning strategy for LLMs, where LLMs are directly tuned with structured objective outputs. It enables LLMs "understanding" the structure of events without any schemas or description in the input. Our experiments achieve state-of-the-art performance on two evaluation datasets and in the online testing.

Empowering Māori Automatic Speech Recognition through EMD-Based Augmentation

ABSTRACT. Low-resource languages like Māori face significant challenges in developing robust Automatic Speech Recognition (ASR) systems due to limited annotated data and linguistic resources. This paper proposes a novel data augmentation framework that enriches training data for ASR models through Empirical Mode Decomposition (EMD) based frequency band perturbation. EMD is employed to decompose speech signals into intrinsic mode functions (IMFs), enabling selective removal of specific frequency components to simulate variations in speaker traits and acoustic environments. To further enhance data diversity, phase spectrum augmentations are also incorporated, providing additional variability without significantly altering signal intelligibility. Experiments on a self-collected 17-hour Māori speech corpus demonstrate consistent improvements across three ASR architectures, including DeepSpeech, Wav2Vec 2.0 XLS-R, and HuBERT. The proposed method significantly reduces Word Error Rates (WER), especially when combined with SpecAugment, underscoring its complementary benefits and effectiveness in enhancing generalization for Māori ASR.

ADR-CSL and SPSA-Assisted Feature Enhancement for Aerial Detection

ABSTRACT. Object detection in aerial images is highly challenging due to the arbitrary orientation of targets and the scarcity of details in small targets. Aiming at the periodicity and boundary discontinuity of angle representation in traditional rotated bounding box detection, as well as the defect that feature representation struggles to balance spatial details and semantic enhancement, this paper proposes an Adaptive Radius Circular Smooth Label (ADR-CSL) and a Lightweight Feature Enhancement Module (SPSA). Detection performance is improved through geometric prior guidance and hierarchical feature interaction. Experiments on the DOTA and HRSC2016 datasets show that the proposed method achieves an mAP of 63.1% on the DOTA1.0 dataset, a 1.1% improvement over the baseline, with significantly improved detection accuracy for high-aspect ratio targets and boundary angle targets. On the HRSC2016 dataset, the mAP reaches 89.8%.

Hybrid CNN-LSTM-AE Framework with LLM-Driven Sentiment Analysis for Anomaly Detection within the Cryptocurrency Markets

ABSTRACT. Recent advances in deep learning and sentiment analysis have opened new avenues for anomaly detection in volatile financial markets, particularly within the cryptocurrency sector. Traditional statistical and price-based anomaly detection methods often struggle to adapt to the non-stationary and sentiment-driven dynamics of these markets. To address this, we propose a novel hybrid framework that integrates a Convolutional Neural Network-Long Short Term Memory-Autoencoder (CNN-LSTM-AE) architecture with sentiment analysis derived from large language models (LLMs). This approach fuses spatial, temporal, and behavioural signals to identify subtle market irregularities and regime shifts frequently overlooked by conventional models. Recognising the inherent limitations and occasional inconsistencies in LLM-generated sentiment data, we incorporate an ensemble sentiment scoring mechanism to enhance reliability. The refined sentiment signals, combined with market features such as OHLCV data, improve the model’s detection accuracy under highly dynamic conditions. Experimental results demonstrate that the proposed sentiment-augmented anomaly detection framework outperforms baseline models, offering a more resilient and context-aware tool for cryptocurrency market surveillance and risk management.

Unveiling Bias in the Autism AI Dataset: A Patterned Sampling Approach for Balanced Learning

ABSTRACT. Autism Spectrum Disorder (ASD) is a complex and heterogeneous neurodevelopmental condition for which early detection is critical to ensure effective treatment and support. One potential approach to address challenges and expedite autism referrals and diagnosis is to leverage Artificial Intelligence (AI) algorithms and models. However, AI systems often face algorithmic bias due to dataset class imbalance. Conventional resampling methods, such as random undersampling, oversampling, and synthetic data generation, are commonly used to mitigate this imbalance, but they can distort the data distribution or introduce artificial noise, often leading to overfitting. In this study, we used the Autism AI dataset, one of the largest available behavioural datasets, which contains over 12,000 caregiver-reported samples. Despite its size, the dataset suffers from class imbalance, with a significantly higher number of autistic samples than non-autistic ones, leading to biased model outcomes. To address these challenges, this study proposes a novel patterned non-synthetic sampling strategy called Rule-Based Screening Sampling (RSS). RSS is a pattern-driven approach that selects the confident minority class samples using rule-based confidence criteria, maintaining data integrity, while effectively handling class imbalance. Comparative evaluation with conventional sampling methods demonstrates that RSS significantly improves key performance, such as unweighted average recall and precision, which are often compromised in imbalanced clinical datasets. While evaluated on the Autism AI dataset, this approach could be generalized to other clinical datasets by identifying dataset-specific patterns and applying rule-based sampling without relying on synthetic data.

Player position classification with fuzzy clustering
PRESENTER: Tomasz Górecki

ABSTRACT. The football transfer market is inherently complex, as a player's valuation depends on many intertwined factors—nationality, individual performance metrics, agents, rival clubs, league prestige, fan engagement, and media narratives—all of which can introduce biases in decision making. Effective team building requires a clear understanding of the existing squad's strengths and playing styles, so that recruits complement the team's tactical needs and long-term strategy. This study systematically analyzes a wide range of on-pitch and contextual attributes to identify those most predictive of specific roles within each position. Using a data-driven clustering and classification framework, we derive clear player profiles that support more objective, transparent transfer decisions. Our findings provide clubs with actionable insights into the traits that drive both individual performance and squad cohesion, ultimately enhancing their strategic planning for future seasons.

Graph of Now and Past Network: A Novel Approach for Dynamic Temporal Graphs Learning

ABSTRACT. Dynamic temporal graphs have become essential in developing recommendation systems and social science research. In this study, we introduce a novel approach for a dynamic temporal graph network named Graph of Now and Past Network (GNP), which enhances temporal graph learning by considering the traditional past data as the past and now. The proposed approach simultaneously addresses the issue of data staleness and emphasizes new edge features and active neighbor information within the existing memory structure. GNP supports real-world applications that demand continuous adaptation, offering timely and context-aware predictions by modeling both recent and historical interaction patterns. Extensive experiments have been conducted, and the results reveal that the proposed model outperforms state-of-the-art baselines, including TGN on four public temporal datasets. The results also indicate that GNP demonstrates high stability and rapid convergence during training. The source code is available at https://anonymous.4open.science/r/GNP-2CB4/.

AutoSAD: An Adaptive Framework for Streaming Anomaly Detection

ABSTRACT. Streaming anomaly detection faces significant challenges due to the evolving nature of data distributions and the lack of labeled data for model optimization. Existing approaches rely on fixed hyperparameters and single algorithms, which limits their adaptability in dynamic environments. We present AutoSAD, an adaptive framework that automatically optimizes model selection and hyperparameters for unsupervised streaming anomaly detection. Our approach maintains a diverse set of anomaly detectors and employs multi-armed bandit optimization for intelligent model selection. Additionally, the framework incorporates evolutionary hyperparameter adaptation, enabling it to continuously adjust to changing data characteristics over time. Experimental evaluation shows that AutoSAD outperforms state-of-the-art anomaly detection algorithms, effectively balancing exploration and exploitation. It achieves significantly better predictive performance while maintaining computational efficiency in streaming scenarios.

Affective Resonance to Agency Alignment: Sense of Agency in Human–centered AI

ABSTRACT. While affective computing has established itself as a cornerstone of human-centered AI, what has emerged is a subtle but consequential epistemic gap, i.e., a mismatch between a system's affective resonance and causal intelligibility: when AI makes decisions that feel right emotionally but are opaque in reasoning or uncontrollable in behavior, users lose the agentic thread that ties them to causal outcomes. This gap highlights a fundamental dimension of human experience that remains underrepresented in AI system design, the so-called sense of agency (SoA) -- the feeling of being in control of one's own actions and their causal outcomes. In this paper, we argue that computing for affect and aligning with SoA are methodologically and functionally distinct, even if interrelated. While the former emphasizes emotion recognition and affective feedback, the latter shifts the focus to the preservation and enhancement of user's SoA during interactions. We further argue that an AI system that misreads affect but supports agency might be trusted more than a system that reads affect but suppresses agency. We build on recent work in the neurocognitive and behavioral sciences, and human-centered AI, to propose the research agenda of agency alignment, i.e., the development of AI systems that preserve and reinforce the user’s sense of self and agency. Thus, we advocate for the AI community to treat human SoA not as a soft design issue, but as a computational problem in its own right that should not be rejected as it can be central to trust and accountability in intelligent systems.

Output Structure Simplification to Enhance Transformer-Based Text-to-Workflow Translation

ABSTRACT. Workflows are structured sequences of tasks that define how a process is executed. Workflows can be used to organize, standardize, and automate processes. Manually creating a workflow from a given natural language is challenging and time-consuming. One potential solution is to automate this process using a Transformer model. However, translating text into a particular workflow format with Transformers could be challenging due to the structural complexity of the target format and a diverse vocabulary, which can reduce accuracy. A data-centric method (Text-to-SSF-Workflow) is proposed to improve Transformer-based Text-to-Workflow translation by simplifying the output structure and reducing vocabulary size through abstraction without modifying the Transformer's architecture. In the absence of publicly available Text-to-Workflow datasets, this study utilizes a Text-to-Source-Code dataset, where the source code is converted into workflow represented by a JSON format and a Simplified Structured Format (SSF). The conversion from source code is achieved through Abstract Syntax Tree (AST) parsing, followed by node-level abstraction using FAISS K-Means Clustering. A domain-adapted tokenizer, based on BERT, is also introduced to better handle code-like syntax in workflow formats for Transformer model training. Preliminary experiments of this research show that training a standard Transformer model with a simple-format dataset can significantly improve translation accuracy.

MACGEC: An LLM-Driven Multi-Agent Framework for Chinese Grammatical Error Correction

ABSTRACT. To address the limitations of monolithic analysis and uncontrollable revisions in traditional Chinese grammatical error correction (CGEC) methods, this paper proposes a multi - agent collaborative CGEC framework with three core capabilities—memory capability, execution capability, and planning capability. Driven by large language models (LLMs), the framework functions via: (1) the Tokenization Agent generating structured part-of-speech sequences as semantic primitives; (2) the Syntax Validation Agent and Semantic Verification Agent conducting parallel formal rule detection and logical-semantic verification to identify conflicts and inconsistencies; (3) the Error Classification Agent integrating multi-source features for dynamic error classification and precise localization with linguistically logical labels; and (4) the Error Correction Agent adhering to the minimal intervention principle to generate and select optimal revision plans via LLMs and a three-dimensional evaluation model. Experiments show the framework significantly boosts error correction performance, with MACGEC achieving an F0.5 of 44.72 on NaCGEC, outperforming DeepSeek-R1 32B COT (37.32) by 19.8%. In time efficiency, MACGEC processes in 98s, 15.5% faster than COT's 116s, while maintaining higher accuracy. This demonstrates its advantages in detection, flexibility, revision quality, and real-time performance, with modular and adaptive capabilities.

In-Vivo Biosensors and Visual Data for Precision Agriculture: a Multimodal Approach for Water Stress Detection in Tomato Plants
PRESENTER: Riccardo Pecori

ABSTRACT. Early and accurate plant stress prediction is fundamental in precision agriculture to optimize resources use and enhance crop yield. This paper introduces a multimodal framework for classifying water stress in tomato plants by exploiting data from novel in-vivo biosensors and images. The method combines temporal electronic signals from the biosensors with RGB and NIR images captured from different points. The considered approach exploits a Transformer for the biosensor data and pretrained CLIP-based encoders for visual data, which are fused together before a cross-attention mechanism is applied. The system classifies plant health status into four categories: Healthy, Stress, Uncertain, and Recovery. The results demonstrate better performance of multimodal model over single-modal baselines, and good results also in distinguishing ambiguous states like the "Recovery" class. This work highlights the effectiveness of the innovative biosensor combined with a multimodal framework for smart agriculture, with implications for sustainable crop management and stress mitigation, a pressing socioeconomic challenge worldwide.

Patching LLMs Efficiently for Edge Devices

ABSTRACT. Over-the-air (OTA) updates are essential for maintaining deployed large language models (LLMs) on edge devices, a trend accelerated by the success of compact models such as DeepSeek R1. However, we find existing delta encoding algorithms often perform poorly when patching LLMs and other AI models, as their core assumptions do not hold for model data. We propose ResComp, a residual-based differencing algorithm tailored to the structural alignment and low compressibility of model weights. Instead of indexing the old version and scanning the new for scattered matches, ResComp directly computes the residual sequence between aligned models and compresses it using the bzip3 compressor, which achieves better compression ratios and runs faster than traditional high-ratio alternatives. Extensive experiments on popular open-weight LLMs and Stable Diffusion variants show that ResComp significantly outperforms traditional algorithms in patch size, memory use, and differencing speed. An additional Run-Length Encoder (RLE) enhancement further improves patching speed by ~30% on a real edge device, making ResComp an efficient and practical choice for industrial model updates.

PGMN: Subgraph Partition-Based Graph Matching Neural Network

ABSTRACT. Subgraph matching aims to find all isomorphic embeddings of a query on a data graph, which has gained increasing importance in many real-world applications such as computer vision, knowledge graph search, bioinformatics and so on. Traditional subgraph matching algorithms provide precise results but face challenges in large graphs because subgraph matching is an NP-complete problem, limiting their practical applicability. With the development of Graph Neural Networks (GNNs), various GNNs-based algorithms have been proposed, where they first obtain the embeddings of query and data graph and deliver the results by comparing these embeddings. Compared with traditional methods, GNNs-based algorithms can improve efficiency to some extent, but still cannot achieve satisfactory matching efficiency in real-world loads. To this end, we introduce a partition-based neural network algorithm called Subgraph Partition-based Graph Matching Neural Network (PGMN). In PGMN, we first propose a partitioning framework, aiming to effectively partition the initial data graph into smaller partitions to accelerate the subgraph matching process. Then, a novel GNNs-based method is proposed to efficiently obtain the subgraph matching results in each partition. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods for subgraph matching, especially in large-scale graph datasets.

CLRec: A Contrastive Learning Framework with Enhanced Labeling for Multi-functional LLM-Based Recommendations

ABSTRACT. Recentadvanceshavedemonstratedthatlarge language models(LLMs) possess strong capabilities in capturing complex patterns within user-item inter actions, thereby attracting increasing attention in recommendation systems. How ever, significant challenges remain in accurately mapping user and item iden tifiers into the semantic space of LLMs and effectively distinguishing subtle differences among items during the recommendation process. In this study, we propose CLRec: A Contrastive Learning Framework with Enhanced Labeling for Multi-functional LLM-Based Recommendations. To further strengthen the model’s sensitivity to semantic differences, we propose a structured label aug mentation mechanism that integrates word-level semantics, role types, and tem plate structural information. This fusion refines the representational boundaries among users, items, and contexts. Additionally, we develop a collaborative con trastive learning framework combining implicit and explicit strategies: the im plicit component enhances representation alignment by implicitly maximizing the mutual information between latent embeddings, while the explicit component introduces semantic-level constraints via QA-style supervision. These dual strate gies together significantly improve the model’s ability to distinguish semanti cally similar items within high-dimensional latent space, with explicit contrastive learning being proposed and applied to recommendation systems for the first time. Extensive experiments on real-world Amazon datasets demonstrate that our method consistently outperforms several state-of-the-art baselines across three key tasks—sequential recommendation, top-N recommendation, and recommen dation explanation—achieving average performance gains of 8%, 5%, and 4.6%, respectively.

Enhancing Keyword Spotting in Mongolian Lead-Type Newspapers through Intermediate Encoding within a Multimodal Framework

ABSTRACT. This paper proposes an innovative deep learning model based on a multimodal framework for efficient keyword spotting in Mongolian lead-type newspaper images. The model integrates visual and textual data to enhance keyword localization accuracy and robustness, supporting both Query-by-Example (QbE) and Query-by-String (QbS) tasks. To address homoglyphic heterogeneity in Mongolian graphemes (i.e., different characters sharing identical glyphs), an intermediate code mapping mechanism based on Unicode standardization is applied, which unifies morphologically identical characters into a normalized intermediate representation layer, achieving a QbS accuracy of 95.57\%. Additionally, to tackle the contextual dependency of character forms in Mongolian cursive writing, learnable positional encoding is incorporated into the word vector embedding layer. This enables the model to dynamically capture morphological changes of characters at different word positions, further improving QbS accuracy to 95.69\%. Experiments demonstrate the model’s robustness under noisy conditions, confirming its effectiveness for Mongolian lead-type newspaper retrieval. This study provides a highly practical and reliable deep learning solution for the field of Mongolian image retrieval, laying a solid foundation for further enhancing the digital retrieval capabilities of Mongolian books, newspapers, and other literary resources.

CDFG:Enhancing Chain-of-Thought Distillation with Feedback

ABSTRACT. Chain-of-thought (CoT) prompting has shown great potential in enhancing the reasoning capabilities of large language models (LLMs), and recent studies have explored distilling this ability into smaller models. However, existing CoT distillation methods often overlook student model errors as valuable learning signals. In this paper, we propose CDFG, a two-stage distillation framework that treats model errors as opportunities for improvement. After an initial imitation-based training phase, the teacher model analyzes the student’s incorrect outputs and generates natural language feedback that highlights reasoning flaws and suggests correction strategies. The student model is then retrained using this guided input. Experiments on several mathematical reasoning benchmarks demonstrate that CDFG consistently improves student model performance. Our results show that incorporating feedback-driven learning into CoT distillation can enhance reasoning accuracy.

MoCo-ANA: MoCo-like Adaptive Neighbor Aggregation for Face Clustering

ABSTRACT. Current face clustering methods rely on global graph structures for feature propagation, which leads to high computational costs and limited constraints about global distribution. We propose MoCo-ANA, a framework that eliminates global graph dependency and decouples local adaptive aggregation from global representation learning. Our method employs a feature-structure co-attention (FS-CoAttn) mechanism for dynamic neighbor selection and introduces a multi-constraint objective including a MoCo-like supervised contrastive loss and hypersphere uniformity regularization. Extensive experiments demonstrate that MoCo-ANA achieves SOTA performance on MS-Celeb-1M and DeepFashion, with ablation studies validating the efficacy of the proposed core components.

A Study on the Performance of Real-Coded Genetic Algorithms for the Circle Packing Problem Using Q-Learning

ABSTRACT. Circle Packing Problem (CPP), is a key area in combinatorial optimization, involves arranging circles to get the maximize packing density within a defined space. This problem, known for its computational complexity and classified as NP-hard, is difficult to solve using conventional optimization methods, particularly in higher dimensions. Genetic Algorithms (GAs) are employed in this study due to their robustness and effectiveness in navigating the complex search spaces typical of NP-hard problems. We utilize Real-Coded Genetic Algorithms (RCGAs) to address CPP. The efficacy of nine distinct RCGA variants is evaluated by comparing their performance against the benchmark instances listed on the Packomania website and existing state-of-the-art results. Additionally, Q-learning is incorporated to dynamically control the genetic parameters, enabling adaptive search behavior and further enhancing packing performance. The performance of these algorithms is evaluated using the mean and best objective function values, with Friedman’s mean ranking test. On the basis of the computational analysis, the performance of the variant of RCGA having Burr Crossover with Polynomial Mutation (BX-PLYM) outperforms all other variants of RCGA.

FATS: A Prompt Injection Attack Utilizing Feign Security Agents with Deceptive Few-shots Learning

ABSTRACT. Large Language Models (LLMs) face significant security risks despite their advanced capabilities. While techniques like Reinforcement Learning with Human Feedback (RLHF) improve ethical alignment, excessive exposure to security-related training data may cause LLMs to overtrust such information, creating new vulnerabilities. To investigate this issue, we propose a novel attack method termed FATS (Feign Agent Attack with Toxic-shots). By obfuscating preference extraction, compromising toxicity samples, and inducing malicious behavior, this method manipulates LLMs into generating harmful content. To evaluate FATS effectiveness, we introduce the FAQuery dataset and conduct experiments on various LLMs. Well-known benchmarks like Advbench were selected to assess the approach. Results demonstrate that mainstream models, including GPT-4.1 (61.6%) and Deepseek-R1 (99.3%) are highly susceptible. Subsequent ablation and defense experiments reveal flaws in existing safeguards (e.g., Llama-Guard), further characterizing FATS vulnerabilities. These findings underscore the need to rigorously analyze security-related data sources during model training, a crucial step toward developing more secure and reliable LLMs.

LLM-KGPlan: Long-Horizon Task Planning via Knowledge-Guided Reasoning

ABSTRACT. Long-horizon task planning by LLM-based intelligent agents plays a critical role in advancing robotic autonomy. While Large Language Models (LLMs) exhibit strong reasoning abilities, they often lack perception of physical environment, leading to long-horizon task plans that are either logically inconsistent or infeasible to execute in real-world scenarios. In this work, we propose a systematic planning framework that enhances LLM-based reasoning via rule-guided Chain-of-Thought prompting and symbolic validation grounded in a domain-specific knowledge graph (KG). The proposed approach enables intelligent agents to decompose complex instructions into coherent subgoals, reason about inter-task dependencies, and generate executable action plans. The KG further serves as a symbolic verifier, enforcing object-action relationships and environmental constraints to ensure the correctness and feasibility of the generated plans. Experimental results in the VirtualHome simulation environment demonstrate that our method improves task planning success rates from 34% to 78%, significantly outperforming existing LLM-based baselines in both plan quality and execution reliability.

Leveraging Runtime Information for LLM Quantization

ABSTRACT. The increasing size and context length of large language models (LLMs) poses significant challenges for memory usage during inference, limiting their deployment on edge devices. Post-training quantization (PTQ) offers a promising solution by reducing memory requirements and improving computational efficiency, but aggressive PTQ methods often lead to significant degradation of performance. To address this, we propose LazyQuant, leveraging two key insights based on runtime information during LLM inference process: (1) the precision of initial key-value (KV) cache segments strongly influences model performance, and (2) space for the KV cache can be allocated later during inference. Instead of relying on static, fully quantized weights, LazyQuant reduces weight size only when memory is tight—leveraging previously generated KV caches, created with higher-precision weights, to mitigate precision loss. Our pilot experiments show LazyQuant surpasses state-of-the-art methods under limited memory budgets.

Model Evaluation with Precision, Recall, and f1 Measure Based on Block-regularized mx2 Cross Validation for Text Corpus

ABSTRACT. For a given text corpus, model evaluation primarily includes model prediction performance estimation and model comparison. The traditional methods are always directly based on point estimations of performance evaluation metrics, such as precision (p), recall (r), and f1 measure, to compare model performance. However, these methods may easily result in low replicable or even incorrect model comparison conclusions owing to the lack of probability evaluation (statistical error estimation). Thus, this study proposes voting aggregation estimators of p, r, and f1 measure based on block-regularized mx2 cross validation. Then, a new Bayes test is constructed based on the proposed estimators to compare model performance. The proposed Bayes test can provide a probability estimation of the superiority of one model over another, giving more reliable model comparison conclusion. Furthermore, several theoretical properties that an upper bound of the expectations of the proposed estimators represented the theoretical maximum performance that the model can achieve and a lower bound of the Bayes factor in the proposed Bayes test controlled by signal-to-noise ratio (SNR) on a given text corpus are proved. Finally, extensive experiments on several natural language processing (NLP) tasks such as semantic role labeling (SRL), named entity recognition (NER), and organization (ORG) entity recognition in the NER task (NER-ORG), demonstrate the effectiveness and superiority of the proposed estimators and Bayes test.

TFA-DQAS: Temperature-Feedback-Adapted Differentiable Quantum Architecture Search

ABSTRACT. Quantum Architecture Search (QAS) facilitates the automated creation of gate-based quantum circuits tailored for specific computational tasks, showing remarkable effectiveness in areas such as quantum state preparation, quantum image processing, and calculating molecular ground-state energies. Building on differentiable neural architecture search techniques, Differentiable Quantum Architecture Search (DQAS) improves search efficiency through continuous relaxation and gradient descent methods. Nonetheless, current differentiable techniques struggle to effectively balance exploration and exploitation, leading to reduced accuracy and unnecessary computational efforts. To overcome this issue, we introduce Temperature Feedback-Adapted Differentiable Quantum Architecture Search (TFA-DQAS), which dynamically adjusts exploration-exploitation strategies by tracking real-time convergence trends. Experimental findings indicate that our approach achieves ground-state energy prediction accuracies of 4.3×10⁻¹⁰ (H₂), 2.0×10⁻⁵ (LiH), 1.1×10⁻⁶ (TFIM), and 2.9×10⁻⁵ (Heisenberg), with peak accuracy exceeding DQAS by factors ranging from 1.3× to 10,000×, while also demonstrating better overall performance compared to existing methods. The adaptive characteristics of TFA-DQAS make it particularly well-suited for near-term quantum devices with limited circuit depth and gate counts. Comprehensive validation in quantum chemistry applications confirms the effectiveness of our method in producing optimized quantum circuits.

Controlled Cluster Separation for Class Incremental Learning

ABSTRACT. Class Incremental Learning (CIL) has attracted a lot of at- tention due to its ability to continuously acquire knowledge from stream- ing data. However, catastrophic forgetting remains a central challenge in CIL. To alleviate this issue, we propose using Controlled Cluster Sep- aration Gaussian Mixture Model (CCS-GMM) to preserve knowledge of category features. CCS can guide the feature distributions of each class into multiple clusters. This multi-cluster structure enables a more fine-grained characterization of class features from multiple dimensions, thereby enhancing the model’s class representation capability and con- structing clearer class boundaries. As a result, features of new classes are less likely to invade the regions of old classes, significantly reducing interference between classes and effectively mitigating catastrophic for- getting. Meanwhile, CCS improves the GMM fitting accuracy of feature distributions, further boosting overall recognition performance. Exten- sive experiments demonstrate that our method improves performance across several CIL benchmarks.

A Dual mmWave Radar System Using Density-based Sequential Clustering

ABSTRACT. This paper presents a dual mmWave radar system with our sequential clustering method to address two significant challenges in radar-based human detection. The first challenge is that conventional clustering algorithms are highly dependent on the predefined parameters such as the number of clusters, MinPts and epsilon, and are susceptible to environmental noise. The second is the performance degradation of many detection methods in real-world environments, primarily due to the sparse data from single-chip radar systems. To overcome these limitations, we propose a dual-radar cooperative framework to enrich the input data and a sequential clustering algorithm. Our algorithm first employs density-based clustering to sequentially identify potential targets while effectively filtering noise. Subsequently, a board-peeling technique is introduced to confirm merged clusters, separating individuals in dense crowds. Finally, the refined parameters of potential targets are used to initialize a Gaussian Mixture Model (GMM), which generates the final, accurate clustering results. Results demonstrate that our clustering method is effective for public datasets and real-world environments, like offline marketing and social distancing monitoring,The proposed system demonstrates its effectiveness as a low-cost, privacy-preserving solution for smart space area management.

Evaluating Semantic Representations in Multimodal Word Grounding

ABSTRACT. Word grounding tasks aim to associate individual words with corresponding elements in visual scenes, enabling machines to link language with perception for effective human–machine interaction. However, existing grounding models struggle to generalize to synonyms or unseen lexical variants, limiting their performance in open-domain scenarios. In this paper, we present a Bayesian multimodal grounding model that incorporates word embeddings as priors within a probabilistic generative process to improve robustness under lexical variation. We compare the effects of static FastText and contextual BERT embeddings on grounding accuracy by conditioning word–visual associations on their semantic representations. Experiments use CLEVR-generated 3D scenes paired with structured compositional descriptions to test the grounding of object categories, colors, and spatial relations across lexical shifts. Results show that contextual embeddings such as BERT consistently outperform static embeddings like FastText in overall grounding accuracy and in resolving spatial relations. We demonstrate that integrating structured probabilistic inference with rich semantic embeddings offers a principled and scalable solution for robust, interpretable word grounding.

EffiPerception: A Plug-and-Play Efficiency Enhancement Framework for 2D and 3D Perception Models

ABSTRACT. Improving the balance between accuracy, speed, and memory usage remains a central challenge in visual perception, particularly across diverse tasks and modalities. While many existing models are tailored to specific tasks like 2D detection or 3D segmentation, few solutions generalize across modalities without redesigning model architectures. In this work, we propose EffiPerception, a plug-and-play framework that enhances the efficiency of existing perception models across both 2D and 3D domains. EffiPerception comprises three modular components: (1) Efficient Feature Extractors that adapt to different input modalities while reducing spatial redundancy, (2) Efficient Layers that improve feature quality and reduce computational overhead without modifying the backbone, and (3) EffiOptim, an 8-bit optimizer that minimizes training-time memory consumption. Designed to be task-compatible and architecture-agnostic, EffiPerception improves the performance-efficiency trade-offs of diverse models without compromising generality. Experiments on COCO, KITTI, and SemanticKITTI demonstrate consistent improvements in accuracy, inference speed, and memory efficiency across four core perception tasks: 2D object detection, 2D instance segmentation, 3D object detection, and 3D point cloud segmentation.

MC-GNNAS-Dock: Multi-criteria GNN-based Algorithm Selection for Molecular Docking

ABSTRACT. Molecular docking constitutes a crucial computational technique in drug design and discovery, enabling prediction of ligand-target interactions. While diverse conventional search-based and machine learning driven docking algorithms exist, no universally optimal docking protocol has emerged due to scenario-specific performance variability. To address this limitation, algorithm selection systems such as GNNAS-Dock--which leverages graph neural networks (GNNs)--have been proposed. The present study advances GNNAS-Dock through three key modifications. Firstly, a multi-criteria evaluation framework is introduced, integrating docking validity assessments via PoseBuster checks alongside commonly used root-mean-square deviation (RMSD) metric for binding pose accuracy. This enhancement primarily yields Multi-criteria GNNAS-dock (MC-GNNAS-Dock). Secondly, MC-GNNAS-Dock is varied by incorporating rank-aware loss functions to refine learning performance. Third, architectural modifications are implemented to improve predictive robustness. An empirical evaluation and analysis are reported using a dataset of 3200 docking scenarios, originated from the PDBBind database. MC-GNNAS-Dock demonstrates superior performance, achieving 3.2--4.1\% higher accuracy compared to the overall or single best-performing baseline algorithm. These findings underscore the efficacy of multi-criteria evaluation metrics and ranking-aware training paradigms in optimizing complex algorithm selection tasks within computational drug discovery pipelines.

VGHTCoder Multi-Agent Code Generation with Hypothesis Testing and Verification-Guided

ABSTRACT. Large Language Models have made remarkable progress in code generation and problem-solving. Code generation tasks not only demand a deep understanding of complex natural language problem descriptions but also require the generation of correct and efficient code capable of passing comprehensive unit tests. Current mainstream approaches typically prompt LLMs to generate initial code drafts by targeting segments corresponding to different reasoning steps. However, these methods still suffer from reasoning errors and demonstrate limited capacity in guiding LLMs to solve complex real-world tasks.To address these limitations, this paper proposes VGHTCoder—a code generation framework that leverages a multi-agent prompting mechanism to simulate the full programming lifecycle. This framework integrates Chain of Verification and Adaptive Debugging to evaluate the code generation tasks completed by other agents. In addition to the programmer and code executor agents, a novel component of the verification stage includes hypothesis testing mechanisms to simulate human-like reasoning.At each stage, the verification agent of VGHTCoder drafts an initial response, then plans verification queries to fact check the draft, and finally guides the code executor through validated responses to ensure effective task completion. This collaborative system outperforms both single-agent models and earlier multi-agent strategies.Our framework achieves state-of-the-art pass@1 performance on multiple benchmarks, including HumanEval (94.5\%), MBPP (92.17\%), and MBPP-ET (64.2\%). Furthermore, our approach demonstrates substantial potential for further advancement in code generation.

Development of fault trees from safety reports to model process risk using CorEx-ARM framework

ABSTRACT. Traditionally, process industries have relied on engineering first-principles based accident models and inputs from subject-matter experts to assess risks. However, with the advent of Industry 4.0, there is huge potential of using data-mining models to identify faults/failures from data and discover cause-effect relationships between the faults to perform fault diagnosis and accident modelling in form of chains of events and fault trees. Safety documents, in form of visit/observation reports, audits and incident investigation reports contain considerable information about faults and their propagation paths and use of text mining models can reveal insights on various latent faults/failure events and how their causal interactions result in propagation of hazards into accidents. Here a novel semi-supervised text mining methodology is presented that enables the user to automatically identify various faults in form of Risk Control System (RCS) failures in process plants. Using user-defined RCS-relevant anchor words as inputs, the model performs CorEx topic modelling on incident descriptions to identify the latent topics representing the RCS failure events. Then Association Rule Mining (ARM) is applied on the RCS failure item-sets to develop major chains of events. Using the event chains as cut-sets, various fault trees depicting failure propagation paths are developed. The methodology was applied on incident investigation reports at a petrochemicals plant and 6 fault trees representing failure propagation paths were identified.

A Hyper-Heuristic Approach to Bi-Space Search for Bin Packing Problems

ABSTRACT. Metaheuristics such as genetic algorithms and local search traditionally explore a solution space to solve combinatorial optimisation problems such as scheduling or packing problems. Hyper-heuristics explore a heuristic space rather than the solution space directly and were introduced to overcome some of the challenges associated with searching the solution space. Usually, only one space, i.e., the solution or program, is explored to solve the problem at hand. More recently, there have been initiatives to explore both the solution and heuristic space at the same time. This has been referred to as bi-space search. In previous work, the effectiveness of bi-space search, using iterated local search, for onedimensional bin packing was shown. This study evaluates the hypothesis that a bi-space search using genetic algorithms to explore the solution and heuristic spaces, and a single point search selection perturbative hyper-heuristic to optimise when to switch between spaces(SPHH), will result in performance improvements over bi-space searches in previous work for bin packing. The proposed approach is evaluated for onedimensional( 1D) and two-dimensional(2D) bin packing problems. The study found that the SPHH performed better than searching each of the spaces separately for both 1D and 2D bin packing problem instances. Furthermore, the proposed approach outperformed the state-of-the-art bi-space searches for bin packing.