View: session overviewtalk overview
10:30 | StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching PRESENTER: Ziyuan Gao ABSTRACT. Text semantic matching requires nuanced understanding of both structural relationships and fine-grained semantic distinctions. While pre-trained language models excel at capturing token-level interactions, they often overlook hierarchical structural patterns and struggle with subtle semantic discrimination. In this paper, we proposed StructCoh, a graph-enhanced contrastive learning framework that synergistically combines structural reasoning with representation space optimization. Our approach features two key innovations: (1) A dual-graph encoder constructs semantic graphs via dependency parsing and topic modeling, then employs graph isomorphism networks to propagate structural features across syntactic dependencies and cross-document concept nodes. (2) A hierarchical contrastive objective enforces consistency at multiple granularities: node-level contrastive regularization preserves core semantic units, while graph-aware contrastive learning aligns inter-document structural semantics through both explicit and implicit negative sampling strategies. Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements over state-of-the-art methods. Notably, StructCoh achieves 86.7% F1-score (+6.2% absolute gain) on legal statute matching by effectively identifying argument structure similarities. |
10:50 | Structure-Aware Dynamic Fusion with Modality Balance for Multimodal KGC ABSTRACT. Multimodal knowledge graphs (MMKGs) integrate structural, visual, and textual modalities to enhance entity and relation representations. However, existing MMKGs completion methods often rely on static fusion strategies that overlook context-specific modality relevance, and they tend to underutilize structural information encoded in the graph topology.In this paper, we present SDMF-MKG, a structure-aware dynamic fusion framework designed to address modality bias and structural underrepresentation in MMKGs. The model incorporates three key components: a structure-guided semantic encoder that preserves topological signals, a dynamic weighting mechanism that adaptively calibrates modality contributions based on triple context, and a KL-regularized loss to encourage balanced modality utilization. We evaluate SDMF-MKG on four benchmark datasets spanning both multimodal-rich and structure-only settings. The model achieves state-of-the-art or competitive performance across most metrics, with notable gains on multimodal datasets such as VTKG-C. Ablation studies further confirm the complementary effects of structure awareness, adaptive fusion, and modality balancing. |
11:10 | PRESENTER: Hai Dang Nguyen ABSTRACT. Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC). |
11:30 | Interpreting Safety: A LLM and STPA Approach ABSTRACT. Artificial Intelligence (AI) models are increasingly used in complex systems such as autonomous vehicles (AVs), where safety and explainability are critical. However, existing explainable AI (xAI) methods focus on model-level transparency while neglecting system-level safety explanations (Gap 1), and prior applications of large language models (LLMs) in AVs often view the AV as a whole, overlooking potential risks arising from interactions among its internal components (Gap 2). To address these gaps, we propose a framework that integrates LLMs with System Theoretic Process Analysis (STPA), a structured method to analyse hazards and assess safety, to improve AV safety assurance. Our framework leverages LLMs for scenario analysis while incorporating STPA to identify unsafe control actions (UCAs) and filter them with real-world video data. We evaluated our method against Lingo-2 (a vision-language-action model developed by Wayve) in a simulated environment, demonstrating superior STPA-based explanations. To evaluate the framework, we employed two ground truth references for accuracy verification and conducted robustness testing, which outperforming traditional LLM-based explainers, as also confirmed by expert evaluations. |
11:50 | Test-Time Recommendation for Safe Medication Combination ABSTRACT. Recommending appropriate medication combinations is crucial to intelligent healthcare. Recent studies leverage Large Language Models (LLMs) to streamline traditional recommendation architectures. However, LLMs are prone to hallucinations during training, often generating nonexistent medications or suggesting incompatible combinations. Moreover, patient queries(i.e., personal information, medical history, and current symptoms) often arrive as an online stream from distributions that differ from the training data. To address this, we propose a novel test-time recommendation method for medication combination, enabling on-the-fly and robust recommendations. During training, we introduce a learnable output layer and a drug–drug interaction (DDI)-aware objective to guide the LLM in generating clinically valid and safe medication combination recommendations. To handle distribution shifts at test time, we further design a self-distillation task that enables off-the-shelf pretrained models to dynamically adapt to unseen patient queries based on their feature representations. Extensive experiments conducted on the MIMIC-III and MIMIC-IV datasets demonstrate that our approach excels in both recommendation accuracy and safety, showing its potential for deployment in real-world clinical settings. |
12:10 | Watermarking Counterfactual Explanations ABSTRACT. Counterfactual (CF) explanations for ML model predictions provide actionable recourse recommendations to individuals adversely impacted by predicted outcomes. However, despite being preferred by end-users, CF explanations have been shown to pose significant security risks in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on the underlying proprietary ML model. To address this security challenge, we propose CFMark, a novel model-agnostic watermarking framework for detecting unauthorized model extraction attacks relying on CF explanations. CFMark involves a novel bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks using these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme. At the same time, the embedded watermark does not compromise the quality of the CF explanations. We evaluate CFMark across diverse real-world datasets, CF explanation methods, and model extraction techniques. Our empirical results demonstrate CFMark's effectiveness, achieving an F-1 score of ~0.89 in identifying unauthorized model extraction attacks using watermarked CF explanations. Importantly, this watermarking incurs only a negligible degradation in the quality of generated CF explanations (i.e., ~1.3% degradation in validity and ~1.6% in proximity). Our work establishes a critical foundation for the secure deployment of CF explanations in real-world applications. |
10:30 | Enhancing LLM Abductive Reasoning through MCMC Premise Retrieval PRESENTER: Yuanyi Wang ABSTRACT. We present a framework that leverages Markov Chain Monte Carlo (MCMC) to enhance abductive reasoning in large language models (LLMs). Abductive reasoning—the task of inferring the most plausible explanation for a given observation—remains a significant challenge for LLMs, particularly in scenarios with incomplete or ambiguous information. Existing methods typically rely on static retrieval strategies that struggle to adapt to diverse reasoning contexts. In contrast, our approach employs an unsupervised MCMC algorithm to efficiently explore large premise spaces, balancing exploration and exploitation to identify the most relevant supporting evidence. These premises are dynamically reordered to appear at the beginning of the prompt, guiding LLMs toward generating more accurate and coherent hypotheses. Experimental results demonstrate substantial gains in both premise recall and hypothesis consistency, highlighting the effectiveness of probabilistic modeling in complex reasoning tasks. When evaluated on the Entailment Bank dataset, our method significantly improves premise retrieval and enables LLMs to generate hypotheses that better align with the ground truth. |
10:50 | Understanding Cross-Lingual Generalization of English-Centric LLMs: The Role of Representation Similarity and Data Exposure PRESENTER: Suchun Xie ABSTRACT. English-centric large language models (LLMs), such as LLaMA, have gained prominence in NLP research and practice. Although these models are predominantly trained on English data, their widespread adoption has prompted important attention regarding their cross-lingual generalization capabilities. While cross-lingual capabilities have been extensively explored in the context of multilingual masked language models (MMLMs), corresponding research on English-centric LLMs remains limited. However, due to their decoder-only architecture and constrained access to multilingual training data, it remains unclear whether insights gained from MMLMs apply to these English-centric models. To fill this gap, we conduct a systematic analysis of cross-lingual generalization capabilities in English-centric LLMs. Our experiments demonstrate that even when fine-tuned solely on English data, English-centric LLMs generalize across languages in both classification and generation tasks. Further analysis reveals that representation similarity to English plays a crucial role in enabling this generalization, outweighing the influence of the multilingual data ratio during pretraining. This finding contrasts with prevailing assumptions in the MMLM literature. Additionally, we propose and empirically validate a similarity-reversed data allocation strategy, one that assigns more data to languages less similar to English, which can effectively enhance overall multilingual performance, particularly under constrained data budgets. |
11:10 | Text2Omni: A Text-only Training Strategy for MLLMs ABSTRACT. Text2Omni is an innovative framework for generating high-quality multimodal synthetic data using text alone, targeting the advancement of Multimodal Large Language Models (MLLMs). Text2Omni addresses the significant challenge of acquiring large-scale multimodal datasets by eliminating the need for real images or audio. The framework leverages the geometric structure of multimodal contrastive representations to generate diverse, high-quality datasets that facilitate pretraining and instruction-tuning for multimodal models. The process involves a three-stage pipeline: (1) Diverse Caption Data Synthesis, where text descriptions are enriched with more detailed semantic information; (2) Instruction-Tuning Data Generation, producing data for complex tasks like multiple-choice and reasoning; and (3) Modality Representation Transfer, where textual descriptions are converted into synthetic image or audio representations. The resulting datasets, Text2Omni-1.8M for pretraining and Text2Omni-540K-Instruction for instruction-tuning, significantly reduce training costs while supporting the development of small- to medium-scale multimodal models. The paper also introduces a two-phase multimodal training paradigm to enhance multimodal understanding and reasoning capabilities efficiently. Experimental results across image-to-text and audio-to-text tasks demonstrate that the Text2Omni framework improves the performance of existing models on a variety of benchmarks, establishing its potential as an effective tool for advancing multimodal learning without requiring large-scale real-world data. |
11:30 | Tunnel Vision in Online Discourse: Formalization and Entropy-Based Quantification with LLM-Simulated Agents ABSTRACT. Online social networks have reshaped public discourse by enabling large-scale, user-driven discussions on societal topics. However, such discussions often exhibit a narrowing of attention, where collective focus converges on a limited subset of topic aspects while neglecting alternative viewpoints. We term this phenomenon tunnel vision. Distinct from ideological echo chambers or filter bubbles, tunnel vision emerges at the aspect level, reflecting reduced topical diversity rather than alignment of opinions. This paper presents a formal framework for defining and quantifying tunnel vision in online discourse. We propose two entropy-based metrics, i.e., Aspect-Sentiment Pairwise Entropy (ASPE) and Coverage-Adjusted Aspect-Sentiment Entropy (CASE), to measure both the distribution and completeness of aspect-level engage ment. To investigate the emergence and dynamics of tunnel vision, we simulate discourse using LLM-simulated agents guided by Bayesian cognitive modeling within an artificial social environment. Our experiments show that tunnel vision naturally arises over time and is shaped by cognitive constraints, including users’ perception windows and expressive capacity. These results offer a new lens on attention dynamics in digital spaces and establish a quantitative basis for detecting and mitigating aspect-level discourse narrowing. |
11:50 | From Evolution to Generation: Leveraging LLMs to Redefine Genetic Programming for Symbolic Regression ABSTRACT. Mathematical equations describe fundamental laws across various disciplines, yet discovering concise and effective mathematical expressions from data remains a challenging task. Traditional symbolic regression methods often overlook domain-specific prior knowledge that scientists rely on, while large language model (LLM)-driven symbolic regression approaches can effectively leverage it. However, existing LLM-driven symbolic regression methods typically require substantial computational resources to generate equations while still suffering from low efficiency in producing high-quality expressions. To address this issue, we propose LLM-Guided Genetic Programming for Symbolic Regression (LLMGP-SR), a prompt-guided equation evolutionary search algorithm. LLMGP-SR integrates LLMs into the initialization, crossover, and mutation operations of genetic programming, achieving an organic integration of semantic generation and structural evolution of expressions. By leveraging an adaptive prompt strategy, LLMGP-SR constructs carefully designed prompts to guide LLMs in generating effective expressions. Experimental results demonstrate that LLMGP-SR significantly outperforms traditional genetic programming in symbolic regression problems across six common benchmark datasets, while maintaining diversity in the solution space. |
12:10 | Explainable LLM-Guided Evidence Retrieval for Claim Verification using Knowledge Graphs ABSTRACT. Claim verification has become an increasingly important field with the rise of online misinformation. Content-based claim verification systems do not provide explicit evidence and are vulnerable to adversarial attacks. Current evidence-based approaches typically employ embedding-based Retrieval-Augmented Generation (RAG), but these conventional RAG systems struggle with multi-hop reasoning across documents. Recent work has proposed RAG systems that address this by indexing the document store into a Knowledge Graph (KG) for improved inter-document connectivity. In this study, we evaluate an explainable LLM-guided evidence retrieval framework using KGs for claim verification on the AVeriTeC dataset and propose a novel "ask first, index later" sparse retrieval approach to improve efficiency and cost-effectiveness. With an AVeriTeC score of 0.57, 0.46 above the baseline, our framework ranks among the top-performing systems. Additionally, it achieved Q only and Q+A scores of 0.47 and 0.34, respectively, which are highly comparable to state-of-the-art results, demonstrating its strong evidence retrieval capabilities and effectiveness in verifying real-world claims. |
10:30 | A Skin Cancer Classification Method Based on Genetic Programming with New Region Detection Operators ABSTRACT. Skin cancer is one of the most prevalent malignant tumours worldwide, and its incidence has continued to climb in recent years. Traditional feature extraction methods often struggle with the high variability and complex patterns in skin cancer images, necessitating more adaptive and automated approaches. This study proposes a genetic programming (GP)-based method with flexible region detection operators (GPFRD) for automatically and flexibly learning discriminative features for various classification tasks of skin cancer images. The proposed GPFRD method integrates preprocessing, region detection, feature extraction, and feature concatenation into a cohesive framework, significantly enhancing flexibility. The newly designed operators precisely localize diagnostically critical regions based on lesion masks while suppressing irrelevant background interference. These operators enable the proposed method to evolve effective feature extraction solutions based on the characteristics of different image datasets. Experimental results on five datasets of varying difficulties demonstrate that the proposed method outperforms the benchmark GP-based method and four traditional feature extraction methods in the majority of cases. |
10:50 | Prompt Efficient Generation Agent for Skin Lesion Segmentation PRESENTER: Xinxu Xie ABSTRACT. Skin lesion segmentation is a medical image analysis task that involves automatically delineating lesion boundaries from dermatoscopic or clinical images. It plays a critical role in the early detection of skin cancer like melanoma. Low contrast and fuzzy boundaries are the main challenges, especially for limited training images. To tackle these challenges with limited annotation data, we propose a controllable prompt generation agent to activate the skin lesion segmentation capability in vision foundation models for medical image analysis. Specifically, we reduce the pixel-level action space to the grid level for efficient search. With Convolutional Neural Networks as the backbone, the agent performs spatial reasoning over an image to simultaneously find the prompt coordinate and its label using a policy function, and provides the selected prompt points for vision foundation model. The interaction process will be terminated within a fixed iteration number. For optimization, we propose asymmetric rewards aligned with the value function and introduce them into proximal policy optimization to save computation and memory cost. Interestingly, better performance is achieved with fewer prompt points than the threshold number, along with some background points. Thus, the proposed agent is referred to as the Prompt Efficient Generation agent. Experimental results on public benchmark skin lesion segmentation datasets show that PEG outperforms state-of-the-art methods, and the IoU improvement is at least 4% compared with SAM. |
11:10 | CMoD-VD: Cross-Modal Distillation with Privileged Motion Supervision for Violence Detection PRESENTER: Pierre Lefebvre ABSTRACT. Automatic violence detection in videos (VD) has become a major challenge in the field of Computer Vision with the deployment of smart cameras and the increasing volume of videos shared online. Recent works primarily rely on CNN-based models paired with 3D or recurrent layers to capture the spatiotemporal dynamics of video streams. The integration of additional modalities, such as audio or optical flow, has recently attracted growing interest. Particularly, optical flow has demonstrated strong relevance in modeling motion patterns associated with violent events. However, its estimation is computationally intensive, limiting its use for real-time applications. In this work, we introduce CMoD-VD, a novel method for violence detection based on two CNN+BiLSTM models enhanced with spatial, channel, and temporal attentions. Our method relies on cross-modal distillation with privileged motion supervision. A teacher model is first trained with both RGB and optical flow videos. Then, a student model learns to reproduce its behavior using RGB frames only. This strategy enables accurate inference without relying on motion estimation. Experiments on three public datasets RWF-2000, Hockey Fight and Violent-Flows demonstrate that our student model achieves competitive results close to the teacher and state-of-the-art methods, while significantly reducing computational costs. |
11:30 | BLAH: Enhancing Small Object Detection via a Bi-Level Interactive Head with Multi-Level Self-Attention ABSTRACT. The detection head framework critically influences the balance between classification and localization in small object detection, yet existing designs often neglect task-specific feature interactions, leading to optimization conflicts. To address this, we propose Bi-Level Attention Head (BLAH), a novel framework that harmonizes dual-task learning through structured attention mechanisms and adaptive loss optimization. BLAH introduces two key innovations: (1) Channel Group Self-Attention (CGSA) stacks, which dynamically recalibrate channel-group dependencies to align classification and localization features, resolving spatial-channel decoupling limitations in conventional attention. (2) Dual-Task Attention (DTA), integrating global channel attention for classification robustness (translation invariance) and local spatial attention for precise localization (translation variability), enabling synergistic task interaction without computational overhead. Further, we design a Differentiable Task-Balanced Loss (DTBL) that adaptively modulates gradients between tasks via cosine similarity constraints, ensuring stable optimization without extra parameters. Extensive experiments on MS COCO and VisDrone demonstrate BLAH’s superiority. When integrated with DETR, Deformable DETR, and YOLOv10, BLAH achieves +1.2% mAP on COCO over state-of-the-art detectors (e.g., YOLO-based, DETR-based) while maintaining inference efficiency, and significantly improves small-object detection (e.g., +4.5% AP_S on YOLOv12). Ablation studies validate each component's necessity. |
11:50 | Multiscale Masking Knowledge Distillation for Dense Visual Prediction ABSTRACT. Object detection and semantic segmentation are fundamental tasks in computer vision, but deploying deep learning models for these tasks in resource-constrained environments remains challenging due to their high computational demands. Knowledge distillation (KD) has emerged as a promising solution, enabling the transfer of knowledge from a large, high-performance teacher model to a lightweight student model. However, existing KD methods often struggle with dense visual prediction tasks due to their complex feature hierarchies and multiscale object representations.In this paper, we propose Multiscale Masking Knowledge Distillation (MMKD), a novel approach that enhances knowledge transfer by leveraging multiscale feature maps and attention-guided masking mechanisms. Our method systematically distills knowledge by focusing on discriminative regions at different scales, ensuring that the student model learns both fine-grained details and high-level contextual information. We introduce a Feature Attention Module (FAM) that dynamically highlights critical regions in feature maps, improving the student’s ability to detect small objects and reduce false positives. We conduct extensive experiments on benchmark datasets, including COCO, and Cityscapes, evaluating our method across multiple architectures (RetinaNet, Faster R-CNN, GFL, DeepLabV3, and PSPNet). Our results demonstrate that MMKD significantly outperforms existing distillation techniques, achieving state-of-the-art performance in both object detection and semantic segmentation. For instance, on COCO, our method improves the mAP of RetinaNet-Res50 from 37.4 to 41.2, surpassing previous approaches like FGD (39.6) and MasKD (39.8). Similarly, in semantic segmentation, MMKD boosts DeepLabV3-MobileNetV2’s mIoU from 73.12 to 76.21, outperforming competitors such as CIRKD (75.42) and MasKD (75.26). |
10:30 | UniTCP: Traffic Prediction via UniBasis Spectral Filtering and Temporal Convolutional Projection ABSTRACT. Accurate and efficient traffic flow prediction is crucial for modern urban transportation systems, directly impacting the effectiveness of intelligent traffic management and sustainable mobility solutions. Current spatio-temporal graph neural networks often fail to balance prediction accuracy and computational efficiency when modeling complex traffic patterns – a critical limitation for real-time applications requiring both precision and rapid processing. This paper presents UniTCP, a novel framework advancing urban traffic flow prediction through three key innovations: (1) The introduction of Universal Polynomial Basis (UniBasis) overcomes limitations of traditional spectral graph convolution by adaptively constructing optimal polynomial filters through data-driven learning, extending the concept of homophily ratio from node classification to multivariate time series forecasting and enabling dynamic modeling of complex spatial dependencies across heterogeneous traffic networks. (2) The innovative Temporal Convolutional Projection Module (TCPM) synergizes multi-scale convolutional branches with trend-aware pooling to comprehensively capture both transient traffic fluctuations and persistent periodic patterns, establishing a new paradigm for efficient temporal feature extraction. (3) A unified architecture integrating node-adaptive parameter learning with time-variant graph structure generation, which achieves optimal performance-efficiency balance through spectral domain parameterization and spatio-temporal embedding fusion. Experimental validation across four public datasets confirms the framework's superior performance in addressing three core challenges: precise modeling of nonlinear spatio-temporal dependencies, computational resource optimization, and effective generalization across diverse traffic networks. The results demonstrate significant improvements in both prediction accuracy and operational efficiency compared to existing state-of-the-art approaches. |
10:50 | QAACoder: A Question Answering Approach to Actor Detection in the Conflict and Mediation Domain ABSTRACT. Monitoring, analyzing, and predicting political turmoil and violence is of utmost importance to a host of political scientists. This is still usually done using event coding systems that use pattern match ing and fixed-size dictionaries. Recently, BERT and ConfliBERT have achieved state-of-the-art results for event coding. However, these meth ods use a sequence classification paradigm, and thus, are unable to ex plicitly model the semantics of the labels and the rich interactions among them. In this paper, we propose a novel method for political event ex traction on the standard CAMEO-based data set by formulating the problem as question answering overcoming the above drawbacks. We can achieve superior results, improving ConfliBERT, the previous state of-the-art model, by an absolute F1 of 2.02%. We also propose a new method for multi-source, multi-target sentences that increases the F1 by 2.29% compared to the previous best method. |
11:10 | CASD: An Accelerated Sampling Framework for ECG Denoising ABSTRACT. Real-time and accurate analysis of electrocardiogram (ECG) signals is a core requirement for clinical applications such as ambulatory monitoring. However, existing ECG denoising methods struggle to bal ance efficiency and fidelity. Traditional approaches, such as filtering and methods based on deep encoders, often compromise the fidelity of crit ical waveforms, especially in the presence of strong noise interference. Although methods based on Denoising Diffusion Probabilistic Models (DDPMs) have achieved significant progress in signal reconstruction, their high iterative inference cost and architectural limitations in mod eling long-range dependencies severely restrict their clinical application potential. To address these challenges, this paper proposes a Conditional Accelerated Sampling Denoising (CASD) framework. This framework formulates the denoising task as an efficient, deterministic sampling pro cess, capable of achieving high-fidelity signal reconstruction in as few as 10 sampling steps. At the core of CASD is a Global Feature Enhancement (GFE) module, which explicitly captures the long-range dependencies of the signal via a self-attention mechanism, thereby effectively suppressing baseline wander noise. |
11:30 | Automated Design of Neural Networks for River Flow Prediction using Weather Data PRESENTER: Junhao Huang ABSTRACT. River flow plays a vital role in the hydrologic cycle. In recent years, numerous machine learning approaches, particularly deep neural networks (DNNs), have demonstrated success in forecasting river flow. However, many of these models depend heavily on extensive historical flow data and very specific hydrological features, which are often labor-intensive to gather. Moreover, most existing predictive models, especially DNN-based, are handcrafted, requiring substantial domain expertise and extensive hyperparameter tuning, making the modeling process potentially inefficient. In this study, we propose a novel approach to achieve flexible and efficient river flow prediction that relies solely on openly accessible weather forecast data. A one-dimensional convolutional neural network (1D-CNN) is designed to effectively capture the temporal relationships between weather variables and river flow. Additionally, an efficient evolutionary neural architecture search (NAS) algorithm is developed to automatically discover the best 1D-CNN architectures, thereby improving predictive performance while reducing the need for manual architecture tuning. To evaluate our approach (named AutoNN-Flow), experiments are conducted using the Kaeo River in New Zealand as a case study. Without access to river-specific attributes, AutoNN-Flow achieves highly accurate 7-day and 14-day river flow predictions, greatly outperforming classic machine learning models. |
11:50 | Subset Selection for Stratified Sampling in Online Controlled Experiments ABSTRACT. Online controlled experiments, also known as A/B testing, are the digital equivalent of randomized controlled trials for estimating the impact of marketing campaigns on website visitors. Stratified sampling is a traditional technique for variance reduction to improve the sensitivity (or statistical power) of controlled experiments; this technique first divides the population into strata (homogeneous subgroups) based on stratification variables and then draws samples from each stratum to avoid sampling bias. To enhance the estimation accuracy of stratified sampling, we focus on the problem of selecting a subset of stratification variables that are effective in variance reduction. We design an efficient algorithm that selects stratification variables one by one by simulating a series of stratified sampling processes. We also estimate the computational complexity of our subset selection algorithm. Computational experiments using synthetic and real-world datasets demonstrate that our method can outperform other variance reduction techniques especially when multiple variables have a certain correlation with the outcome variable. Our subset selection method for stratified sampling can improve the sensitivity of online controlled experiments, thus enabling more reliable marketing decisions. |
12:10 | A Dynamic Time-Frequency Representation and Cross-Attention Model for Production Forecasting in Waterflooding Oilfields ABSTRACT. During the high water-cut stage of waterflood development, the enhanced nonlinearity of reservoir seepage fields and spatiotemporal non-stationarity of injection-production system dynamics pose major challenges to production forecasting. Existing frequency-domain feature modeling methods exhibit critical limitations: sluggish response to abrupt injection-production dynamic shifts, an inherent imbalance in capturing local transients versus extracting global periodicity, and susceptibility to feature coupling confusion when processing non-stationary production data. To address these issues, this paper proposes the dynamically perceptive hybrid deep learning model DynaTCN-Wave-BiLSTM-CA. The framework features: (1) a Dynamic Temporal Convolutional Network (Dynamic-TCN) employing input-feature-driven adaptive dilation coefficients and channel attention to dynamically adjust convolutional receptive fields for efficient cross-scale feature extraction; (2) third-order db4 wavelet packet decomposition to segregate key parameters into frequency bands, integrated with a dual time-frequency attention mechanism for precise multi-scale feature decoupling; and (3) a bidirectional cross-attention fusion module leveraging BiLSTM networks to capture spatiotemporal dynamics and backward-delayed responses of injection-production systems, thereby deeply integrating transient abrupt features in the time domain with frequency-separated components (low-frequency seepage trends and high-frequency equipment noise). Validation using actual production data from an offshore oilfield confirms the model’s superior performance in non-stationary production sequence prediction compared to mainstream methods. |
10:30 | Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain PRESENTER: Kun Bian ABSTRACT. In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as Mamba deep learning models, have made significant progress in modeling long sequences. Compared to Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), Vision Mamba (ViM) methods have not yet achieved fully competitive performance. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We believe that the introduction of frequency domain information can enable ViM to achieve a better global receptive field during the scanning process. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, considering that Mamba remains essentially a recurrent neural network (RNN), we question the necessity of position embedding in ViM and remove it accordingly in Vim-F. Vim-F has good scalability. As far as we know, its variant Vim-F(CF) is the first ViM model to use a convolution-free ViM encoder. Another variant, Vim-F(H), introduces a linear attention mechanism. This reduces the model's sensitivity to the input sequence and achieves better performance. |
10:50 | D-FRGAT: Event Prediction Technology Based on Temporal Knowledge Graph Reasoning ABSTRACT. "Event" refers to a specific incident or occurrence that has a significant impact on human society and the natural world. Predicting such events helps reduce the risk of potential losses. Event prediction technology plays a vital role in ensuring safety, reducing risks, and con- trolling infectious diseases, among other aspects. Some researches use temporal knowledge graphs to mine the relationship between events and predict the occurrence of future events by analyzing the evolution pat- terns of historical events. However, there are some challenges in the pre- diction process. Firstly, some studies model historical events as a time point process to learn the entity evolution representation, ignoring the interaction between concurrent events occurring at the same time. Sec- ondly, when focusing on the interaction of concurrent events, most studies update node features through simple static linear transformations, with- out considering the importance of semantic information and dynamically distinguishing different edges under the same relationship type to node representation. Thirdly, it is impossible to model the temporal informa- tion and semantic information of historical events, and lacks the ability to dynamically perceive the importance of both.To improve prediction accuracy, we proposes a temporal event prediction model (D-FRGAT). Specifically, D-FRGAT incorporates a third-level control mechanism to decay the temporal information, and integrates multi-relational attention networks and node feature adjustment techniques to better capture the correlations among events over time. Experimental results demonstrate that our method achieves superior performance, the MRR on the GDELT and ICEWS18 datasets increased by 10.93% and 11.39% respectively. |
11:10 | A Comparative Study of Variational and Vector Encoders in Graph User Matching ABSTRACT. Cross-Platform User Identification (CPUI) aims to identify social media accounts belonging to the same real-world user across different platforms. This task is vital for combating cybercrime, where malicious users create multiple accounts, and for enhancing user modeling in fields such as sociology, economics, and epidemiology. Prior research suggests that vector-based encoding of a local network graph may fall short when faced with real-world inconsistencies such as platform dependency and data sparsity. In response, variational encoding, which models the data as normal distributions explicitly, has been proposed as a more robust alternative. In this paper we present a comparative study of vector and variational encoding approaches in the context of a binary CPUI classification task. For this goal, we constructed a synthetic heterogeneous graph derived from 277 research papers authored within an engineering department. Using vector-embedding of the textual context of the papers as features, various models were trained to evaluate the advantage of variational encoding in CPUI. Experimental results show that the standard vector encoding consistently outperforms the variational models in terms of accuracy, F1-score, and AUC-ROC. While all models achieved high performance (accuracy around 90%), there was no empirical advantage to using variational encoding in our experiments. These findings suggest that the benefits of variational encoding may depend on the presence of real-world data inconsistencies that our synthetic dataset lacks |
11:30 | Improving Nystrom Spectral Clustering with Unsupervised Vector Quantization and Incomplete Cholesky Decomposition ABSTRACT. Spectral clustering is a popular clustering approach with wide applications in various fields including machine learning and pattern recognition. However, in dealing with large-scale datasets, it requires to build a very large pairwise similarity matrix, which can be very time-consuming. The Nystrom method is well known for its ability to approximate the feature space with a small number of samples (landmarks), thereby reducing the computation overhead significantly. Motivated by this observation, in this paper we address this problem by approximating the similarity matrix and eigenvectors based on the Nystrom method. First, we present a sampling method to determine the landmarks, which are used in the Nystrom method to obtain the approximate similarity matrix and eigenvectors. By careful utilization of k-means++ method and cosine similarity, our method improves the quality of landmarks. Second, we use incomplete Cholesky decomposition to accelerate the approximation method, and therefore improve the efficiency of the whole algorithm. In experiments with synthetic and real datasets, our algorithm is shown to be effective in comparison with some other approaches. |
11:50 | AMCCL: Adaptive Multi-Scale Convolution Fusion Network with Contrastive Learning for Multimodal Sentiment Analysis PRESENTER: Jiakang Yu ABSTRACT. Multimodal Sentiment Analysis (MSA) requires robust representations that capture both cross-modal consistency and intra-modal distinctions. Existing fusion methods often fail to adapt to diverse sentiment cues and neglect inter-modal correlations, while contrastive learning approaches insufficiently consider pair distribution and loss design. We propose an Adaptive Multi-scale Convolution fusion network with Contrastive Learning for multimodal sentiment analysis (AMCCL), which dynamically fuses multimodal information using an Adaptive Multiscale Convolution (AMC) module. The AMC module dynamically fuses features through multi-scale convolutions with adaptive weighting and squeeze-and-excitation block to enhance salient channels. Our fine-grained contrastive learning leverages sentiment polarity and intensity, with tailored loss functions to strengthen the positive pairs and balance the intermodal and intra-modal relations. Extensive evaluations on the MOSI and MOSEI datasets confirm that AMCCL delivers superior performance relative to state-of-the-art approaches. |
12:00 | Enhancing Generalized Category Discovery via Chaotic Sparsity Matching ABSTRACT. Generalized Category Discovery (GCD) is a task focused on identifying both known and novel categories within an unlabeled dataset by leveraging another labeled dataset containing only known categories. However, current research in GCD faces several challenges. First, it is difficult to prevent noisy representations in the unlabeled data, which hampers the transfer of knowledge from known to novel categories and consequently leads to suboptimal performance in discovering new categories. Second, after knowledge transfer through calibration from labeled to unlabeled data, the category boundaries are often weakly constrained, resulting in overlaps and ambiguity among categories. To address these issues, this paper proposes Chaotic Sparsity Matching and Transfer (CSMT). Specifically, we utilize the Iterative Truncated Mean to select prototypes from unlabeled data. During the calibration process, we employ Sparsity Matching for knowledge transfer, which helps mitigate the influence of noisy data and improves transfer effectiveness. Additionally, we introduce two levels of alignment: instance-level and category-level. At the instance level, we integrate LORS to generate chaotic instances, enabling the acquisition of instance-level knowledge by aligning original features with chaotic features, thereby enhancing learning for novel categories. At the category level, we incorporate a Margin Constraint mechanism to strengthen category separability and prevent ambiguous prototype assignments. Experiments conducted on three benchmark datasets demonstrate that CSMT significantly outperforms state-of-the-art methods. |
13:30 | Chain-of-Conceptual-Thought Elicits Daily Conversation in Large Language Models PRESENTER: Qingqing Gu ABSTRACT. Chain-of-Thought (CoT) is widely applied to enhance the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks, when there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose a new prompt-based paradigm called Chain of Conceptual Thoughts (CoCT), which suggests the LLM first to produce the tag of concepts, then complete the detailed content following the concept. To encourage this hierarchical way of thinking, we implement the concepts with emotions, strategies and topics. We experiment with this paradigm in daily and emotional support conversations, covering tasks with both in-domain and out-of-domain concept settings. Automatic, human, and LLM-based evaluations reveal that CoCT surpasses several prompt-based baselines such as self-refine, ECoT, SoT and RAG, suggesting a potential solution of LLM prompting paradigm for a wider scope of tasks. |
13:50 | A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction ABSTRACT. This paper investigates demonstration selection strategies for predicting a user's next point-of-interest (POI) using large language models (LLMs), aiming to accurately forecast a user's subsequent location based on historical check-in data. While in-context learning (ICL) with LLMs has recently gained attention as a promising alternative to traditional supervised approaches, the effectiveness of ICL significantly depends on the selected demonstration. Although previous studies have examined methods such as random selection, embedding-based selection, and task-specific selection, there remains a lack of comprehensive comparative analysis among these strategies. To bridge this gap and clarify the best practices for real-world applications, we comprehensively evaluate existing demonstration selection methods alongside simpler heuristic approaches such as geographical proximity, temporal ordering, and sequential patterns. Extensive experiments conducted on three real-world datasets indicate that these heuristic methods consistently outperform more complex and computationally demanding embedding-based methods. Notably, in certain scenarios, these simpler heuristic methods even surpass fine-tuned models without requiring further training. Our source code is available at: https://anonymous.4open.science/r/poi-demonstration-selection-5B1C. |
14:10 | CLIP-LMFA: Few-Shot Anomaly Detection via Large Language Model-Driven Hybrid Prompts and Multi-Scale Adaptive Fusion ABSTRACT. Industrial anomaly detection plays a key role in ensuring production quality and operational safety. Although large-scale language vision models are gradually applied to the field of industrial anomaly detection due to their advantages in few-shot scenarios, their limited semantic generalization ability and lack of fine-grained spatial sensitiv- ity hinder their deployment in real-world and high-precision industrial environments. To address these challenges, we propose a CLIP-LMFA framework for few-shot anomaly detection based on CLIP. We intro- duce a hybrid textual prompt strategy driven by a large language model (LLM) to enhance semantic discrimination while reducing the manual design cost. We design a multi-scale local adaptive fusion (MFEAF) en- coder that can jointly capture global semantics and local fine-grained anomalies to achieve pixel-level anomaly segmentation. Without addi- tional fine-tuning or retraining, CLIP-LMFA achieves significant perfor- mance improvements on benchmark datasets, outperforming the base- line by 1.3% and 4.5% in the I-AUROC test on the MVTec-AD and Brain datasets, respectively, demonstrating its effectiveness and prac- ticality in real-world industrial applications. Our code is available on: https://github.com/PRICAI25/CLIP-LMFA. |
14:30 | Towards Automation in Log Parsing: Auto-Prompt Optimization with Natural Language Gradients ABSTRACT. Log parsing aims to transform raw log information into fixed log templates and dynamic log parameters, thereby achieving structured data conversion to break down automated analysis barriers and improve the performance of anomaly analysis applications. Existing LLM-based log parsing methods primarily rely on manually crafted prompts or selecting a static candidate prompt through prompt optimization to guide the model in gradually understanding the parsing task. However, due to the high diversity of real-world log data, existing log parsing methods may not be able to dynamically update prompt information based on the input logs, resulting in limited adaptability to different parsing scenarios. To overcome this limitation, we propose a framework called NLGLP, which aims to achieve automated prompt tuning to adapt to diverse data scenarios. Specifically, NLGLP first automatically selects several examples that best match the current task through a candidate example set, then combines automated prompt optimization techniques to dynamically adjust prompts based on natural language gradients, thereby improving adaptability and parsing performance under different log data. Experiments conducted on 16 public loghub datasets demonstrate that NLGLP achieves an average parsing accuracy of 98. 9\%, obtaining the best current performance. |
14:50 | Learn from the Past: Language-conditioned Object Rearrangement with Large Language Models ABSTRACT. Object manipulation for rearrangement into a specific goal state is a significant task for collaborative robots. Accurately determining object placement is a key challenge, as misalignment can increase task complexity and the risk of collisions, affecting the efficiency of the rearrangement process. Most current methods heavily rely on pre-collected datasets to train the model for predicting the goal position. As a result, these methods are restricted to specific instructions, which limits their broader applicability and generalisation. In this paper, we propose a framework of flexible language-conditioned object rearrangement based on the Large Language Model (LLM). Our approach mimics human reasoning by making use of successful past experiences as a reference to infer the best strategies to achieve a current desired goal position. Based on LLM's strong natural language comprehension and inference ability, our method generalises to handle various everyday objects and free-form language instructions in a zero-shot manner. Experimental results demonstrate that our methods can effectively execute the robotic rearrangement tasks, even those involving long sequences of orders. |
15:10 | KALE-LM-Chem: Vision and Practice Toward an AI Brain for Chemistry PRESENTER: Weichen Dai ABSTRACT. Recent advancements in large language models (LLMs) have demonstrated strong potential for enabling domain-specific intelligence. In this work, we present our vision for building an AI-powered chemical brain, which frames chemical intelligence around four core capabilities: information extraction, semantic parsing, knowledge-based QA, and reasoning \& planning. We argue that domain knowledge and logic are essential pillars for enabling such a system to assist and accelerate scientific discovery. To initiate this effort, we introduce our first generation of large language models for chemistry: \textbf{\textit{KALE-LM-Chem}} and \textbf{\textit{KALE-LM-Chem-1.5}}, which have achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development. |
13:30 | Enhancing the Forward Forward Algorithm with Label Based Similarity for Improved Neural Network Training ABSTRACT. The Forward Forward (FF) algorithm has been proposed as a biologically plausible alternative to backpropagation for training deep neural networks. It replaces backward gradient computations with a dual forward pass strategy, where each layer independently optimizes a local "goodness" function to distinguish between positively and negatively labeled data. However, due to the lack of global gradient flow and local training, FF based training suffers from poor inter layer coordination and suboptimal label alignment. In this research, we enhance the Forward Forward framework by using the Hilbert Schmidt Independence Criterion (HSIC) to improve the goodness function at each layer. HSIC serves as a label aware statistical dependence measure, encouraging each layer’s output to retain relevant input structure while aligning more closely with the true class labels. Our formulation introduces distinct HSIC based objectives for positive and negative passes: the positive pass maximizes dependence with the true label, while the negative pass penalizes alignment with incorrect labels. This design maintains the local and backpropagation free nature of FF training while promoting global task coherence and refines its ability to differentiate between positive and negative data more effectively, leading to more robust feature representations and improved learning dynamics. Our experimental results demonstrate that this approach substantially improves the accuracy of the FF algorithm across multiple benchmark datasets, narrowing the performance gap with backpropagation while preserving the FF algorithm’s intrinsic advantages. |
13:50 | BMSR: A Bidirectional Multi-hop Predictor with Structure-aware Ranking Loss for NAS ABSTRACT. Performance evaluation is crucial in neural architecture search (NAS), but full training is costly and slow. Performance predictors offer an efficient way to quickly evaluate architectures, significantly speeding up the process. However, existing predictors often trade accuracy for speed or depend on complex encoders and costly pretraining, making it difficult to balance accuracy and efficiency with limited labeled data. In this paper, we propose a Bidirectional Multi-hop predictor with Structure-aware Ranking Loss (BMSR), which is designed for speedy and accurate performance prediction. During feature extraction, BMSR applies a bidirectional multi-hop graph convolution network with hop-aware attention to capture long-range and directional dependencies from architectures. Once the architecture embeddings are obtained, a progressively shrinking MLP is employed to compress them layer by layer, enhancing nonlinear modeling and improving representation quality. In the optimization stage, BMSR adopts a structure-aware ranking loss that leverages topological and operational similarity to encourage stable rankings among architectures. Experiments across multiple NAS benchmarks demonstrate that BMSR achieves competitive performance in both efficiency and accuracy. On NAS-Bench-201, BMSR identifies the optimal architecture using only 100 labeled samples and 8.45 seconds—just 0.3% of the computation time required by prior SOTA methods. Code is anonymously available. |
14:10 | More Imperceptible Adversarial Attack Method on Graph Neural Networks ABSTRACT. Graph Neural Networks (GNNs) have achieved notable success in various graph-related tasks, yet they remain highly susceptible to adversarial attacks. Minor perturbations to graph data—particularly to structure—can significantly impair model performance. Most existing attacks manipulate the graph structure (e.g., adding or removing edges), but due to the sparsity of real-world graphs, such changes often violate the imperceptibility constraint. Furthermore, these methods typically perturb either structure or features in isolation, neglecting the inherent coupling between them. To address these issues, we propose More Imperceptible Adversarial Attack(MIAA), a novel method that jointly perturbs both structural and feature information while maintaining imperceptibility. MIAA introduces a Gradient-Adaptive Permutation Attack (GAPA) method, which disrupts node-edge-feature semantics by permuting existing elements in the graph. Guided by the gradient direction, the attack maximizes prediction loss while preserving global statistical properties. Experiments on benchmark datasets including Cora, Citeseer, and Polblogs show that MIAA significantly outperforms existing methods. Under a perturbation budget constrained to half the node degree, it improves attack success rates by 5.31% to 35.06%, demonstrating both effectiveness and subtlety in adversarial manipulation. |
14:30 | CoTraX: An Efficient Parallel Training Method for On-Policy Deep Reinforcement Learning PRESENTER: Hao Dai ABSTRACT. Deep reinforcement learning (DRL) has significantly advanced artificial agents in complex environments by integrating deep learning with reinforcement learning, demonstrating success in domains such as robotics, reinforcement learning from human feedback (RLHF), and game-playing. However, the alternating training and execution phases, particularly in on-policy methods, introduce substantial synchronization overhead, limiting efficiency. To address this challenge, we analyze the computational interplay between these phases and propose CoTraX, a novel framework that strategically overlaps training and execution to optimize resource utilization and accelerate training. Furthermore, we develop an adaptive control algorithm to mitigate potential adverse effects of overlapping. Extensive experiments demonstrate that CoTraX reduces training time by an average of $9.89\%$ without compromising performance. |
14:50 | SWD-HTM: An novel Hierarchical Temporal Memory Model Integrating Optimal Transport and Sparse Autoencoder PRESENTER: Ye Wang ABSTRACT. Hierarchical Temporal Memory (HTM) is a biologically inspired, online learning algorithm that emulates neocortical computation for time series modeling. However, its reliance on hand-crafted encoders limits adaptability. Meanwhile, independent encoding and concatenation of multivariate feature embeddings often cause dimension explosion. To overcome these limitations, we propose a novel HTM architecture integrating deep representation learning via a Sparse Autoencoder (SAE) with optimal transport theory. The SAE replaces manual the original encoder and spatial pooler components with a data-driven, end-to-end framework, enhancing generalization. The Sliced Wasserstein Distance (SWD) is introduced to align the SAE’s hidden-layer activation distribution with the target Sparse Distributed Representation (SDR), ensuring sparsity, similarity, and distributivity simultaneously. This alignment minimizes distributional discrepancy while reducing computational complexity. Extensive experiments demonstrate that the proposed SWD-HTM model significantly improves prediction accuracy, achieving 14.3\% and 22.1\% gains on short-term and long-term forecasting tasks, respectively, outperforming traditional HTM and state-of-the-art baselines. |
13:30 | Multi-Stage Variance-Controlled Gradient Updates: Toward Robust Continual Learning ABSTRACT. Large Language Models (LLMs) have demonstrated remarkable performance and strong generalization across diverse tasks; however, catastrophic forgetting remains a fundamental challenge in continual learning scenarios. MIGU, a label-free approach, alleviates forgetting by selectively updating parameters based on gradient magnitude, thereby improving adaptability. Despite its effectiveness, MIGU relies heavily on manually tuned mask generation thresholds, which incur significant computational overhead and limit scalability. To address these limitations, this paper proposes MVGU, an improved method employing multi-stage variance-controlled gradient updates. At its core, MVGU optimizes pre-mask vector generation and threshold selection strategies to reduce dependence on empirical hyperparameter tuning inherent in MIGU, enhancing training efficiency. Extensive continual learning experiments on T5-Large and LLaMA3-8B Instruct architectures demonstrate that MVGU achieves comparable or superior performance to MIGU with fewer training iterations. Results indicate that MVGU is an effective continual learning strategy, capable of reducing training overhead, mitigating task interference during continual learning, and strengthening model adaptability in dynamic learning environments. |
13:50 | Evolving Task-Specific Fine-Tuning Strategies in Transfer Learning ABSTRACT. Transfer learning plays a critical role in addressing the challenges of limited training data and restricted computing resources. Pre-training enables models to learn general feature representations, and fine-tuning adapts the pre-trained weights to downstream tasks. However, designing suitable fine-tuning strategies often requires extensive manual trial-and-error or exhaustive grid search, which is time-consuming and prone to overfitting when data are scarce. To address this challenge, we propose an evolutionary computation-based framework to optimize task-specific fine-tuning strategies. The framework encodes fine-tuning strategies as chromosomes and employs a tailored genetic algorithm with novel crossover and mutation operators, alongside a performance predictor to guide the evolution towards promising regions with minimal extra training cost. Experimental results on four benchmark datasets demonstrate that the proposed method outperforms hand-crafted strategies and peer adaptive fine-tuning methods in terms of classification accuracy. Further analysis reveals the effectiveness of the newly designed operators and performance predictor, as well as the necessity of task-specific fine-tuning strategies. This study bridges the gap between evolutionary computation and transfer learning by introducing an automated framework for optimizing fine-tuning strategies, offering a practical tool for improving task-specific adaptation. |
14:10 | Enhancing Replay-Based Continual Learning via Predictive Uncertainty Controller ABSTRACT. Continual Learning (CL) aims to develop AI models that learn effectively from sequential tasks while mitigating catastrophic forgetting. Replay-based methods have emerged as a promising solution for CL, which stores a subset of past exemplars and then replays it to preserve prior knowledge. Existing exemplar selection strategies predominantly focus on feature-space representativeness but overlook output distribution variation. In this work, we identify that neighboring samples in feature space may sustain significantly different output probability distributions. This indicates that the nearest neighbors to class-wise mean feature vectors do not consistently serve as optimal representative samples. We further demonstrate that predictive uncertainty serves as a reliable indicator of such non-representative samples. Building on this insight, we propose Predictive Uncertainty Controller (PUC), which aims to benefit replay-based CL methods by filtering out samples with excessive uncertainty. Extensive experiments validate our approach, showing that PUC consistently enhances CL performance when integrated with existing replay-based methods. |
14:30 | Predict Social Economic Outcomes by Transferred Knowledge with Satellite Imagery PRESENTER: Zhiqiang Zou ABSTRACT. Traditional deep learning methods and econometric model have played a crucial role in the field of data mining, particularly in the prediction of soci-oeconomic outcomes. However, socioeconomic information is unable to be directly extracted from remote sensing data. So, in this paper, we propose a method to leverage transfer learning to predict socioeconomic indicators (outcomes) through satellite imagery. Specifically, we use road network types as a proxy for socioeconomic factors, which is more effectively and stably than using nightlight. We have extracted eleven distinct road topologi-cal features to generate reasonable road network types. Given the unique characteristics of road networks, we have constructed and fine-tuned a hy-brid pre-trained model that combines ResNet50 and Vision Transformer ar-chitectures for the transfer learning task. Through extensive experiments conducted across multiple regions, we demonstrated that our approach out-performs state-of-the-art methods in this field. This work highlights the po-tential of leveraging road network types as a proxy for socioeconomic in-formation and the effectiveness of our transfer learning-based framework in extracting valuable insights from satellite imagery to support socioeconomic policy decisions. The code had released in https://github.com/xiachan254/PredSocecOut. |
13:30 | Enhancing Graph Anomaly Detection with Contrastive Pre-training and Pseudo-label Learning ABSTRACT. Graph anomaly detection has been instrumental in many domains, playing a critical role in identifying unusual patterns in graph-structured data. Recent advancements that integrate graph neural networks with contrastive learning have demonstrated significant potential and achieved promising results. Despite the progress, how to design a method that is both effective and efficient remains a key challenge. In this paper, we propose a new graph anomaly detection method TSGAD. Specifically, TSGAD adopts a two-stage design of pre-training and pseudo-label learning. The first stage uses contrastive pre-training to make initial predictions, and the second stage uses pseudo-label learning based on the predictions of the first stage. Additionally, we also elaborately design the two-stage framework to ensure that it scales linearly w.r.t. the graph size. Extensive experimental results on four real-world datasets demonstrate both the effectiveness and efficiency of the proposed method. Overall, TSGAD is fully unsupervised and does not require any manual label annotations; meanwhile, it can also achieve comparable results with its supervised counterpart. |
13:50 | LN_Net: Lightweight Non-interventionist Network for Weakly-Supervised Video Anomaly Detection ABSTRACT. Traditional spatiotemporal modeling for video anomaly detection often suffers from high computational costs and relies on heuristic assumptions (e.g., feature magnitude) that can be flawed and hinder robust representation learning. To address this, we propose LN_Net, a Lightweight Network adopting a non-interventionist strategy to overcome the pitfalls of imposing such potentially misleading priors. This approach grants the model greater flexibility to learn discriminative patterns directly from observational data. LN_Net implements this strategy through two core, efficient innovations: (1) an Efficient Temporal Modeling Module (ETMM) capturing multi-faceted temporal dynamics without convolutions, and (2) an Adaptive Focusing Module (AFM) highlighting salient temporal evidence. Our non-interventionist method achieves competitive detection accuracy (97.77\% SH-AUC, 86.21\% UCF-AUC). Simultaneously, it demonstrates state-of-the-art efficiency, requiring only about 1/35th the parameters, 1/135th the model size, and 1/51th the inference time compared to recent complex methods like VadCLIP. This highlights its significant practical value for deployment.Our code is available at: https://anonymous.4open.science/r/LN_Net-0324 |
14:10 | PCGS: Pose Conditioned Generative Steganography ABSTRACT. Secure transmission of private information over public channels without arousing suspicion remains a fundamental challenge in stega-nography. Traditional methods modify pixel-level or frequency-domain features, making them vulnerable to detection and degradation. Recent synthesis-based approaches leverage generative models to embed data but often suffer from limited capacity or visual artifacts. In this work, we propose a pose conditioned generative steganographic framework that decouples message representation from image content. Binary messages are first mapped to human poses using a geometry-aware codebook derived from real-world data. These poses then serve as structural conditions to guide diffusion-based image generation, producing semantically coherent and visually natural stego images. By encoding multiple human poses in a single image, our framework increases message capacity while preserving visual coherence. To enhance robustness, we introduce a randomized linear expansion scheme to stabilize pose-code mapping under occlusion and detection noise. We evaluate the method under various perturbations and assess detectability using state-of-the-art steganalysis models. Experimental results show strong imperceptibility, decoding accuracy, and semantic flexibility, highlighting the effectiveness of our framework in enabling secure and coverless generative steganography. The code for PCGS will be released prior to publication. |
14:30 | MASM-Net: A Multi-Scale Adaptive Mamba Network for Cuffless Blood Pressure Estimation Using Photoplethysmographic Signals PRESENTER: Peiyu Fan ABSTRACT. Continuous blood pressure monitoring is crucial for the pre vention and management of cardiovascular diseases. However, conven tional cuff-based measurement methods are unsuitable for prolonged ambulatory use due to discomfort and restricted mobility. To overcome these challenges, we present MASM-Net, a deep learning architecture for accurate cuffless blood pressure estimation from single-channel photoplethys mographic (PPG) signals. The proposed model integrates three key components including stacked Adaptive Multi-Scale Convolution (AMSC) modules for comprehensive temporal feature extraction, a Dimension Expansion (DE) module for enhanced feature representation, and a Mamba module for efficient long-range temporal dependency modeling with linear computational complexity. Extensive experiments on two public datasets demonstrate that MASM-Net achieves state-of-the-art performance, with mean absolute errors and standard deviations of 2.25 ± 3.90 mmHg (systolic) and 1.26 ± 2.14 mmHg (diastolic) on the UCI dataset, and 2.69 ± 3.80 mmHg (systolic) and 1.63 ± 2.26 mmHg (diastolic) on the BCG dataset. These results surpass those of existing methods, establishing a robust framework for continual, noninvasive blood pressure monitoring. |
14:50 | Wind Power Curve Data Cleaning Model Based on LOF-ITSM ABSTRACT. Aiming at the problems that traditional image threshold segmentation method cannot effectively clean the outlier anomaly data in the wind power curve and it is difficult to adapt to the fuzzy boundaries in the curve, this paper proposes a data cleaning method of local outlier factor combined with image threshold segmentation (LOF-ITSM). Firstly, the LOF algorithm is used to pre-clean the wind power data, which is used to identify and clean the outlier anomalous data; then the image threshold segmentation method is improved by introducing a local adaptive threshold optimization mechanism, which better adapts to the fuzzy boundaries in the curves, and at the same time mitigates the insufficiency of the sensitivity of the LOF algorithm to the stacked anomalous data. The experimental results indicate that this method can better clean the abnormal data in the wind power curve, which lays a data foundation for promoting the efficient utilization of wind energy and the green transformation of energy structure. |
15:10 | HGTUL: A Hypergraph-based Model For Trajectory User Linking ABSTRACT. Trajectory User Linking (TUL) focuses on linking anonymous trajectories with the users who generated them. It is essential for understanding and modeling human mobility patterns. Despite significant advancements in this field, existing studies primarily neglect the high-order inter-trajectory relationships — complex associations among multiple trajectories, often revealed through co-occurrence across multiple locations. Furthermore, they fail to consider the variable influence of Points of Interest (POIs) on different trajectories, as well as the user class imbalance problem caused by disparities in user activity levels and check-in frequencies. To address these limitations, we propose a novel HyperGraph-based Trajectory User Linking model (HGTUL). Our model learns trajectory representations from both relational and spatiotemporal perspectives: (1) It models high-order trajectory associations via a hypergraph and incorporates an attention mechanism to learn the variable impact of POIs; (2) It encodes spatio-temporal characteristics by feeding the temporal and spatial features of each trajectory into a sequential encoder. Furthermore, we introduce a data balancing method to mitigate user class imbalance, and experimentally validate its significance in TUL. Extensive experiments on three real-world datasets show that HGTUL outperforms state-of-the-art baselines, achieving improvements of 2.57%∼20.09% in ACC@1 and 5.68%∼26.00% in Macro-F1 scores. The code is available at https://github.com/changfengjie3003/HGTUL. |
13:30 | Training-free Clothing Region of Interest Self-correction for Virtual Try-On PRESENTER: Shengjie Lu ABSTRACT. VTON (Virtual Try-ON) aims at synthesizing the target clothing on a certain person, preserving the details of the target clothing while keeping the rest of the person unchanged. Existing methods suffer from the discrepancies between the generated clothing results and the target ones, in terms of the patterns, textures and boundaries. Therefore, we propose to use an energy function to impose constraints on the attention map extracted through the generation process. Thus, at each generation step, the attention can be more focused on the clothing region of interest, thereby influencing the generation results to be more consistent with the target clothing details. Furthermore, to address the limitation that existing evaluation metrics concentrate solely on image realism and overlook the alignment with target elements, we design a new metric, Virtual Try-on Inception Distance (VTID), to bridge this gap and ensure a more comprehensive assessment. On the VITON-HD and DressCode datasets, our approach has outperformed the previous state-of-the-art (SOTA) methods by 1.4%, 2.3%, 12.3%, and 5.8% in the traditional metrics of LPIPD, FID, KID, and the new VTID metrics, respectively. Additionally, by applying the generated data to downstream Clothing-Change Re-identification (CC-Reid) methods, we have achieved performance improvements of 2.5%, 1.1%, and 1.6% on the LTCC, PRCC, VC-Clothes datasets in the metrics of Rank-1. The code of our method will be public after acceptance. |
13:50 | A Vision-Language Fusion Framework for Ethnic Minority Costume Image Captioning PRESENTER: Yujing Huang ABSTRACT. Image description generation of ethnic minority costumes is an important research direction in the intersection of computer vision and natural language processing, with application potential in cultural protection, education, tourism and other fields. Existing image description methods face challenges such as description simplification, insufficient semantic accuracy and "hallucination phenomenon" in the field of ethnic minority images, and lack of well-annotated datasets. To ad-dress these issues, this paper proposes a framework for generating image de-scriptions of ethnic minority costumes that integrates memory mechanisms, cog-nitive computing and multimodal large language models: Firstly, an abstract se-mantic memory bank is constructed to store ethnic semantic information and be dynamically invoked; Secondly, an image semantic understanding method based on cognitive computing is designed, which enhances the expressiveness and in-terpretability of descriptions through entity recognition, attribute analysis and re-lationship reasoning; Finally, a multimodal large model is combined to generate more detailed and accurate descriptions through modality alignment. Experiments show that this framework significantly outperforms traditional methods in the ac-curacy and semantic richness of image descriptions of ethnic minority costumes, achieving the best results in both image recognition accuracy and comprehensive indicators of description generation, providing an innovative technical solution for the digital protection and dissemination of ethnic minority cultures. |
14:10 | Improved GJO optimized CNN-BiLSTM-Attention touchdown speed prediction model ABSTRACT. Accurately predicting the touchdown point speed of civil aviation aircraft is crucial for identifying unstable approaches and estimating runway occupancy time post-landing.To achieve accurate prediction of the touchdown point speed, a CNN-BiLSTM-ATTENTION speed prediction model optimized by an improved golden jackal optimization (IGJO) algorithm is proposed. Multiple factors affecting the aircraft's touchdown speed are comprehensively considered. Input feature data for the model are constructed by combining ADS-B data, flight plan information, meteorological data, and runway information. The CNN-BiLSTM-ATTENTION network is used to extract the deep spatial and temporal features of the data. Meanwhile, the GJO algorithm is improved through Gaussian random walk, spiral search, sine-cosine search strategy, and lens imaging opposition-based learning strategy. The resulting IGJO algorithm has stronger global search ability and higher convergence accuracy. The CNN-BiLSTM-ATTENTION model optimized by IGJO has higher prediction accuracy. Experimental results show that the MAE, MAPE, and RMSE of the prediction results of the IGJO-CNN-BiLSTM-ATTENTION model are 3.2017, 3.06%, and 3.8817 respectively. It has higher prediction accuracy compared with the unoptimized model and the models optimized by GJO, PSO, and DA. This prediction method provides strong support for air traffic control departments to obtain accurate touchdown speed predictions. |
14:30 | BAD: Bidirectional Attention-Guided Distillation for Object Detection ABSTRACT. Feature-based knowledge distillation has been widely adopted to boost lightweight detectors. However, existing methods often rely on ground-truth bounding boxes to localize salient regions for guiding the feature learning of student models. These approaches ignore the significant structural and semantic gap between teacher and student networks, making it challenging to transfer effective knowledge . We propose \textbf{BAD}, a novel distillation framework that introduces bidirectional attention guidance in both spatial and channel dimensions. The core of BAD is to extract spatial response patterns from both teacher and student networks, and fuse them into a joint attention mask. This mask identifies semantically aligned regions without relying on ground-truth boxes, enabling adaptive guidance for the student to focus on task-relevant areas while preserving its representational flexibility. Meanwhile, the masking mechanism helps mitigate training imbalance caused by overly strict alignment. The spatial attention module helps localize contextually important regions, while the channel attention enhances global semantic alignment. We conduct extensive experiments on several popular object detectors, including YOLOv8, RetinaNet, FCOS and Faster R-CNN. The results demonstrate that BAD achieves stable and effective knowledge transfer under heterogeneous architectures and limited model capacity. |
14:50 | Towards Effective Event Argument Extraction via Enhanced Contextual Understanding ABSTRACT. Event Argument Extraction is a crucial task that aims to extract the arguments of specified events from a text and predict their roles. Recent mainstream methods for event argument extraction still falls short in handling long-distance dependencies of arguments, resulting in limited contextual understanding and suboptimal performance. To address these limitations, here we propose an effective event argument extraction model named DSEAE. The proposed DSEAE model mainly consists of the dependency-guided module and the structure-aware module, each of which employs a distinct and newly improved self-attention mechanism. The dependency-guided module aims to guide the model in associating different prompts with their corresponding event contexts, whereas the structure-aware module aims to strengthen interaction between event information for better contextual understanding. Experimental results show that our method achieves better performance compared to the baselines. This demonstrates the effectiveness of our method. |
15:10 | Neural Architecture Search of Sample Reweighting Networks for Complex Distribution Shift ABSTRACT. Sample reweighting is a major approach to addressing distribution shifts, such as label noise and class imbalance. Meta-Weight-Net (MW-Net) is a promising sample reweighting network that computes weights based on classification loss. Although MW-Net improves prediction performance under a single type of distribution shift using a simple neural network, its performance degrades when facing both label noise and class imbalance, where it is hard to determine appropriate weights solely from classification loss and using a simple network. In this study, we introduce neural architecture search to MW-Net to mitigate such performance degradation. Using the tree-structured Parzen estimator, we explore the optimal number of hidden layers and nodes and select the most suitable intermediate layer in the classification model to serve as the input for MW-Net. Experimental results on the CIFAR-10 and CIFAR-100 datasets that were modified to include both label noise and class imbalance demonstrate the effectiveness of neural architecture search for MW-Net. |