View: session overviewtalk overview
| 10:30 | Lightweight Neural Architecture Search via Training-free ZiCo-Block Evaluation ABSTRACT. Neural architecture search (NAS) has gained great attention thanks to its ability to automatically discover promising network architectures for given tasks. However, most existing NAS methods require extensive computational resources for architecture evaluation. Meanwhile, the increasing demand for deploying deep neural networks on mobile devices highlights the importance of reducing the model size. To address these challenges, this paper proposes an evolutionary NAS algorithm with efficient blocks using training-free ZiCo-Block evaluation, called EZB-NAS. We design two computationally efficient blocks and construct a hierarchical and variable-length search space to discover lightweight architectures. In addition, we improve the ZiCo proxy to reduce the structural bias in block evaluation by averaging the ZiCo scores across all layers within a block. The developed zero-cost proxy, named ZiCo-Block, is integrated into an evolutionary computation approach for lightweight architecture design. With improved genetic operators, we simultaneously optimize the ZiCo-Block score and the number of parameters to discover highly accurate and lightweight architectures. Experimental results on the CIFAR datasets demonstrate the good performance of the proposed EZB-NAS algorithm in terms of accuracy, model size, and computational cost, compared to efficient NAS algorithms. |
| 10:50 | A Consistent Lebesgue Measure for Multi-label Learning ABSTRACT. Multi-label loss functions are usually non-differentiable, requiring surrogate loss functions for gradient-based optimisation. The consistency of surrogate loss functions is not proven and is exacerbated by the conflicting nature of multi-label loss functions. To directly learn from multiple related, yet potentially conflicting multi-label loss functions, we propose a \textit{Consistent Lebesgue Measure-based Multi-label Learner} (CLML) and prove that CLML can achieve theoretical consistency under a Bayes risk framework. Empirical evidence supports our theory by demonstrating that: (1) CLML can consistently achieve state-of-the-art results; (2) the primary performance factor is the Lebesgue measure design, as CLML optimises a simpler feedforward model without additional label graph, perturbation-based conditioning, or semantic embeddings; and (3) an analysis of the results not only distinguishes CLML's effectiveness but also highlights inconsistencies between the surrogate and the desired loss functions. |
| 11:10 | Reasoning from Norms to Collective Agency ABSTRACT. In this paper, we investigate collective agency from both normative and formal perspectives. Moving beyond prior literature that primarily emphasizes preference dependency or coalitional power, our analysis highlights the critical role of norms --- encoded as ordered (conditional) imperatives --- in facilitating the emergence of collective agency. Drawing insights from game theory and default logic, we propose a novel default theory to capture how norms change game scenarios. We present formal proofs establishing corresponding relationships between this default theory and strategic game models, and extend these findings to multi-game scenarios. Our technical results enhance the understanding of the interplay between norms and agency, providing an initial formalization for applying these interconnected theories. |
| 11:30 | Evolving Task-Specific Fine-Tuning Strategies in Transfer Learning ABSTRACT. Transfer learning plays a critical role in addressing the challenges of limited training data and restricted computing resources. Pre-training enables models to learn general feature representations, and fine-tuning adapts the pre-trained weights to downstream tasks. However, designing suitable fine-tuning strategies often requires extensive manual trial-and-error or exhaustive grid search, which is time-consuming and prone to overfitting when data are scarce. To address this challenge, we propose an evolutionary computation-based framework to optimize task-specific fine-tuning strategies. The framework encodes fine-tuning strategies as chromosomes and employs a tailored genetic algorithm with novel crossover and mutation operators, alongside a performance predictor to guide the evolution towards promising regions with minimal extra training cost. Experimental results on four benchmark datasets demonstrate that the proposed method outperforms hand-crafted strategies and peer adaptive fine-tuning methods in terms of classification accuracy. Further analysis reveals the effectiveness of the newly designed operators and performance predictor, as well as the necessity of task-specific fine-tuning strategies. This study bridges the gap between evolutionary computation and transfer learning by introducing an automated framework for optimizing fine-tuning strategies, offering a practical tool for improving task-specific adaptation. |
| 11:50 | Enhancing Replay-Based Continual Learning via Predictive Uncertainty Controller ABSTRACT. Continual Learning (CL) aims to develop AI models that learn effectively from sequential tasks while mitigating catastrophic forgetting. Replay-based methods have emerged as a promising solution for CL, which stores a subset of past exemplars and then replays it to preserve prior knowledge. Existing exemplar selection strategies predominantly focus on feature-space representativeness but overlook output distribution variation. In this work, we identify that neighboring samples in feature space may sustain significantly different output probability distributions. This indicates that the nearest neighbors to class-wise mean feature vectors do not consistently serve as optimal representative samples. We further demonstrate that predictive uncertainty serves as a reliable indicator of such non-representative samples. Building on this insight, we propose Predictive Uncertainty Controller (PUC), which aims to benefit replay-based CL methods by filtering out samples with excessive uncertainty. Extensive experiments validate our approach, showing that PUC consistently enhances CL performance when integrated with existing replay-based methods. |
| 12:10 | Predict Social Economic Outcomes by Transferred Knowledge with Satellite Imagery PRESENTER: Zhiqiang Zou ABSTRACT. Traditional deep learning methods and econometric model have played a crucial role in the field of data mining, particularly in the prediction of soci-oeconomic outcomes. However, socioeconomic information is unable to be directly extracted from remote sensing data. So, in this paper, we propose a method to leverage transfer learning to predict socioeconomic indicators (outcomes) through satellite imagery. Specifically, we use road network types as a proxy for socioeconomic factors, which is more effectively and stably than using nightlight. We have extracted eleven distinct road topologi-cal features to generate reasonable road network types. Given the unique characteristics of road networks, we have constructed and fine-tuned a hy-brid pre-trained model that combines ResNet50 and Vision Transformer ar-chitectures for the transfer learning task. Through extensive experiments conducted across multiple regions, we demonstrated that our approach out-performs state-of-the-art methods in this field. This work highlights the po-tential of leveraging road network types as a proxy for socioeconomic in-formation and the effectiveness of our transfer learning-based framework in extracting valuable insights from satellite imagery to support socioeconomic policy decisions. The code had released in https://github.com/xiachan254/PredSocecOut. |
| 10:30 | Towards Interpretable Load Forecasting: A Liquid Neural Network Approach with Temporal and Feature Importance Modeling PRESENTER: Ziqian Liu ABSTRACT. Accurate and interpretable load forecasting is critical for modern power systems. Existing methods have made notable progress, however, with the development of smart grids, load data have become increasingly non-stationary under the influence of multiple factors. Further, model interpretability is subject to higher demands due to the need for regulatory compliance in power system operations.To address these challenges, we propose an interpretable forecasting model, Feature-weighted Liquid-core model with INterpretable Temporal attention (FLINT), that incorporates a feature-weighted strategy to dynamically assess the contribution of heterogeneous input features, leverages Liquid Neural Networks to capture the dynamic non-stationarity of load data, and integrates a time-aware attention mechanism to model temporal dependencies and highlight critical time steps.Further, we innovatively introduce a multi-level interpretability module that, form both global and local perspectives, explains prediction outcomes by assessing input feature importance, highlighting critical time steps, and tracing the causes of abrupt load changes. Specifically, gradient attribution and gating quantify global feature contributions, the sparse, bio-inspired Liquid Neural Networks (LNNs) architecture enables traceable mutation-level reasoning, and attention highlights key temporal points. Empirically, a comparison with six baseline models on three real-world load datasets demonstrates that FLINT achieves approximately a 4% improvement in forecasting performance compared to the strongest baseline, as measured by MAE, while offering superior interpretability. |
| 10:50 | FuzzyProbNet: An Interpretable Fuzzy Probabilistic Network for Cement Compressive Strength Prediction ABSTRACT. The compressive strength of cement is a critical indicator for evaluating its quality and ensuring the safety and durability of engineering structures. However, traditional physical testing methods, characterized by long durations and high costs, fail to meet the demands of modern intelligent construction for rapid and economical assessment. Consequently, the development of advanced predictive models is of paramount importance. Currently, prevailing predictive models often face a "trilemma" where prediction accuracy, uncertainty quantification, and model interpretability are difficult to achieve simultaneously. The "black-box" nature of these models restricts their application in safety-critical domains. To address this challenge, this paper proposes a novel Fuzzy Probabilistic Network (FuzzyProbNet). This model transforms numerical inputs into interpretable semantic concepts through a learnable fuzzification process, extracts robust deep features using a Variational Autoencoder, and ultimately generates a complete predictive probability distribution via a Gaussian Mixture output head. Experimental results demonstrate that the proposed FuzzyProbNet outperforms baseline models across various metrics for both point and probabilistic prediction. Furthermore, visualization and analysis of the model's internal workings validate its clear decision-making logic and inherent interpretability. |
| 11:10 | Uncertainty Estimation by Human Perception versus Neural Models ABSTRACT. Modern neural networks (NNs) often achieve high predictive accuracy but remain poorly calibrated, producing overconfident predictions even when wrong. This miscalibration poses serious challenges in applications where reliable uncertainty estimates are critical. In this work, we investigate how human perceptual uncertainty compares to uncertainty estimated by NNs. Using three vision benchmarks annotated with both human disagreement and crowdsourced confidence, we assess the correlation between model-predicted uncertainty and human-perceived uncertainty. Our results show that current methods only weakly align with human intuition, with correlations varying significantly across tasks and uncertainty metrics. Notably, we find that incorporating human-derived soft labels into the training process can improve calibration without compromising accuracy. These findings reveal a persistent gap between model and human uncertainty and highlight the potential of leveraging human insights to guide the development of more trustworthy AI systems. |
| 11:30 | Belief Change with Full Memory and Trust ABSTRACT. We consider belief change in a situation where agents have full memory of all information that has been reported over time. In this context, we no longer have an a priori initial belief state. Instead, we have a history of past reports along with a trust state that indicates how strongly each information source is trusted. If we have a static model of trust, then this approach is essentially gives a variation of regular iterated revision. However, we introduce a model of trust change, where trust levels can increase or decrease based on agreement between sources. In this case, we end up with a new kind of belief change operator. The new operator can abandon sources and re-integrate them over time, while maintaining beliefs are are justified both by trust and an underlying Darwiche-Pearl operator. |
Zoom link: https://vuw.zoom.us/j/97046252862
| 10:30 | PRISM: Principled Reasoning for Identifying and Suppressing Model Biases at Scale PRESENTER: Xunfei Zhu ABSTRACT. Large language models (LLMs) have shown impressive capabilities in diverse applications, from complex reasoning to creative generation. However, these models often rely on spurious correlations rather than causal understanding, leading to systematic biases that compromise their fairness and reliability. Current debiasing methods frequently approach bias as a single-dimensional problem, lack frameworks to differentiate between causal relationships and spurious patterns, and typically require extensive model modifications or domain-specific knowledge. We introduce PRISM, a novel framework that treats bias as a multi-dimensional causal phenomenon and operates through prompt-based learning without model modification. PRISM consists of three core elements: Dimensional Bias Identification (DBI), which isolates distinct causal dimensions of bias; Targeted Example Synthesis (TES), which creates counterfactual examples highlighting specific bias aspects; and Discriminative Learning Enhancement (DLE), which uses these examples to help models distinguish genuine features from spurious correlations. Our comprehensive evaluation across multiple datasets and model architectures demonstrates that PRISM consistently outperforms existing debiasing techniques, particularly for complex, multi-dimensional biases. Additional experiments confirm PRISM's generalizability across different models and datasets, establishing it as a flexible and effective approach to creating more fair and reliable language models. |
| 10:50 | PRESENTER: Muyang Xu ABSTRACT. Code-switching (CSW) refers to the phenomenon where multilingual speakers integrate multiple languages within a single utterance. Although recent studies have made notable progress in developing Grammatical Error Correction (GEC) systems for CSW scenarios involving Chinese lexical items, many still rely on simplistic translation-based data generation, which often limits semantic diversity and fails to capture the complexity of natural CSW expressions. To address this issue, we propose a multi-stage data construction approach to enrich training datasets and improve model generalization. Specifically, we first employ a model-based generation method to produce monolingual augmented data, followed by a perplexity-based (PPL) adaptive filtering algorithm to ensure data diversity and quality. Next, we apply three levels of translation-based augmentation to both the filtered and the original datasets, effectively simulating natural CSW patterns at varying levels of complexity. Finally, we perform multi-stage model training on the combined datasets to progressively enhance model robustness across diverse data distributions. Experimental results show that our optimized model achieves an average improvement of 1.82 $F_{0.5}$ points across two CSW GEC test sets, demonstrating the effectiveness of the proposed approach. |
| 11:10 | A Data Augmentation Approach Using Sentiment Analysis and WordNet-Based Comments Transformation ABSTRACT. With the proliferation of social media, cyberbullying has emerged as a significant societal issue, necessitating robust detection mechanisms. This paper presents a novel data augmentation approach using Sentiment Analysis and WordNet-Based Comment Transformation (SAWCT), designed to enhance the performance of cyberbullying detection models. Our method leverages Aspect Sentiment Triplet Extraction (ASTE) to identify aspect and opinion words within comments and employs WordNet to generate semantically rich augmentations. We demonstrate the effectiveness of SAWCT across three Large Language Models (LLMs) and three datasets: HateXplain, OLID, and Cyberbullying Tweets (CT). The results indicate consistent improvements in precision, recall, and F1 score, with the most significant enhancements observed when combining aspect and opinion words. Our ablation study further validates the importance of both components in improving detection accuracy. By addressing the Long Tail phenomenon and mitigating overfitting in LLMs, SAWCT provides a substantial advancement in the field of cyberbullying detection. This work contributes a new perspective on leveraging sentiment analysis and WordNet to augment training data and enhance model understanding of nuanced language use in cyberbullying contents. |
| 11:30 | HyKAG: Hybrid Knowledge-Aware Retrieval-Augmented Generation for Knowledge-Intensive Questions ABSTRACT. Knowledge-intensive Questions typically require Large Language Models (LLMs) to retrieve external knowledge beyond their parametric memory to generate factually accurate and human-aligned answers. Retrieval-Augmented Generation (RAG), a reliable technique for supplementing LLMs with external information, enhances generation quality and mitigates hallucination by incorporating retrieved knowledge into the reasoning process. However, existing multi-step retrieval RAG methods are prone to introducing a large number of irrelevant passages during deep exploration of external knowledge bases and remain constrained by one-sided exploration strategies. This hinders effective exploration and utilization of high-quality knowledge, ultimately leading to unreliable reasoning and answers. To this end, we propose a novel Hybrid Knowledge-Aware RAG (HyKAG) framework for knowledge-intensive questions. Specifically, to enable deeper exploration of high-quality external knowledge and enhance the model’s knowledge awareness, we first propose hybrid knowledge expansion and refinement modules that enrich retrieved content from dual retrieval perspectives and refine it through an incremental cross-step integration strategy. Furthermore, we introduce a hybrid knowledge-aware adaptive retrieval module that formulates high-quality retrieval decisions by leveraging the refined hybrid knowledge, thereby facilitating deeper knowledge exploration. Extensive empirical results on four datasets demonstrate the superiority of HyKAG. |
| 11:50 | ABSTRACT. Autoregressive decoding makes inference for Large Language Models (LLMs) both memory bandwidth-bound and time-consuming. In this paper, we reconsider draft head paradigm in speculative decoding and derive two key observations. Firstly, existing draft heads are sequentially independent, speculating on draft tokens without considering their preceding context within the continuation. Secondly, highly ambiguous tokens disproportionately corrupt the effective length of draft sequences generated by draft heads. Based on these insights, we propose Morpheus, a draft head that generates draft tokens sequentially in an autoregressive manner. By integrating features from the target model and the draft head itself from the previous time step, Morpheus effectively extended the average acceptance length, thereby increasing the end-to-end decoding rate. We conducted comprehensive evaluations of Morpheus, including code generation task and text generation task. For Vicuna 7B, Morpheus improves the speed of decoding by 1.15x and 2.5x compared to Medusa decoding and autoregressive decoding, respectively. |
| 12:00 | AMCCL: Adaptive Multi-Scale Convolution Fusion Network with Contrastive Learning for Multimodal Sentiment Analysis PRESENTER: Jiakang Yu ABSTRACT. Multimodal Sentiment Analysis (MSA) requires robust representations that capture both cross-modal consistency and intra-modal distinctions. Existing fusion methods often fail to adapt to diverse sentiment cues and neglect inter-modal correlations, while contrastive learning approaches insufficiently consider pair distribution and loss design. We propose an Adaptive Multi-scale Convolution fusion network with Contrastive Learning for multimodal sentiment analysis (AMCCL), which dynamically fuses multimodal information using an Adaptive Multiscale Convolution (AMC) module. The AMC module dynamically fuses features through multi-scale convolutions with adaptive weighting and squeeze-and-excitation block to enhance salient channels. Our fine-grained contrastive learning leverages sentiment polarity and intensity, with tailored loss functions to strengthen the positive pairs and balance the intermodal and intra-modal relations. Extensive evaluations on the MOSI and MOSEI datasets confirm that AMCCL delivers superior performance relative to state-of-the-art approaches. |
| 12:10 | KALE-LM-Chem: Vision and Practice Toward an AI Brain for Chemistry PRESENTER: Weichen Dai ABSTRACT. Recent advancements in large language models (LLMs) have demonstrated strong potential for enabling domain-specific intelligence. In this work, we present our vision for building an AI-powered chemical brain, which frames chemical intelligence around four core capabilities: information extraction, semantic parsing, knowledge-based QA, and reasoning \& planning. We argue that domain knowledge and logic are essential pillars for enabling such a system to assist and accelerate scientific discovery. To initiate this effort, we introduce our first generation of large language models for chemistry: \textbf{\textit{KALE-LM-Chem}} and \textbf{\textit{KALE-LM-Chem-1.5}}, which have achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development. |
Zoom link: https://vuw.zoom.us/j/93664289896
| 10:30 | FreezeSeg2RL: Frozen Segmentation Pretraining for Reinforcement Learning On vascular Interventional Robot Autonomous Delivering PRESENTER: Ziyang Mei ABSTRACT. Vascular interventional robotic systems play a critical role in protecting physicians from X-ray radiation exposure during vascu lar surgery procedures. The AI-copilot autonomous delivering capability can further enhance physicians’ experience on robotic systems. However, extracting features from real-time X-ray image and generating opera tional decisions is challenging. Addressing these challenges, this paper proposed a X-ray simulation platform, which provides data and environ ments for pre-training vision encoders and training agent by reinforce ment learning. This agent integrates pre-trained vision encoder for inter ventional instruments with actor-critic network. This paper validated the impact of different vision encoders, different feature fusion schemes, and differ ent reinforcement learning methods on agent performance. The optimized solution demonstrated superior performance in simulation benchmarks and was successfully transferred to a real robotic system. Experiments demonstrate this method’s ability to generate effective strategies from real-time X-ray inputs and shows promising clinical robotics applica tions. |
| 10:50 | Residual-based Adaptive Domain Decomposition Method for the Physics-Informed Neural Network ABSTRACT. The physics-informed neural network (PINN) has emerged as a powerful framework for solving partial differential equations (PDEs) by incorporating physical laws as constraints in the loss function. However, utilizing a single neural network approximator across the entire computational domain may encounter challenges in converging to the correct solutions, especially for equations where different regions exhibit diverse properties. In this paper, we introduce the residual-based adaptive domain decomposition method for the PINN (RA-PINN), which adaptively partitions the computational domain and assigns sub-networks to solve each subdomain independently. The RA-PINN eliminates the need for manual domain division in existing PINN domain decomposition methods by enabling the network to autonomously infer subdomains based on residual information. The RA-PINN offers a more efficient alternative for solving PDEs, achieving enhanced adaptivity and reduced reliance on prior knowledge about the solution or domain characteristics. We conducted experiments on three representative cases with sharp solution variations, demonstrating that the RA-PINN outperforms the PINN and other non-adaptive domain decomposition methods in capturing localized features and ensuring solution accuracy. |
| 11:10 | Empowering Graph Contrastive Learning with Topological Rationale ABSTRACT. Graph contrastive learning (GCL) methods are dedicated to modeling the invariant information from graphs via well-crafted graph augmentation or stochastic encoder perturbation. Although prevailing methods have achieved great progress, we argue that they overlook the essential topological invariant information, referred to as topological rationale. In this regard, we conduct exploratory experiments that visually demonstrate the deficiency of GCL methods in capturing topological rationale and reveal the positive correlation between this deficiency and the discriminability degeneration of GCL methods. To this end, we introduce a novel plug-and-play approach, termed Topological Rationale-enhanced Graph Contrastive Learning (TRGCL). Specifically, TRGCL integrates the node-level and substructure-level topological rationale learning modules in the topological rationale learning stage, thereby empowering the GCL encoder to capture topological invariant information sufficiently. Furthermore, we introduce a semantic-orthogonal adaptive weighting module to ensure that the derived topological rationale remains complementary to semantic information. Theoretically, we revisit the paradigm of GCL from the causal perspective and substantiate the theoretical validity of TRGCL. Experimental results on various datasets in the domains of social networks and biochemical molecules demonstrate the effectiveness of TRGCL. |
| 11:30 | HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates PRESENTER: Lei Lu ABSTRACT. Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complementary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving indices map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced and robust compression performance at ultra-low bitrates. |
| 11:50 | Spatio-temporal dynamic multi-scale graph convolutional networks for traffic flow prediction PRESENTER: Jiawei Wang ABSTRACT. Graph-constructed traffic flow prediction has recently achieved significant advances in transportation research. Existing methods predominantly rely on predefined spatial adjacency graphs to model spatio-temporal relationships. However, these static adjacency matrices inadequately represent the complex spatio-temporal correlations among road network nodes and fail to capture dynamic interactions that evolve over time. This paper proposes a novel traffic prediction model called the Spatial-Temporal Dynamic Multiscale Graph Convolutional Network (SDMGCN). The SDMGCN first models the dynamic properties of node spatial correlations through attention mechanisms, constructing the Dynamic Interaction Perception Graph (DIPG). It then innovatively proposes the Multi-Order Augmented Graph Convolution Module (MOAGCM), which adaptively adjusts node weights through a multi-order information aggregation mechanism. When combined with the DIPG, it captures deeper dynamic spatial dependencies between nodes. Finally, the multiscale time-gated convolution module captures temporal dependencies at various time scales. Experimental evaluations on two real-world traffic datasets demonstrate that the SDMGCN model significantly outperforms state-of-the-art methods. |
| 12:10 | UniAVLM: Unified Large Audio-Visual Language Models for Comprehensive Video Understanding PRESENTER: Lecheng Yan ABSTRACT. Modern video understanding requires integrating multimodal signals, but current Multimodal Large Language Models (MLLMs) often process audio and visual streams separately, missing key relationships and causing fragmented understanding with a disjointed audio-visual representation. In this work, we propose UniAVLM, a large audio-visual language model for comprehensive video understanding, which first employing Whisper-style audio feature extraction to capture relevant auditory information. We then introduce spatiotemporal position encoding to enhance the video representation with temporal dynamics. Finally, we implement cross-modal attention mechanisms to explicitly fuse the audio and visual features, allowing the model to learn the intricate relationships between these modalities and creating a cohesive multimodal representation. We conduct extensive experiments on the Audio-Visual Scene-Aware Dialogue (AVSD) benchmark, comparing our model against seven representative multimodal baselines and demonstrate state-of-the-art performance, with our model achieving 48.91% accuracy and 89.93 BERTScore-F1. Specifically, our model outperforms the best vision-language model by 6.79% accuracy, and surpasses the state-of-the-art full multimodal model by 4.07% accuracy, while using only parameter-efficient fine-tuning. Comprehensive ablation studies highlight the critical impact of lightweight integration strategies and thorough cross-modal fusion on comprehensive video understanding. |
| 13:30 | PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use ABSTRACT. Large Language Models (LLMs) show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erro- neous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforce- ment learning phase. In the online phase, a dedicated Planner is trained via roup Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Ex- periments on the ToolHop and T-Eval benchmarks show PEARL signif- icantly outperforms existing methods, achieving a new state-of-the-art success rate of 56.5% on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents. |
| 13:50 | ColorFP: Improving AI-Generated Text Detection via Fixed Vocabulary Partitioning and Half-Bit Fingerprinting PRESENTER: He Li ABSTRACT. With the rapid proliferation of large language models (LLMs), their misuse has engendered significant societal concern. Accordingly, the development of efficient and robust AI-generated text detection has emerged as a pivotal strategy for mitigating the potential abuse of LLMs. Existing approaches predominantly rely on a compute-intensive fine-tuning paradigm to capture the implicit stylistic cues of AI-generated text. However, fine-tuning such detectors not only incurs substantial overhead but also yields poor robustness, as classification based solely on stylistic cues fails against textual adversarial attack. This paper introduces ColorFP, a robust AI-generated text detection framework based on fixed vocabulary partitioning and half-bit fingerprinting. Specifically, to achieve the optimal trade-off between detection success rates and generated text quality, we introduce a novel probabilistically biased half-bit fingerprint encoding. To enhance detection robustness, we employ a static hash-seeded pseudorandom number generator to ensure consistent vocabulary partitioning across distinct fingerprints, thereby mitigating the challenges posed by textual adversarial attacks. To comprehensively evaluate ColorFP, we assembled a corpus of fingerprinted text outputs from five LLMs; results show that ColorFP outperforms all baselines—achieving a 93.70\% average F1 in a five-class setting—while reducing time and computational overhead by up to 30× compared to state-of-the-art approaches. |
| 14:10 | SASP-NMT: Syntax-Aware Structured Prompting for Low-Resource Neural Machine Translation PRESENTER: Hao Xing ABSTRACT. In Neural Machine Translation (NMT) with Large Language Models (LLMs), prompting has become the predominant approach for adapting to a new translation task without requiring extensive fine-tuning data. However, when translating low-resource language pairs, conventional prompts, built as simple linear text, struggle to represent richer dependency or constituency syntax, making it difficult for LLMs to grasp the source language's syntactic patterns and semantic nuances and thus impairing translation quality. To address this challenge, this paper proposes Syntax-Aware Structured Prompting (SASP). Since word-level embeddings are insufficient for capturing the overall semantics of a sentence and are susceptible to interference from sentence length and word frequency, we encode source and candidate sentences with sentence-level embeddings and retrieve several semantically similar sentences from the target-language monolingual corpus. Subsequently, each retrieved sentence undergoes fine-grained dependency parsing to extract clause-level subject-verb-object structures as well as part-of-speech information. These syntactic patterns are then organized into clause‐level structural templates and integrated with the retrieved example sentences to form a structured prompt, enhancing translation quality. We evaluate SASP on language pairs between Mongolian-Chinese (Mo-Zh), Uyghur-Chinese (Ug-Zh), and Tibetan-Chinese (Ti-Zh) using the CCMT2019 corpus. Experimental results show that SASP consistently improves translation quality across all tasks, achieving up to a 13.4% improvement over zero-shot baselines. These findings indicate that incorporating structured syntactic knowledge into prompt design can significantly enhance the performance of LLMs in low-resource machine translation, particularly in terms of syntactic accuracy and target-language consistency. |
| 14:30 | AgentFactory: Towards Automated Agentic System Design and Optimization ABSTRACT. Large Language Models (LLMs) have demonstrated remarkable capabilities as powerful components in agentic systems, enabling sophisticated reasoning and complex task execution. However, current approaches to manually designing and optimizing agentic systems heavily rely on manual effort, limiting their adaptability and scalability. Recent work has explored the automated optimization of workflow designs. How ever, these approaches often overlook the crucial role of model capabilities and focus on single performance metrics, failing to address real-world deployment constraints. In this paper, we present AgentFactory, a framework that jointly optimizes both foundation models and workflow structures in agentic systems while considering multiple objectives including performance, cost, and efficiency. AgentFactory leverages advanced LLMs as optimizers (such as GPT-4o and DeepSeek V3) to navigate the vast search space of possible configurations, employing a three-stage optimization pipeline (planning, tuning, and workflow design) to automatically discover effective combinations of fine-tuned models and optimized workflows. Through an iterative optimization process, our framework systematically explores and evaluates different agentic system designs,adapting to task-specific requirements while maintaining operational efficiency. We evaluate AgentFactory across eight benchmarks spanning five domains, including general reasoning, coding, mathematics, medicine, and finance. Our experiments demonstrate that AgentFactory consistently outperforms both manually designed methods and existing automated approaches, achieving an average improvement of 9.1% across all benchmarks, with particularly significant gains in domain-specific tasks (19.6% on MedQAand18.7%onFinEval). These results establish AgentFactory as a promising approach for developing more capable and efficient agentic systems through automated optimization. |
| 14:50 | Context-Aware and Knowledge-Grounded Conversational Recommendation with Prompt Learning ABSTRACT. Conversational Recommender Systems (CRSs) aim to provide personalized recommendations through multi-turn dialogues. While Large Language Models (LLMs) have shown promise in handling conversational and recommendation tasks, integrating user preference, contextual knowledge, and generation quality remains a significant challenge. In this work, we propose GraphPromptCRS, a prompt-based and knowledge-grounded CRS framework that jointly performs recommendation and response generation with a frozen LLM. Our system leverages soft prompt learning to encode task-specific information without fine-tuning all model parameters. To enhance the model’s reasoning capabilities, we introduce a GraphRAG-based knowledge construction pipeline that builds dynamic knowledge graphs from dialogue history using structured prompts. Additionally, we incorporate a Community Prompt Enhancer to capture users’ topical preferences, guiding personalized and context-aware generation. Experimental results on the ReDial dataset demonstrate that GraphPromptCRS significantly outperforms baselines in both recommendation accuracy and conversational diversity, validating the effectiveness of our approach. |
Zoom link: https://vuw.zoom.us/j/94581076597?pwd=HIwJVdF6FunOCzb7THbay7h6vIpMOc.1
Password: 563873
| 13:30 | When Vision Becomes a Threat: Adversarial Prompt Injection via Visual Embedding Manipulation PRESENTER: Yajing Ma ABSTRACT. Recent progress in multimodal large language models (MLLMs) has brought impressive capabilities but also introduced critical safety vulnerabilities due to their susceptibility to adversarial manipulation. Unlike textual inputs that pass through symbolic-level filtering, visual inputs are mapped into continuous embeddings by frozen vision encoders and injected directly into the language model without explicit safety checks. This unveils an overlooked security risk. We propose the first stealthy embedding-level jailbreak attack that directly perturbs visual token embeddings to inject harmful semantics, thereby bypassing alignment filters and reliably triggering unsafe behavior in MLLMs. Our method constructs a latent semantic embedding matrix from a curated and model-assisted harmful text set and blends it into selected visual token embeddings. To validate the effectiveness of our injection approach, we systematically evaluate multiple spatial attack strategies guided by a segment-wise sensitivity analysis. Experiments on three representative MLLMs (LLaVA-1.5, LLaVA-1.6, and mPLUG-Owl2) demonstrate that our method achieves significantly higher attack success rates (ASR), outperforming the strongest baselines by up to 4.5% absolute ASR. Our results demonstrate that embedding-level injection presents a potent and stealthy jailbreak vector, outperforming prior methods and revealing an overlooked threat surface in MLLMs. |
| 13:50 | Phoneme-Based Optimization of Enrollment Selection for Speaker Identification PRESENTER: Long-Quoc Le ABSTRACT. This paper presents a novel phoneme-based approach to enrollment utterance selection for speaker identification. Unlike conventional strategies that ignore linguistic diversity, our method explicitly maximizes phoneme coverage in the enrollment set, yielding more representative and robust speaker profiles. We demonstrate that increasing phoneme diversity directly improves speaker embeddings and identification accuracy, even under real-world speech variability. Experiments on the Vietnam-Celeb dataset with the state-of-the-art ECAPA-TDNN model show that our approach boosts identification accuracy from 93.6% to 95.7% and F1-score from 95.5% to 96.1%, relative to standard selection methods. Remarkably, these gains are achieved with fewer enrollment utterances, substantially reducing user effort. Analysis reveals a near-linear relationship between phoneme coverage and classification performance, highlighting phoneme diversity as a critical factor for effective enrollment. These findings underscore the practical value of our method for building more accurate, efficient, and user-friendly speaker identification systems. |
| 14:10 | A Quadratic Programming Framework Unifying Different Types of Visual Servoing With Obstacle Avoidance for Joint-Constrained Robots PRESENTER: Shan Zhang ABSTRACT. Visual servoing (VS) is a control technique that employs visual features captured by a camera to guide robots toward desired targets. According to the retrieved visual features, VS is commonly divided into position-based VS (PBVS), image-based VS (IBVS) and homography-based VS (HBVS). Apart from the specified VS task, obstacle avoidance (OA) and joint-limit avoidance (JLA) are crucial for ensuring safety and reliability of the robot. This paper focuses on developing a quadratic programming (QP) framework that unifies the aforementioned different types of visual servoing with OA and JLA capabilities for joint-constrained redundant robots. Then, a gradient-dynamics based neurodynamic network (GDNN) is designed to serve as a QP solver. Simulations and experiments conducted using two Franka Emika Panda robots demonstrate the validity and practicality of the established QP framework for achieving VS tasks with OA and JLA considered. |
| 14:30 | Few-Shot Document-Level Relation Extraction Based on Chain-of-thought with Discriminative Multi-view Prototype Tuning ABSTRACT. Few‐shot document‐level relation extraction (FSDLRE) seeks to uncover semantic relations among entities in a document when only a handful of labeled examples are available. Existing prototype‐based meta‐learning methods build class prototypes for matching but suffer from two key limitations: 1) Inadequate NOTA modeling. By focusing on “learning‐to‐match,” they neglect robust representations for None‐of‐the‐Above (NOTA) cases. 2) Underutilized supervision and reasoning. They do not fully exploit multi‐view supervisory signals nor harness the structured reasoning capabilities of large language models (LLMs), limiting their ability to adapt to new domains under scarce labels. In this paper, we propose Chain‐of‐Thought with Discriminative Multi-view Prototype Tuning (CDMPT), which harnesses the discriminative power of large language models to mine supervisory signals from scarce labeled data through multiple complementary views. By harnessing the chain-of-thought mechanism, our method treats each few-shot episode’s target relation classes as individual sub-domains and performs a structured prototype construction analysis and discriminative reasoning process to complete the FSDLRE task. Extensive experiments demonstrate that our approach significantly boosts performance in few-shot document-level relation extraction by average , especially under cross-domain settings. |
| 14:50 | DEIMerge: An Automatic Program Repair Framework Based on Multi-agent Collaboration and Intelligent Patch Merging ABSTRACT. Large language models (LLMs) have made significant progress in software engineering (SWE) tasks, such as code generation and automatic repair. However, current SWE agents have limited effectiveness when handling complex software defects and often overlook potential useful information in failed patches. To address this, this study proposes an automatic program repair (APR) framework called DeIMerge, which is based on multi-agent collaboration and intelligent patch merging. After voting to select the optimal patch, the framework uses an LLM to analyse all failed patches deeply, fuse scattered repair clues, and generate high-quality merged patches. Experimental results show that this method increases the single-patch repair rate of open-source SWE agents from 27.3% to 37.0%, with an overall maximum repair rate of 57.3%. This framework has been validated for its versatility across multiple mainstream LLM models. Multi-agent patch merging can effectively extract repair clues from failed patches and significantly improve automatic repair performance, providing new insights into solving complex defects. |
| 15:10 | Chain-of-Conceptual-Thought Elicits Daily Conversation in Large Language Models PRESENTER: Qingqing Gu ABSTRACT. Chain-of-Thought (CoT) is widely applied to enhance the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks, when there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose a new prompt-based paradigm called Chain of Conceptual Thoughts (CoCT), which suggests the LLM first to produce the tag of concepts, then complete the detailed content following the concept. To encourage this hierarchical way of thinking, we implement the concepts with emotions, strategies and topics. We experiment with this paradigm in daily and emotional support conversations, covering tasks with both in-domain and out-of-domain concept settings. Automatic, human, and LLM-based evaluations reveal that CoCT surpasses several prompt-based baselines such as self-refine, ECoT, SoT and RAG, suggesting a potential solution of LLM prompting paradigm for a wider scope of tasks. |
| 13:30 | Solving Low-dose Computer Tomography inverse problem by learning the first-order score of the sparse sinogram samples' distribution PRESENTER: Yuchen Quan ABSTRACT. Computer Tomography is widely used to acquire the internal structures of the target object in a non-invasive way. In order to obtain a high-quality reconstruction, densely distributed detectors are always used to improve the sampling rate, which is used to avoid the artifacts caused by angular undersampling. However, the high density of X-rays is harmful to the human body, which implies to us that sparse-view measurement is urgently needed. Lots of current methods couldn't make full use of the information from different domains, which leads to the failure of getting a reliable reconstruction result. This paper shows that the Low-dose Computer Tomography inverse problem could be solved from the perspective of an inpainting task in the measurement domain (Radon domain). Besides this, a method based on a score-based diffusion model is proposed, and some properties of the sinogram are used to achieve a more reliable result: 14% improvement in the metric PSNR and some improvement in the metric SSIM. |
| 13:50 | A Dual-Domain Perception and Fuzzy Learning Enhanced Framework for Diabetic Retinopathy Grading PRESENTER: Ye Wang ABSTRACT. Diabetic retinopathy (DR) is a leading cause of preventable vision loss worldwide. Accurate DR grading remains a major challenge due to substantial variability in lesion size and morphology, indistinct lesion boundaries, and subtle lesion characteristics that often resemble normal retinal tissue. To address these issues, we propose a novel Transformer-based framework that integrates dual-domain perception with fuzzy learning to enhance DR grading performance. Specifically, we design an Inverted Residual Fuzzy Block (IRFB) to improve lesion localization. It assigns adaptive fuzzy weights in both channel and spatial domains, effectively enhancing lesion-relevant features while suppressing irrelevant information. Furthermore, we introduce a Fuzzy Learning-based Multi-Scale Feature Enhancement (FMFE) module, which captures refined multi-scale representations and mitigates feature redundancy. To further improve global lesion feature extraction and contextual information, we propose the Dual-Domain Perception Transformer (DDPT). This module models both spatial and frequency domain characteristics via domain-specific self-attention mechanisms and employs a cross-attention strategy to fuse complementary information across domains. By combining spatial and frequency-domain features, our model achieves deeper contextual understanding and robust representation of complex lesion structures. Our model achieves a Quadratic Weighted Kappa (QWK) of 94.7% and an accuracy of 90.8% on the APTOS-2019 dataset, and a QWK of 85.9% and an accuracy of 86.3% on the DDR dataset, outperforming existing methods and demonstrating the effectiveness and robustness of our approach. |
| 14:10 | From Catch to Product: Machine Learning Driven Spectral Analysis for Fish Processing Line Allocation ABSTRACT. Efficiently allocating fish catch to appropriate production lines is vital for maximizing economic value in the seafood industry. This paper proposes a machine learning-based framework for non-destructive fish-to-product classification using vibrational spectroscopy techniques. Instead of predicting accurate biochemical compositions, the allocation task is formulated as a multi-class classification problem, with classes derived from expert-driven clustering of biochemical profiles. To address the challenges posed by noisy and limited spectral data, a novel data processing framework is proposed, which integrates data augmentation, feature selection, and feature fusion. This framework enriches the training dataset through domain-inspired linear spectral augmentation and employs selective feature fusion to extract robust and complementary features from multiple spectral modalities. The resulting fused features are then used to train standard classifiers, leading to improved classification performance. Experimental results on real-world fish spectral datasets demonstrate the effectiveness of this approach, offering a practical tool for intelligent production line allocation in fish processing. |
| 14:30 | MediVerse: AI-Powered Interactive Voice-Driven Virtual Reality for Health Data Analytics PRESENTER: Rani Adam ABSTRACT. biomedical data is increasingly complex, and existing interfaces often fall short in supporting intuitive, immersive exploration. We present MediVerse, a novel edge-cloud architecture that integrates voice-based natural language interfaces, immersive virtual reality (VR) visualization, and large language model (LLM)-based query translation to enable real-time, hands-free interaction with complex biomedical datasets. MediVerse leverages head-mounted VR displays for voice input, cloud-based orchestration for query interpretation and generation, and real-time 3D data rendering in an immersive environment. We demonstrate the platform through two case studies with biomedical data and evaluate its performance across 20 benchmark queries. Our findings highlight the system’s ability to accurately interpret user intent, maintain low-latency responsiveness, and deliver immersive, context-aware visualizations. This work introduces a reusable, modular framework that enhances voice-driven, LLM-assisted biomedical analytics in VR and lays the foundation for next-generation immersive data systems. |
Zoom link: https://vuw.zoom.us/j/97046252862
| 13:30 | Personalized Knowledge Tracing Model with Memory Reinforcement and Forgetting-Aware Mechanisms ABSTRACT. Knowledge Tracing (KT), a fundamental technology in online intelligent education systems, is designed to model learners' learning processes and monitor the dynamic evolution of their knowledge states. Learners' memory of acquired knowledge decays over time, with forgetting patterns varying based on individual cognitive characteristics. However, most existing KT models adopt a unified and simplified forgetting function, which fails to accurately capture individualized memory decay and distinguish between the effects of content similarity and time on learning performance, leading to a significant decline in accuracy when predicting long-term learning outcomes. To address this, we proposes a Memory-Enhanced Personalized Diagnostic Knowledge Tracing model (MLEKT), which integrates a forgetting-aware linear bias, error-boosted spaced repetition algorithm, and genetic algorithm optimization to precisely model forgetting behaviors in long learning sequences. Specifically, this paper first designs a forgetting-enhancement module based on a spaced repetition algorithm to provide more fine-grained and personalized forgetting enhancement for different types of learning interactions. Second, a forgetting-aware linear bias mechanism is introduced to effectively distinguish the effects of content similarity and time. Finally, a genetic algorithm-based optimization method for personalized forgetting enhancement values is proposed, enabling personalized parameter configurations for different learners. Experiments on three public datasets demonstrate that the MLEKT model outperforms baseline methods in both accuracy and stability, with significant advantages in long-sequence learning interactions. |
| 13:50 | High-Order Information Embedding Transfer for Clustering with Constrained Laplacian Rank PRESENTER: Lijuan Wang ABSTRACT. Most Constrained Laplacian Rank methods rely on first-order similarity graphs, which capture only direct neighbor relations and constrain the discovery of latent high-order structures, such as indirect connections, particularly in single-view settings. Furthermore, fusing simply multi-order proximity matrices without structural alignment results in redundancy and inconsistency. As the proximity order increases, it tends to introduce irrelevant high-order information, degrading clustering performance and stability. To address these issues, we propose a high-order clustering method that constructs anchor-sample high-order bipartite graphs using recursively computed SVD-based similarity matrices, effectively capturing indirect neighborhood relations and enhancing global structural representation. A unified feature embedding space is designed to enable cross-order knowledge transfer via co-clustering, where low-order embeddings guide high-order representations, improving structural consistency and feature discriminability while filtering irrelevant links. Additionally, nuclear norm and sparsity regularization are applied to suppress redundancy and enhance robustness. Experiments on five public datasets show that our method consistently outperforms state-of-the-art approaches across four clustering metrics, validating its effectiveness and resilience. |
| 14:10 | Balanced Learning for Incremental Multi-view Clustering ABSTRACT. In practice, the number of views can increase over time, and repeatedly fusing all views upon the arrival of each new view can result in high computational costs and accumulated redundancy. Additionally, early-acquired views may become unavailable due to privacy, storage, or data expiration issues, leading to reduced consistency and poor clustering performance. To solve these issues, we propose Balanced Learning for Incremental Multi-View Clustering (BIMC), which incrementally constructs a unified matrix to preserve view information over time. To further enhance clustering performance, each new view is integrated using balanced learning that reduces feature distribution shifts and erroneous connections between clusters while maintaining consistency within clusters. Finally, to further enhance consistency, cluster labels are directly obtained from the consensus graph by enforcing a Laplacian rank constraint, enabling unified graph construction and clustering. Experimental results demonstrate that BIMC achieves superior clustering performance and efficient view fusion across diverse multi-view datasets. |
| 14:30 | Federated Dual-Clustered Prototype Learning under Domain Heterogeneity ABSTRACT. Federated learning allows clients to collaboratively train models while safeguarding the privacy of their data. Existing methods typically assume that data from different clients originates from the same domain or distribution. Nonetheless, owing to regional constraints, data features from diverse clients demonstrate notable variations, termed as domain heterogeneity. The naive aggregation of models trained on such heterogeneous data can result in a global model that is biased towards dominant domains and generalizes poorly to others. Therefore, we expect the global model to have better generalization performance in different domains. In this paper, we propose a federated dual-clustered prototype learning(FedCPL) framework, a novel approach designed to counteract domain heterogeneity and improve model generalization. The key insight is to construct a shareable global prototype through dual-clustering, effectively minimizing the discrepancy among feature representations from disparate domains. On the client side, we introduce weighted contrastive learning and feature fusion to align local features, thereby mitigating domain-specific biases during model training. On the server side, an adaptive weighted aggregation strategy is introduced to prioritize contributions from more challenging domains. Extensive experiments on multiple benchmark datasets demonstrate that FedCPL significantly outperforms existing methods in scenarios with domain heterogeneity. |
| 14:50 | TVL-Filter: Total Variation Loss–Based Sample Filter for Efficient Adversarial Detection PRESENTER: Fei Zhang ABSTRACT. DNN models in computer vision are vulnerable to adversarial samples that are crafted with imperceptible perturbations, which can lead to unpredictable security risks. Currently, there are many countermeasures proposed in the literature to detect adversarial samples and mitigate their impact. However, these detection algorithms introduce significant computational overhead, limiting their practicality. To address this, two insights motivate this study: 1) for those deployed DNN models, the majority of inputs are benign samples that do not need to undergo detection; 2) the crafted perturbations of adversarial samples can be regarded as a type of high-frequency noise signal. To this end, we propose the Total Variation Loss–Based Sample Filter (TVL-Filter), a plug-in module designed for efficient adversarial detection, which employs the TV-loss value to evaluate samples' high-frequency noise signals, and filters out a significant portion of benign samples before detection accordingly. TVL-Filter helps to substantially reduce the adversarial detection overhead with an acceptable sacrifice of the detection precision. Our experiments indicate that after employing the TVL-Filter, three state-of-the-art detection algorithms achieve speedups of up to 8.73x, 8.32x, and 7.06x, with adversarial sample detection accuracy losses of only 2%, 2.90%, and 1.13%, respectively. |
| 15:10 | Dual-Aspect Enhancement of Data Replay: Influence-Guided Replay and Contrastive Gradient Modulation ABSTRACT. Recent advances in language models have significantly improved natural language processing. However, these models face challenges in continual learning (CL), particularly in retaining previously acquired knowledge while assimilating new information, a problem known as catastrophic forgetting. We revisited the concept of data replay in continual learning and introduced two novel improvements: the \textbf{I}nfluence-\textbf{G}uided \textbf{S}ampling (IGS) strategy for memory buffer construction and the \textbf{C}ontrastive \textbf{G}radient \textbf{M}odulation (CGM) mechanism for parameter update, aiming to mitigate catastrophic forgetting and enhance knowledge transfer. IGS-CGM not only replays past data but also modulates the current task's gradient through a contrastive analysis with gradients from previous tasks, thereby preserving the model's proficiency in previously acquired domains while learning new ones. We conducted extensive experiments on three CL benchmarks, covering traditional finetuning and instruction finetuning for large language models, demonstrating its effectiveness in mitigating catastrophic forgetting and enhancing knowledge transfer. |
Zoom link: https://vuw.zoom.us/j/93664289896
| 16:00 | VISP: Volatility Informed Stochastic Projection for Adaptive Regularization ABSTRACT. We propose VISP: Volatility Informed Stochastic Projection, an adaptive regularization method that leverages gradient volatility to guide stochastic noise injection in deep neural networks. Unlike conventional techniques that apply uniform noise or fixed dropout rates, VISP dynamically computes volatility from gradient statistics and uses it to scale a stochastic projection matrix. This mechanism selectively regularizes inputs and hidden nodes that exhibit higher uncertainty while preserving stable representations, thereby mitigating overfitting. Extensive experiments on MNIST, CIFAR-10, and SVHN demonstrate that VISP consistently improves generalization performance over baseline models and fixed-noise alternatives. In addition, detailed analyses of the evolution of volatility, the spectral properties of the projection matrix, and activation distributions reveal that VISP not only stabilizes the internal dynamics of the network but also fosters a more robust feature representation. These findings suggest that data-dependent, volatility-driven regularization is a promising direction for enhancing the performance of deep neural architectures. |
| 16:20 | Incremental Hashing with Asymmetric Distance for Image Retrieval in Non-stationary Environments PRESENTER: Zihao Zhan ABSTRACT. Existing online hashing methods generally employ the Hamming distance for similarity evaluation, which leads to the information loss of data location. Candidate images may have the same Hamming distance but different similarity from the query, which reduces the retrieval accuracy. Especially for non-stationary data environments, concept drift problems are prevalent. The location information loss makes it more difficult for capturing the distribution changes in data environments. To alleviate above concerns, Incremental Hashing with Asymmetric Distance (ICHAD) is proposed in this paper for image retrieval in non-stationary environments. In ICHAD, the online asymmetric distance based on learned hash codes is employed for the similarity evaluation. It preserves the location information of data more accurately and is computed efficiently without accessing the old data. Experimental results show that ICHAD outperforms existing hashing methods in various non-stationary data scenarios with concept drift. |
| 16:40 | A Neural Subgraph Counting Method based on Matching Matrix ABSTRACT. Subgraph counting aims to compute the number of subgraphs in a data graph G that match a given query graph q, which has been applied in various fields such as bioinformatics, data mining, and social network analysis. Early methods fundamentally rely on enumerating all possible subgraphs, but they face high computational cost because enumerating all possible subgraphs is an NP-complete problem. To release the complexity, approximate method has gained attention, as in many cases, approximate counts are sufficient for decision-making or identifying trends. Recently, researchers have begun applying GNNs to approximate subgraph counting tasks, yet existing GNN-based methods suffer from inefficiencies caused by unpromising data vertices and limited use of the matching information between query and data vertices. To address these challenges, we propose a Neural Subgraph Counting method based on Matching Matrix, namely MMNSC, which consists of two key components: (1) Candidates Extraction, which retrieves candidate substructures from data graph using a new filtering method, and (2) Matching Matrix Estimator, a learning-based estimator that generates a matching matrix between query graph and data graph. Through experiments on five real-world data graphs, MMNSC demonstrates superior performance over existing state-of-the-art methods. |
| 17:00 | Partial Multi-Label Feature Selection Based on Matrix Elastic-Net ABSTRACT. Partial Multi-Label Learning (PML) is an extension of multi-label classification where each training instance is associated with a set of candidate labels that contains both relevant and noisy labels. The presence of high-dimensional feature representations in PML exacerbates learning complexity, thereby heightening the model’s sensitivity to noise and irrelevant information. To tackle this issue, we propose a Partial Multi-Label Feature Selection method based on Matrix Elastic-Net (PMLFS-MEN). Firstly, we employ a low-rank and sparse decomposition to separate the candidate label matrix into a low-rank ground-truth label matrix and a sparse noise label matrix. Subsequently, we incorporate matrix elastic-net regularization, where the nuclear norm is regarded as the l_{1}-norm of the singular values of the ground-truth matrix and the Frobenius norm as its l_{2}-norm, thereby encouraging a balanced low-rank structure and improving stability. Moreover, a non-convex l_{2,1-2}-norm is introduced to achieve a sparse solution, thus improving the ability of the selected features to discriminate between labels. Extensive experiments on both synthetic and real-world PML datasets validate that PMLFS-MEN achieves superior performance over state-of-the-art partial and traditional multi-label feature selection methods. |
| 17:20 | A Dynamic Time-Frequency Representation and Cross-Attention Model for Production Forecasting in Waterflooding Oilfields ABSTRACT. During the high water-cut stage of waterflood development, the enhanced nonlinearity of reservoir seepage fields and spatiotemporal non-stationarity of injection-production system dynamics pose major challenges to production forecasting. Existing frequency-domain feature modeling methods exhibit critical limitations: sluggish response to abrupt injection-production dynamic shifts, an inherent imbalance in capturing local transients versus extracting global periodicity, and susceptibility to feature coupling confusion when processing non-stationary production data. To address these issues, this paper proposes the dynamically perceptive hybrid deep learning model DynaTCN-Wave-BiLSTM-CA. The framework features: (1) a Dynamic Temporal Convolutional Network (Dynamic-TCN) employing input-feature-driven adaptive dilation coefficients and channel attention to dynamically adjust convolutional receptive fields for efficient cross-scale feature extraction; (2) third-order db4 wavelet packet decomposition to segregate key parameters into frequency bands, integrated with a dual time-frequency attention mechanism for precise multi-scale feature decoupling; and (3) a bidirectional cross-attention fusion module leveraging BiLSTM networks to capture spatiotemporal dynamics and backward-delayed responses of injection-production systems, thereby deeply integrating transient abrupt features in the time domain with frequency-separated components (low-frequency seepage trends and high-frequency equipment noise). Validation using actual production data from an offshore oilfield confirms the model’s superior performance in non-stationary production sequence prediction compared to mainstream methods. |
Zoom link: https://vuw.zoom.us/j/94581076597?pwd=HIwJVdF6FunOCzb7THbay7h6vIpMOc.1
Password: 563873
| 16:00 | GAM: A Generative Autoencoder for Diverse Human Motion Prediction ABSTRACT. Diverse human motion prediction focuses on forecasting plausible future human motions based on past motion sequence, which has caused widespread attention. Note that there exists a discrepancy between the latent space dimension and the original human motion dimension. This discrepancy makes the generative methods hard to train, which in turn affects the diversity of the generated samples. In this paper, we propose a novel method, called GAM, to innovatively map both the observed human motion sequence and the reconstructed human motion sequence into the latent space by an encoder network, and then minimize the divergence between these sequences in the latent space. Specifically, we enforce the consistency in both the data space and the latent space with latent reconstruction losses. This operation can effectively align the human motion sequence with the latent representation, and mitigate the challenges between the uncertainty of human motion factors and inherent dimensionality differences. In addition, we employ the Mamba model to extract the spatio-temporal features of the dynamics of human motions. The extensive experiments were conducted on two standard benchmark datasets, Human3.6M and HumanEva-I, the results demonstrated that our method surpasses the current state-of-the-art baselines in terms of both sample diversity and accuracy |
| 16:20 | CLAF: A Critical Learning Period-Aware Adaptive Framework for Federated Learning in Heterogeneous Environments ABSTRACT. Federated Learning (FL) enables privacy-preserving collaborative model training across decentralized clients. While adaptive client selection and knowledge distillation (KD) offer potential efficiency gains by monitoring client progress, existing methods lack systematic understanding of pervasive client and data heterogeneity in practical settings. Prevailing FL approaches assume homogeneous clients with equal importance and capability, selected uniformly at random – an assumption contradicted by Critical Learning Periods (CLP) theory, which demonstrates that minor gradient disturbances during early sensitive phases irreparably degrade model accuracy. To address this, we propose the Critical Learning Period-Aware adaptive Framework (CLAF), a novel FL framework for heterogeneous environments. CLAF introduces dual-granularity (coarse- and fine-grained) CLP detection to intelligently optimize client selection and drive adaptive KD strategies. Extensive experiments on diverse models and datasets show CLAF outperforms state-of-the-art methods by up to 22% in accuracy while maintaining robust generalization capabilities. |
| 16:40 | HiGraph-LLM: Hierarchical Graph Encoding and Integration with Large Language Models ABSTRACT. Graph Neural Networks (GNNs) have achieved remarkable performance on graph-centric tasks such as node classification and link prediction. Meanwhile, Large Language Models (LLMs) have shown impressive performance in language understanding across diverse domains. GNNs effectively capture structural information but struggle with rich semantic modeling, while LLMs offer strong contextual reasoning yet fail to encode graph topology. This dual challenge necessitates addressing both the inherent limitations in node representation learning and the complexities involved in aligning graph-structured data with the token space of LLMs. To address these challenges, we introduce HiGraph-LLM, a novel framework designed for hierarchical graph encoding and integration with large language models. HiGraph-LLM refines node representations by integrating multi-level structural features and aligns them with LLMs through curriculum-driven prompt learning. Specifically, HiGraph-LLM consists of two modules: the Hierarchical Node Information Learning Module, which effectively consolidates information from hierarchical node levels to improve node representations, and the LLM’s Graph Information Integration Module, which optimizes the alignment of graph data with the LLM. Comprehensive experiments on multiple benchmark datasets demonstrate the effectiveness of our proposed method. The code will be released upon acceptance of the paper. |
| 17:00 | ELM-Based Finite-Time State Observer Designs for Uncertain Robotic Systems ABSTRACT. This paper focuses on the finite-time state observer design for a class of robotic systems with lossy state measurement and nonlinear dynamics including uncertainties and disturbances. The extreme learning machine (ELM) algorithm is applied to approximate the nonlinear dynamics, and simultaneously, adaptive technique is employed to adjust the output weights of the ELM network and to remove the adverse effects of residual errors and disturbances. Then, a finite-time state observer based on the adaptive signals is developed to estimate the immeasurable states within a finite time accurately. Ultimately, the estimation accuracy of the designed finite-time ELM network-based observer is demonstrated by simulation results on a robotic manipulator platform. |
| 17:20 | Document-level Relation Extraction with Multi-scale and Non-bridge Reasoning ABSTRACT. Document-level relation extraction (DocRE) aims to extract semantic relations among entities distributed across multiple sentences. Existing methods, whether graph-based or sequence-based, predominantly rely on single-granularity representations, overlooking the fact that different relational triples often require distinct semantic granularities for accurate inference. Moreover, current DocRE approaches commonly retain useful information for relation prediction via bridge entities, which allows the model to elaborately capture the intrinsic interdependence between target entities. However, these studies ignore the potential contributions of non-bridge elements in the reasoning process. To address these limitations, we propose MNR (Multi-scale and Non-bridge Reasoning), a novel framework that introduces multi-scale semantic spaces and cross-axial attention mechanisms to enhance both relation extraction and relational reasoning. Experiments conducted on several widely used DocRE benchmarks demonstrate the effectiveness and generalization capability of our method. |
| 17:40 | VARMA-Enhanced Transformer for Time Series Forecasting PRESENTER: Jiajun Song ABSTRACT. Although Transformer-based models have significantly advanced time series forecasting, their effectiveness and architectural complexity remain subjects of intense debate. Recent work, such as the Cross-Attention-only Time Series transformer (CATS), has demonstrated that eliminating the permutation-invariant self-attention mechanism can lead to superior performance and efficiency. However, these streamlined architectures may overlook the fine-grained, local temporal dependencies effectively captured by classical statistical models like VARMA. To address this gap, we propose VARMAformer, a novel architecture that synergizes the efficiency of a cross-attention-only framework with the principles of classical time series analysis. Our model introduces two key innovations: (1) a dedicated VARMA-inspired Feature Extractor (VFE) that explicitly models autoregressive (AR) and moving-average (MA) patterns at the patch level, and (2) a VARMA-Enhanced Attention (VE-atten) mechanism that employs a temporal gate to make queries more context-aware. By fusing these classical insights into a modern backbone, VARMAformer captures both global, long-range dependencies and local, statistical structures. Through extensive experiments on widely-used benchmark datasets, we demonstrate that our model consistently outperforms existing state-of-the-art methods. Our work validates the significant benefit of integrating classical statistical insights into modern deep learning frameworks for time series forecasting. |
| 18:00 | Multiscale Masking Knowledge Distillation for Dense Visual Prediction ABSTRACT. Object detection and semantic segmentation are fundamental tasks in computer vision, but deploying deep learning models for these tasks in resource-constrained environments remains challenging due to their high computational demands. Knowledge distillation (KD) has emerged as a promising solution, enabling the transfer of knowledge from a large, high-performance teacher model to a lightweight student model. However, existing KD methods often struggle with dense visual prediction tasks due to their complex feature hierarchies and multiscale object representations.In this paper, we propose Multiscale Masking Knowledge Distillation (MMKD), a novel approach that enhances knowledge transfer by leveraging multiscale feature maps and attention-guided masking mechanisms. Our method systematically distills knowledge by focusing on discriminative regions at different scales, ensuring that the student model learns both fine-grained details and high-level contextual information. We introduce a Feature Attention Module (FAM) that dynamically highlights critical regions in feature maps, improving the student’s ability to detect small objects and reduce false positives. We conduct extensive experiments on benchmark datasets, including COCO, and Cityscapes, evaluating our method across multiple architectures (RetinaNet, Faster R-CNN, GFL, DeepLabV3, and PSPNet). Our results demonstrate that MMKD significantly outperforms existing distillation techniques, achieving state-of-the-art performance in both object detection and semantic segmentation. For instance, on COCO, our method improves the mAP of RetinaNet-Res50 from 37.4 to 41.2, surpassing previous approaches like FGD (39.6) and MasKD (39.8). Similarly, in semantic segmentation, MMKD boosts DeepLabV3-MobileNetV2’s mIoU from 73.12 to 76.21, outperforming competitors such as CIRKD (75.42) and MasKD (75.26). |
Zoom link: https://vuw.zoom.us/j/92860465891
| 16:00 | Guided Attention Mechanism in Multi-turn Dialogue Summarization ABSTRACT. Multi-turn dialogue summarization aims to efficiently extract core information from vast amounts of conversational data. However, this task often causes "structure-saliency conflict" when balancing the structural perception and saliency focus of the summary, resulting in chaotic summary logic or vacuous content. This paper proposes a guided attention mechanism based on the synergy of structure and saliency. First, the macro-topic flow of the dialogue is predicted, and a structural attention mask is constructed to impose hard constraints on the summary range, ensuring the overall logical coherence of the summary to alleviate the problem of logical confusion; secondly, within the constraint range, the saliency of each discourse is scored by fusing multi-dimensional features, and the attention weight is dynamically guided softly to ensure the focus of key content thereby mitigating the issue of vacuous or generic content. To verify the mechanism, the STGSum summary model is constructed. Experiments on two public datasets, CSDS and DialogSum, show that the STGSum model performs significantly better than mainstream baseline models such as TGDS and TODS in key indicators such as ROUGE, especially when dealing with complex structured dialogues. It shows excellent robustness. This study provides an effective solution for generating high-quality dialogue summaries with clear logic and prominent focus. |
| 16:20 | Beyond Local Balance: A Global Perspective for Signed Network Embedding PRESENTER: Ziyi Hou ABSTRACT. Learning low-dimensional node representations is crucial for analysis in signed networks. Existing embedding methods, often based on local structural balance theory, tend to overlook network-wide information. This is a critical omission, as negative links can create globally pivotal nodes—such as those bridging conflicting communities—whose importance cannot be captured by local analysis alone. To address this limitation, we propose SAGA (Signed-Aware Global Attention), a novel signed network embedding framework. SAGA first utilizes a signed graph neural network to learn local representations that differentiate between positive and negative ties. It then introduces a global pooling mechanism that generates a graph-level summary, enabling the model to generate node embeddings that reflect both their global context and structural significance. Experiments on five real-world datasets demonstrate that SAGA consistently outperforms existing methods on downstream tasks, validating its effectiveness in capturing both local and global network properties. |
| 16:40 | PBNAT: Overcoming the Accuracy-Robustness Trade-off via Parallel Batch Normalization ABSTRACT. The efficiency and convergence of adversarial training are compromised by the pronounced distributional divergence between clean and adversarial samples, which has been largely attributed to Batch Normalization (BN). Although researchers have attempted to address this mismatch via BN-free or dual-BN frameworks, but these approaches invariably sacrifice natural accuracy for adversarial robustness or vice versa. To overcome these limitations, we introduce Parallel Batch Normalization Adversarial Training (PBNAT), which augments the network with multiple BN branches and a trainable selector that models each input’s feature statistics as a weighted combination of these branches. During training, an alternative BN-scheduling scheme and a novel BN-pruning algorithm work in concert to reduce computational overhead and bolster generalization. During inference, the selector generates a sample-specific weighted combination over all normalization branches, enabling a more flexible and adaptive normalization strategy. This dynamic normalization mechanism enables the model to adapt seamlessly to both clean and adversarial distributions without manual tuning. Empirical results also demonstrate that PBNAT reconciles the accuracy–robustness trade-off, achieving superior natural accuracy and adversarial robustness compared to single-BN, BN-free, and dual-BN baselines. |
| 17:00 | QAACoder: A Question Answering Approach to Actor Detection in the Conflict and Mediation Domain ABSTRACT. Monitoring, analyzing, and predicting political turmoil and violence is of utmost importance to a host of political scientists. This is still usually done using event coding systems that use pattern match ing and fixed-size dictionaries. Recently, BERT and ConfliBERT have achieved state-of-the-art results for event coding. However, these meth ods use a sequence classification paradigm, and thus, are unable to ex plicitly model the semantics of the labels and the rich interactions among them. In this paper, we propose a novel method for political event ex traction on the standard CAMEO-based data set by formulating the problem as question answering overcoming the above drawbacks. We can achieve superior results, improving ConfliBERT, the previous state of-the-art model, by an absolute F1 of 2.02%. We also propose a new method for multi-source, multi-target sentences that increases the F1 by 2.29% compared to the previous best method. |
| 17:20 | Assessing Nuanced Personality Inducing in Language Models via Vignette Tests ABSTRACT. Personality inducing has emerged as a critical research area in modern intelligent systems, which focusing on adapting to traits of specific individuals for delivering tailored experiences. Although Large language models (LLMs) have become increasingly proficient at simulating personality traits, two major challenges remain. First, existing research focuses on psychological questionnaires, which exhibit a significant gap from real-world scenarios, making it unclear to measure personality induction performance in scenario situations. Second, subtle differences between personalities can also lead to significantly different Behaviors. In this paper, we present a benchmark, VTPI(Vignette Tests for Nuanced Personality Inducing), comprising vignette questions that assess whether indcuement methods successfully induce the personality traits. And we find that current inducing approaches fail catastrophically on indcuing the nuanced personalities under our constructed questions from real scenario. We thus develop a simple yet effective induction method(DPI) that is capable of capturing subtle differences between nuanced personality traits for precise behavior induction. While VTPI remains challenging, we show that DPI scales well with LLMs (e.g., ChatGPT-4o and DeepSeek-R1) and outperforms previous methods by a large margin(average 19.65% improvement of F1 on Qwen2.5-14B and 32B models). |
| 17:40 | Wave–PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention ABSTRACT. We introduce Wave–PDE Nets, a neural architecture whose elementary operation is a differentiable simulation of the second-order wave equation. Each layer propagates its hidden state as a continuous field through a medium with trainable spatial velocity c(x) and damping γ(x). A symplectic spectral solver based on FFTs realises this propa- gation in O(n log n) time. This oscillatory, global mechanism provides a powerful alternative to attention and first-order state-space models. We prove that a single Wave-PDE layer is a universal approximator. On lan- guage and vision benchmarks, Wave-PDE Nets match or exceed Trans- former performance while demonstrating superior practical efficiency, re- ducing wall-clock time by up to 30% and peak memory by 25%. Ablation studies confirm the critical role of symplectic integration and a spectral Laplacian for stability and performance. Visualizations of the learned physical parameters reveal that the model learns intuitive strategies for information propagation. These results position Wave-PDE Nets as a computationally efficient and robust architecture with a strong physical inductive bias. |
| 17:50 | AFD-STA: Adaptive Filtering Denoising with Spatiotemporal Attention for Chaos Prediction PRESENTER: Chunlin Gong ABSTRACT. This paper presents AFD-STA Net, a neural framework integrating adaptive filtering and spatiotemporal dynamics learning for predicting high-dimensional chaotic systems governed by partial differential equations. The architecture combines: 1) An adaptive exponential smoothing module with position-aware decay coefficients for robust attractor reconstruction, 2) Parallel attention mechanisms capturing cross-temporal and spatial dependencies, 3) Dynamic gated fusion of multiscale features, and 4) Deep projection networks with dimension-scaling capabilities. Numerical experiments on nonlinear PDE systems demonstrate the model's effectiveness in maintaining prediction accuracy under both smooth and strongly chaotic regimes while exhibiting noise tolerance through adaptive filtering. Component ablation studies confirm critical contributions from each module, particularly highlighting the essential role of spatiotemporal attention in learning complex dynamical interactions. The framework shows promising potential for real-world applications requiring simultaneous handling of measurement uncertainties and high-dimensional nonlinear dynamics. |
Zoom link: https://vuw.zoom.us/j/97046252862
| 16:00 | Multimodal Named Entity Recognition with Synthesized SVG Graphics and Structural Semantic Consistency Scoring PRESENTER: Shujun Xia ABSTRACT. Named entity recognition (NER) is a fundamental task in natural language processing (NLP). However, it often has difficulty in handling entity ambiguity due to limited contextual clues. To address this challenge, traditional multimodal methods introduce social media images to assist text understanding, but the semantic deviation and noise of images seriously restrict the multimodal modeling effect. This paper proposes NER-S, a novel multimodal NER framework that integrates text-guided Scalable Vector Graphics (SVG) as reliable visual information and combines structural semantic consistency score (SSCS) to select images with high visual and semantic consistency as auxiliary information to improve entity recognition performance. Specifically, the original text is first input into the SVG image generation model to generate candidate images. Then, the optimal image is selected by SSCS and input into the multimodal named entity recognition model as the final visual supplement. Experiments on Twitter-2015 and Twitter-2017 datasets demonstrate the effectiveness of NER-S, with F1 scores of 76.60% and 86.87%, respectively. Our model outperforms all text-only baselines and exhibits comparable or superior robustness and generalization capabilities to existing multimodal models with real-world images. |
| 16:20 | DREAM: A Dual Representation Learning Model for Multimodal Recommendation ABSTRACT. Multimodal recommendation focuses primarily on effectively exploiting both behavioral and multimodal information for the recommendation task. However, most existing models suffer from the following issues when fusing information from two different domains: (1) Previous works do not pay attention to the sufficient utilization of modal information by only using direct concatenation, addition, or simple linear layers for modal information extraction. (2) Previous works treat modal features as learnable embeddings, which causes the modal embeddings to gradually deviate from the original modal features during learning. We refer to this issue as Modal Information Forgetting. (3) Previous approaches fail to account for the significant differences in the distribution between behavior and modality, leading to the issue of representation misalignment. To address these challenges, this paper proposes a novel \textbf{D}ual \textbf{RE}present\textbf{A}tion learning model for \textbf{M}ultimodal Recommendation called \textbf{DREAM}. For sufficient information extraction, we introduce separate dual lines, including Behavior Line and Modal Line, in which the Modal-specific Encoder is applied to empower modal representations. To address the issue of Modal Information Forgetting, we introduce the Similarity Supervised Signal to constrain the modal representations. Additionally, we design a Behavior-Modal Alignment module to fuse the dual representations through Intra-Alignment and Inter-Alignment. Extensive experiments on three public datasets demonstrate that the proposed DREAM method achieves state-of-the-art (SOTA) results, with the source code available at https://anonymous.4open.science/r/DREAM-8497. |
| 16:40 | MeDRNet: A Knowledge-Augmented Multi-Model Framework for Robust Medical Language Understanding ABSTRACT. While Large Language Models (LLMs) hold great promise for medical applications, their susceptibility to subtle linguistic variations—such as terminology differences, colloquial phrasing, and altered word order—poses a critical challenge for clinical reliability. We propose MeDRNet, a knowledge-enhanced multi-model medical AI framework that dynamically fuses generic and domain-specific models through an adaptive routing mechanism. Its modular design integrates innovative techniques, including adversarial training, logical consistency constraints, and a knowledge alignment module, leveraging medical knowledge graphs and retrieval-augmented generation to effectively mitigate hallucinations and enhance factual accuracy.MeDRNet is designed to maintain semantic stability under diverse clinical inputs, including noisy, informal, and domain-specific queries, making it suitable for high-stakes healthcare scenarios. Extensive experiments on PromptCBLUE, MultiMedBench, and a Real-World Query Set demonstrate that MeDRNet consistently outperforms leading baselines—including GPT-4, Aquila-Med LLM, and HuatuoGPT-o1—in terms of accuracy, robustness, and hallucination resistance.These findings establish MeDRNet as a scalable and trustworthy foundation for real-world clinical language understanding tasks. The framework is readily extensible to downstream applications such as diagnostic decision support, electronic health record (EHR) summarization, and multilingual medical QA, offering a promising pathway for integrating LLMs into next-generation clinical workflows. |
| 17:00 | Multi-modal Multi-objective Particle Swarm Optimization Using Growing Neural Gas Network ABSTRACT. In multi-modal multi-objective optimization (MMO), multiple Pareto optimal solutions with distinct decision variables can be projected onto an identical objective vector on the Pareto front. Numerous optimization algorithms develop sophisticated diversity preserving mechanisms to extensively explore the Pareto set (PS). However, existing work has neglected explicit learning of the Pareto set, while the intersection of machine learning and MMO remains largely unexplored. To advance the field, a multi-modal multi-objective particle swarm optimization that incorporates a growing neural gas network is proposed, termed MMPSO-GNG. The algorithm incrementally learns the topological structure of the PS to construct the network, in which a network-based solution generator and a selector are developed to facilitate exploration and maintain diversity. The generator leverages the network nodes to identify the neighborhood of each particle and guide the update of its position. The selector combines crowding distance and node-associated particle count to maintain a diverse set of particles. Performance evaluation on the CEC 2020 benchmark suite reveals that MMPSO-GNG surpasses five competing algorithms, therefore validating the effectiveness of integrating machine learning into particle swarm optimization to address complex multi-modal multi-objective problems. |
| 17:20 | BiGMF: Multimodal Sentiment Analysis By Bidirectional Cross-Modal Attention with Geometric Volume Regularization PRESENTER: Youwei Zhang ABSTRACT. Multimodal Sentiment Analysis (MSA) aims to integrate text, audio, and visual information to better understand human emotions. However, existing approaches lack a structured bidirectional mechanism for the exchange of semantic information between text, audio, and video modalities. This difficulty in modeling complex cross-modal dependencies consequently restricts their capacity to capture detailed semantic correlations. Moreover, previous methods typically fail to effectively align representations across different modalities, leading to semantic inconsistencies and redundant information during the fusion process. To address these issues, this paper proposes a novel bidirectional cross-modal fusion framework named BiGMF. The method is built upon a hierarchical cross-modal interaction architecture that enables bidirectional information exchange at multiple levels, enhancing the modeling capacity for cross-modal interactions. In addition, a geometric volume regularization strategy is introduced to reinforce semantic consistency. This strategy explicitly promotes the alignment of modality-specific features by constraining the geometric volume of their joint distribution in a shared embedding space. Extensive experiments on two MSA benchmarks demonstrate the effectiveness of the proposed method. |
| 17:40 | TRUST: Transparent, Robust and Ultra-Sparse Trees ABSTRACT. Piecewise-constant regression trees remain popular for their interpretability, yet often lag behind black-box models like Random Forest in predictive accuracy. In this work, we introduce TRUST (Transparent, Robust, and Ultra-Sparse Trees), a novel regression tree model that combines the accuracy of Random Forests with the interpretability of shallow decision trees and sparse linear models. TRUST further enhances transparency by leveraging Large Language Models to generate tailored, user-friendly explanations. Extensive validation on synthetic and real-world benchmark datasets demonstrates that TRUST consistently outperforms other interpretable models -- including CART, Lasso, and Node Harvest -- in predictive accuracy, while matching the accuracy of Random Forest and offering substantial gains in both accuracy and interpretability over M5', a well-established model that is conceptually related. |
| 18:00 | Interpretable Brain Network Analysis for Psychiatric Diagnosis Using Fuzzy Logic ABSTRACT. Psychiatric disorders impose a significant burden on health- care systems, necessitating accurate and interpretable diagnostic tools. Functional magnetic resonance imaging (fMRI) provides insights into brain functional connectivity (FC), yet traditional models often lack transparency. We propose a novel fuzzy logic-based approach to model and interpret brain networks for psychiatric diagnosis. This method em- ploys fuzzy rules to capture causal relationships between brain regions and diagnostic outcomes, delivering individualized explanations with- out relying on graph neural networks. Evaluations on large-scale fMRI datasets, such as REST-meta-MDD, demonstrate competitive perfor- mance and clinically relevant interpretability, with our model being ex- plainable. |
Zoom link: https://vuw.zoom.us/j/93664289896
| 16:00 | Balancing Fairness and Performance Under Multiple Sensitive Attributes PRESENTER: Muxiang Zhang ABSTRACT. Advances in machine learning enable solutions to increasingly complex problems. However, the predominant focus on predictive accuracy in many models often results in insufficient attention to potential biases against certain groups, thereby highlighting the critical need for fairness-aware machine learning. While most existing studies focus solely on debiasing with respect to a single sensitive attribute (e.g., race or gender), they fail to simultaneously consider fairness under multiple sensitive attributes. Furthermore, current fairness-enhancing approaches frequently degrade model performance. To address these limitations, we propose a novel framework named BFPM that achieves a better balance between fairness and performance across multiple sensitive attributes. BFPM consists of two parts. First, in the data pre-processing stage, we generate synthetic samples to balance the proportion of multiple sensitive attributes in the dataset, thereby enhancing fairness. Second, in the in-processing stage, we employ a retrieval-augmented model to obtain the context of each sample, thereby strengthening its representation. Comprehensive experiments across benchmark datasets demonstrate that BFPM significantly outperforms state-of-the-art methods, simultaneously improving fairness while maintaining or enhancing performance. |
| 16:20 | CRQCDM: A Causal Representation and Contextual Q-matrix Cognitive Diagnosis Model PRESENTER: Zhiwei Cai ABSTRACT. Cognitive diagnosis is a fundamental task in intelligent education, aiming to accurately assess students' latent mastery of knowledge concepts. However, existing models typically face two critical challenges. First, they generally treat knowledge concepts as isolated entities, failing to model the pedagogically-grounded causal dependencies among them. Second, as the scale of exercise pools and knowledge systems on online education platforms continues to grow, omissions often occur when annotating exercises with their associated fine-grained knowledge concepts. To address these issues, this paper proposes a causal representation and contextual Q-matrix cognitive diagnosis model (CRQCDM). The model operates through two synergistic mechanisms. First, a causal information-guided representation learning module is used to model the dependencies among knowledge concepts based on a predefined causal graph, generating more interpretable and nuanced student and exercise representations. Second, a contextual Q-matrix enhancement module integrates the student and exercise representations to uncover implicit knowledge concepts associated with the exercises. Extensive experiments were conducted with CRQCDM on three real-world datasets. The results demonstrate that the performance of CRQCDM is superior to that of existing methods. |
| 16:40 | HCTLR: A Hybrid CNN-Conformer Framework for Offline Handwritten Chinese Text Line Recognition PRESENTER: Chaozong Chen ABSTRACT. Handwritten Chinese text line recognition and handwriting verification play a crucial role in various fields such as office automation, classification of anonymous letters, and identity authentication. However, existing algorithms face significant challenges in extracting features from handwritten Chinese characters due to the complexity of their structure, image distortion and blurriness, as well as the limited availability of data samples. Furthermore, most existing studies focus on recognizing individual characters and verifying handwritten signatures. However, in practical applications, Chinese handwriting typically appears in text line format. To address these challenges, we construct the HW-CN dataset and propose an adaptive image interpolation algorithm (Otsu-Better) to tackle problems such as broken strokes and blurred characters in low-resolution images. Additionally, we introduce a recognition model specifically designed for handwritten Chinese text lines, which we refer to as Handwritten Chinese Text Line Recognition (HCTLR), to better meet the demands of real-world scenarios and reduce the impact caused by text segmentation. Experimental results demonstrate that the proposed HCTLR model achieves a total recognition accuracy of 78.7% in the HW-CN dataset, representing an improvement of 6.3% compared to the CRNN model. |
| 17:00 | UI-Most: Leveraging Multi-Agent Systems for One-Shot Automatic GUI Testing ABSTRACT. GUI automation testing is a mainstream approach to ensure the software quality of mobile applications. To reduce manual testing costs, a large number of automated test cases are typically executed for regression and compatibility testing whenever there are requirement changes or version updates. Although extensive research has applied LLMs and MLLMs to GUI automation, most of these works conduct test- ing in stable, interference-free environments. In contrast, real-world busi- ness scenarios often involve numerous dynamic interference factors and strong business-specific contexts, leading to lower success rates for these methods. To address these challenges, we propose a novel UI automation testing technology (UI-Most) based on a multi-agent architecture. This method is designed to enhance the robustness of UI automation testing by assigning specialized roles to independent agents and enabling their collaboration. At the same time, it leverages the business knowledge from AppGraph for one-shot learning, thereby improving the recognition of UI elements in new scenarios. The effectiveness of our approach has been validated on real test case sets. Furthermore, this method has already been applied to automated regression testing, significantly reducing both manual testing costs and maintenance overhead of test cases. |
| 17:20 | DBFormer: Dual Branch Transformer for Visible-infrared Person Re-identification ABSTRACT. Conventional visible-light-based person re-identification (Re-ID) techniques suffer performance degradation in nighttime or low-light conditions, while visible-infrared (RGB-IR) cross-modal Re-ID can adapt to multiple indoor and nocturnal scenarios. However, the latter faces dual challenges: significant feature distribution discrepancies and local-global feature representation imbalance. Recently, Transformer architectures and part-based methods have demonstrated great progress in traditional Re-ID tasks; however, their direct application to cross-modal scenarios exhibits critical limitations, such as compromised feature integrity from excessive fine-grained segmentation and substantially increased computational complexity. To address these challenges, we propose a Dual-Branch Transformer network (DBFormer) which horizontally partitions the feature encoding process into upper-body and lower-body branches, thereby enhancing the detailed feature modeling capability. Moreover, we design a dual-branch alignment loss function to enforce feature distribution consistency and mitigate inter-branch discrepancies, and a cross-modal alignment loss function to significantly improve Re-ID performance by optimizing cross-modal feature distances. Extensive experiments demonstrate that our method achieves superior accuracy in person re-identification, outperforming state-of-the-art approaches in recent years. |
| 17:40 | RT-DETR-MO: A Lightweight Detector for Small Object Detection in Open-Water UAV Imagery PRESENTER: Yongtao Luo ABSTRACT. Detecting small objects in open-water UAV imagery is chal-lenging due to low contrast, scale variation, and tight on-board latencyconstraints. We present RT-DETR-MO, where “MO” stands for Mar-itime Open-water, a lightweight transformer-based detector tailored formaritime scenarios. The design introduces three targeted components: aDynamic Inception-style Mixed Convolution block (DiMConv) for adap-tive multi-scale representation, a Locally-enhanced Token Statistics Self-Attention (LTSSA) that injects neighborhood priors into linear-time at-tention to emphasize small or clustered targets, and a lightweight Mod-ulation Fusion Module (MFM) for branch-aware feature integration. Onthe SeaDronesSee benchmark, RT-DETR-MO achieves 83.9% mAP@50and 49.9% mAP@50:95, surpassing the RT-DETR baseline by 2.4 and2.0 points, respectively. It also cuts parameters by 35.7% and boostsinference speed by 40.7%. These results demonstrate a more favorableaccuracy–efficiency–size trade-off for real-time maritime UAV detection. |
| 17:50 | MGDD: Multidimensional Graph Data Distillation via Contrastive Learning and Feature-Enhanced Propagation ABSTRACT. With the ongoing advancement of Graph Neural Networks (GNNs) in modeling graph-structured data, Knowledge Distillation (KD) has emerged as an efficient paradigm for structured knowledge transfer, aiming to migrate the expressive capacity of a teacher model to a more lightweight and interpretable student model. However, existing graph distillation methods often focus on single-dimensional alignment (e.g., structural or attribute-based) and primarily rely on contrastive learning in the feature space. This limits their ability to retain the decision boundaries and semantic structures encoded in the teacher model, thereby impairing the completeness and expressiveness of the transferred knowledge. To address this, we propose \textbf{M}ultidimensional \textbf{G}raph \textbf{D}ata \textbf{D}istillation via Contrastive Learning and Feature-Enhanced Propagation(MGDD), a unified distillation framework based on the coordinated integration of multiple modules. Rather than a simple combination of existing strategies, MGDD tightly couples three core components: parameterized label propagation, feature transformation, and logits-based contrastive distillation, enabling joint modeling of label semantics, node attributes, and global topology. In particular, MGDD incorporates a node-wise fusion weighting mechanism and a decision-space alignment strategy to enhance the student model’s capacity for structural modeling and boundary-aware classification. We assess MGDD on Cora, Citeseer, and Pubmed, with six widely used GNNs serving as teacher models: GCN, GAT, APPNP, GraphSAGE, SGC, and GCNII. Experimental results demonstrate that MGDD improves classification accuracy by 7.35\% on average over the corresponding teacher models and outperforms the state-of-the-art graph distillation baseline by 1.99\%. |
| 18:00 | Enhancing Generalized Category Discovery via Chaotic Sparsity Matching ABSTRACT. Generalized Category Discovery (GCD) is a task focused on identifying both known and novel categories within an unlabeled dataset by leveraging another labeled dataset containing only known categories. However, current research in GCD faces several challenges. First, it is difficult to prevent noisy representations in the unlabeled data, which hampers the transfer of knowledge from known to novel categories and consequently leads to suboptimal performance in discovering new categories. Second, after knowledge transfer through calibration from labeled to unlabeled data, the category boundaries are often weakly constrained, resulting in overlaps and ambiguity among categories. To address these issues, this paper proposes Chaotic Sparsity Matching and Transfer (CSMT). Specifically, we utilize the Iterative Truncated Mean to select prototypes from unlabeled data. During the calibration process, we employ Sparsity Matching for knowledge transfer, which helps mitigate the influence of noisy data and improves transfer effectiveness. Additionally, we introduce two levels of alignment: instance-level and category-level. At the instance level, we integrate LORS to generate chaotic instances, enabling the acquisition of instance-level knowledge by aligning original features with chaotic features, thereby enhancing learning for novel categories. At the category level, we incorporate a Margin Constraint mechanism to strengthen category separability and prevent ambiguous prototype assignments. Experiments conducted on three benchmark datasets demonstrate that CSMT significantly outperforms state-of-the-art methods. |