View: session overviewtalk overview
10:30 | Towards Autonomous Building Construction: A Multi-Agent Framework Leveraging Large Multimodal Models ABSTRACT. Large Language Models (LLMs) have advanced creative AI agents capable of complex planning and decision-making. Yet, autonomous building construction remains difficult due to challenges in handling diverse structures and inferring precise spatial layouts from visual data. We propose a novel multi-agent framework that leverages Large Multimodal Models (LMMs) to integrate textual and visual inputs for iterative building reproduction. In our system, an Advisor Agent performs high-level visual reasoning and planning, while a Constructor Agent translates this guidance into precise blueprints for execution. Additional innovations include part-based structural modeling, blueprint consistency across cycles, and multi-view perception to handle occlusion. Experiments in Minecraft demonstrate that this multi-agent architecture substantially improves accuracy, flexibility, and error correction compared to single-agent baselines. |
10:50 | PTFA: An LLM-based Agent that Facilitates Online Consensus Building through Parallel Thinking ABSTRACT. Consensus building is inherently challenging due to the diverse opinions held by stakeholders. Effective facilitation is crucial to support the consensus building process and enable efficient group decision making. However, the effectiveness of facilitation is often constrained by human factors such as limited experience and scalability. In this research, we propose a Parallel Thinking-based Facilitation Agent (PTFA) that facilitates online, text-based consensus building processes. The PTFA automatically collects real-time textual input and leverages large language models (LLMs) to perform all six distinct roles of the well-established Six Thinking Hats technique in parallel thinking. To illustrate the potential of the agent, a pilot study was conducted, demonstrating its capabilities in idea generation, emotional probing, and deeper analysis of idea quality. Additionally, future open research challenges such as optimizing scheduling and managing behaviors in divergent phase are identified. Furthermore, a comprehensive dataset that contains not only the conversational content among the participants but also between the participants and the agent is constructed for future study. |
11:10 | Prompt Attacks and Safeguards in Large Language Models: A Survey ABSTRACT. Large Language Models (LLMs) have quickly improved in how well they understand and generate human-like language. But as they become more capable, they also become more vulnerable to adversarial manipulation. This survey looks at different types of prompt-based attacks that take advantage of the tendency of models to follow instructions, often in ways that can undermine safety, privacy, or reliability. We organize these threats into a clear taxonomy and also explore a range of defense strategies. In addition, we review tools and benchmarks used to test how robust these models are (including PyRIT, Giskard, Garak, and PromptBench). By mapping attacks to defenses in a layered framework, this work emphasizes the need for thoughtful, flexible safeguards when using LLMs in real-world settings. |
11:30 | Assessing Nuanced Personality Inducing in Language Models via Vignette Tests ABSTRACT. Personality inducing has emerged as a critical research area in modern intelligent systems, which focusing on adapting to traits of specific individuals for delivering tailored experiences. Although Large language models (LLMs) have become increasingly proficient at simulating personality traits, two major challenges remain. First, existing research focuses on psychological questionnaires, which exhibit a significant gap from real-world scenarios, making it unclear to measure personality induction performance in scenario situations. Second, subtle differences between personalities can also lead to significantly different Behaviors. In this paper, we present a benchmark, VTPI(Vignette Tests for Nuanced Personality Inducing), comprising vignette questions that assess whether indcuement methods successfully induce the personality traits. And we find that current inducing approaches fail catastrophically on indcuing the nuanced personalities under our constructed questions from real scenario. We thus develop a simple yet effective induction method(DPI) that is capable of capturing subtle differences between nuanced personality traits for precise behavior induction. While VTPI remains challenging, we show that DPI scales well with LLMs (e.g., ChatGPT-4o and DeepSeek-R1) and outperforms previous methods by a large margin(average 19.65% improvement of F1 on Qwen2.5-14B and 32B models). |
11:50 | AgentFactory: Towards Automated Agentic System Design and Optimization ABSTRACT. Large Language Models (LLMs) have demonstrated remarkable capabilities as powerful components in agentic systems, enabling sophisticated reasoning and complex task execution. However, current approaches to manually designing and optimizing agentic systems heavily rely on manual effort, limiting their adaptability and scalability. Recent work has explored the automated optimization of workflow designs. How ever, these approaches often overlook the crucial role of model capabilities and focus on single performance metrics, failing to address real-world deployment constraints. In this paper, we present AgentFactory, a framework that jointly optimizes both foundation models and workflow structures in agentic systems while considering multiple objectives including performance, cost, and efficiency. AgentFactory leverages advanced LLMs as optimizers (such as GPT-4o and DeepSeek V3) to navigate the vast search space of possible configurations, employing a three-stage optimization pipeline (planning, tuning, and workflow design) to automatically discover effective combinations of fine-tuned models and optimized workflows. Through an iterative optimization process, our framework systematically explores and evaluates different agentic system designs,adapting to task-specific requirements while maintaining operational efficiency. We evaluate AgentFactory across eight benchmarks spanning five domains, including general reasoning, coding, mathematics, medicine, and finance. Our experiments demonstrate that AgentFactory consistently outperforms both manually designed methods and existing automated approaches, achieving an average improvement of 9.1% across all benchmarks, with particularly significant gains in domain-specific tasks (19.6% on MedQAand18.7%onFinEval). These results establish AgentFactory as a promising approach for developing more capable and efficient agentic systems through automated optimization. |
12:10 | SASP-NMT: Syntax-Aware Structured Prompting for Low-Resource Neural Machine Translation PRESENTER: Hao Xing ABSTRACT. In Neural Machine Translation (NMT) with Large Language Models (LLMs), prompting has become the predominant approach for adapting to a new translation task without requiring extensive fine-tuning data. However, when translating low-resource language pairs, conventional prompts, built as simple linear text, struggle to represent richer dependency or constituency syntax, making it difficult for LLMs to grasp the source language's syntactic patterns and semantic nuances and thus impairing translation quality. To address this challenge, this paper proposes Syntax-Aware Structured Prompting (SASP). Since word-level embeddings are insufficient for capturing the overall semantics of a sentence and are susceptible to interference from sentence length and word frequency, we encode source and candidate sentences with sentence-level embeddings and retrieve several semantically similar sentences from the target-language monolingual corpus. Subsequently, each retrieved sentence undergoes fine-grained dependency parsing to extract clause-level subject-verb-object structures as well as part-of-speech information. These syntactic patterns are then organized into clause‐level structural templates and integrated with the retrieved example sentences to form a structured prompt, enhancing translation quality. We evaluate SASP on language pairs between Mongolian-Chinese (Mo-Zh), Uyghur-Chinese (Ug-Zh), and Tibetan-Chinese (Ti-Zh) using the CCMT2019 corpus. Experimental results show that SASP consistently improves translation quality across all tasks, achieving up to a 13.4% improvement over zero-shot baselines. These findings indicate that incorporating structured syntactic knowledge into prompt design can significantly enhance the performance of LLMs in low-resource machine translation, particularly in terms of syntactic accuracy and target-language consistency. |
10:30 | Towards Interpretable Load Forecasting: A Liquid Neural Network Approach with Temporal and Feature Importance Modeling PRESENTER: Ziqian Liu ABSTRACT. Accurate and interpretable load forecasting is critical for modern power systems. Existing methods have made notable progress, however, with the development of smart grids, load data have become increasingly non-stationary under the influence of multiple factors. Further, model interpretability is subject to higher demands due to the need for regulatory compliance in power system operations.To address these challenges, we propose an interpretable forecasting model, Feature-weighted Liquid-core model with INterpretable Temporal attention (FLINT), that incorporates a feature-weighted strategy to dynamically assess the contribution of heterogeneous input features, leverages Liquid Neural Networks to capture the dynamic non-stationarity of load data, and integrates a time-aware attention mechanism to model temporal dependencies and highlight critical time steps.Further, we innovatively introduce a multi-level interpretability module that, form both global and local perspectives, explains prediction outcomes by assessing input feature importance, highlighting critical time steps, and tracing the causes of abrupt load changes. Specifically, gradient attribution and gating quantify global feature contributions, the sparse, bio-inspired Liquid Neural Networks (LNNs) architecture enables traceable mutation-level reasoning, and attention highlights key temporal points. Empirically, a comparison with six baseline models on three real-world load datasets demonstrates that FLINT achieves approximately a 4% improvement in forecasting performance compared to the strongest baseline, as measured by MAE, while offering superior interpretability. |
10:50 | FuzzyProbNet: An Interpretable Fuzzy Probabilistic Network for Cement Compressive Strength Prediction ABSTRACT. The compressive strength of cement is a critical indicator for evaluating its quality and ensuring the safety and durability of engineering structures. However, traditional physical testing methods, characterized by long durations and high costs, fail to meet the demands of modern intelligent construction for rapid and economical assessment. Consequently, the development of advanced predictive models is of paramount importance. Currently, prevailing predictive models often face a "trilemma" where prediction accuracy, uncertainty quantification, and model interpretability are difficult to achieve simultaneously. The "black-box" nature of these models restricts their application in safety-critical domains. To address this challenge, this paper proposes a novel Fuzzy Probabilistic Network (FuzzyProbNet). This model transforms numerical inputs into interpretable semantic concepts through a learnable fuzzification process, extracts robust deep features using a Variational Autoencoder, and ultimately generates a complete predictive probability distribution via a Gaussian Mixture output head. Experimental results demonstrate that the proposed FuzzyProbNet outperforms baseline models across various metrics for both point and probabilistic prediction. Furthermore, visualization and analysis of the model's internal workings validate its clear decision-making logic and inherent interpretability. |
11:10 | Interpretable Brain Network Analysis for Psychiatric Diagnosis Using Fuzzy Logic ABSTRACT. Psychiatric disorders impose a significant burden on health- care systems, necessitating accurate and interpretable diagnostic tools. Functional magnetic resonance imaging (fMRI) provides insights into brain functional connectivity (FC), yet traditional models often lack transparency. We propose a novel fuzzy logic-based approach to model and interpret brain networks for psychiatric diagnosis. This method em- ploys fuzzy rules to capture causal relationships between brain regions and diagnostic outcomes, delivering individualized explanations with- out relying on graph neural networks. Evaluations on large-scale fMRI datasets, such as REST-meta-MDD, demonstrate competitive perfor- mance and clinically relevant interpretability, with our model being ex- plainable. |
11:30 | Uncertainty Estimation by Human Perception versus Neural Models ABSTRACT. Modern neural networks (NNs) often achieve high predictive accuracy but remain poorly calibrated, producing overconfident predictions even when wrong. This miscalibration poses serious challenges in applications where reliable uncertainty estimates are critical. In this work, we investigate how human perceptual uncertainty compares to uncertainty estimated by NNs. Using three vision benchmarks annotated with both human disagreement and crowdsourced confidence, we assess the correlation between model-predicted uncertainty and human-perceived uncertainty. Our results show that current methods only weakly align with human intuition, with correlations varying significantly across tasks and uncertainty metrics. Notably, we find that incorporating human-derived soft labels into the training process can improve calibration without compromising accuracy. These findings reveal a persistent gap between model and human uncertainty and highlight the potential of leveraging human insights to guide the development of more trustworthy AI systems. |
11:50 | Belief Change with Full Memory and Trust ABSTRACT. We consider belief change in a situation where agents have full memory of all information that has been reported over time. In this context, we no longer have an a priori initial belief state. Instead, we have a history of past reports along with a trust state that indicates how strongly each information source is trusted. If we have a static model of trust, then this approach is essentially gives a variation of regular iterated revision. However, we introduce a model of trust change, where trust levels can increase or decrease based on agreement between sources. In this case, we end up with a new kind of belief change operator. The new operator can abandon sources and re-integrate them over time, while maintaining beliefs are are justified both by trust and an underlying Darwiche-Pearl operator. |
10:30 | FreezeSeg2RL: Frozen Segmentation Pretraining for Reinforcement Learning On vascular Interventional Robot Autonomous Delivering PRESENTER: Ziyang Mei ABSTRACT. Vascular interventional robotic systems play a critical role in protecting physicians from X-ray radiation exposure during vascu lar surgery procedures. The AI-copilot autonomous delivering capability can further enhance physicians’ experience on robotic systems. However, extracting features from real-time X-ray image and generating opera tional decisions is challenging. Addressing these challenges, this paper proposed a X-ray simulation platform, which provides data and environ ments for pre-training vision encoders and training agent by reinforce ment learning. This agent integrates pre-trained vision encoder for inter ventional instruments with actor-critic network. This paper validated the impact of different vision encoders, different feature fusion schemes, and differ ent reinforcement learning methods on agent performance. The optimized solution demonstrated superior performance in simulation benchmarks and was successfully transferred to a real robotic system. Experiments demonstrate this method’s ability to generate effective strategies from real-time X-ray inputs and shows promising clinical robotics applica tions. |
10:50 | Residual-based Adaptive Domain Decomposition Method for the Physics-Informed Neural Network ABSTRACT. The physics-informed neural network (PINN) has emerged as a powerful framework for solving partial differential equations (PDEs) by incorporating physical laws as constraints in the loss function. However, utilizing a single neural network approximator across the entire computational domain may encounter challenges in converging to the correct solutions, especially for equations where different regions exhibit diverse properties. In this paper, we introduce the residual-based adaptive domain decomposition method for the PINN (RA-PINN), which adaptively partitions the computational domain and assigns sub-networks to solve each subdomain independently. The RA-PINN eliminates the need for manual domain division in existing PINN domain decomposition methods by enabling the network to autonomously infer subdomains based on residual information. The RA-PINN offers a more efficient alternative for solving PDEs, achieving enhanced adaptivity and reduced reliance on prior knowledge about the solution or domain characteristics. We conducted experiments on three representative cases with sharp solution variations, demonstrating that the RA-PINN outperforms the PINN and other non-adaptive domain decomposition methods in capturing localized features and ensuring solution accuracy. |
11:10 | Empowering Graph Contrastive Learning with Topological Rationale ABSTRACT. Graph contrastive learning (GCL) methods are dedicated to modeling the invariant information from graphs via well-crafted graph augmentation or stochastic encoder perturbation. Although prevailing methods have achieved great progress, we argue that they overlook the essential topological invariant information, referred to as topological rationale. In this regard, we conduct exploratory experiments that visually demonstrate the deficiency of GCL methods in capturing topological rationale and reveal the positive correlation between this deficiency and the discriminability degeneration of GCL methods. To this end, we introduce a novel plug-and-play approach, termed Topological Rationale-enhanced Graph Contrastive Learning (TRGCL). Specifically, TRGCL integrates the node-level and substructure-level topological rationale learning modules in the topological rationale learning stage, thereby empowering the GCL encoder to capture topological invariant information sufficiently. Furthermore, we introduce a semantic-orthogonal adaptive weighting module to ensure that the derived topological rationale remains complementary to semantic information. Theoretically, we revisit the paradigm of GCL from the causal perspective and substantiate the theoretical validity of TRGCL. Experimental results on various datasets in the domains of social networks and biochemical molecules demonstrate the effectiveness of TRGCL. |
11:30 | HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates PRESENTER: Lei Lu ABSTRACT. Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complementary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving indices map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced and robust compression performance at ultra-low bitrates. |
11:50 | Spatio-temporal dynamic multi-scale graph convolutional networks for traffic flow prediction ABSTRACT. Graph-constructed traffic flow prediction has recently achieved significant advances in transportation research. Existing methods predominantly rely on predefined spatial adjacency graphs to model spatio-temporal relationships. However, these static adjacency matrices inadequately represent the complex spatio-temporal correlations among road network nodes and fail to capture dynamic interactions that evolve over time. This paper proposes a novel traffic prediction model called the Spatial-Temporal Dynamic Multiscale Graph Convolutional Network (SDMGCN). The SDMGCN first models the dynamic properties of node spatial correlations through attention mechanisms, constructing the Dynamic Interaction Perception Graph (DIPG). It then innovatively proposes the Multi-Order Augmented Graph Convolution Module (MOAGCM), which adaptively adjusts node weights through a multi-order information aggregation mechanism. When combined with the DIPG, it captures deeper dynamic spatial dependencies between nodes. Finally, the multiscale time-gated convolution module captures temporal dependencies at various time scales. Experimental evaluations on two real-world traffic datasets demonstrate that the SDMGCN model significantly outperforms state-of-the-art methods. |
13:30 | PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use ABSTRACT. Large Language Models (LLMs) show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erro- neous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforce- ment learning phase. In the online phase, a dedicated Planner is trained via roup Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Ex- periments on the ToolHop and T-Eval benchmarks show PEARL signif- icantly outperforms existing methods, achieving a new state-of-the-art success rate of 56.5% on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents. |
13:50 | ColorFP: Improving AI-Generated Text Detection via Fixed Vocabulary Partitioning and Half-Bit Fingerprinting PRESENTER: He Li ABSTRACT. With the rapid proliferation of large language models (LLMs), their misuse has engendered significant societal concern. Accordingly, the development of efficient and robust AI-generated text detection has emerged as a pivotal strategy for mitigating the potential abuse of LLMs. Existing approaches predominantly rely on a compute-intensive fine-tuning paradigm to capture the implicit stylistic cues of AI-generated text. However, fine-tuning such detectors not only incurs substantial overhead but also yields poor robustness, as classification based solely on stylistic cues fails against textual adversarial attack. This paper introduces ColorFP, a robust AI-generated text detection framework based on fixed vocabulary partitioning and half-bit fingerprinting. Specifically, to achieve the optimal trade-off between detection success rates and generated text quality, we introduce a novel probabilistically biased half-bit fingerprint encoding. To enhance detection robustness, we employ a static hash-seeded pseudorandom number generator to ensure consistent vocabulary partitioning across distinct fingerprints, thereby mitigating the challenges posed by textual adversarial attacks. To comprehensively evaluate ColorFP, we assembled a corpus of fingerprinted text outputs from five LLMs; results show that ColorFP outperforms all baselines—achieving a 93.70\% average F1 in a five-class setting—while reducing time and computational overhead by up to 30× compared to state-of-the-art approaches. |
14:10 | UI-Most: Leveraging Multi-Agent Systems for One-Shot Automatic GUI Testing ABSTRACT. GUI automation testing is a mainstream approach to ensure the software quality of mobile applications. To reduce manual testing costs, a large number of automated test cases are typically executed for regression and compatibility testing whenever there are requirement changes or version updates. Although extensive research has applied LLMs and MLLMs to GUI automation, most of these works conduct test- ing in stable, interference-free environments. In contrast, real-world busi- ness scenarios often involve numerous dynamic interference factors and strong business-specific contexts, leading to lower success rates for these methods. To address these challenges, we propose a novel UI automation testing technology (UI-Most) based on a multi-agent architecture. This method is designed to enhance the robustness of UI automation testing by assigning specialized roles to independent agents and enabling their collaboration. At the same time, it leverages the business knowledge from AppGraph for one-shot learning, thereby improving the recognition of UI elements in new scenarios. The effectiveness of our approach has been validated on real test case sets. Furthermore, this method has already been applied to automated regression testing, significantly reducing both manual testing costs and maintenance overhead of test cases. |
14:30 | DEIMerge: An Automatic Program Repair Framework Based on Multi-agent Collaboration and Intelligent Patch Merging ABSTRACT. Large language models (LLMs) have made significant progress in software engineering (SWE) tasks, such as code generation and automatic repair. However, current SWE agents have limited effectiveness when handling complex software defects and often overlook potential useful information in failed patches. To address this, this study proposes an automatic program repair (APR) framework called DeIMerge, which is based on multi-agent collaboration and intelligent patch merging. After voting to select the optimal patch, the framework uses an LLM to analyse all failed patches deeply, fuse scattered repair clues, and generate high-quality merged patches. Experimental results show that this method increases the single-patch repair rate of open-source SWE agents from 27.3% to 37.0%, with an overall maximum repair rate of 57.3%. This framework has been validated for its versatility across multiple mainstream LLM models. Multi-agent patch merging can effectively extract repair clues from failed patches and significantly improve automatic repair performance, providing new insights into solving complex defects. |
14:50 | Morpheus: Accelerating Large Language Models with Feature-Augmented Autoregressive Drafting ABSTRACT. Autoregressive decoding makes inference for Large Language Models (LLMs) both memory bandwidth-bound and time-consuming. In this paper, we reconsider draft head paradigm in speculative decoding and derive two key observations. Firstly, existing draft heads are sequentially independent, speculating on draft tokens without considering their preceding context within the continuation. Secondly, highly ambiguous tokens disproportionately corrupt the effective length of draft sequences generated by draft heads. Based on these insights, we propose Morpheus, a draft head that generates draft tokens sequentially in an autoregressive manner. By integrating features from the target model and the draft head itself from the previous time step, Morpheus effectively extended the average acceptance length, thereby increasing the end-to-end decoding rate. We conducted comprehensive evaluations of Morpheus, including code generation task and text generation task. For Vicuna 7B, Morpheus improves the speed of decoding by 1.15x and 2.5x compared to Medusa decoding and autoregressive decoding, respectively. |
13:30 | Prompt-Enhanced Multimodal Learning for Robust Sentiment Analysis with Incomplete Data ABSTRACT. Multimodal sentiment analysis faces significant challenges when processing incomplete data, which is a common scenario in real-world applications due to sensor failures or transmission errors. In this paper, we propose a Prompt-Enhanced Multimodal Learning (PEML) framework that mimics human cognitive process for handling incomplete information. It comprises three core components: (1) Modality-Specific Prompt Encoder (MSPE) that activates prior knowledge through learnable prompt templates, providing adaptive enhancement for different missing patterns; (2) Cross-Modal Adaptive Alignment (CMAA) that establishes inter-modal information exchange channels through a dynamic gating mechanism; (3) Quality-Aware Fusion (QAF) that dynamically fuses high-quality features based on multi-level quality assessment, achieving confidence-based information integration. Extensive experiments across various missing data scenarios demonstrate that PEML outperforms existing state-of-the-art methods, validating the effectiveness of modeling human cognitive processes for robust multimodal learning. |
13:50 | Topology-guided Hypergraph Transformer Network: Unveiling Structural Insights ABSTRACT. Graph Transformers (GTs), a specialized variant of Graph Neural Networks (GNNs), have gained attention for their effectiveness in node and graph representation. However, GTs primarily capture simple dyadic relations, missing higher-order relations in graphs. Hypergraphs can naturally represent these complex, high-order relationships, leading to the development of hypergraph transformers. Existing hypergraph transformers, however, focus on semantic feature-based attention, often losing structural attributes. To address this, we propose a Topology-guided Hypergraph Transformer Network (THTN). Our model formulates a graph into a hypergraph while preserving structural integrity, facilitating the learning of higher-order relations within the network. Then, THTN introduces a novel structure-aware attention mechanism, identifying the importance of nodes and hyperedges from both semantic and structural perspectives. Additionally, a structural and spatial encoding module incorporates topological and spatial information into node representations. By integrating these modules, THTN captures a range of local and global topological features. Extensive experiments on node classification tasks demonstrate that THTN consistently outperforms existing models, showcasing its effectiveness and potential. |
14:10 | Meta Learning-enhanced Iterative Learning Control for Tracking PRESENTER: Shuaiwei Zhang ABSTRACT. Iterative Learning Control (ILC) fundamentally suffers from sensitivity to uncertainties, and poor cross-task generalization. Addressing these limitations, we propose the first unified Meta Learning-enhanced Iterative Learning Control (Meta-ILC) framework—integrating metalearning principles with neural network-based adaptive control. Our approach replaces fixed ILC gain matrices with context-sensitive operators Lp(t) generated by deep residual networks, while a meta-optimizer extracts transferable knowledge across tasks to initialize near-optimal controllers. The framework autonomously adapts to system variations without model reliance or manual tuning, resolving the core deficiencies of conventional ILC. Experimental validation confirms significant advantages: accelerated convergence, minimal initial tracking errors, and robust performance across unseen trajectories, demonstrating transformative potential for high-precision applications including multi-axis robotics and semiconductor manufacturing. |
14:30 | Wave–PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention ABSTRACT. We introduce Wave–PDE Nets, a neural architecture whose elementary operation is a differentiable simulation of the second-order wave equation. Each layer propagates its hidden state as a continuous field through a medium with trainable spatial velocity c(x) and damping γ(x). A symplectic spectral solver based on FFTs realises this propa- gation in O(n log n) time. This oscillatory, global mechanism provides a powerful alternative to attention and first-order state-space models. We prove that a single Wave-PDE layer is a universal approximator. On lan- guage and vision benchmarks, Wave-PDE Nets match or exceed Trans- former performance while demonstrating superior practical efficiency, re- ducing wall-clock time by up to 30% and peak memory by 25%. Ablation studies confirm the critical role of symplectic integration and a spectral Laplacian for stability and performance. Visualizations of the learned physical parameters reveal that the model learns intuitive strategies for information propagation. These results position Wave-PDE Nets as a computationally efficient and robust architecture with a strong physical inductive bias. |
13:30 | Phoneme-Based Optimization of Enrollment Selection for Speaker Identification ABSTRACT. This paper presents a novel phoneme-based approach to enrollment utterance selection for speaker identification. Unlike conventional strategies that ignore linguistic diversity, our method explicitly maximizes phoneme coverage in the enrollment set, yielding more representative and robust speaker profiles. We demonstrate that increasing phoneme diversity directly improves speaker embeddings and identification accuracy, even under real-world speech variability. Experiments on the Vietnam-Celeb dataset with the state-of-the-art ECAPA-TDNN model show that our approach boosts identification accuracy from 93.6% to 95.7% and F1-score from 95.5% to 96.1%, relative to standard selection methods. Remarkably, these gains are achieved with fewer enrollment utterances, substantially reducing user effort. Analysis reveals a near-linear relationship between phoneme coverage and classification performance, highlighting phoneme diversity as a critical factor for effective enrollment. These findings underscore the practical value of our method for building more accurate, efficient, and user-friendly speaker identification systems. |
13:50 | Solving Low-dose Computer Tomography inverse problem by learning the first-order score of the sparse sinogram samples' distribution PRESENTER: Yuchen Quan ABSTRACT. Computer Tomography is widely used to acquire the internal structures of the target object in a non-invasive way. In order to obtain a high-quality reconstruction, densely distributed detectors are always used to improve the sampling rate, which is used to avoid the artifacts caused by angular undersampling. However, the high density of X-rays is harmful to the human body, which implies to us that sparse-view measurement is urgently needed. Lots of current methods couldn't make full use of the information from different domains, which leads to the failure of getting a reliable reconstruction result. This paper shows that the Low-dose Computer Tomography inverse problem could be solved from the perspective of an inpainting task in the measurement domain (Radon domain). Besides this, a method based on a score-based diffusion model is proposed, and some properties of the sinogram are used to achieve a more reliable result: 14% improvement in the metric PSNR and some improvement in the metric SSIM. |
14:10 | A Dual-Domain Perception and Fuzzy Learning Enhanced Framework for Diabetic Retinopathy Grading PRESENTER: Ye Wang ABSTRACT. Diabetic retinopathy (DR) is a leading cause of preventable vision loss worldwide. Accurate DR grading remains a major challenge due to substantial variability in lesion size and morphology, indistinct lesion boundaries, and subtle lesion characteristics that often resemble normal retinal tissue. To address these issues, we propose a novel Transformer-based framework that integrates dual-domain perception with fuzzy learning to enhance DR grading performance. Specifically, we design an Inverted Residual Fuzzy Block (IRFB) to improve lesion localization. It assigns adaptive fuzzy weights in both channel and spatial domains, effectively enhancing lesion-relevant features while suppressing irrelevant information. Furthermore, we introduce a Fuzzy Learning-based Multi-Scale Feature Enhancement (FMFE) module, which captures refined multi-scale representations and mitigates feature redundancy. To further improve global lesion feature extraction and contextual information, we propose the Dual-Domain Perception Transformer (DDPT). This module models both spatial and frequency domain characteristics via domain-specific self-attention mechanisms and employs a cross-attention strategy to fuse complementary information across domains. By combining spatial and frequency-domain features, our model achieves deeper contextual understanding and robust representation of complex lesion structures. Our model achieves a Quadratic Weighted Kappa (QWK) of 94.7% and an accuracy of 90.8% on the APTOS-2019 dataset, and a QWK of 85.9% and an accuracy of 86.3% on the DDR dataset, outperforming existing methods and demonstrating the effectiveness and robustness of our approach. |
14:30 | ELM-Based Finite-Time State Observer Designs for Uncertain Robotic Systems ABSTRACT. This paper focuses on the finite-time state observer design for a class of robotic systems with lossy state measurement and nonlinear dynamics including uncertainties and disturbances. The extreme learning machine (ELM) algorithm is applied to approximate the nonlinear dynamics, and simultaneously, adaptive technique is employed to adjust the output weights of the ELM network and to remove the adverse effects of residual errors and disturbances. Then, a finite-time state observer based on the adaptive signals is developed to estimate the immeasurable states within a finite time accurately. Ultimately, the estimation accuracy of the designed finite-time ELM network-based observer is demonstrated by simulation results on a robotic manipulator platform. |
14:50 | From Catch to Product: Machine Learning Driven Spectral Analysis for Fish Processing Line Allocation ABSTRACT. Efficiently allocating fish catch to appropriate production lines is vital for maximizing economic value in the seafood industry. This paper proposes a machine learning-based framework for non-destructive fish-to-product classification using vibrational spectroscopy techniques. Instead of predicting accurate biochemical compositions, the allocation task is formulated as a multi-class classification problem, with classes derived from expert-driven clustering of biochemical profiles. To address the challenges posed by noisy and limited spectral data, a novel data processing framework is proposed, which integrates data augmentation, feature selection, and feature fusion. This framework enriches the training dataset through domain-inspired linear spectral augmentation and employs selective feature fusion to extract robust and complementary features from multiple spectral modalities. The resulting fused features are then used to train standard classifiers, leading to improved classification performance. Experimental results on real-world fish spectral datasets demonstrate the effectiveness of this approach, offering a practical tool for intelligent production line allocation in fish processing. |
15:10 | MediVerse: AI-Powered Interactive Voice-Driven Virtual Reality for Health Data Analytics PRESENTER: Rani Adam ABSTRACT. biomedical data is increasingly complex, and existing interfaces often fall short in supporting intuitive, immersive exploration. We present MediVerse, a novel edge-cloud architecture that integrates voice-based natural language interfaces, immersive virtual reality (VR) visualization, and large language model (LLM)-based query translation to enable real-time, hands-free interaction with complex biomedical datasets. MediVerse leverages head-mounted VR displays for voice input, cloud-based orchestration for query interpretation and generation, and real-time 3D data rendering in an immersive environment. We demonstrate the platform through two case studies with biomedical data and evaluate its performance across 20 benchmark queries. Our findings highlight the system’s ability to accurately interpret user intent, maintain low-latency responsiveness, and deliver immersive, context-aware visualizations. This work introduces a reusable, modular framework that enhances voice-driven, LLM-assisted biomedical analytics in VR and lays the foundation for next-generation immersive data systems. |
13:30 | Personalized Knowledge Tracing Model with Memory Reinforcement and Forgetting-Aware Mechanisms ABSTRACT. Knowledge Tracing (KT), a fundamental technology in online intelligent education systems, is designed to model learners' learning processes and monitor the dynamic evolution of their knowledge states. Learners' memory of acquired knowledge decays over time, with forgetting patterns varying based on individual cognitive characteristics. However, most existing KT models adopt a unified and simplified forgetting function, which fails to accurately capture individualized memory decay and distinguish between the effects of content similarity and time on learning performance, leading to a significant decline in accuracy when predicting long-term learning outcomes. To address this, we proposes a Memory-Enhanced Personalized Diagnostic Knowledge Tracing model (MLEKT), which integrates a forgetting-aware linear bias, error-boosted spaced repetition algorithm, and genetic algorithm optimization to precisely model forgetting behaviors in long learning sequences. Specifically, this paper first designs a forgetting-enhancement module based on a spaced repetition algorithm to provide more fine-grained and personalized forgetting enhancement for different types of learning interactions. Second, a forgetting-aware linear bias mechanism is introduced to effectively distinguish the effects of content similarity and time. Finally, a genetic algorithm-based optimization method for personalized forgetting enhancement values is proposed, enabling personalized parameter configurations for different learners. Experiments on three public datasets demonstrate that the MLEKT model outperforms baseline methods in both accuracy and stability, with significant advantages in long-sequence learning interactions. |
13:50 | High-Order Information Embedding Transfer for Clustering with Constrained Laplacian Rank PRESENTER: Lijuan Wang ABSTRACT. Most Constrained Laplacian Rank methods rely on first-order similarity graphs, which capture only direct neighbor relations and constrain the discovery of latent high-order structures, such as indirect connections, particularly in single-view settings. Furthermore, fusing simply multi-order proximity matrices without structural alignment results in redundancy and inconsistency. As the proximity order increases, it tends to introduce irrelevant high-order information, degrading clustering performance and stability. To address these issues, we propose a high-order clustering method that constructs anchor-sample high-order bipartite graphs using recursively computed SVD-based similarity matrices, effectively capturing indirect neighborhood relations and enhancing global structural representation. A unified feature embedding space is designed to enable cross-order knowledge transfer via co-clustering, where low-order embeddings guide high-order representations, improving structural consistency and feature discriminability while filtering irrelevant links. Additionally, nuclear norm and sparsity regularization are applied to suppress redundancy and enhance robustness. Experiments on five public datasets show that our method consistently outperforms state-of-the-art approaches across four clustering metrics, validating its effectiveness and resilience. |
14:10 | Balanced Learning for Incremental Multi-view Clustering ABSTRACT. In practice, the number of views can increase over time, and repeatedly fusing all views upon the arrival of each new view can result in high computational costs and accumulated redundancy. Additionally, early-acquired views may become unavailable due to privacy, storage, or data expiration issues, leading to reduced consistency and poor clustering performance. To solve these issues, we propose Balanced Learning for Incremental Multi-View Clustering (BIMC), which incrementally constructs a unified matrix to preserve view information over time. To further enhance clustering performance, each new view is integrated using balanced learning that reduces feature distribution shifts and erroneous connections between clusters while maintaining consistency within clusters. Finally, to further enhance consistency, cluster labels are directly obtained from the consensus graph by enforcing a Laplacian rank constraint, enabling unified graph construction and clustering. Experimental results demonstrate that BIMC achieves superior clustering performance and efficient view fusion across diverse multi-view datasets. |
14:30 | Federated Dual-Clustered Prototype Learning under Domain Heterogeneity ABSTRACT. Federated learning allows clients to collaboratively train models while safeguarding the privacy of their data. Existing methods typically assume that data from different clients originates from the same domain or distribution. Nonetheless, owing to regional constraints, data features from diverse clients demonstrate notable variations, termed as domain heterogeneity. The naive aggregation of models trained on such heterogeneous data can result in a global model that is biased towards dominant domains and generalizes poorly to others. Therefore, we expect the global model to have better generalization performance in different domains. In this paper, we propose a federated dual-clustered prototype learning(FedCPL) framework, a novel approach designed to counteract domain heterogeneity and improve model generalization. The key insight is to construct a shareable global prototype through dual-clustering, effectively minimizing the discrepancy among feature representations from disparate domains. On the client side, we introduce weighted contrastive learning and feature fusion to align local features, thereby mitigating domain-specific biases during model training. On the server side, an adaptive weighted aggregation strategy is introduced to prioritize contributions from more challenging domains. Extensive experiments on multiple benchmark datasets demonstrate that FedCPL significantly outperforms existing methods in scenarios with domain heterogeneity. |
14:50 | TVL-Filter: Total Variation Loss–Based Sample Filter for Efficient Adversarial Detection PRESENTER: Fei Zhang ABSTRACT. DNN models in computer vision are vulnerable to adversarial samples that are crafted with imperceptible perturbations, which can lead to unpredictable security risks. Currently, there are many countermeasures proposed in the literature to detect adversarial samples and mitigate their impact. However, these detection algorithms introduce significant computational overhead, limiting their practicality. To address this, two insights motivate this study: 1) for those deployed DNN models, the majority of inputs are benign samples that do not need to undergo detection; 2) the crafted perturbations of adversarial samples can be regarded as a type of high-frequency noise signal. To this end, we propose the Total Variation Loss–Based Sample Filter (TVL-Filter), a plug-in module designed for efficient adversarial detection, which employs the TV-loss value to evaluate samples' high-frequency noise signals, and filters out a significant portion of benign samples before detection accordingly. TVL-Filter helps to substantially reduce the adversarial detection overhead with an acceptable sacrifice of the detection precision. Our experiments indicate that after employing the TVL-Filter, three state-of-the-art detection algorithms achieve speedups of up to 8.73x, 8.32x, and 7.06x, with adversarial sample detection accuracy losses of only 2%, 2.90%, and 1.13%, respectively. |
15:10 | Dual-Aspect Enhancement of Data Replay: Influence-Guided Replay and Contrastive Gradient Modulation ABSTRACT. Recent advances in language models have significantly improved natural language processing. However, these models face challenges in continual learning (CL), particularly in retaining previously acquired knowledge while assimilating new information, a problem known as catastrophic forgetting. We revisited the concept of data replay in continual learning and introduced two novel improvements: the \textbf{I}nfluence-\textbf{G}uided \textbf{S}ampling (IGS) strategy for memory buffer construction and the \textbf{C}ontrastive \textbf{G}radient \textbf{M}odulation (CGM) mechanism for parameter update, aiming to mitigate catastrophic forgetting and enhance knowledge transfer. IGS-CGM not only replays past data but also modulates the current task's gradient through a contrastive analysis with gradients from previous tasks, thereby preserving the model's proficiency in previously acquired domains while learning new ones. We conducted extensive experiments on three CL benchmarks, covering traditional finetuning and instruction finetuning for large language models, demonstrating its effectiveness in mitigating catastrophic forgetting and enhancing knowledge transfer. |
16:00 | VISP: Volatility Informed Stochastic Projection for Adaptive Regularization ABSTRACT. We propose VISP: Volatility Informed Stochastic Projection, an adaptive regularization method that leverages gradient volatility to guide stochastic noise injection in deep neural networks. Unlike conventional techniques that apply uniform noise or fixed dropout rates, VISP dynamically computes volatility from gradient statistics and uses it to scale a stochastic projection matrix. This mechanism selectively regularizes inputs and hidden nodes that exhibit higher uncertainty while preserving stable representations, thereby mitigating overfitting. Extensive experiments on MNIST, CIFAR-10, and SVHN demonstrate that VISP consistently improves generalization performance over baseline models and fixed-noise alternatives. In addition, detailed analyses of the evolution of volatility, the spectral properties of the projection matrix, and activation distributions reveal that VISP not only stabilizes the internal dynamics of the network but also fosters a more robust feature representation. These findings suggest that data-dependent, volatility-driven regularization is a promising direction for enhancing the performance of deep neural architectures. |
16:20 | Beyond Local Balance: A Global Perspective for Signed Network Embedding PRESENTER: Ziyi Hou ABSTRACT. Learning low-dimensional node representations is crucial for analysis in signed networks. Existing embedding methods, often based on local structural balance theory, tend to overlook network-wide information. This is a critical omission, as negative links can create globally pivotal nodes—such as those bridging conflicting communities—whose importance cannot be captured by local analysis alone. To address this limitation, we propose SAGA (Signed-Aware Global Attention), a novel signed network embedding framework. SAGA first utilizes a signed graph neural network to learn local representations that differentiate between positive and negative ties. It then introduces a global pooling mechanism that generates a graph-level summary, enabling the model to generate node embeddings that reflect both their global context and structural significance. Experiments on five real-world datasets demonstrate that SAGA consistently outperforms existing methods on downstream tasks, validating its effectiveness in capturing both local and global network properties. |
16:40 | CLAF: A Critical Learning Period-Aware Adaptive Framework for Federated Learning in Heterogeneous Environments ABSTRACT. Federated Learning (FL) enables privacy-preserving collaborative model training across decentralized clients. While adaptive client selection and knowledge distillation (KD) offer potential efficiency gains by monitoring client progress, existing methods lack systematic understanding of pervasive client and data heterogeneity in practical settings. Prevailing FL approaches assume homogeneous clients with equal importance and capability, selected uniformly at random – an assumption contradicted by Critical Learning Periods (CLP) theory, which demonstrates that minor gradient disturbances during early sensitive phases irreparably degrade model accuracy. To address this, we propose the Critical Learning Period-Aware adaptive Framework (CLAF), a novel FL framework for heterogeneous environments. CLAF introduces dual-granularity (coarse- and fine-grained) CLP detection to intelligently optimize client selection and drive adaptive KD strategies. Extensive experiments on diverse models and datasets show CLAF outperforms state-of-the-art methods by up to 22% in accuracy while maintaining robust generalization capabilities. |
17:00 | Incremental Hashing with Asymmetric Distance for Image Retrieval in Non-stationary Environments PRESENTER: Zihao Zhan ABSTRACT. Existing online hashing methods generally employ the Hamming distance for similarity evaluation, which leads to the information loss of data location. Candidate images may have the same Hamming distance but different similarity from the query, which reduces the retrieval accuracy. Especially for non-stationary data environments, concept drift problems are prevalent. The location information loss makes it more difficult for capturing the distribution changes in data environments. To alleviate above concerns, Incremental Hashing with Asymmetric Distance (ICHAD) is proposed in this paper for image retrieval in non-stationary environments. In ICHAD, the online asymmetric distance based on learned hash codes is employed for the similarity evaluation. It preserves the location information of data more accurately and is computed efficiently without accessing the old data. Experimental results show that ICHAD outperforms existing hashing methods in various non-stationary data scenarios with concept drift. |
17:20 | UCTPose: Uncertainty-aware Multi-view 3D Animal Pose Estimation ABSTRACT. Data-driven quantitative analysis of animal behavior relies critically on precise video segmentation, accurate 3D animal pose estimation, and behavioral pattern interpretation. Three persistent challenges impede accurate pose estimation: cross-view occlusions, perspective-induced domain shifts, and scarcity of 3D annotations data. To address these limitations, we present UCTPose, a weakly supervised multi-view animal pose estimation framework that synergistically integrates uncertainty-aware 2D modeling with confidence-guided 3D triangulation. The core innovations include: UCTPose employs a reparameterized perturbation module that simulates view-dependent feature uncertainties, enhancing confidence calibration for 2D keypoint predictions under conclusion. A geometry-constrained triangulation head that reconstructs 3D poses by incorporating per-joint confidence scores, optimized via reprojection residual loss to enforce spatial consistency. Comprehensive evaluations on three multi-view mice behavioral datasets demonstrate that UCTPose achieves state-of-the-art performance. These results validate UCTPose’s superior cross-view generalization and occlusion resilience. The framework provides a robust tool for high-fidelity 3D kinematic profiling of naturalistic animal behaviors, significantly reducing dependency on exhaustive 3D annotations. |
17:40 | A Neural Subgraph Counting Method based on Matching Matrix ABSTRACT. Subgraph counting aims to compute the number of subgraphs in a data graph G that match a given query graph q, which has been applied in various fields such as bioinformatics, data mining, and social network analysis. Early methods fundamentally rely on enumerating all possible subgraphs, but they face high computational cost because enumerating all possible subgraphs is an NP-complete problem. To release the complexity, approximate method has gained attention, as in many cases, approximate counts are sufficient for decision-making or identifying trends. Recently, researchers have begun applying GNNs to approximate subgraph counting tasks, yet existing GNN-based methods suffer from inefficiencies caused by unpromising data vertices and limited use of the matching information between query and data vertices. To address these challenges, we propose a Neural Subgraph Counting method based on Matching Matrix, namely MMNSC, which consists of two key components: (1) Candidates Extraction, which retrieves candidate substructures from data graph using a new filtering method, and (2) Matching Matrix Estimator, a learning-based estimator that generates a matching matrix between query graph and data graph. Through experiments on five real-world data graphs, MMNSC demonstrates superior performance over existing state-of-the-art methods. |
16:00 | Few-Shot Document-Level Relation Extraction Based on Chain-of-thought with Discriminative Multi-view Prototype Tuning ABSTRACT. Few‐shot document‐level relation extraction (FSDLRE) seeks to uncover semantic relations among entities in a document when only a handful of labeled examples are available. Existing prototype‐based meta‐learning methods build class prototypes for matching but suffer from two key limitations: 1) Inadequate NOTA modeling. By focusing on “learning‐to‐match,” they neglect robust representations for None‐of‐the‐Above (NOTA) cases. 2) Underutilized supervision and reasoning. They do not fully exploit multi‐view supervisory signals nor harness the structured reasoning capabilities of large language models (LLMs), limiting their ability to adapt to new domains under scarce labels. In this paper, we propose Chain‐of‐Thought with Discriminative Multi-view Prototype Tuning (CDMPT), which harnesses the discriminative power of large language models to mine supervisory signals from scarce labeled data through multiple complementary views. By harnessing the chain-of-thought mechanism, our method treats each few-shot episode’s target relation classes as individual sub-domains and performs a structured prototype construction analysis and discriminative reasoning process to complete the FSDLRE task. Extensive experiments demonstrate that our approach significantly boosts performance in few-shot document-level relation extraction by average , especially under cross-domain settings. |
16:20 | Context-Aware and Knowledge-Grounded Conversational Recommendation with Prompt Learning ABSTRACT. Conversational Recommender Systems (CRSs) aim to provide personalized recommendations through multi-turn dialogues. While Large Language Models (LLMs) have shown promise in handling conversational and recommendation tasks, integrating user preference, contextual knowledge, and generation quality remains a significant challenge. In this work, we propose GraphPromptCRS, a prompt-based and knowledge-grounded CRS framework that jointly performs recommendation and response generation with a frozen LLM. Our system leverages soft prompt learning to encode task-specific information without fine-tuning all model parameters. To enhance the model’s reasoning capabilities, we introduce a GraphRAG-based knowledge construction pipeline that builds dynamic knowledge graphs from dialogue history using structured prompts. Additionally, we incorporate a Community Prompt Enhancer to capture users’ topical preferences, guiding personalized and context-aware generation. Experimental results on the ReDial dataset demonstrate that GraphPromptCRS significantly outperforms baselines in both recommendation accuracy and conversational diversity, validating the effectiveness of our approach. |
16:40 | Guided Attention Mechanism in Multi-turn Dialogue Summarization ABSTRACT. Multi-turn dialogue summarization aims to efficiently extract core information from vast amounts of conversational data. However, this task often causes "structure-saliency conflict" when balancing the structural perception and saliency focus of the summary, resulting in chaotic summary logic or vacuous content. This paper proposes a guided attention mechanism based on the synergy of structure and saliency. First, the macro-topic flow of the dialogue is predicted, and a structural attention mask is constructed to impose hard constraints on the summary range, ensuring the overall logical coherence of the summary to alleviate the problem of logical confusion; secondly, within the constraint range, the saliency of each discourse is scored by fusing multi-dimensional features, and the attention weight is dynamically guided softly to ensure the focus of key content thereby mitigating the issue of vacuous or generic content. To verify the mechanism, the STGSum summary model is constructed. Experiments on two public datasets, CSDS and DialogSum, show that the STGSum model performs significantly better than mainstream baseline models such as TGDS and TODS in key indicators such as ROUGE, especially when dealing with complex structured dialogues. It shows excellent robustness. This study provides an effective solution for generating high-quality dialogue summaries with clear logic and prominent focus. |
17:00 | Adaptive Persona Context Modulation for Personalized Emotional Support Conversation ABSTRACT. Personalized Emotion Support Conversation (ESC) systems (i.e., supporters) assist users (i.e., seekers) in navigating negative emo- tional states through personalized, empathetic interactions, which often equipped with a persona extractor. Currently, two key challenges are encountered by personalized ESC systems. First, while existing persona extractors attempt to infer persona from dialogue to understand seekers, they often struggle to distinguish the speakers’ roles and thus only con- sider the utterances from seeker’s side. Second, incorporating personal information without consideration of contextual relevance risks damaging the naturalness and coherence of responses. Therefore, we present a novel Adaptive Persona Context Modulation approach (APCM) for the ESC task. For more effective persona extraction, we reconstruct the Persona- Chat dataset to adapt to our task and propose a role-cognitive persona extractor, thus enhancing comprehensive understanding for seeker’s per- sona from utterances by both sides while preventing role ambiguity. For persona-context integration, our model introduces an Adaptive Atten- tion Balancing Module that dynamically adjusts the influence of persona and context information during response generation, better reflecting real-world conversation patterns where seeker’s persona is only consid- ered in appropriate circumstances. Extensive experiments on benchmark datasets demonstrate the effectiveness of APCM, achieving state-of-the- art (SOTA) performance in emotional support dialogue generation. Our code is public at https://anonymous.4open.science/r/APCM. |
17:20 | Integrating Sensemaking in a Question-Driven Development Framework for Textual Narrative Reporting Applications PRESENTER: Ruilin Wang ABSTRACT. Although traditional data-to-text Natural Language Generation (NLG) technology is useful for disseminating insights from data science projects as textual narratives to wider audiences, traditional NLG applications rely on application knowledge that is not transferable to new data science projects without significant effort. This limitation arises because application knowledge is not acquired and organized around key concepts of data science, such as questions that motivate investigations, algorithms that answer these questions, and sensemaking processes that apply this knowledge at all stages. This paper introduces a ``rapid prototyping followed by iterative refinement'' methodology and its corresponding development framework. It leverages the strengths of cognitive sensemaking within a question-driven data science process. In the context of human-machine collaboration in data science, our approach facilitates a greater share of sensemaking responsibility to the machine side, enhancing the collaboration between humans and machines. The paper also proposes an information consistency principle to ensure alignment among input data, model results, user requirements, and generated reports. The effectiveness of this framework has been demonstrated through experiments, confirming that the quality of the reports generated matches that of reports crafted by experts. |
17:40 | HyKAG: Hybrid Knowledge-Aware Retrieval-Augmented Generation for Knowledge-Intensive Questions ABSTRACT. Knowledge-intensive Questions typically require Large Language Models (LLMs) to retrieve external knowledge beyond their parametric memory to generate factually accurate and human-aligned answers. Retrieval-Augmented Generation (RAG), a reliable technique for supplementing LLMs with external information, enhances generation quality and mitigates hallucination by incorporating retrieved knowledge into the reasoning process. However, existing multi-step retrieval RAG methods are prone to introducing a large number of irrelevant passages during deep exploration of external knowledge bases and remain constrained by one-sided exploration strategies. This hinders effective exploration and utilization of high-quality knowledge, ultimately leading to unreliable reasoning and answers. To this end, we propose a novel Hybrid Knowledge-Aware RAG (HyKAG) framework for knowledge-intensive questions. Specifically, to enable deeper exploration of high-quality external knowledge and enhance the model’s knowledge awareness, we first propose hybrid knowledge expansion and refinement modules that enrich retrieved content from dual retrieval perspectives and refine it through an incremental cross-step integration strategy. Furthermore, we introduce a hybrid knowledge-aware adaptive retrieval module that formulates high-quality retrieval decisions by leveraging the refined hybrid knowledge, thereby facilitating deeper knowledge exploration. Extensive empirical results on four datasets demonstrate the superiority of HyKAG. |
16:00 | Multimodal Named Entity Recognition with Synthesized SVG Graphics and Structural Semantic Consistency Scoring PRESENTER: Shujun Xia ABSTRACT. Named entity recognition (NER) is a fundamental task in natural language processing (NLP). However, it often has difficulty in handling entity ambiguity due to limited contextual clues. To address this challenge, traditional multimodal methods introduce social media images to assist text understanding, but the semantic deviation and noise of images seriously restrict the multimodal modeling effect. This paper proposes NER-S, a novel multimodal NER framework that integrates text-guided Scalable Vector Graphics (SVG) as reliable visual information and combines structural semantic consistency score (SSCS) to select images with high visual and semantic consistency as auxiliary information to improve entity recognition performance. Specifically, the original text is first input into the SVG image generation model to generate candidate images. Then, the optimal image is selected by SSCS and input into the multimodal named entity recognition model as the final visual supplement. Experiments on Twitter-2015 and Twitter-2017 datasets demonstrate the effectiveness of NER-S, with F1 scores of 76.60% and 86.87%, respectively. Our model outperforms all text-only baselines and exhibits comparable or superior robustness and generalization capabilities to existing multimodal models with real-world images. |
16:20 | DREAM: A Dual Representation Learning Model for Multimodal Recommendation ABSTRACT. Multimodal recommendation focuses primarily on effectively exploiting both behavioral and multimodal information for the recommendation task. However, most existing models suffer from the following issues when fusing information from two different domains: (1) Previous works do not pay attention to the sufficient utilization of modal information by only using direct concatenation, addition, or simple linear layers for modal information extraction. (2) Previous works treat modal features as learnable embeddings, which causes the modal embeddings to gradually deviate from the original modal features during learning. We refer to this issue as Modal Information Forgetting. (3) Previous approaches fail to account for the significant differences in the distribution between behavior and modality, leading to the issue of representation misalignment. To address these challenges, this paper proposes a novel \textbf{D}ual \textbf{RE}present\textbf{A}tion learning model for \textbf{M}ultimodal Recommendation called \textbf{DREAM}. For sufficient information extraction, we introduce separate dual lines, including Behavior Line and Modal Line, in which the Modal-specific Encoder is applied to empower modal representations. To address the issue of Modal Information Forgetting, we introduce the Similarity Supervised Signal to constrain the modal representations. Additionally, we design a Behavior-Modal Alignment module to fuse the dual representations through Intra-Alignment and Inter-Alignment. Extensive experiments on three public datasets demonstrate that the proposed DREAM method achieves state-of-the-art (SOTA) results, with the source code available at https://anonymous.4open.science/r/DREAM-8497. |
16:40 | MeDRNet: A Knowledge-Augmented Multi-Model Framework for Robust Medical Language Understanding ABSTRACT. While Large Language Models (LLMs) hold great promise for medical applications, their susceptibility to subtle linguistic variations—such as terminology differences, colloquial phrasing, and altered word order—poses a critical challenge for clinical reliability. We propose MeDRNet, a knowledge-enhanced multi-model medical AI framework that dynamically fuses generic and domain-specific models through an adaptive routing mechanism. Its modular design integrates innovative techniques, including adversarial training, logical consistency constraints, and a knowledge alignment module, leveraging medical knowledge graphs and retrieval-augmented generation to effectively mitigate hallucinations and enhance factual accuracy.MeDRNet is designed to maintain semantic stability under diverse clinical inputs, including noisy, informal, and domain-specific queries, making it suitable for high-stakes healthcare scenarios. Extensive experiments on PromptCBLUE, MultiMedBench, and a Real-World Query Set demonstrate that MeDRNet consistently outperforms leading baselines—including GPT-4, Aquila-Med LLM, and HuatuoGPT-o1—in terms of accuracy, robustness, and hallucination resistance.These findings establish MeDRNet as a scalable and trustworthy foundation for real-world clinical language understanding tasks. The framework is readily extensible to downstream applications such as diagnostic decision support, electronic health record (EHR) summarization, and multilingual medical QA, offering a promising pathway for integrating LLMs into next-generation clinical workflows. |
17:00 | Multi-modal Multi-objective Particle Swarm Optimization Using Growing Neural Gas Network ABSTRACT. In multi-modal multi-objective optimization (MMO), multiple Pareto optimal solutions with distinct decision variables can be projected onto an identical objective vector on the Pareto front. Numerous optimization algorithms develop sophisticated diversity preserving mechanisms to extensively explore the Pareto set (PS). However, existing work has neglected explicit learning of the Pareto set, while the intersection of machine learning and MMO remains largely unexplored. To advance the field, a multi-modal multi-objective particle swarm optimization that incorporates a growing neural gas network is proposed, termed MMPSO-GNG. The algorithm incrementally learns the topological structure of the PS to construct the network, in which a network-based solution generator and a selector are developed to facilitate exploration and maintain diversity. The generator leverages the network nodes to identify the neighborhood of each particle and guide the update of its position. The selector combines crowding distance and node-associated particle count to maintain a diverse set of particles. Performance evaluation on the CEC 2020 benchmark suite reveals that MMPSO-GNG surpasses five competing algorithms, therefore validating the effectiveness of integrating machine learning into particle swarm optimization to address complex multi-modal multi-objective problems. |
17:20 | BiGMF: Multimodal Sentiment Analysis By Bidirectional Cross-Modal Attention with Geometric Volume Regularization PRESENTER: Youwei Zhang ABSTRACT. Multimodal Sentiment Analysis (MSA) aims to integrate text, audio, and visual information to better understand human emotions. However, existing approaches lack a structured bidirectional mechanism for the exchange of semantic information between text, audio, and video modalities. This difficulty in modeling complex cross-modal dependencies consequently restricts their capacity to capture detailed semantic correlations. Moreover, previous methods typically fail to effectively align representations across different modalities, leading to semantic inconsistencies and redundant information during the fusion process. To address these issues, this paper proposes a novel bidirectional cross-modal fusion framework named BiGMF. The method is built upon a hierarchical cross-modal interaction architecture that enables bidirectional information exchange at multiple levels, enhancing the modeling capacity for cross-modal interactions. In addition, a geometric volume regularization strategy is introduced to reinforce semantic consistency. This strategy explicitly promotes the alignment of modality-specific features by constraining the geometric volume of their joint distribution in a shared embedding space. Extensive experiments on two MSA benchmarks demonstrate the effectiveness of the proposed method. |
16:00 | TRUST: Transparent, Robust and Ultra-Sparse Trees ABSTRACT. Piecewise-constant regression trees remain popular for their interpretability, yet often lag behind black-box models like Random Forest in predictive accuracy. In this work, we introduce TRUST (Transparent, Robust, and Ultra-Sparse Trees), a novel regression tree model that combines the accuracy of Random Forests with the interpretability of shallow decision trees and sparse linear models. TRUST further enhances transparency by leveraging Large Language Models to generate tailored, user-friendly explanations. Extensive validation on synthetic and real-world benchmark datasets demonstrates that TRUST consistently outperforms other interpretable models -- including CART, Lasso, and Node Harvest -- in predictive accuracy, while matching the accuracy of Random Forest and offering substantial gains in both accuracy and interpretability over M5', a well-established model that is conceptually related. |
16:20 | Balancing Fairness and Performance Under Multiple Sensitive Attributes ABSTRACT. Advances in machine learning enable solutions to increasingly complex problems. However, the predominant focus on predictive accuracy in many models often results in insufficient attention to potential biases against certain groups, thereby highlighting the critical need for fairness-aware machine learning. While most existing studies focus solely on debiasing with respect to a single sensitive attribute (e.g., race or gender), they fail to simultaneously consider fairness under multiple sensitive attributes. Furthermore, current fairness-enhancing approaches frequently degrade model performance. To address these limitations, we propose a novel framework named BFPM that achieves a better balance between fairness and performance across multiple sensitive attributes. BFPM consists of two parts. First, in the data pre-processing stage, we generate synthetic samples to balance the proportion of multiple sensitive attributes in the dataset, thereby enhancing fairness. Second, in the in-processing stage, we employ a retrieval-augmented model to obtain the context of each sample, thereby strengthening its representation. Comprehensive experiments across benchmark datasets demonstrate that BFPM significantly outperforms state-of-the-art methods, simultaneously improving fairness while maintaining or enhancing performance. |
16:40 | CRQCDM: A Causal Representation and Contextual Q-matrix Cognitive Diagnosis Model PRESENTER: Zhiwei Cai ABSTRACT. Cognitive diagnosis is a fundamental task in intelligent education, aiming to accurately assess students' latent mastery of knowledge concepts. However, existing models typically face two critical challenges. First, they generally treat knowledge concepts as isolated entities, failing to model the pedagogically-grounded causal dependencies among them. Second, as the scale of exercise pools and knowledge systems on online education platforms continues to grow, omissions often occur when annotating exercises with their associated fine-grained knowledge concepts. To address these issues, this paper proposes a causal representation and contextual Q-matrix cognitive diagnosis model (CRQCDM). The model operates through two synergistic mechanisms. First, a causal information-guided representation learning module is used to model the dependencies among knowledge concepts based on a predefined causal graph, generating more interpretable and nuanced student and exercise representations. Second, a contextual Q-matrix enhancement module integrates the student and exercise representations to uncover implicit knowledge concepts associated with the exercises. Extensive experiments were conducted with CRQCDM on three real-world datasets. The results demonstrate that the performance of CRQCDM is superior to that of existing methods. |
17:00 | A Multi-Stage Data Construction Approach for Code-Switching Grammatical Error Correction PRESENTER: Muyang Xu ABSTRACT. Code-switching (CSW) refers to the phenomenon where multilingual speakers integrate multiple languages within a single utterance. Although recent studies have made notable progress in developing Grammatical Error Correction (GEC) systems for CSW scenarios involving Chinese lexical items, many still rely on simplistic translation-based data generation, which often limits semantic diversity and fails to capture the complexity of natural CSW expressions. To address this issue, we propose a multi-stage data construction approach to enrich training datasets and improve model generalization. Specifically, we first employ a model-based generation method to produce monolingual augmented data, followed by a perplexity-based (PPL) adaptive filtering algorithm to ensure data diversity and quality. Next, we apply three levels of translation-based augmentation to both the filtered and the original datasets, effectively simulating natural CSW patterns at varying levels of complexity. Finally, we perform multi-stage model training on the combined datasets to progressively enhance model robustness across diverse data distributions. Experimental results show that our optimized model achieves an average improvement of 1.82 $F_{0.5}$ points across two CSW GEC test sets, demonstrating the effectiveness of the proposed approach. |
17:20 | PRISM: Principled Reasoning for Identifying and Suppressing Model Biases at Scale PRESENTER: Xunfei Zhu ABSTRACT. Large language models (LLMs) have shown impressive capabilities in diverse applications, from complex reasoning to creative generation. However, these models often rely on spurious correlations rather than causal understanding, leading to systematic biases that compromise their fairness and reliability. Current debiasing methods frequently approach bias as a single-dimensional problem, lack frameworks to differentiate between causal relationships and spurious patterns, and typically require extensive model modifications or domain-specific knowledge. We introduce PRISM, a novel framework that treats bias as a multi-dimensional causal phenomenon and operates through prompt-based learning without model modification. PRISM consists of three core elements: Dimensional Bias Identification (DBI), which isolates distinct causal dimensions of bias; Targeted Example Synthesis (TES), which creates counterfactual examples highlighting specific bias aspects; and Discriminative Learning Enhancement (DLE), which uses these examples to help models distinguish genuine features from spurious correlations. Our comprehensive evaluation across multiple datasets and model architectures demonstrates that PRISM consistently outperforms existing debiasing techniques, particularly for complex, multi-dimensional biases. Additional experiments confirm PRISM's generalizability across different models and datasets, establishing it as a flexible and effective approach to creating more fair and reliable language models. |
17:40 | RT-DETR-MO: A Lightweight Detector for Small Object Detection in Open-Water UAV Imagery PRESENTER: Yongtao Luo ABSTRACT. Detecting small objects in open-water UAV imagery is chal-lenging due to low contrast, scale variation, and tight on-board latencyconstraints. We present RT-DETR-MO, where “MO” stands for Mar-itime Open-water, a lightweight transformer-based detector tailored formaritime scenarios. The design introduces three targeted components: aDynamic Inception-style Mixed Convolution block (DiMConv) for adap-tive multi-scale representation, a Locally-enhanced Token Statistics Self-Attention (LTSSA) that injects neighborhood priors into linear-time at-tention to emphasize small or clustered targets, and a lightweight Mod-ulation Fusion Module (MFM) for branch-aware feature integration. Onthe SeaDronesSee benchmark, RT-DETR-MO achieves 83.9% mAP@50and 49.9% mAP@50:95, surpassing the RT-DETR baseline by 2.4 and2.0 points, respectively. It also cuts parameters by 35.7% and boostsinference speed by 40.7%. These results demonstrate a more favorableaccuracy–efficiency–size trade-off for real-time maritime UAV detection. |