ISAIR2025: THE 10TH INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND ROBOTICS
PROGRAM FOR SATURDAY, SEPTEMBER 20TH

View: session overviewtalk overview

08:30-18:00 Session 1

JianZhi Buliding Room116 (謇智楼116)

08:30
G-NeRF: A novel view synthesis method for complex scenes based on Neural Radiance Fields

ABSTRACT. Recently Neural Radiance Fields (NeRF) is becoming an active research area in the application of view synthesis, and a diverse set of relevant models have been explored. However, the current existing vast majority of methods only focus on the view synthesis for single target in simple scene, and cannot gain remarkable visual effects for complex scenes, which includes uneven illumina-tion distribution, massive background interference and abundance texture of targets. To alleviate these problems, the novel view synthesis method based on Mixture-of-Experts (MoE) model and NeRF, which became known as G-NeRF, is proposed. We present two important innovations. First, the MoE model processes various 3D points and global information with the Permanent-NeRF-expert to improve the model’s view synthesis ability of complex scenes. Second, the Detail-enhancement module optimizes the brightness changes of the scene to improve the performance of the model in complex highlight and reflection scene, and the synthesized novel view is exported from the External-Network. Finally, to validate the effectiveness of the proposed method, this paper conducts comparative on complex scene datasets. Qualita-tive results show that MoE with the Permanent-NeRF-expert and Detail-enhancement module has an excellent effect on enhancing the model, and im-proves the performance of G-NeRF in complex scenes view synthesis tasks.

08:35
CASA-YOLO: A High-Performance Underwater Object Detection Model Based on Sonar Images

ABSTRACT. Underwater object detection is essential in marine surveying, fisheries assessment, and autonomous underwater robot navigation. How- ever, traditional object detection models struggle with forward-looking sonar images due to noise, blurred boundaries, low contrast, and high computational complexity, resulting in reduced accuracy and limited real-time performance.This thesis proposes CASA-YOLO, an improved underwater object detection model based on YOLOv8, optimized in fea- ture extraction, fusion, and detection head design. The original C2f module is replaced by a novel C2f-CAFormer-CGLU structure, where CAFormer integrates self-attention with convolution to enhance feature modeling, and CGLU utilizes channel gating to improve feature distinc- tion. SimAM attention mechanisms are applied to both Backbone and Neck structures, enhancing sensitivity to underwater targets and sup- pressing background interference. Additionally, an Auxiliary Detection Head is introduced to capture and fuse multi-scale features, improv- ing detection of small and low-contrast targets.Experiments on a public sonar dataset show CASA-YOLO achieves a mean Average Precision (mAP@0.5:0.95) of 49.4%, a 3.6% improvement over YOLOv8, while maintaining computational efficiency. CASA-YOLO thus offers an effec- tive solution for underwater detection and monitoring, supporting more precise and efficient marine exploration.

08:40
A Lightweight Underwater Object Detection Model for Sonar Images: UWFAST-YOLO

ABSTRACT. Underwater object detection plays a crucial role in marine exploration, fisheries monitoring, and underwater robotic navigation. However, traditional object detection models often struggle with sonar images due to noise interference, low contrast, and high computational complexity, which limits detection accuracy and real-time performance. To address these challenges, this paper proposes UWFAST-YOLO, an improved lightweight object detection model based on YOLOv8, de- signed to enhance detection performance and computational efficiency for underwater sonar images. To reduce computational costs and bet- ter adapt to the unique characteristics of sonar images, this study in- troduces the Fast and Efficient Adaptive Attention Module (FastE). By integrating Partial Convolution (PConv) and Efficient Multi-head Atten- tion (EMA), FastE enhances feature extraction capability while minimiz- ing computational overhead, improving the model’s adaptability in com- plex underwater environments. For feature fusion, this paper proposes an improved High-level Screening Feature Pyramid Network (HSFPN), which integrates Coordinate Attention (CA) to efficiently fuse multi- scale features. This significantly enhances detection robustness, particu- larly in low signal-to-noise ratio (SNR) conditions and blurry boundary detection tasks. Additionally, this study designs a Task Adaptive Dy- namic Detection Head (TADDH) to reinforce information interaction between classification and localization tasks. This not only boosts de- tection accuracy but also reduces model parameters and computational complexity, further optimizing the model’s lightweight characteristics. Extensive experiments were conducted on publicly available underwa- ter sonar datasets, demonstrating that UWFAST-YOLO significantly reduces computational costs while maintaining detection accuracy com- pared to the original YOLOv8 model. Specifically, the proposed model achieves a 60% reduction in parameter size, a 50% decrease in com- putational cost, and a substantial improvement in inference efficiency. The results validate the effectiveness of UWFAST-YOLO for underwa- ter sonar object detection, offering an efficient and lightweight detection solution that enhances the real-time detection capabilities of autonomous underwater exploration systems.

08:45
A review of object detection techniques for novel power systems

ABSTRACT. With the rapid evolution of smart grid technologies, UAV-assisted power equipment inspection has emerged as a critical approach for ensuring operational stability and efficiency in next-generation power systems. Recent advancements in deep learning-driven object detection methodologies have demonstrated promising results for power infrastructure applications, where symmetry in equipment geometry and asymmetry in defect patterns present unique analytical challenges. However, three critical issues persist: (1) the scarcity of annotated datasets creating asymmetry in data distribution, (2) the prevalence of small-scale targets with symmetrical structural features, and (3) the interference from asymmetrical environmental backgrounds - all impacting maintenance effectiveness. This paper systematically surveys global research progress while analyzing the evolution of detection frameworks through the lens of symmetry-aware feature extraction and asymmetry-tolerant modeling. We establish evaluation criteria emphasizing balanced performance in symmetrical component localization and asymmetrical defect characterization. Comparative analyses of leading models reveal how symmetry-enhanced architectures improve small-target recognition, while asymmetry-adaptive mechanisms mitigate complex background interference. Our investigation highlights specialized deep learning implementations for power system applications, including symmetry-preserving component detection and asymmetry-sensitive defect diagnosis. The synthesis identifies key research directions: addressing data asymmetry through synthetic augmentation, enhancing symmetry exploitation in multi-scale detection, and developing hybrid architectures that balance structural symmetry learning with environmental asymmetry suppression. Practical implementation strategies are proposed to advance intelligent inspection capabilities while maintaining methodological alignment with symmetry-asymmetry principles in data analysis.

08:50
RUMNet: Reconstructed Attention and Unified Multimodal Network for Medical Image Segmentation

ABSTRACT. Medical image segmentation is a critical task in medical image analysis, and accurate segmentation results often require high-quality medical image datasets as a foundation. Due to the challenges in acquiring medical datasets, efficient utilization of existing data has become an urgent problem to address. However, current methods do not fully leverage medical data Existing methods directly embed text using attention mechanisms without performing feature alignment between text and images. To address this, we propose a novel and efficient segmentation model, RUMNet, which introduces a reconstruction attention mechanism for image-text fusion in medical semantic segmentation. We designed a feature fusion module to accommodate the different natures of textual and image information, ensuring the full integration of both modalities at different stages. We evaluate RUMNet on the QaTa-COVID19 and MosMedData+ datasets, and experimental results demonstrate that RUMNet achieves superior segmentation performance with fewer parameters. RUMNet, with a total of 24.5M parameters, is trained on two medical datasets that contain both textual and image information. On the QaTa-COVID19 dataset, it achieves a Dice score of 83.82% and an mIoU of 75.16%. On the MosMedData+ dataset, it achieves a Dice score of 74.19% and an mIoU of 61.26%.

08:55
3D Hand Pose Estimation Based on Multi-scale Feature Fusion and 3D Convolutional Neural Network

ABSTRACT. This work addresses a challenging problem of estimating the 3D hand pose from a single RGB image. Most current methods regress the 3D hand pose from 2D key points, which are obtained from 2D heatmaps regressed from 2D image feature. Due to the single scale of the image feature and the loss of much information during the 2D key point regression, the accuracy of hand pose estimation is affected. To this end, we propose a hand pose estimation method based on multi-scale feature fusion and 3D convolutional neural network. This method consists of two parts. The feature extraction part extracts multi-scale features, and then fuses them through upsampling to obtain 3D image feature. The pose estimation part performs 3D convolution on the 3D image features to regress 3D heatmaps, and then obtain the 3D hand pose from them directly. Experimental results show that the proposed method has a mean per joint position error of only 7.32 mm on the Stereo Hand Pose Tracking Benchmark, which is better than most 3D hand pose estimation methods. (备注:这篇论文已经被ISAIR 2024会议录用(ISAIR 2024的论文编号是140),并按时缴纳了6000元人民币注册费,但是这篇论文未能进入ISAIR 2024的会议论文集。按照主编建议,这篇论文重新投稿到ISAIR 2025,以便这篇论文能进入ISAIR 2025论文集)

09:00
An Integrated Deep Learning and Fuzzy Clustering Framework for Highway Congestion Prediction and Emergency Lane Activation

ABSTRACT. This paper addresses the issue of highway traffic congestion by proposing an integrated traffic flow prediction and congestion warning model based on YOLOv8 and deep learning. The model combines fuzzy C-means (FCM) clustering, Particle Swarm Optimization (PSO) algorithm, and Radial Basis Function (RBF) neural networks to provide real-time traffic state predictions and emergency lane activation decisions. By collecting and analyzing key traffic parameters such as flow rate, density, and speed, the model effectively forecasts dynamic traffic changes and issues congestion warnings. Further, a real-time decision-making model based on FCM-RBF, combined with a dual-lane cellular automaton model, simulates and verifies the traffic flow improvements after emergency lane activation. Experimental results demonstrate that the proposed model offers high accuracy and practicality, providing scientific decision-making support for traffic management departments to alleviate congestion and enhance road throughput.

09:05
Multi-Condition Underwater Image Enhancement Based on Rectified Flow

ABSTRACT. Underwater image enhancement (UIE) is essential for improving visual quality in underwater environments, which often suffer from issues such as color distortion, low contrast, and blurring caused by light attenuation and scattering. Traditional physics-based models and deep learning approaches dominate current research, yet both face challenges—rigid assumptions and unstable or inefficient training, respectively. To overcome these limitations, we propose an adaptive enhancement method based on \textbf{Rectified Flow}, which optimizes the probability transition path to achieve efficient and high-quality image restoration. Our method further incorporates physical priors, including background light maps and transmission maps, to enhance the model’s understanding of underwater degradation. Moreover, an \textbf{adaptive masking strategy} is introduced to exclude poorly enhanced regions during training, improving data quality and generalization. Extensive experiments on benchmark datasets demonstrate that our method achieves superior performance over state-of-the-art techniques in terms of both visual fidelity and computational efficiency.

09:10
A Two-Stream Manipulation Trace Network for Face Forgery Detection in Videos

ABSTRACT. The widespread dissemination of synthetic facial images poses a serious threat to social security. Existing facial forgery detection techniques still lack a robust detector to reveal forged facial images in complex scenes. To overcome this deficiency, we propose a new network named Two-Stream Manipulation Trace Network (TSMT) to learn the subtle manipulation traces of deep forgery methods on facial regions. TSMT includes a spatial manipulation trajectory extraction module and a frequency domain manipulation trajectory extraction module, which respectively learn subtle manipulation traces in the image space via convolutional neural networks and extract features of high-frequency information in the image frequency domain. These high-frequency information are easier to detect because facial forgery areas usually exist at the edges and details of the face. Additionally, we also designed a feature fusion module combined with a self-attention mechanism. This module fuses the manipulation traces in the spatial and frequency domains to generate the final features to be classified. Experiments are conducted to evaluate the performance of the proposed model on three commonly used video datasets. The results demonstrate that our model outperforms the state-of-the-art models and has stronger robustness and generalization ability.

09:15
PromptMoE: Query Enhancement for Code Search via Prompted Mixture of Experts

ABSTRACT. Code search can be seen as a multimodal task, where natural language queries must be effectively mapped to code snippets. Code search systems are essential for developers, facilitating the retrieval of relevant code snippets based on natural language queries. However, the efficacy of these systems is often hindered by the variability and ambiguity inherent in user queries. In this paper, we introduce PromptMoE (Prompted Mixture of Experts), an innovative framework designed to enhance query understanding and improve code search performance. PromptMoE leverages a mixture of experts architecture, where multiple specialized models, each proficient in a particular aspect of code semantics, are dynamically selected based on the given query. By incorporating prompting techniques, the system tailors the query to better align with the capabilities of the chosen experts, thereby refining the search results. Our extensive experiments on benchmark datasets demonstrate that PromptMoE significantly outperforms existing state-of-the-art code search methods. This enhancement is attributed to the system's ability to process diverse query types adaptively and its robust handling of complex code semantics. Our findings suggest that PromptMoE represents a promising direction for advancing intelligent code search systems.

09:20
Semantic Information-Driven 6DOF Pose Estimation

ABSTRACT. Unordered grasping represents a critically significant and profoundly challenging task within the field of robotic manipulation. Its application are extensive, particularly in industrial settings. 6D pose estimation serves as a pivotal technological underpinning for such robotic applications and has attracted substantial attention from both academia and companies. In this study, we introduce a 6D pose estimation method tailored specifically for texture-less rigid industrial parts. Our approach is guided by prior semantic information and combines deep learning-based object detection with traditional pose estimation solutions, organized into two subsequent stages. In the initial stage, a state of art deep learning-based object detection method is employed to classify the object's category and potential location, followed by the use of point-pair features to recover the object's pose. We evaluate the performance of proposed method on the MVTec ITODD [32] dataset, demonstrating its superiority over the majority of existing methods. Furthermore, we integrate this method into an robotic grasping system, with experimental results affirming its high precision and rapid execution speed, rendering it highly suitable for bin-picking tasks.

09:25
A review of multi-source and multi-modal anomaly data detection for new power systems

ABSTRACT. The power grid is becoming increasingly digitalized as a result of the widespread use of electronic power equipment and the wide availability of new energy sources. It is especially crucial to create multimodal anomaly detection technology in order to handle the increasing complexity and volatility of the power system and guarantee its safe and stable functioning. This paper then analyzes the anomaly detection method of single-modal data and the anomaly detection method under the fusion of multisource and multimodal data, comparing and analyzing it with the single-modal method and the multimodal method, as well as the multimodal fusion and nonfusion method. This is based on an analysis of the architecture of the new power system and its data multisource and multimodal characteristics. Lastly, the difficulties encountered by the present research are enumerated and the future course of development is suggested.

09:30
Mutton adulteration identification using multispectral imaging

ABSTRACT. This study explores the use of multispectral imaging (MSI) combined with ma-chine learning to detect duck meat adulteration in mutton rolls. A total of 240 samples (120 pure mutton and 120 adulterated with duck meat) were analyzed us-ing MSI in the 400–970 nm wavelength range. Significant spectral differences were observed between lamb and duck meat, particularly in the 600–970 nm band. Principal component analysis (PCA) captured 95.44% of the spectral vari-ance using the first three components. Classification models were developed with Support Vector Machine (SVM), Backpropagation Neural Network (BPNN), Partial Least Squares Discriminant Analysis (PLS-DA), and Extreme Gradient Boosting (XGBoost). The SVM model with first-order derivative preprocessing (1stDe-SVM) achieved the highest validation accuracy of 98.33%. Feature selec-tion methods (SPA, CARS, OPBS, and PCA) were applied to simplify the mod-el, with SPA identifying four key wavelengths (365, 620, 730, and 780 nm). The simplified SPA-SVM model achieved a validation accuracy of 95.00%. Visuali-zation maps generated using these wavelengths effectively highlighted the distri-bution of adulterated duck meat in mutton samples. The results demonstrate that MSI and machine learning can efficiently detect duck meat adulteration, providing a foundation for developing portable spectral detection devices.

09:35
Multi-source data fusion algorithm based on deep neural network for underground power cable

ABSTRACT. Underground power cables are a critical component of modern power systems, and monitoring their operational status is vital for ensuring grid stability. However, due to the complex operating environment of underground cables, traditional single-sensor data analysis methods face significant limitations when dealing with multi-source data fusion and nonlinear fault patterns. This paper proposes a multi-source data fusion algorithm (MDF-DNN) for underground power cables based on deep neural networks (DNN), which combines convolutional neural networks (CNN) and long short-term memory networks (LSTM) to efficiently fuse heterogeneous sensor data and perform fault diagnosis. Experimental results indicate that, compared with traditional methods such as weighted averaging, Kalman filtering, and support vector machines (SVM), the proposed approach exhibits significant advantages in accuracy, robustness, and fault diagnosis efficiency. This research provides an efficient and reliable solution for complex cable monitoring scenarios.

09:40
Sought-after game Wordle-----Cracking the secret of the mysterious grid diagram

ABSTRACT. Wordle is a popular word-guessing game; it requires players to guess a 5-letter word correctly within six chances. In each guess, the system displays each letter with a different colored background square and jumps to the next guess opportunity until the chances run out. The maximum number of chances is 6. Each player will receive a chart of their results, which can be shared on Twitter. Based on this background, we build ARIMA time series model to predict the number of result reports collected and the percentage of players choosing hard mode on a given date. a BP neural network model was proposed to predict the percentage of tries (). K-means clustering and decision tree classification model are employed for the classification of words. Finally, we combine cultural context and Psychology to further mine other interesting characteristics of data. For problem 1, we consider that the number of reported results is time-dependent, so we develop a time series model to predict the number of reported outcomes. Firstly, the data are pre-processed and some outliers are removed. Considering that the time series model requires time continuity, we interpolate these outliers to obtain the processed data set. Then with the help of SPSS software, we get the most suitable model as ARIMA(0,1,13). From this model, the prediction for the number of reports on March 1, 2023 was can be obtained as 15246 with a prediction interval of (8049,26368) and the percentage of hard mode players will be 9.625%.In order to evaluate the relationship between word attributes and the percentage of players choosing difficult mode (HMP), we first established four attribute indicators: WF, LR, SLF, and IF. By calculating the Spearman correlation coefficients, we were surprised to find that none of the four indicators were correlated with HMP. For problem 2, we employ BP neural network model to make predictions for the case where multiple input variables affect multiple output variables. The four attributes of the word (LR,WF,SLF, IF) and HMP are normalized as the input signals, and the are the output signals. Based on the time series obtained by question 1 to get HMP, we can predict that when the word is EERIE on March 1st , 2023, the percentage distribution of attempts for that day from 1 to 7 tries is 0.0120%, 1.4470%, 8.2390%, 28.8315%, 33.7353%, 21.9698%, 7.7510%. For problem 3, We first perform systematic clustering and obtain the optimal classification number of 3 through the Elbow method. By analyzing the distribution of these three types of , we define this three categories as difficult, medium and easy. Secondly, by observing the clustering results, we found that there were significant differences in LR and WF for the three types of words. Therefore, we use these two indicators as the classification basis of the decision tree to establish the model, and obtain the following classification rules: When LR<1 and WF > 0.519972, its difficulty can be classified as easy; when LR>=1 and 0.527278<=WF<2.02459, the difficulty can be classified as difficult; else, it can be classified as medium. Based on the attributes of the word ‘EERIE’, we classify it as difficult. In addition, by digging further into the raw data, we found some interesting features. For example, on some special days, the number of people playing Word decreased significantly. If the words of the day are more difficult, it will reduce the desire to share

09:45
Design of an Intelligent DC Regulated Power Supply

ABSTRACT. With the rapid advancement of artificial intelligence and IoT technologies, power systems play an increasingly critical role in modern society. As electronic devices become more pervasive, ensuring safe, stable, and efficient operation of power systems and power electronics has emerged as a major research focus. Traditional analog power supplies can no longer meet the growing demands for compact size, low cost, and high reliability. This paper presents the design of an intelligent DC-regulated power supply capable of precise voltage regulation (±0.02V accuracy for a 20V output from a 12V input) and integrated overcurrent protection, significantly enhancing system safety and stability. The design adopts a modular architecture with high-reliability components and incorporates a fuzzy PID control algorithm to improve dynamic response and adaptability compared to conventional PID methods. Key hardware components include a Boost converter as the main topology, an STM32F103C8T6 microcontroller for control, and dedicated circuits for voltage/current sampling and protection. Software implementation features PWM modulation, real-time PID adjustment, and AD conversion. Simulation (via Simulink) and physical prototyping validate the design’s performance, demonstrating stable operation under test conditions with optimized cost, precision, and scalability for future upgrades.

09:50
Research on Tobacco Leaf Mildew Detection Method Based on Multi-Channel Imaging System

ABSTRACT. Tobacco leaf mildew is a critical issue affecting the quality of tobacco during storage and transportation. Traditional detection methods rely on manual inspection, which suffers from low efficiency and strong subjectivity. This study proposes a non-destructive detection method for tobacco leaf mildew based on multi-channel imaging technology. By constructing a multi-channel imaging system, data from healthy and mildewed tobacco leaves were collected. Data quality was enhanced using preprocessing methods such as Savitzky-Golay (SG) smoothing filter ,Standard Normal Variate (SNV) transformation,and Multiplicative Scatter Correction(MSC). Key spectral features were extracted, and a mildew discrimination model was established. Experimental results demonstrated that the PCA-RF detection model, based on Principal Component Analysis (PCA) and Random Forest (RF), achieved an accuracy rate of over 96%, significantly outperforming traditional manual inspection methods. This approach provides a reliable technical means for early mildew detection during tobacco storage, offering significant value for improving quality control in the tobacco industry.

09:55
System design of an underwater trash removal robot

ABSTRACT. Global ocean plastic pollution has surpassed 150 million tons and increases by 8–12 million tons annually. Because traditional manual and trawl-based cleanup approaches lack sufficient single-machine salvage capacity, consume high energy, and have limited operational range and operating radius, they cannot meet expanding cleanup demands. To address this, we present an innovative autonomous underwater trash-cleaning robotic system comprising four mechanical subsystems—buoyancy, collection, clamping, and propulsion—and a double closed-loop PID controller featuring a gyroscope-based positional outer loop and an incremental velocity inner loop. Our visual recognition module builds on lightweight YOLOv5n, integrating GhostNetV2, BiFPN, and RFCBAMConv, trimming parameters from 2.50 M to 1.67 M and raising precision from 0.948 to 0.959 while maintaining $\mathrm{mAP}_{50}$ and $\mathrm{mAP}_{50\text{-}95}$. Ablation, real-world deployment, and comparative experiments demonstrate that our improved detector surpasses mainstream architectures such as DETR, EfficientDet, and CenterNet in detection performance. The complete robotic design provides a practical, scalable, energy-efficient, and environmentally friendly solution for large-scale marine plastic debris removal.

10:00
Research on Detection and Separation Technology of Carbon Dioxide and Water Vapor

ABSTRACT. In today's industrial manufacturing, processes such as high temperature combustion, chemical reactions, cooking and drying inevitably produce and emit large amounts of carbon dioxide and water vapor. These two gases not only exacerbate the greenhouse effect and increase atmospheric humidity, constitute a long-term pressure on the global climate and local environmental quality, but also affect the operation of production lines. Therefore, carbon dioxide and water vapor are not only the key targets of environmental management, but also the key parameters that must be monitored and separated to optimize process stability and improve product quality. In this paper, a set of integrated detection and separation technology is proposed for the real-time monitoring and efficient separation of carbon dioxide and water vapor in industrial sites.

10:05
Facial expression recognition using SqueezeNet with convolutional block attention module

ABSTRACT. Intelligent facial expression recognition (FER) plays a crucial role in human-computer interaction and mental health disorder detection. In this study, an improved SqueezeNet model combines the convolutional block attention module (CBAM) to extract subtle features to achieve high accuracy in real-time emotion detection. The lightweight convolutional neural network model incorporating the attention mechanism can speed up model inference without losing recognition accuracy. The experimental result shows the improved CBAM-based SqueezeNet model achieves 66.73%, 99.66%, and 95.09% accuracy on the FER2013, CK+, and JAFFE datasets, respectively, while running at an average speed of 79 FPS with a significant reduction in the number of parameters and computation.

10:10
HiResSiamNet: Hierarchical Residual Siamese Networks for Source Camera Device Linking on Small-sized Images

ABSTRACT. Source camera device linking plays a crucial role in image forensics to verify whether two digital images were taken by the same camera device. While significant performance has been made by the PRNU-based correlation methods over the past decade, the primary challenge remains unresolved in scenarios with small-sized images. In this paper, we formulate the task of source camera device linking as a binary classification problem and propose a novel framework, hierarchical residual Siamese networks, to solve the challenge of small-sized images. By leveraging the encoder-decoder architecture with skip connections and the capabilities of hierarchical residual blocks, the proposed hierarchical residual network achieves hierarchical multi-scale feature aggregation across spatial and channel dimensions, facilitating the extraction of subtle sensor pattern noises from input images. Compared with the existing state-of-the-art methods, including both the PRNU-based correlation methods and deep learning-based methods, our proposed method not only achieves the best overall performance but demonstrates a superior ability to balance sensitivity and specificity, thereby providing a more reliable and less biased classification.

10:15
Research on Music Emotion Recognition Based on Multi-scale Attention and Cross-modal Contrastive Learning

ABSTRACT. Music emotion recognition is a subfield of Music Information Retrieval (MIR). In recent years, thanks to the continuous development in the field of deep learning, the domain of using computers to obtain human-related emotions has attracted the attention of an increasing number of researchers. However, due to the subjectivity of music emotions, extracting features related to music emotions has become extremely complex. On one hand, it is difficult for people to establish a connection between complex music encodings and the corresponding emotions. On the other hand, there is a lack of labeled data of audio and emotion tags[1]. Based on the above problems, this chapter proposes an attention-based multi-scale music emotion recognition model. Most of the existing models overlook the low-level features of emotions. In this paper, a multi-scale parallel branch structure is used to ensure that the model can fully learn both the low-level and high-level features of music simultaneously, and these features play an important role in the music emotion recognition task. Compared with other algorithms, this approach enables the model to learn more comprehensive emotional information, thereby improving the accuracy of the model's recognition. In addition, the vast majority of current articles hold that music emotions involve spatial and temporal features. Therefore, a polar self-attention mechanism is used to enable the model to better learn the features related to music emotions. This kind of attention can focus on fine-grained image information and enhance the model's ability to extract spectrogram features. Finally, Temporal Convolutional Network (TCN) is used to extract music features. The dilated convolution and causal convolution of the TCN network can achieve a larger receptive field while using fewer parameters, which helps the model to better focus on the historical information of music. The final model has achieved excellent results on the PMEmo music dataset, with a Root Mean Squared Error (RMSE) score of 0.1104 and an R2 score of 0.6724 for the Arousal emotion, and an RMSE score of 0.1155 and an R2 score of 0.5022 for the Valence emotion.

10:20
A Study of Self-Trained Unsupervised Semantic Segmentation Based on Dual-Branch

ABSTRACT. Image semantic segmentation, as a core research field of computer vision, plays a crucial role in image understanding. However, the training of seman-tic segmentation networks requires a large amount of fine-grained pixel-level segmentation labels. Obtaining semantic segmentation labels requires signif-icant human effort. Unsupervised semantic segmentation methods, which use labeled or easily label-obtainable source datasets along with unlabeled target domain image datasets to achieve high accuracy on the target domain test dataset, can save much human resources and have become a current re-search hotspot. However, existing unsupervised methods have the problem of insufficient semantic information extraction when fine-tuned after pre-training on large-scale domain datasets and directly applied to the target do-main. To solve this problem, we design a dual-branch unsupervised semantic segmentation algorithm in this paper. Firstly, aiming at the problems of the current mainstream methods using Transformer, a semantic branch is de-signed on the backbone network to capture semantic context information. Since the target domain has no true labels, based on retaining the pseudo-labels generated by the teacher network, in our algorithm we propose a dual-branch internal loss method that uses the pseudo-labels generated by the student network to guide the semantic branch to enhance the ability of the semantic branch to extract image context information. In the decoding stage, the feature fusion ability of the decoder is improved through the polar self-attention mechanism. Finally, we evaluate in two main unsupervised domain semantic segmentation tasks of GTA5→Cityscapes and SYNTHIA→Cityscapes. Compared with the baseline model, the mIoU is in-creased by 2.6% and 3.7% respectively, significantly improving the segmen-tation effect of the model.

10:25
Research on Controllable Music Generation Algorithm Based on Multi-Branch Fusion

ABSTRACT. With the accelerating progress of information technology, the demand for music composition continues to grow. However, existing music generation algorithms primarily focus on improving the quality of generated samples, with most methods offering only limited control over the generated sequences. To address this issue, this paper proposes a music generation algorithm based on multi-branch fusion. The algorithm enhances the diversity and quality of generated music by incorporating a melody description branch and fusing expert description features, learned description features, and melody description features through parallel cross-attention. To further optimize the model's generative capabilities, this paper introduces the RoBERTa pre-trained model and a contrastive learning method based on instance discrimination. The contrastive learning method treats each sample as an independent category, maximizing the consistency of the same sample in the feature space while minimizing the similarity between different samples to learn discriminative representations. Based on the aforementioned research, comparative and ablation experiments were conducted on the LakhMIDI dataset. The results demonstrate that the proposed algorithm achieves improvements of 0.059 in chord accuracy, 0.056 in two cosine similarity metrics, and 0.055 in note density, validating the algorithm's effectiveness and advantages.

10:30
Melody Extraction Based on Dual-Branch Feature Fusion and Spatial Direction Attention

ABSTRACT. With the development of music information retrieval, melody extraction has become an important research direction. The research results of this task show great application potential in many fields, such as music transcription, cover song recognition and humming query system. In this paper, a melody extraction algorithm based on dual-branch fusion and spatial directional attention is proposed. The traditional feature representation method needs to suppress the harmonic and sub-harmonic signals, which will bring some information loss in the process of preprocessing. In order to improve the effectiveness of the input feature representation, a pre-trained network model for extracting high-level semantic features in audio is introduced to form a dual-branch structure with the original feature input. On this basis, a feature fusion module is proposed to mine the deep-level feature information and realize multi-granularity feature integration. Meanwhile, in order to improve the feature extraction efficiency, the channel convolution module is proposed in this paper, and a new attention mechanism is proposed to efficiently acquire fine-grained local feature information from both horizontal and vertical directions.

10:35
Generating Video with Conditional Control Diffusion Model

ABSTRACT. We present the Conditional Control Diffusion Model (CCDM), a neural network that converts a text-to-image (T2I) model into a video model by using conditional control while keeping the image quality of the original model. CCDM first trains on real video data, creating a composite model to fuse multiple frames and learn action priors. Then, CCDM adopts the Stable Diffusion architecture and integrates the T2I model, ensuring no changes to the T2I model during video generation. Finally, CCDM feeds back the generated frames to the model as feedback, reducing flickering caused by content changes. We test CCDM on various T2I models from CivitAI with different styles and features. Using prompts from the T2I model’s website, we generate videos and show that CCDM can produce dynamic information and handle generation tasks with 8GB VRAM. CCDM has excellent potential for video generation applications.

10:40
QMB: A Quaternion-based Modality Balancing Framework for Multimodal Multilabel Emotion Recognition

ABSTRACT. Multimodal Multilabel Emotion Recognition (MMER) aims to identify multiple emotions from heterogeneous modalities. Two challenges of MMER are modality imbalance and insufficient modal interaction, which lead to limited representation capacity. In this paper, we propose a novel Quaternion-based Modality Balancing framework (QMB) to address these challenges through two key components: (1) Adversarial Temporal Masking (ATM) strategy is introduced to mask emotionally salient segments of dominant modality during training, thereby encouraging the model to attend to underutilized modalities and learn more balanced representations. (2) Hypercomplex Quaternion Fusion (HQF) module that projects modality-specific features into a quaternion space and performs Hamilton product operations, which enables efficient modeling of high-order inter-modal interactions while preserving modality specific semantics. We evaluate our method on two benchmark datasets, CMU-MOSEI and M3ED. Experimental results demonstrate that our model achieves state-of-the-art performance, confirming the effectiveness of QMB in improving modal balance and modeling higher-order interactions among modalities.

10:45
DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction

ABSTRACT. Open-vocabulary semantic segmentation (OVSS) plays an essential role for real-life applications, such as intelligent surveillance, self-driving, and robotic vision. The recent methods of OVSS have witnessed remarkable development using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-defined category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP’s open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.

10:50
Lightweighted MX-YOLO Model for Traffic Object Detection

ABSTRACT. Traffic safety is a pivotal component of modern smart cities, yet persistent challenges remain due to frequent accidents. While surveillance systems generate extensive real-time video data, the computational demands imposed on edge devices hinder real-time processing, especially as complex models with high computational costs compromise detection efficiency. To address these limitations in traffic scenarios and enhance small object detection performance, this study proposes MX-YOLOv10, an optimized lightweight model derived from DM-YOLOv10.The proposed method integrates two key innovations: a C3Ghost module to reduce the parameters in the neck network and Reparameterized Convolution modules to replace the standard Conv-BN- SiLU blocks in the backbone. Experimental results demonstrate that MX- YOLOv10 achieves 104.2 FPS, a significant increase from the baseline 70.4 FPS, while maintaining a mAP@50 of 88.52% compared to the original 90.10%. Furthermore, the model’s parameters decrease from 2.9 million to 2.5 million, and computational costs reduce from 8.7 GFLOPs to 6.8 GFLOPs. These advancements position MX-YOLOv10 as an efficient solution for real- time traffic monitoring on resource-constrained edge devices.

10:55
MFW-RTDETR: Lightweighted Model Using MobileNetV4 and LAMP for Aerial Object Detection

ABSTRACT. This paper proposes a lightweight aerial photography target detection algorithm MFW-RTDETR for UAV embedded devices, aiming to solve the computing power bottleneck problem when large-scale detection models are deployed on edge devices. By integrating MobileNetV4ConvSmall as the backbone network, combining the Varifocal Loss classification loss with the Focaler-WIoUv3 bounding box regression loss, and adopting the Layer-Adaptive Magnitude Pruning strategy, the model achieves a 66% reduction in parameter scale and a 42.3% decrease in computational complexity on the VisDrone2019 dataset, while maintaining a detection accuracy of 52.08% mAP@0.5. Experimental results show that the algorithm has significant real-time advantages and robustness in complex aerial photography scenarios such as foggy days and nights

11:00
T-Splines local refinement based on Half-edge Data Structure

ABSTRACT. Compared with other geometric modeling methods, T-splines have had a profound impact on multiple aspects of intelligent manufacturing through their flexible local refinement ability and efficient data representation. For the convenience of programming, the Half-edge data structure is usually used to calculate T-splines. Based on this, this paper presents T-splines local refinement based on Half-edge data structure. We have defined the blending function object and simplified the criteria for determining the legitimacy of T-mesh based on this object. This paper also presents the information update situation of T mesh based on Half-edge data structure during local refinement, and avoids the generation of L-junction that raise the complexity of the T mesh.

11:05
MENet: Learning Memory-enhanced Network for Real-time Video Lane Detection

ABSTRACT. Accurately and fast detecting lanes in videos plays essential role for self-driving. Although numerous approaches have been dedicated to improving lane detection accuracy using deep neural networks in challenging scenarios, e.g., severe occlusion, ambiguous lanes, and poor lighting conditions, there are very few efforts to detect lanes in resource-constrained environments, e.g., limited computational resources, small model size, and the requirement of highly running speeds. This paper addresses video lane detection in resource-constrained scenarios, where a trade-off issue is taken into account in terms of detection accuracy and implementation efficiency. To achieve this, a memory-enhanced network, called MENet, is presented for real-time lane detection by fully capturing temporal contextual information across different frames. More specifically, MENet employs a temporal context refinement module (TCRM) to provide memory-based guidance for lane predictions. To capture long-term temporal context, the high-level semantics of all previous frames are implicitly memorized and updated in hidden states. In contrast, the short-term interactions of neighboring frames are recursively encoded in low-level feature embeddings. Two components complement each other to predict the lanes in current frame. To ensure real-time estimation, TCRM adopts a lightweight LaneGRU unit that integrates long-term and short-term temporal context without relying on repeated feature extraction, thereby saving substantial computational costs. The comprehensive experimental results demonstrated that our MENet is simple and effective, achieving a state-of-the-art trade-off in terms of detection accuracy and running efficiency on the VIL-100 and OpenLane-V datasets.

11:10
Development of a Multimodal Dialogue Robot for Multi-Speakers

ABSTRACT. This study aims to develop a multimodal dialogue robot capable of engaging in natural conversations with multiple users. To achieve this, we constructed a system that integrates the MMDAgent speech dialogue platform with a Kinect sensor. The proposed dialogue robot utilizes speech recognition and image processing to detect the direction of the sound source and the face orientation of each speaker, enabling it to identify the current speaker and determine the intended addressee. This allows the robot to direct its responses to the appropriate participant in real time. We implemented a communication mechanism that transmits the estimated face orientations and sound source directions to MMDAgent, which manages the dialogue logic. The system was evaluated through functional tests involving two users and three dialogue scenarios. In each case, the robot consistently generated appropriate responses based on the user's position and gaze, confirming its effectiveness in handling dynamic multi-user interactions. This research contributes to the development of physically embodied robots capable of context-aware and socially intelligent behavior. In future work, we plan to extend the system to handle more than two users and integrate gesture recognition. We also aim to explore alternative image processing approaches such as OpenPose and OpenFace for more natural and robust human-robot interaction.

11:15
Consonant-Enhanced Hearing Aid for Speech Intelligibility in Older Adults with Mild Hearing Loss – A Listening Evaluation of Consonant Enhancement –

ABSTRACT. This study presents a listening evaluation of a consonant enhancement method designed to improve speech intelligibility for older adults with mild hearing loss. The proposed method selectively enhances word-initial consonants based on their phonetic categories — fricatives, plosives, affricates, and nasals. Unlike conventional approaches that simply apply uniform amplification, this method applies consonant category-specific enhancement, including high-frequency boosting, amplitude gain, and time stretching, tailored to the perceptual characteristics of each consonant type. Experiments were conducted using synthetic Japanese speech stimuli under three processing conditions: unprocessed, high-frequency enhancement only (conventional method), and consonant category-specific enhancement (proposed method). Twelve older participants (aged 68–77), none of whom used hearing aids, were asked to transcribe the words they heard in controlled listening sessions. Pre- and post-experiment hearing assessments were conducted, including both pure-tone and speech audiometry. The results indicated that the proposed method improved intelligibility for voiceless plosives and nasals, while results for voiced plosives varied across individuals. Recognition accuracy was positively correlated with speech audiometry scores but showed limited correlation with pure-tone thresholds. Additionally, consonant recognition accuracy varied by individual and phoneme, suggesting that personalized enhancement strategies may be required. These findings demonstrate the potential of selective consonant enhancement to support speech understanding in older adults and highlight the importance of tailoring processing methods to both phonetic features and individual hearing profiles.

11:20
Phoneme Category Classification for Consonant-Enhanced Hearing Aid System

ABSTRACT. In an increasingly aged society, the demand for hearing aids is growing due to the rise in age-related hearing loss. However, adoption remains low, particularly among older adults with mild hearing loss, who often avoid using hearing aids due to cost, discomfort, and the limited improvement in speech intelligibility. To address this issue, we propose a hearing aid system that emphasizes consonants in speech input based on their phoneme categories, such as fricatives, plosives, and nasals, to enhance speech intelligibility. This study focuses on the classification of these phoneme categories as a core component of the system. We implement and evaluate classification methods using convolutional neural networks (CNN) and long short-term memory with fully convolutional networks (LSTM-FCN). Training data were prepared from two large-scale Japanese speech corpora, and multiple segmentation strategies were tested. Experimental results show that the CNN outperforms the LSTM-FCN in classification accuracy, and the proposed approach remains robust even when applied to speech data with ambient noise. This paper presents an overview of the proposed system, the classification framework, and the results of classification experiments. Taken together, these findings demonstrate the feasibility of phoneme-category-based consonant enhancement and highlight its potential for real-time, personalized auditory support.

11:25
Geospatial Target Recognition Using Feature-Enhanced YOLOv11

ABSTRACT. This paper presents an enhanced YOLOv11 model integrated with BiFPN and GLSA attention mechanisms to advance geospatial target recognition. BiFPN enables efficient bidirectional multi-scale feature fusion, while GLSA balances global context and local detail extraction, addressing challenges like subtle target features, background interference, and small object detection. Evaluations on the COCO dataset and a self-built dataset encompassing 6 urban road target categories demonstrate the improved model outperforms YOLOv8, Faster R-CNN, and the original YOLOv11. On the self-built dataset, it achieves 57.4% mAP50, 33.2% mAP50-95, and 74.5 FPS. Experimental results validate its significant real-time performance and robustness in geospatial scenarios, providing technical support for resource exploration and environmental monitoring.

11:30
Adaptive Confidence and Credibility-Aware Advising for Robust Ad Hoc Teamwork

ABSTRACT. Action advising accelerates collaboration in ad hoc teams, but existing protocols are often communication-inefficient and vulnerable to unreliable teammates. To overcome these limitations, we propose ACCA (Adaptive Confidence and Credibility-Aware Advising), a novel framework for robust and efficient ad hoc teamwork. ACCA integrates two innovations: (1) a State-Importance Gating mechanism that filters advice requests to conserve the communication budget for critical decisions, and (2) a Dynamic Credibility Model that learns teammate reliability to select the most trustworthy guidance. Our comprehensive evaluation in a mixed cooperative-adversarial environment demonstrates that ACCA significantly outperforms standard baselines. Ablation studies confirm that both components are synergistic and essential to its success. ACCA achieves superior performance with substantially higher communication efficiency. Moreover, it maintains robustness against malicious advisors, a scenario where baseline methods suffer severe performance degradation.

11:35
TAC-Net: A Time-Frequency-Adaptive Correlation Network for A-share forecasting

ABSTRACT. Stock price forecasting poses a formidable challenge in quantitative finance due to nonlinear market behaviors, high-dimensional feature spaces, and temporal volatility. Despite advancements in deep learning for financial time-series modeling, integrating temporal patterns, frequency-domain insights, and stockmarket interactions remains complex. We propose the Time-Frequency-Adaptive Correlation Network (TAC-Net), an innovative framework for A-share market forecasting, predicting t+4 day stock returns through market-state perception, frequency-aware modeling, and inter-stock relationship mining. TAC-Net features market-aware temporal attention for joint modeling of time dependencies and market conditions, Fourier-enhanced modules for multi-scale periodic signal extraction, a stock relation module capturing market-driven dependencies, and an IC-weighted loss function optimizing predictive accuracy and ranking performance. Evaluated on daily CSI 300 and CSI 800 data (2010–2025), TACNet surpasses baselines like LSTM, Transformer, and MASTER across metrics including IC, ICIR, RIC, RICIR, AR, and IR, demonstrating its theoretical rigor and practical efficacy for quantitative investment and risk management.

11:40
Research on Informer Wind Power Prediction Method Based on Wavelet Multi-Scale Fusion and GRU Optimization

ABSTRACT. Wind power generation is an important part of renewable energy, and its output power has high volatility and uncertaintyTo address the nonstationary, multiscale, and intertwined trend-periodic characteristics of wind power time series, this paper proposes a hybrid forecasting model combining multiscale decomposition, learnable wavelet transform, GRU, and Informer architectures. The model decomposes the original sequence into trend and seasonal components using wavelet analysis, which are then modeled by GRU and Informer, respectively. A hierarchical fusion strategy is adopted: seasonal features are aggregated bottomup from fine to coarse scales, while trend features flow top-down. GRU captures long-term trends, and Informer, with ProbSparse attention, efficiently models high-frequency variations. Experimental results on real-world datasets show that the proposed model outperforms state-of-the-art methods in both accuracy and generalization.

11:45
Linear Regression based Self-intersection Detection Algorithm of Bézier Surfaces

ABSTRACT. Surface self-intersection detection plays an important role in industrial software such as geometric modeling, CAD and CAE, which is essential for ensuring the geometric consistency and construction stability of the model. This paper presents a linear regression based self-intersection detection method of Bézier surfaces. By constructing a sample set of Bézier surfaces, the input features and regression labels are settled based on spatial point pairs which associated with geometric and parametric distance. A linear regression algorithm of surface self-intersection is proposed to learn the mapping relationship between the point pairs. To address the imbalance in the distribution of self-intersecting and non-self-intersecting samples, the ADASYN adaptive sampling method is given for sample enhancement. The experimental results show that the new algorithm significantly reduces the misjudgment rate while maintaining high detection accuracy of surface self-intersection.

11:50
Automated Handwritten Character Recognition on Log Ends

ABSTRACT. This paper proposes a deep learning-based method for automated handwritten character recognition on log ends, addressing challenges posed by complex textures and character variability. The approach integrates an improved YOLOv11 localization module, featuring a novel Partial Wavelet Downsampling (PWD) for better feature preservation and MPDIoU loss for robust bounding box regression, with a recognition module employing image processing for segmentation and MobileNetV2 for classification. Experimental results on a custom dataset demonstrate high localization (98.58%) and recognition (89%) accuracy, validating the effectiveness of the proposed techniques. This work provides a practical solution for automated log end character recognition, contributing to industrial automation and scene text recognition in challenging environments.

11:55
Semantic DETR: Enhancing Defect Detection via Semantically-Guided Feature Refinement

ABSTRACT. While DETR-like detectors have performed successfully in generic object detection tasks, their accuracy significantly declines in real-world scenarios where images are often corrupted by degradation. A natural solution is to enhance degraded images before detection, but the misaligned optimization objectives between image enhancement and object detection tasks often lead to suboptimal results. To address this, we present an end-to-end degradation-oriented object detection framework, named Semantic DETR, which enhances the performance of DETR through a collaborative approach that combines semantic prompt and feature refinement. First, we introduce a realistic degradation simulation pipeline to simulate real-world degradations. Besides, we develop a LoRA-based image enhancement backbone by integrating low-rank adaptation into pre-trained backbone, enabling robust feature extraction from degraded inputs. Furthermore, we propose a semantic prompt generation pipeline that extracts target-aware semantic prompt from degraded images, which is incorporated into the transformer encoder to guide the model to efficiently identify objects. Finally, we introduce a feature refinement-guided collaborative learning strategy, which combines multi-scale feature refinement module with multiple auxiliary detection heads to improve query assignment under degraded conditions. Extensive experiments on both generic and task-specific degraded datasets demonstrate the effectiveness and generalizability of Semantic DETR.

12:00
Improved BEVFormer for Complex Object Detection in Autonomous Driving

ABSTRACT. Achieving accurate and efficient 3D object detection from surround-view cameras remains a key challenge in autonomous driving, especially in complex scenarios with constrained computational resources. This paper revisits BEVFormer and introduces two key enhancements to improve accuracy and inference speed. Firstly, an adaptive BEV grid density mechanism is proposed in which deformable convolution dynamically reallocates spatial resolution. Secondly, the BEV encoder is redesigned with depth-wise separable convolutions to replace the standard Self-Attention and FFN stacks. The experiments on datasets nuScenes, Waymo, and KITTI show that the improved BEVFormer achieves 63.2% mAP and 75.1% NDS on nuScenes, outperforming BEVFormer, CenterPoint, and PV-RCNN, while running at 30 FPS on an RTX 3090. Consistent performance improvements across diverse hardware platforms validate the method’s suitability for resource-constrained deployment.

12:05
Movie Poster Design Based on Composite Template Learning

ABSTRACT. This study explores the use of composite template learning methods to enhance movie poster design. By analyzing a curated dataset of movie posters, compositional elements such as object layout and scene arrangement were extracted and used to train a templatelearning model for generating new posters. The results demonstrate that the proposed approach effectively captures key design features, providing realistic posters with enhanced visual coherence, genre-specific consistency, and scalability. This method offers valuable insights into AI-driven design automation for the film industry.

12:10
An improved G-mean and its application on Class-Speciffc cost regulation Extreme Learning Machine over imbalanced dataset

ABSTRACT. In recent years, numerous practical data have demonstrated an imbalanced scenario, which lead to degradation of classifier performance. This paper delves into an effective modified model of the Extreme Learning Machine (ELM) tailored for imbalanced datasets. The conventional ELM-based models often exhibit a bias towards the majority class. To address this issue, we introduce an advanced binary classifier, dubbed the Maximized Geometric Mean (G-mean), Class-Specific, Cost-Regulation Extreme Learning Machine (MG-CCR-ELM). Our model integrates class-specific cost parameters, employs an enhanced inertial time-varying weight bat algorithm, and aims to maximize G-mean as the optimization objective. The cost parameters, hidden-layer input weights, and biases are determined by our refined bat algorithm. Furthermore, we propose an improved G-mean metric, accompanied by its definition, properties, and theoretical proofs. To evaluate the model's performance, we benchmark it against ELM-based approaches like Weighted-ELM and CCR-ELM, as well as existing imbalanced learning methods such as bagging and Adaboost. Four methods are devised to implement our model in experiments, utilizing 21 public imbalanced datasets from UCI and KEEL repositories. Experimental outcomes reveal that our proposed model surpasses ELM-based methods in terms of accuracy and G-mean criteria on most public imbalanced datasets. When compared to existing imbalanced learning techniques, our fourth method achieves the best performance on approximately half of the 21 public imbalanced datasets. Overall, the model demonstrates promising classification outcomes for the majority of datasets.

12:15
SpatioTemporal Dynamic-Aware Mamba for Video Paragraph Captioning

ABSTRACT. Video paragraph captioning aims to generate coherent and contextual descriptions for a video consisting of multiple consecutive frames. Effectively capturing dynamic information is crucial for understanding the entire video. Moreover, video paragraph captioning requires extracting rich spatiotemporal representations, which has mainly been achieved using self-attention mechanisms in transformer-based models. However, self-attention introduces significant computational overhead. To address this issue, we propose a novel Mamba-based SpatioTemporal dynamic-aware (MSTD) model for video paragraph captioning. Our MSTD model features a ``spatiotemporal dynamic acquisition'' module that gathers continuous perspectives across multiple frames, ensuring sufficient contextual information for generating accurate and logically consistent captions. Additionally, to overcome two limitations in the Mamba model when processing input tokens—namely, historical decay and element contradiction, we use masked backward computation and element residual connections. We conducted extensive experiments on several large-scale video paragraph captioning datasets, including ActivityNet, YouCookII, and VideoStory, to validate the effectiveness of our proposed MSTD model. The results demonstrate that MSTD is highly capable of generating precise, thoughtful, and narratively coherent video paragraph captions.

12:20
Feature Fusion Based Automatic Chord Recognition Model: BTC-FDAA-FGF

ABSTRACT. Automatic chord recognition is a significant topic in the field of Music In-formation Retrieval(MIR). Serving as one of the cornerstone features of mu-sic, the chords obtained by chord recognition algorithms are the basis of many high-level semantic tasks. At present, a severe class imbalance problem exists in the domain of automatic chord recognition, where the recognition accuracy of rare chords is much lower than that of common chords, which significantly affect the overall performance of chord recognition algorithms. In this paper, a chord recognition algorithm based on feature fusion is designed. First, in the feature extraction part, Hybrid Constant-Q Trans-form(HCQT) is introduced to assist with Constant-Q Transform(CQT) to obtain richer and finer musical signal features, enabling better tracking of over-tones. Next, the extracted chord features are sent to the network, where the frequency-domain adaptive attention(FDAA) mechanism is used to enhance feature saliency, ensuring that the network can adaptively adjust the weights for different frequency components when training, thereby selectively enhancing frequency-domain features that contain important information. The enhanced features are then fed into an aggregation module that integrates a bidirectional self-attention module and Fourier transform module, enabling more effective capture of fine-grained features, global context information, and periodic structures in chords. The experimental results shows that pro-posed method outperforms existing mainstream bassline methods by 1.2% to 2.2% on the MIREX metrics, validating the effectiveness of the algorithm.

12:25
Beat tracking algorithm based on multi-scale feature fusion and attention mechanism

ABSTRACT. Automatic beat and downbeat tracking is an important research direction in the field of music information retrieval. This paper proposes a beat tracking algorithm based on multi-scale feature fusion and attention mechanism for the joint tracking of beat and downbeat. Firstly, we propose a convolution feature extraction layer based on multi scale feature fusion, which makes the model pay attention to different levels of music information and exchange musical instrument information with separated tracks. Then, based on the dilated self attention, we introduce the dilated neighborhood attention module and the global attention module with multi-scale features. The former not only reduces the time complexity, but also realizes the information exchange of time instrument dimension characteristics, and improves the accuracy of beat detection; The latter can determine the global optimal beat sequence while fusing the time information of different scales, which improves the stability of beat detection. By comprehensively utilizing the information of different musical levels and a variety of attention mechanisms, our model can better perceive the global and local characteristics of beat. We performed experimental verification on four widely used datasets, including ballroom, Hainsworth, harmonic and Carnatic datasets. The experimental results show that, compared with the deep learning method in recent years, our proposed model shows better performance in beat tracking and downbeat tracking. Compared with baseline, the F-measure indexes of beat tracking and downbeat tracking on ballroom dataset are improved by 1.2% and 2.8% respectively

12:30
DualSpinNet: A Crop Yield Prediction Model based on LSTM and GRU

ABSTRACT. Accurate prediction of crop yield is of great significance for agricultural production and food security. In this paper, we introduce DualSpinNet, a novel model that integrates long short-term memory network (LSTM) and gated recurrent unit (GRU) architectures to address this challenge. This model employs a dual-stream approach to extract temporal features of climate and soil data, utilizing parallel LSTM and GRU layers. These features are subsequently refined through an additional GRU layer to enhance time-series dependencies. The dual recurrent structure allows for more precise extraction and processing of multi-level temporal features, thereby improving the accuracy of prediction. The final yield predictions are generated through a fully connected layer. We trained and validated the model using the Kaggle dataset, and compared its performance with other state-of-the-art models. Empirical results demonstrate that our proposed model achieves lower MSE and MAE, making it effective for crop yield prediction. The proposed method offers a highly accurate tool for agricultural producers and decision-makers, contributing to improvements in crop yield and quality, and promoting food security and sustainable agricultural development.

12:35
Underwater image super-resolution via multi-domain learning

ABSTRACT. Underwater images suffer from the haze effect and low contrast due to wavelength- and distance-dependent scattering and attenuation. These issues present significant challenges for various underwater vision tasks. Super-resolution (SR) of underwater images offers an effective solution, enhancing both detail refinement and overall image visibility. However, underwater image SR is challenging due to the serious degradation in image texture and color information. Aiming to enhance the performance of underwater image SR, this paper presents an underwater image SR network via multi-domain learning. Concretely, we first propose a multi-domain encoder network, which incorporates the gray and two color spaces into a unified structure. Such an architecture enables our proposed to model improve underwater image quality by texture improvement and color correction. Coupled with the channel attention mechanism, the most discriminative features extracted from multiple domains are adaptively integrated and highlighted. Consequently, our network effectively boosts image resolution and improves the visual quality of underwater images by leveraging multi-domain data and the strengths of learning-based approaches. Comprehensive experiments confirm the superior performance of our proposed model in underwater image SR.

12:40
A Benchmark for Document Understanding of LLMs in the Field of Electric Power

ABSTRACT. The construction of a smart grid heavily relies on the image understanding capabilities of large language models. Existing evaluation methods often use general-purpose benchmarks, making it difficult to assess performance in specialized fields like the power industry. To address this challenge, we developed a dataset specifically designed to evaluate the Q&A capabilities of large models within the power domain. This dataset consists of 1,995 evaluation questions based on electricity-related images sourced from utility websites, academic papers, and open-source image libraries. It covers a wide range of power system aspects, such as policy, smart grid construction, and power finance, and icludes diverse visual styles and content. Various question types were designed, including text extraction, counting, date recognition, tabular data extraction, and cross-paragraph comprehension, to thoroughly assess model performance in Q&A tasks. Experimental evaluations were conducted on general-purpose large models like Claude-3 and GPT-4, as well as multimodal models like Qwen2-VL. Results indicate that while these models perform well, there is still room for improvement, especially with complex tasks. This evaluation frame-work not only enhances understanding of model capabilities in real-world applications but also provides a valuable reference for intelligent power system management. Moreover, it offers new insights into the evaluation of large models in other industry-specific applications.

12:45
Prior-Guided Attention Network for Underwater Enhancement

ABSTRACT. Underwater images often suffer from color deviation, low contrast, blurring, and other degradation issues due to the attenuation characteristics of water and the presence of particles in the aquatic environment. In this paper, we propose an underwater image enhancement method based on the transformer architecture and underwater prior knowledge to achieve visually improved results. Specifically, we introduce a prior-guided attention network (PGBNet), which comprises a prior-guided block (PGB) and a multi-feature attention block (MFAB). On the one hand, considering the varying degrees of color degradation, we employ the PGB to direct the network in capturing features that are significantly degraded. On the other hand, the multi-feature attention block is incorporated to explore rich-feature information at multiple scales in the underwater image. Experimental results demonstrate that our method effectively corrects color biases and removes haze across diverse underwater datasets.

12:50
Small Object Detection Using DM-YOLOv10 in Complex Scenes

ABSTRACT. Small object detection is challenging, especially in complex traffic scenarios with varied backgrounds and multi-scale issues. This paper proposes an enhanced DM-YOLOv10n model, integrating Deformable Convolutional Network v3 (DCNv3) and Multi-dimensional Channel Attention (MCA) to improve detection accuracy. Key improvements include embedding DCNv3 into C2f modules of the backbone network to enhance small object feature extraction, adding a P2 detection head alongside P3, P4, and P5 to capture fine details from high-resolution feature maps, and incorporating MCA to better focus on important features in complex backgrounds. Experimental results on a self-constructed traffic dataset show a 7.0% improvement in mean Average Precision (mAP), reaching 90.1%, demonstrating the model's enhanced performance in detecting small objects.

12:55
The capacity prediction of lithium-ion batteries based on an innovate mode decomposition technique

ABSTRACT. Predicting the remaining useful life (RUL) of lithium-ion batteries (LIBs) accurately is crucial for ensuring the reliable operation and timely servicing of battery systems. However, the occurrence of capacity recovery during the degradation process poses a significant challenge to the precision of capacity forecasting. To boost the forecasting accuracy, we have introduced the Time-Varying Filter-based Empirical Mode Decomposition (TVF-EMD) to break down the initial capacity data into multiple sub-series. Additionally, the Box-counting Dimension (BCD) is used to assess the intricacy of these components, categorizing them into high-complexity and low-complexity sub-layers. Moreover, an Elman Neural Network (Elman) and a Bidirectional Long Short-Term Memory Network (BiLSTM) are leveraged for forecasting to estimate the sub-components' values. Consequently, the final estimated capacity sequences are derived by merging the forecasted sub-components. Ultimately, the efficacy of the proposed predictive framework is confirmed using two distinct battery datasets. The results from the experiments show that the maximum values for root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) of our proposed hybrid model are merely 4.033, 3.837 and 5.236 % respectively. These figures confirm that our proposed architecture outperforms other comparative models and holds great potential for application in the field of RUL forecasting.

13:00
Improved CycleGAN for Mine Low-Light Image Enhancement

ABSTRACT. To address the challenges of collecting paired datasets in coal mine under-ground environments and the issues of low brightness in collected images that hinder subsequent recognition tasks, an improved CycleGAN-based low-light image enhancement method is proposed. To solve the difficulty of paired dataset collection, CycleGAN is chosen for unsupervised learning. To enhance the feature extraction capability of the generator, the Efficient Channel Attention(ECA) is introduced. A dilated residual convolutional brightness enhancement module is added to the generator to improve the brightness of underground images. Finally, to avoid over-enhancement and distortion, InstanceNorm (IN) in CycleGAN is replaced with Adaptive Layer-Instance Normalization (AdaLIN). Experimental results show that the im-proved method outperforms the original CycleGAN, with average increases of 9.00% in PSNR, 5.55% in SSIM, 1.70% in EN, and 7.61% in VIF, effec-tively enhancing the brightness and clarity of contours in coal mine under-ground images.

13:05
Financial Sentiment Analysis for Pre-trained Language Models Incorporating Dictionary Knowledge and Neutral Features

ABSTRACT. With the increasing complexity of financial markets, accurate sentiment analysis of financial texts has become increasingly important. However, there are obvious shortcomings in traditional methods: firstly, they tend to misjudge the sentiment tendency when dealing with professional terms in the financial field, and fail to accurately understand the actual meanings of these terms in different market environments; and secondly, they have a high error rate in identifying expressions that are superficially neutral but contain market signals, and it is difficult to capture the implied market orientations in them. Such limitations seriously affect the reliability of financial text sentiment analysis. This study proposes a financial sentiment analysis model, EnhancedFinSentiBERT, which incorporates financial domain pre-training, dictionary knowledge embedding and neutral feature extraction to improve the accuracy of financial text sentiment analysis. In order to comprehensively evaluate the model performance, this study conducts experimental validation on the widely used FinancialPhraseBank and FiQA datasets, and the results show that the EnhancedFinSentiBERT model achieves a better performance enhancement compared to the current mainstream methods on these two datasets. In particular, the model shows better recognition ability in identifying neutral sentiment, which reduces the misclassification rate to a certain extent and helps to improve the accuracy of sentiment analysis in the financial domain. Through comparative experiments and ablation analysis, it is observed that each component of the model has a positive impact on the performance, with dictionary knowledge embedding and neutral feature extraction contributing more significantly to the model performance improvement.

13:10
Spatial and Spectrum-Based Dual-Branch Domain Generalization for Face Anti-Spoofing

ABSTRACT. Previous face anti-spoofing (FAS) works have demonstrated less generalization performance in practical applications due to the unpredictability of unseen domains. With the significant diversity of environmental conditions, attack materials and camera devices, it is also challenging to find domain-invariant features even using domain generalization or domain adaptation. Considering that phase information from Fourier transform can preserve the edge information and contain high-level semantic information of images, it can be used as domain-invariant features for FAS task. Based on this assumption, this paper presents a novelty dual-branch domain generalization (DBDG) framework to improve the generalization ability for face anti-spoofing. One branch is responsible for extracting the phase information from the spectrum domain and redistributing the weights of domain-invariant features by a frequency filter module. The other branch is designed to directly extract the intrinsic liveness cues from the spatial domain of original RGB images. To bridge the representation bias between the spatial and spectrum domains, an approximation loss is introduced as auxiliary supervision by minimizing the distance of two branch features. A feature fusion strategy is then used for binary classification in the spatial branch. Finally, an adversarial learning approach is utilized on live cues to enhance domain generalization for diverse live faces, aiming to establish a compact live distribution. Extensive experiments show that our DBDG approach is effective and outperforms the state-of-the-art methods on four public databases.

13:15
Fault Prediction of Electro-Mechanical Actuators Based on Time Series Analysis

ABSTRACT. Electro-Mechanical Actuators (EMAs) are high-precision servo control components on aircraft and are critical actuators in flight control systems which makes the fault prediction for EMAs significantly important. Existing approaches typically address these challenges through nonlinear modeling and time series analysis. However, when dealing with highly complex nonlinearities and long-term trends, these methods face issues such as insufficient accuracy, model overfitting, and strong dependence on data. To address the limitations, we propose a fault trend prediction method for EMAs based on association analysis and time series prediction. Firstly, dynamic time warping is applied to extract health-related features. We then use fuzzy clustering to classify the data into three stages: healthy, degenerative, and near-failure. The distance from the healthy stage center is used as a key indicator to quantify the actuator’s health status. Based on this health indicator, we develop an improved Transformer model to predict the fault trend of EMAs. Experiments with NASA's FLEA dataset have shown that the proposed method has significant advantages over existing approaches in terms of mean absolute error, mean squared error, and root mean squared error.

13:20
Dynamic Drift Compensation Federated Semi-Supervised Learning for Lung Nodule Segmentation

ABSTRACT. Accurate lung nodule segmentation enables quantitative characterizations for monitoring the evolution of lung cancer. However, the high cost and time required for annotating medical images, along with increasingly stringent data privacy concerns, have limited the application of traditional supervised learning methods in medical image segmentation tasks. In this paper, we focus on a practically significant yet challenging federated semi-supervised segmentation problem, where some clients possess fully annotated data while others merely have unlabeled data. The lack of labeled information may cause model updates on unlabeled clients to deviate from the global optimal path, leading to client drift and making collaborative learning across clients more challenging. To address this issue, we propose a novel Dynamic Drift Compensation Federated (DDCS-Fed) Semi-Supervised Learning framework. To mitigate client drift, we design a Dynamic Drift Compensation strategy that dynamically adjusts the loss by quantifying the deviation between the global and local models, allowing the model to better adapt to each client's data characteristics and achieve a smoother optimal solution for each local model. Additionally, to more accurately assess the contributions of unlabeled clients in federated learning, we introduce an adaptive uncertainty-based dynamic aggregation method, which uses the model uncertainty and data quality of each client as the basis for aggregation weights. Extensive experiments on public datasets LIDC-IDRI, LNDb, MSD, and two in-house datasets demonstrate that our method achieves the state-of-the-art performance.

13:25
Auto-weighted Multi-view Efficient Spectral Clustering via Anchor-based Bipartite Graph

ABSTRACT. Spectral clustering is often hindered by its extremely high computational complexity and limited scalability for large-scale problems, particularly when applied to multi-view clustering scenarios. Recently, various efficient spectral clustering methods for large-scale multi-view data have emerged to address this limitation. However, these methods primarily concentrate on enhancing computational efficiency, while neglecting the adaptive integration of multi-view information. Additionally, they disregard the desirable low-rank nature of the learned similarity matrix, which represents the true underlying clustering structure in multi-view data, thereby leading to a degradation of clustering performance. To address these problems, we propose an Auto-weighted Multi-view Efficient Spectral Clustering method via anchor-based bipartite graph (AMESC). The AMESC framework leverages anchor-based bipartite graphs and introduces a novel essential unified bipartite graph learning strategy, which adaptively assigns weights to individual views during the fusion of multi-view graphs. Furthermore, we design a regularizer function for the learned bipartite graph by using a non-convex reformulation of nuclear norm, which induces the associated similarity matrix to exhibit low-rank properties within a linear time cost. Consequently, the unified bipartite graph that represents the essential clustering information in multi-view data can be learned in an auto-weighted and efficient manner, enabling direct acquisition of clustering results through an landmark-based spectral clustering algorithm. Extensive experiments on a series of multi-view large-scale datasets validate the competitive performance of AMESC.

13:30
UAV Maritime Rescue Object Detection Based on YOLO11

ABSTRACT. Target detection based on deep learning is a critical visual detection method for marine rescue. In complex marine environments, existing methods frequently fall short of performance expectations and face the following challenges. 1) Feature extraction capability is insufficient for small and medium-sized targets in complex marine environments. 2) The computational efficiency needs to be optimized for UAVs (Unmanned Aerial Vehicles). 3) Multi scale features cannot be robustly integrated. 4) The unbalanced regression loss can affect the accuracy of object detection. To tackle these challenges, we introduce an improved model that builds upon YOLO11. Specifically, we integrate the SCConv (Spatial and Channel Reconstruction Convolution) module, which optimizes feature extraction capabilities within the C3k2 framework. Furthermore, to further augment the model's sensitivity and selectivity towards crucial features, we incorporate the SE (Squeeze-and-Excitation) attention mechanism.At the same time, SimSPPF (Simplified Spatial Pyramid Pooling Fast) technology is introduced, which can improve computational efficiency while ensuring accuracy. In addition, the adoption of AFPN (asymptotic feature pyramid network) promotes robust and effective multi-scale feature extraction, improving YOLO11’s capacity to detect objects at various dimensions. Finally, We utilize the WIoUv3 (Wise IoUv3) aims to achieve a harmonious balance between regression penalties, thereby enhancing the model's overall performance even further. The analytical findings indicate that, in comparison to the baseline YOLO11 model, the enhanced version has achieved a 0.75% improvement in accuracy, a 5.63% increase in mAP50, and a 1.76% rise in mAP50-95. Additionally, the model has reduced its parameter count by 16.12%, thereby successfully attains an equilibrium between precision in detection and efficiency in computation.

13:35
Uncertainty-aware Semi-supervised Human Pose Estimation

ABSTRACT. The performance of semi-supervised human pose estimation models relies highly on the quality of pseudo-labels. To refine the quality of pseudo-labels, earlier attempts are mainly ensembling-based, resulting in considerably heavier training workload. In this work, we present a simple approach to estimate the uncertainty of predicted pseudo-labels through a disentangled manner, with unique integration of estimated uncertainty into the training scheme to improve pseudo-label quality. We introduce a pretext task for heatmap regression as the modeling of encoder uncertainty and incorporate heteroscedasticity into unsupervised learning as decoder uncertainty. We propose a novel formulation for pseudo-label fusion with estimated uncertainties as guidance. Experiments show that our method alone achieves comparable performance without data or model ensembling. Meanwhile, our uncertainty estimation technique could further improve model performance when combined with these ensembling-based methods. We also visualize the estimated uncertainty, which further demonstrate the effectiveness of our method.

13:40
Masked Denoising Diffusion for Accurate Anomaly Recognition and Reconstruction

ABSTRACT. The scarcity of defect samples and the emergence of unknown defect types pose significant challenges for data-driven industrial anomaly detection models. Unsupervised learning models extract features from normal samples and identify abnormal samples based on differences in feature distributions, thus avoiding reliance on labeled abnormal samples. Currently, generative-based unsupervised models demonstrate powerful performance in industrial anomaly detection. Denoising Diffusion Anomaly Detection (DDAD) model relies on the Stable Diffusion framework, which applies normal samples to train the model and achieves responsiveness to the locations of unknown defect samples. We notice DDAD model, trained solely on defect-free samples, encounters difficulties in accurately reconstructing defect areas. In this paper, a Masked DDAD (mDDAD) model is proposed, which employs the Masked Image Modeling (MIM) strategy to enhances sensitivity to critical features by masking parts of images and training the model to reconstruct them. Additionally, the Dynamic Noise Adjustment Mechanism (DNAM) is introduced to apply lower noise intensity to masked areas during denoising, thereby preserving crucial information and improving reconstruction accuracy. Moreover, the loss function is optimized through Dynamic Weighted Loss Optimization (DWLO), which incorporates mask weights and dynamic noise adjustment. Experiments on the MVTec-AD and VisA datasets demonstrate that the mDDAD model significantly outperforms existing methods in both anomaly detection and reconstruction. Code is available at https://github.com/zhg-SZPT/mDDAP.

13:45
CBLoRAS: A hybrid re-sampling strategy for imbalanced learning

ABSTRACT. Imbalanced learning is the ML conducting on imbalanced datasets. Imbalanced classification is an import task and a significant challenge. Existing strategies have alleviated the imbalance issue by synthesizing new samples and a representation is LoRAS by selecting shadow points for affine combination. Although LoRAS enhances the complexity of synthetic samples, it lacks attention to boundary samples in the synthesis and effective prevention of noise. This paper introduces a new oversampling technique called Clear Boundary LoRAS (CBLoRAS). CBLoRAS enhances the representational ability of generating minority class samples by combining boundary sample selection and RkNN denoising technology. CBLoRAS firstly identifies boundary samples, then applies the LoRAS algorithm to generate new synthetic samples among boundary samples, and removes noise through RkNN technology to improve sample quality. Experiments on 12 class-imbalanced datasets prove that CBLoRAS outperforms various popular methods in terms of the F1 score; ablation studies show that CBLoRAS effectively focuses on representative samples and removes noise instances, demonstrating its superiority in dealing with imbalanced datasets.

13:50
Improved Multi-Objective Marine Predators Algorithm for Image Denoising Network Architecture Search

ABSTRACT. Convolutional neural networks (CNNs) have proven to be highly effective in image denoising. However, optimizing their hyperparameters and structure is a complex expensive multi-objective optimization problem. Traditional algorithms for solving such a problem experience high computational costs and search stagnation. To tackle these challenges, we present a multi-objective algorithm combining quasi-opposition-based learning (QOBL) with an improved marine predators algorithm called QIMOMPA. The QOBL method is introduced to initialize the population, improving initial solution quality while ensuring initial population diversity. To overcome the slow convergence and search stagnation of the original marine predators algorithm (MPA), we propose a method to adaptively update the population by adopting different MPA stages for varying levels of sub-populations. In this method, different sub-populations follow different updating strategies according to their ranking levels, effectively balancing the convergence and diversity of the population. Meanwhile, the random walk stage of the MPA is dynamically introduced to prevent search stagnation. Subsequently, a surrogate model based on Gaussian process regression combining radial basis function kernel with white noise kernel function is proposed to reduce computational resources and time consumption. Experimental results indicate that QIMOMPA surpasses other MOMPA variants in convergence accuracy, stability, and quality of the Pareto frontier. The proposed CNN optimized by QIMOMPA improves the peak signal-to-noise ratio and the structural similarity index by approximately 1.54% and 1.01% on dataset MRI, respectively, and by approximately 0.79% and 1.79% on dataset CheXNet, respectively, compared to the best competing algorithm. Additionally, the surrogate model substantially decreases the consumption of computational resources without compromising the algorithm’s search accuracy, effectively accelerating the search efficiency.

13:55
RMPFuse: Infrared-Visible Fusion via Residual Networks and MultiPath Convolutional Modules

ABSTRACT. In the field of image processing, the fusion of infrared and visible images has become increasingly significant, which aims to integrate the advantages of both modalities to enhance visual quality and information richness. Traditional image fusion methods typically use concatenated infrared and visible images as input. However, these approaches often fail to separately extract and preserve the intrinsic information of each modality, limiting feature learning capacity. To overcome these limitations, residual networks and multipath convolutional modules are proposed for infrared and visible image fusion in this paper. The method consists of four modules: the residual network module (RN), parallel multipath convolutional module (PMC), channel-spatial attention module (CSA), and decoder module. First, RN extracts joint features from concatenated infrared and visible images, leveraging a residual structure to deepen the network and mitigate gradient vanishing issues. Second, PMC processes infrared images independently, utilizing a parallel multipath convolution mechanism to synthesize multi-scale feature information and achieve a richer representation. Third, CSA enhances feature extraction from visible images by integrating channel and spatial attention mechanisms for multi-dimensional feature representation. Finally, the features extracted by these three modules are fed into the decoder module to generate the fused image. Additionally, a two-stage fusion strategy with different loss functions is adopted. The proposed staged optimization strategy ensures that the model meets the different training requirements of each stage. Extensive experiments on public datasets reveal that our RMPFuse outperforms other representative state-of-the-art approaches in both qualitative and quantitative assessments. Meanwhile, extended experiments demonstrate its strong generalization ability by applying the model to other image fusion tasks.

14:00
AeroYOLOv9 for Airport Surface Object Detection via the ASMOD Dataset

ABSTRACT. This study introduces AeroYOLOv9, a novel object detection algorithm specifically designed for airport environments, along with the ASMOD dataset, a curated collection of annotated images for evaluating object detection methods. The ASMOD dataset comprises three primary object categories: planes, cars, and persons. Each object is labeled with a bounding box to facilitate precise evaluation. AeroYOLOv9 integrates a Swin Transformer backbone, which enhances the detection of small objects. Additionally, it incorporates an Adaptive Receptive Field mechanism to enable dynamic scale adaptation and applies Optimal Transport Assignment to mitigate class imbalance. Experimental results demonstrate that AeroYOLOv9 surpasses existing state-of-the-art algorithms, including YOLOv6, YOLOv8, and YOLOv9, in terms of mean Average Precision at 0.5 IoU. The ablation study highlights the complementary contributions of each module, with the highest performance achieved when all three components are combined. The ASMOD dataset and AeroYOLOv9 together establish a new benchmark for object detection in airport environments, providing valuable insights for real-world applications in safety, surveillance, and automation.

14:05
DMFSO-YOLO: A Dynamic Multi-scale Fusion and Speed-Optimized Network for Steel Surface Defect Detection

ABSTRACT. Steel surface defects seriously reduce the durability and usability of steel products, posing great challenges to industrial production. Defect detection methodologies currently show limitations in managing multiscale feature variations and precisely localizing small defects, primarily due to their close resemblance to the background, complicating accurate identification. This research introduces an innovative method called DMFSO-YOLO, representing Dynamic Multiscale Fusion and Speed-Optimized Network, tailored for identifying defects on steel surfaces. Firstly, the network's backbone incorporates a speed-optimized precision module (SOPM) to enhance computation efficiency, lower memory consumption, and decrease the chances of overfitting. Secondly, we create a dynamic multi-scale feature fusion module (DMFF) within the neck of YOLOv8 to enhance feature extraction and integration across various dimensions and layers. Finally, the adoption of a normalized Gaussian wasserstein distance (NWD) loss function offers stable gradient feedback by accurately measuring the difference between inferred and actual bounding boxes. Experiments on the NEU-DET dataset demonstrate that DMFSO-YOLO attains a 79.3% mAP and 245.19 FPS, indicating its potential as a robust and efficient solution for instant defect identification in industrial applications.

14:10
DESC-Net: Weakly Supervised Nuclei Detection and Segmentation with Partial Point Labels

ABSTRACT. Nuclei segmentation is a crucial step in the realm of digital pathology image analysis and serves as the foundation for subsequent research endeavors. The advancement of automated nuclei segmentation techniques facilitates the quantitative analysis of morphometric features of nuclei within histopathological images. However, manually annotating tens of thousands of nuclei demands substantial human effort and domain-specific expertise in pathology. In this paper, we propose a weakly supervised nuclei detection and segmentation based on partial point annotations, termed DESC-Net, which relies exclusively on partial point labels. Specifically, we design a two-stage dual-branch fusion colorization self-supervised framework that utilizes the hematoxylin component of original image as input to the segmentation network, thereby effectively enhancing contrast between the nuclei and background. Further, a colorization self-supervision strategy is introduced to solve the problem of image color information loss. An edge information mining is introduced to provide more accurate edge information for nuclei. The final segmentation probability map is generated by fusing the outputs of two networks running at different levels. Moreover, we consider using only partial point annotations for weakly supervised nuclei tasks. We propose a novel detection network named FB-net which integrates a Feature Denoising (FD) strategy alongside a Background Weakening (BW) module, significantly enhances nuclei detection performance while providing more comprehensive annotations for subsequent segmentation tasks. We conducted systematic evaluations of our proposed methods using two public datasets: MoNuSeg and CPM. Experimental results demonstrate that our approach exhibits considerable advantages in terms of accuracy, effectiveness and robustness, outperforming the latest state-of-the-art approaches.

14:15
A Transformer-based Multi-label Defect Image Classification Algorithm with Label Correlation Fusion

ABSTRACT. The structural quality and performance safety of urban infrastructure, such as sewer pipes and bridges, are of paramount importance. However, due to the complex structure and unique locations of these facilities, traditional manual inspection methods are not only inefficient but also pose significant safety risks. Therefore, developing automated defect detection and classification technologies is of great significance for improving the efficiency and safety of infrastructure maintenance. This paper proposes a multi-label defect image classification algorithm based on Transformer networks and label correlation fusion (LabelMDIC), aiming to address the limitations of existing methods in utilizing label co-occurrence relationships and fusing visual features across different scales in building facility defect detection. LabelMDIC employs the Swin Transformer to extract multi-scale defect features and leverages the self-attention mechanism to model the contextual semantics and co-occurrence relationships of defect labels, thereby supervising the visual feature reasoning process of defect images. Additionally, the algorithm integrates the co-occurrence relationship features of defect labels at different stages of the network to enhance the expression of information during the feature extraction process and introduces an Asymmetric Loss function to improve the model's ability to learn positive labels while reducing the impact of negative labels. Comparative experimental results on multiple datasets demonstrate that the LabelMDIC model not only achieves excellent classification accuracy but also shows significant advantages in terms of model complexity and inference speed, providing an efficient and practical solution for multi-label defect image classification tasks in building facilities.

14:20
Second-order LSTM networks for time series forecasting

ABSTRACT. Long Short-Term menory (LSTM) networks have shown excellent performance in time series prediction tasks. Learning characteristics for long-term dependences with LSTM is challenging due to their limited internal memories. In this paper, we propose a new hidden memory enhanced model called Second-Order LSTM (SndLSTM). This architecture has distinct advantages over existing LSTM, where the LSTM’s output computed by the preceding two hidden layer’s outputs, other than only one in ordinary LSTM. On periodic sequence, non-periodic sequence,wind energy data and electric load data, the experimental results demonstrate that the proposed SndLSTM is promising for practical applications, and SndLSTM yield better performance than ordinary LSTM by 35.89% and 5.99% on average in terms of RMSE and R2.

14:25
Deep Learning-Based Object Grasping Detection for Industrial Robots

ABSTRACT. With the rapid development of machine vision technology, the application level of robots in industrial automation is constantly improving, and the scope of industrial applications is also expanding. To meet the extreme requirements of high efficiency in industrial assembly lines, industrial robots need to have the ability to achieve high precision and rapid target recognition. Deep learning algorithms have attracted much attention due to their fast and accurate characteristics, and are therefore widely used in the field of robot visual recognition. This article is based on deep learning technology and adopts an industrial-grade robot target recognition algorithm based on the YOLOv5 network model as the visual foundation. The research mainly focuses on the following aspects: firstly, using SRCNN for image super-resolution enhancement to improve image quality; Secondly, by pre-processing the target image and using the trained YOLOv5 network model, efficient recognition and classification of industrial robot targets can be achieved; Finally, a dataset for industrial parts was created, providing necessary data support for the training of algorithm models. This study has important theoretical and practical significance for improving the performance of industrial robots in target recognition and classification tasks.

14:30
Deep Learning-Based 3D Point Cloud Instance Segmentation: A Survey

ABSTRACT. With the rapid development of 3D acquisition technology, various 3D sensors such as LiDAR and RGB-D cameras have become increasingly widespread. These sensors provide rich geometric information, with point clouds serving as one of the most efficient representations for 3D data. 3D point cloud instance segmentation has gained significant research attention in recent years, as it plays a crucial role in understanding and modeling three-dimensional environments by distinguishing different objects in space. This paper presents a comprehensive survey of deep learning-based 3D point cloud instance segmentation methods, which can be broadly categorized into two-stage and single-stage approaches. Two-stage methods first generate object proposals and then refine them for in-stance segmentation, achieving high accuracy at the cost of increased computa-tional complexity. In contrast, single-stage methods eliminate the proposal gen-eration step, significantly reducing computational overhead while maintaining competitive performance. Additionally, we introduce widely used benchmark da-tasets, including S3DIS and ScanNetV2, which provide large-scale annotated point cloud data for training and evaluation. We also discuss common evaluation metrics, such as Mean Intersection over Union (MIoU), Mean-Class Coverage (mCov), and Mean Class Weighted Coverage (WCov), which are essential for assessing segmentation performance.

14:35
A 3D Point Cloud Instance Segmentation Algorithm Based on Sparse Convolution and Proposal Generation

ABSTRACT. With the widespread use of 3D point cloud data in scene understanding, efficient-ly extracting features from point clouds and performing instance segmentation has become a key research focus. Due to the sparsity and disorder of 3D point clouds, traditional 2D convolution operations face many challenges when applied to point cloud processing, especially with voxelization, which can cause loss of point information and local texture features. To address these issues, this paper proposes a 3D point cloud instance segmentation algorithm based on sparse con-volution and proposal generation, aiming to improve the accuracy of point cloud feature extraction and instance segmentation. The algorithm consists of two stag-es: In the first stage, sparse convolution is used to design a sparse convolution-based U-Net backbone network to extract deep features from the point cloud and generate instance proposals. In the second stage, the generated instance proposals are refined. A smaller U-Net network is used to further extract features and pre-dict the category, instance mask, and mask score for each instance. By correcting classification errors and suppressing background, the model's accuracy and ro-bustness are improved. Experimental results show that this method effectively enhances instance segmentation performance in complex scenes, with promising application prospects.

14:40
Methods of Raw Material Ordering and Transportation Schemes Based on Data Analysis

ABSTRACT. This work investigates the raw material ordering and transportation issues in production enterprises. By conducting data analysis and establishing models, it proposes optimal ordering and transportation schemes. For supplier ranking, the TOPSIS model is employed to sort suppliers based on their performance. Using selection matrices and 0-1 programming models, combined with considerations of supply errors and transportation losses, the best suppliers and the most economical ordering and transportation plans are determined. Further, to meet higher requirements for supply schemes, models are established to solve the cost-minimizing ordering and transportation plans. Finally, by predicting future supply volumes using LSTM, a linear programming model is established to optimize the ordering and transportation plans under maximum production capacity.

14:45
TAFNet:Temporal Attention Fusion Network for Robust Deepfake Detection

ABSTRACT. Existing deepfake detection methods usually focus on static spatial features or simplistic spatiotemporal modeling, failing to effectively capture complex temporal dynamics and cross-modal dependencies across video frames, thereby limiting reduces their robustness and accuracy in complex scenarios.To address this issue, we propose a novel Temporal Attention Fusion Network (TAFNet), which enhances detection accuracy through multi-level feature extraction, cross-modal feature fusion, and temporal modeling. First, TAFNet employs a High-Resolution Network (HRNet) to extract high-quality, fine-grained spatial features. Second, we introduce a Laplacian pyramid module combined with a Window-based Attention Module (WAM) to enrich spatial details and frequency-domain features, significantly improving sensitivity to subtle forgery traces. Third, a Multi-head Cross-Relational Attention (MCRA) mechanism is proposed to efficiently capture long-range spatial dependencies across and within video frames. Finally, a lightweight Bi-directional Long Short-Term Memory (Bi-LSTM) module is utilized for temporal modeling, deeply exploring dynamic inconsistencies across frames. A large number of experimental results within and across datasets demonstrate the effectiveness and robustness of our method in various deepfake detection problems.

14:50
A Defect Detection Model for Transmission Line Stockbridge Dampers Based on YOLOv11 with Privacy Protection

ABSTRACT. As a vital component of transmission lines, the condition of Stockbridge dampers is essential for maintaining the stability and safety of power transmission systems. In recent years, deep learning techniques have proven effective in addressing the challenges of detecting defects in Stockbridge dampers, including complex backgrounds and the diverse shapes and sizes of target regions. However, current methods often fail to offer robust data privacy protection during their application. This issue is particularly pronounced when vast amounts of data are spread across various regions, which in turn complicates the effective integration and utilization of the data. To resolve these critical issues, this study presents a privacy-enhanced defect detection framework for transmission line Stockbridge dampers, integrating the YOLOv11 architecture with systematic privacy preservation mechanisms. Our solution ensures data confidentiality through localized client training and secure parameter sharing protocols, effectively decoupling raw data transmission from model optimization. The framework embodies three key innovations: 1) Development of the FDRD benchmark dataset comprising real-world transmission line inspection imagery, establishing a standardized evaluation platform for damper defect analysis. 2) Implementation of a federated learning architecture that simultaneously addresses data privacy concerns and resolves cross-regional data isolation challenges through decentralized model training. 3) Comprehensive performance evaluations demonstrating significant improvements over baseline models (YOLOv9/YOLOv10), achieving state-of-the-art metrics including 0.9 mAP50, 0.928 precision, and 0.785 recall-validating the framework's dual capability in enhancing detection accuracy while maintaining rigorous privacy standards. This work advances intelligent inspection methodologies by providing a balanced approach to technical performance and data security requirements in modern power infrastructure maintenance.

14:55
Multi Action Unit Feature Fusion Network for Micro-Expression Recognition

ABSTRACT. Recent advances in deep learning have sparked significant interest in micro-expression recognition. However, most existing methods process the entire facial region directly, which makes it challenging to capture the subtle variations in facial action units, thus limiting the performance of micro-expression recognition. To overcome this limitation, this paper proposes the Multi-Action Unit Feature Fusion Network (MAUFFN), which effectively captures subtle changes in facial action units. Specifically, we first develop a multi-facial action optical flow estimation module that simultaneously extracts both global facial optical flow information after motion amplification and localized optical flow information from the eye and mouth-nose regions, enabling a more precise capture of facial expression nuances. Then, we design an effective feature extraction module and a novel feature fusion module to capture both global and local information. We validate our proposed method on four public datasets: CASME II, SAMM, SMIC-HS, and MMEW. Our approach achieves state-of-the-art results in the four datasets, with 86.30% UAR and 92.88% UF1 on the composite dataset, outperforming the latest method by 1.77% and 6.0%, respectively.

15:00
Real-Time Soft Tissue Modeling and Simulation Based on FEM

ABSTRACT. Neurosurgery surgery has the characteristic of emergency onset, requiring experienced doctors to perform the operation, and experienced neurosurgeons are often concentrated in central cities. Therefore, preoperative training for the neurosurgical system is very important and essential. Thispaper proposes an improved finite element method to simulate meningioma surgery and intracerebral hematoma removal surgery. A preoperative training platform is designed to allow doctors to immerse themselves in surgical training and improve surgical accuracy and efficiency.

15:05
Research on energy consumption prediction method of pure electric tugboat based on machine learning

ABSTRACT. Predicting the energy consumption of ships is a popular area of study right now, but little is known about the re-search on predicting the energy consumption of pure electric tugboats. Using vessel operational data and meteoro-logical data, this paper develops six distinct machine learning models for predicting the energy consumption of a pure electric tugboat while it is cruising. It also investigates the impact of meteorological factors on energy con-sumption prediction. A case study was conducted to predict the power consumption per minute of an all-electric tugboat. According to the experimental data, Random Forest model performs better than other models in terms of prediction accuracy, with a much lower error. After adding meteorological elements as model input variables, the forecast accuracy increases, according to the two sets of comparison studies. This study closes a gap in the litera-ture on predicting the energy consumption of pure electric tugboats and offers marine industry operators and prac-titioners a high-performing energy consumption prediction model that aids in improving energy utilization effi-ciency and the creation of driving strategies, tugboat management, scheduling, and charging strategies.