Few-shot Medical Image Segmentation via Query Transformation Learning
ABSTRACT. Few-shot segmentation, which aims to segment unseen classes from a small number of labeled images, has made great progress in both natural and medical images in recent years. However, there are still several flaws in the current few-shot medical image segmentation methods. The diversity of the base classes in the training stage is insufficient, so it is still difficult to train a model with strong generalization through a large number of images and classes like natural images. Besides, due to the discrepancies between the query and support images even if the information with representational is mined from the support images, it is also difficult for this information to exert a stable effect when guiding query image segmentation. To address these issues, we propose a novel method to construct support-query pairs based on superpixel hierarchies. Aim to simulate the variation of the same medical class across different slice sequences, and make the model adapt to the difference between support and query. Furthermore, we also design a pipeline to learn an optimized prototypical network for prediction by leveraging the invariance of gray-scale transformation and the equivariance of geometric transformation. Such an operation can improve the prototype guidance in feature space for two query views with different transformations. Extensive experiments on two abdominal medical datasets (MRI and CT) effectively demonstrate the superiority of our network when compared with current state-of-the-art methods.
Boosting Remote Semantic Segmentation Using Vision-and- language Foundation Model
ABSTRACT. In recent years, visual analysis and process-
ing of remote sensing images have become increasingly
popular. Vision-language foundation models (such as
RemoteClip) embed rich prior knowledge from a large
number of remote sensing images through extensive pre-
training. Although these models perform well in image-
level tasks, their prior knowledge has not been fully uti-
lized in pixel-level segmentation tasks. To address this
issue, we propose a lightweight fusion framework named
Remote Foundation Model for Segmentation (RFM-Seg).
This framework trains V-branch connectors, VL-branch
connectors, and the VL-map module while freezing both
the foundation model and the remote sensing segmenta-
tion model. These modules effectively integrate multi-
scale and multi-modal prior knowledge from remote
sensing images into mainstream remote sensing segmen-
tation models, thereby enhancing the model’s perfor-
mance in pixel-level segmentation tasks. We validated
the effectiveness of this model framework on four chal-
lenging aerial image segmentation benchmark datasets,
including ISPRS Vaihingen, ISPRS Potsdam, Aerial,
and LoveDA Urban. Experimental results show that
RFM-Seg achieves state-of-the-art outcomes on these
four benchmark datasets. Additionally, the lightweight
design of the fusion components allows RFM-Seg to
maintain efficient training and inference efficiency. The
source code will be available at https://github.com/
fffsbx/RFM-Seg.
RaEUNet: A Retentive and Efficient UNet for Medical Image Segmentation
ABSTRACT. The combination of Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) in U-shaped networks offers great potential for medical image segmentation. To address the lack of explicit spatial priors in CNN-Transformer architectures, we propose a novel network named RaEUNet, integrating Retentive Networks Meet Vision Transformers (RMT) and modified Efficient Multi-scale Convolutional Attention Decoding (EMCAD). RMT with explicit spatial priors, replaces ViT to enhance spatial features in the encoder. For decoder, we propose the efficient channel attention up-convolution block (ECAUB), which reduces the impact of insensitivity to channel relationships caused by depth-wise convolution by introducing an efficient channel attention module (ECAM). Additionally, We design multi-scale re-weighted attention module (MRAM) by introducing Spatial and Channel Reconstruction Convolution (SCConv) and our proposed Re-weighted Adjustment Module (RAM), which fully integrates high-level and low-level features, reducing feature redundancy and improving information capture. Experiments on Synapse, Polyp, and ACDC datasets show superior performance over state-of-the-art methods.
Generating 3D fish motion skeleton via iterative optimization method and FishSkeletonNet
ABSTRACT. The fish motion skeleton serves as the foundation for 3D fish motion modeling, enabling the manipulation of fish posture deformations and movements. In agricultural applications, it facilitates the observation of fish behavior to detect their health status or enhance feeding practices. However, the joints within the fish motion skeleton, responsible for driving the fish's movements, are not always stable, which undergo changes as the fish grows. The unstable topology of the skeleton poses a challenge when attempting to simulate a lifelike fish skeleton. In this paper, we present a novel method for generating a 3D fish skeleton based on fish posture data. Our approach establishes an initial motion skeleton including the spine and fins. We then determine its parameters, encompassing joint positions and the number of joints, through iterative optimization, employing collected data from fish with various shapes and five common postures as constraints. Furthermore, the skeletons generated through this optimization process are utilized as sample data for training the FishSkeletonNet network, a framework introduced in this paper for predicting fish motion skeletons of input 3D fish bodies. To validate the effectiveness of our approach, we introduce a new dataset of grass carp postures, on which we carry out experiments and conduct both quantitative and qualitative evaluations. The experiments illustrate that our method generates fish motion skeletons that closely emulate the actual motion skeleton structure of fish, demonstrating a higher level of biological plausibility compared to existing methods.
Semantic-Orthogonal Multi-modal Attention Network for RGB-D Salient Object Detection
ABSTRACT. In recent years, RGB-D saliency object detection has made significant advancements in the field of computer vision. However, existing methods still face challenges in feature extraction, cross-modal fusion, and multi-scale processing, limiting their performance in complex scenarios. To tackle these challenges, we propose SOMA-Net (Semantic Orthogonal Multi-Modal Attention Network), a novel and efficient RGB-D saliency object detection model that incorporates three key innovations: First, inspired by the “local focus-global reasoning” dual-path mechanism of the human visual system, we introduce a novel method for semantic token sparsification—Dual-Stage Sparse Semantic Enhancement (DSSE), based on the Swin Transformer architecture. DSSE filters out redundant semantic information, improving generalization and enabling focus on crucial semantics. This method enhances feature extraction efficiency by reducing FLOPs by over 33%, without sacrificing accuracy compared to the original Swin Transformer backbone. Second, we propose the Orthogonal Multi-Modal Mutual Attention Fusion (O-MMAF) module, which integrates mutual attention with orthogonal channel attention. This module effectively leverages the complementary relationship between RGB and depth features, improving both accuracy and robustness in cross-modal fusion. Finally, inspired by the visual processing mechanisms of primates, we design the Multi-Scale Self-Calibrating Spatial Recursive Attention (MSRA) module. By extracting multi-scale information and performing deep optimization, MSRA simulates the brain’s approach to information processing, generating high-precision saliency predictions in a coarse-to-fine manner. Experimental results show that SOMA-Net achieves outstanding performance across four evaluation metrics on nine publicly available RGB-D datasets, surpassing 12 state-of-the-art models, demonstrating its effectiveness in this field,Our code is published at https://github.com/jiaweiXu1029.
Unified Cross-Domain Refinement Network for Camouflaged Object Detection
ABSTRACT. Camouflaged Object Detection (COD) is a challenging task in computer vision, aiming to segment objects that seamlessly blend into their backgrounds. In the spatial domain, pixel-level information effectively captures salient regions in an image but is highly sensitive to small variations or noise. In contrast, the frequency domain, by separating high and low-frequency components—especially utilizing the smoothing effect of low-frequency components—provides stronger robustness against noise. Therefore, this paper aims to explore a comprehensive fusion approach to enhance the model's inference capabilities. We propose a novel Cross-Domain Refinement Network (CDRNet), which enhances information by establishing correlations and differences between the frequency and spatial domains, followed by iterative refinement of the prediction results. In the first stage of coarse segmentation, we introduce a Domain Correlation Fusion (DCF) module, which uses spatial domain information to guide the frequency domain. Additionally, we design a Domain Difference Convolution (DDC) to exploit the differences between the two domains to enhance spatial domain information. In the second stage of fine-grained optimization, we adopt a newly developed Iterative Refinement Masking (IRM) to restore details. Experimental results across four COD datasets demonstrate that CDRNet achieves state-of-the-art performance. The source code is available at https://anonymous.4open.science/r/CDRNet-E4B0/.
StyleRegLoRA: An Improved Framework for SDXL LoRA Fine-tuning Towards Art Category Generation
ABSTRACT. Diffusion models excel in general text-to-image (T2I) synthesis but often generate stereotypical images lacking diversity when targeting specific art categories. While fine-tuning methods like LoRA can adapt models to new styles using datasets, standard approaches struggle to simultaneously balance generation diversity, instance accuracy (fidelity to prompt content rendered in the target style), and style accuracy (category fidelity), often overfitting to common features rather than capturing the true category distribution. Addressing this challenge, particularly when fine-tuning on a specific art category dataset, we propose StyleRegLoRA, an improved LoRA framework built upon SDXL. Its core contribution is a style regularization loss applied to intermediate UNet features. This loss combines feature moment matching to ensure style accuracy representative of the category, with an instance feature push-away mechanism to promote diversity and instance accuracy. Optimizing this joint objective guides LoRA towards learning a representation that better balances these three crucial elements. Comprehensive experiments on two art datasets, including quantitative metrics, qualitative analysis, and user studies, demonstrate StyleRegLoRA's advantages over baseline T2I models and standard LoRA fine-tuning approaches in achieving balanced and high-fidelity art category generation.
Latent Interpretation for Multi-View Face Synthesis across GAN and Diffusion via Conditional Reconstruction
ABSTRACT. This paper presents a novel multi-view face synthesis framework that utilizes the latent spaces of models such as GANs and diffusion models. By incorporating the conditional reconstruction loss and a bidirectional editing strategy, our method effectively guides the conditional encoder to capture meaningful facial orientation semantic vector directly in the latent space, thus eliminating the need for scoring functions in the image space. It allows editing of latent code neatly migrate from GAN model to the diffusion model, which addressing the limitations of existing GAN-based latent interpretation in face reconstruction and multi-view synthesis, particularly in preserving details and maintaining style consistency. Extensive experiments on the CelebA-HQ, FFHQ, and synthetic datasets validate the effectiveness of our method in both GAN and diffusion model. The results demonstrate that our approach outperforms state-of-the-art methods in terms of generation quality, detail preservation, and identity consistency. Ablation studies further confirm the critical contributions of the proposed method to enhancing the model's performance and stability. The code and trained models are available at https://github.com/Xenithon/LMFS.
PeelMesh: Efficient Interactive Segmentation via Geodesic-Driven Dynamic Topological Updates
ABSTRACT. Mesh segmentation constitutes a fundamental challenge in digital geometry processing, shape analysis, and geometric modeling. While auto segmentation techniques have been extensively studied, they often lack user controllability to produce application-specific segmentation result. Conventional interactive methods incorporate user guidance through landmark selection but rely heavily on harmonic field computations for geodesic boundary generation, incurring substantial computational overhead. Furthermore, existing interactive frameworks typically operate on individual models, necessitating labor-intensive repetitions for batch processing of structurally similar datasets. To address these limitations, we present PeelMesh, an open-source, user-friendly interactive mesh segmentation framework. We propose a dynamic KD-tree integration mechanism coupled with real-time geodesic extraction, avoid expensive computational overhead in conventional harmonic field based approaches. By replacing global energy minimization with incremental topological updates, this method maintains geometric consistency while achieving per-model segmentation within 100 milliseconds on the PSB (Princeton Segmentation Benchmark). For batch processing challenges, we implement predefined anatomical landmark configurations (e.g., FaceScape's 68 standardized markers) with connectivity constraints, eliminating repetitive manual interventions across structurally similar datasets. Experimental validation on 842 facial models demonstrates efficient extraction of 19 distinct sub-regions with total execution time of 372.73 seconds (avg. 442.67 ms/model). The modular architecture facilitates seamless integration with 3D modeling workflows, augmented by Python bindings that lower technical barriers for cross-domain adoption.
ENViT-FAS: ENViT with Attack-Oriented Augmentation for Domain-Generalized Face Anti-Spoofing
ABSTRACT. Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we propose ENViT-FAS, a novel framework that integrates enhanced neighborhood attention vision transformer (ENViT) with attack-oriented data augmentation for domain-generalized FAS.
By performing an attack-oriented data augmentation strategy on real face samples to generate pseudo-negative samples, the model’s robustness to samples from different environments is enhanced. To strengthen the model's ability of local feature perception, we introduce neighborhood attention with feature enhancement module. And we employ dynamic consistency constraint and multi-scale style regularization to further enhance the cross-domain adaptability. Through intra-dataset and inter-dataset comparison experiments, it is verified that our method ENViT-FAS achieves strong generalization performance.
Machine Learning Approach for Color Selection near Light/Dark Boundaries Considering Emotional Impressions
ABSTRACT. In illustrations, colors near light/dark boundaries significantly impact the visual impression. This study proposes an automatic coloring method for these boundary areas based on desired emotional impressions. Our system processes line drawings with base colors to generate normal maps using a GAN-based approach. This normal map is then used to identify shadows and light/dark boundaries, followed by applying appropriate colors. For shade colors, we trained a multilayer perceptron on 536 base-shade color pairs from professional character designs, achieving accurate shade coloring. For boundary colors, we modeled impressions from a questionnaire survey (53 participants rating 7 levels across 3 dimensions: ``like-dislike,'' ``harmony-disharmony,'' and ``gorgeous-simple'') and implemented this in a learning network. Our evaluation shows the system can effectively support expressive illustration creation with minimal error (MAE of 2.03). In a feedback survey with professional illustrators, 80\% expressed interest in using the system, particularly appreciating how it maintains consistency in emotional expression while reducing coloring time. The main contributions are: (1) a machine learning approach for shadow color prediction, and (2) an impression-based coloring system for light/dark boundaries.
EX-YOLO: Improve YOLOv11 to achieve efficient detection of small targets of drones
ABSTRACT. In recent years, with the rapid development of intelligent traffic systems and unmanned technology, the detection and identification of traffic objects has become an important technology to ensure road safety and optimize traffic management.
However, the detection of small targets in traffic scenarios still faces great challenges.
Traditional computer vision methods show obvious limitations in the face of complex traffic scenarios, especially small target detection.
The neural network model based on deep learning has made revolutionary progress in image processing and target detection, especially in small target detection, which has effectively made up for the shortcomings of traditional methods.
This paper proposes an improved model based on YOLOv11, which aims to improve the detection and recognition performance of small targets of traffic objects by drones.
By introducing the improved pyramid context enhancement network (SPPC) module, as well as the lightweight attention convolution module (DBSS) and NWD (Normalized Gaussian Wasserstein Distance) loss functions, the detection accuracy and stability of small targets have been significantly improved.
The main contributions of this article include:
1. The SPPC module is proposed. By introducing the combination of context enhancement module (CAM) and FPN module, it improves the richness of contextual information of the feature map and enhances the detection ability of small targets.
2. Lightweight attention convolution module (DBSS), while improving the detection accuracy, reduces the computing burden of the model and achieves efficient feature processing.
3. The NWD loss function optimizes the loss calculation method and significantly improves the recognition performance of small targets in response to the identification challenges in small target detection.
A Novel Approach to Defocus Region Detection in Phase-Shifting Structured Light 3D Reconstruction
ABSTRACT. Structured light 3D reconstruction technology has been widely applied in fields such as industrial inspection and medical imaging due to its advantages of non-contact operation, high accuracy, and efficiency. However, in practical acquisition scenarios—especially in handheld small field-of-view Fringe Projection Profilometry (FPP) systems—defocus of the camera or projector often results in local blurring or even loss of fringe patterns, which significantly compromises the accuracy of phase calculation and the reliability of subsequent depth estimation. To address the issue of fringe blurring caused by defocus in such systems, our study proposes a defocus region detection method based on wrapped phase analysis. The proposed method utilizes the phase difference stability among three wrapped phase maps to construct a rule-based mask determination mechanism without requiring any learning process. By applying a fixed threshold, our method effectively identifies and removes unusable data points such as defocus regions, low-reflectivity areas, and background regions. Compared with traditional methods that rely on modulation thresholding, our method offers stronger system generalization capability and lower parameter dependency. Experiments conducted on both a handheld visible-light FPP system and a desktop near-infrared FPP system demonstrate that the proposed method can accurately extract valid fringe regions, thereby improving the accuracy and robustness of 3D reconstruction. Our method shows strong practical value in engineering applications.
MatLayerNet: A Multi-Agent-Based Method for Text-to-PBR Material Generation
ABSTRACT. In the field of 3D generation, producing high quality physically based material maps has long been a key research challenge. However, mainstream diffusion-based material generation methods often lack fine-grained details and logical consistency. In contrast, manual material creation typically relies on extensive material layering and
grouping to enrich details and ensure logical coherence. In this paper, we propose MatLayerNet, a multi-agent-based method that organizes large language model (LLM) agents to simulate the human workflow of material creation for material generation tasks. Firstly, we construct an iterative material segmentation pipeline that decomposes textual descriptions of objects into multiple material texts. Secondly, we deconstruct the logic of material layering, guiding the LLM to generate structured layer plans based on substrate material, material texture, and material aging. Thirdly, we introduce a vertically structured multi-agent architecture to generate consistent and coherent material parameters across all layers. Finally, we incorporate a retrieval-augmented generation (RAG) approach, where the LLM invokes a material mask generator to produce layer-specific masks and determine spatial distribution. Experimental results demonstrate that our method can generate more accurate material details while maintaining consistency and logical integrity among all material components.
Embedding Space Decomposition Meets Invertible Networks: A New Paradigm for Unpaired Low-Light Enhancement
ABSTRACT. Most existing low-light image enhancement methods require paired low/normal-light training data that is difficult to acquire in practice. While unpaired learning methods circumvent this need, they face challenges in establishing reliable supervision. To address this problem, we propose a novel two-stage framework based on invertible networks. First, we propose a Retinex decomposition framework that separates and mutually refines illumination/reflectance components in an embedding space through cross-component affine transformations, where a CNN-Mamba hybrid architecture generates modulation parameters to enable synergistic component optimization. Second, an invertible neural network learns bidirectional mappings between low- and normal-light domains, utilizing pseudo-references formed by combining decomposed low-light reflectance with normal-light illumination while enforcing cycle consistency. Extensive experiments on three benchmark datasets demonstrate our method’s superiority, outperforming state-of-the-art approaches in multiple quantitative metrics while achieving better noise suppression and detail preservation in visual results.
Robust bi-level optimization for monocular human and scene reconstruction
ABSTRACT. We propose a method integrating bi-level optimization and image inpainting for the simultaneous reconstruction of human bodies and their surrounding scenes from monocular RGB videos. Recognizing that human-scene interaction is a critical component in monocular reconstruction, our approach utilizes bi-level optimization to leverage contact information. To further enhance reconstruction robustness, particularly in occluded environments, we incorporate image inpainting techniques for recovering and unveiling hidden scene elements, thereby effectively addressing complex scenarios. Both qualitative and quantitative evaluations performed on the PROX and i3DB datasets demonstrate the significance of image inpainting within our framework and validate the effectiveness of the proposed bi-level optimization strategy.
A Privacy-preserving Collaborative 3D Design Framework for Copyright Protection
ABSTRACT. Traditional collaborative modelling design and project file management platforms typically rely on centralized system architectures, which can lead to single points of failure, resulting in data manipulation or loss, project delays, or even necessitating the complete redo of entire projects. Additionally, the intellectual property of designers' works is not effectively protected, posing risks of infringement and piracy. However, blockchain and zero-knowledge proof technologies promise to address these issues by ensuring data immutability and privacy protection. In this paper, we propose a privacy-preserving collaborative 3D modelling and design framework for copyright protection, which can authenticate the intellectual property rights of works and verify the integrity of works without revealing the information of works. This framework utilizes blockchain to track the revision history of design data by designing transaction data models and smart contracts, and zero-knowledge proof technology to recognize and protect intellectual property by designing algorithms based on zk-SNARK. Finally, we take blender 3D animation as an example to illustrate and evaluate the performance of our framework. The results indicate that our framework is effective in achieving secure 3D animation production, with latency, TPS (transactions per second), QPS (queries per second), storage costs, IPFS upload speed, and zero-knowledge proof generation cost all within acceptable ranges.
Enhancing Vision Transformer for Fine-Grained Classification with Selective Attention Aggregation and Multi-Head Noise Suppression
ABSTRACT. Fine-grained visual classification (FGVC) is a challenging computer vision task that aims to distinguish highly similar subcategories by identifying subtle visual differences in specific parts of objects. Traditional convolutional neural network (CNN)-based methods, constrained by their reliance on translation invariance and local receptive fields, struggle to capture global contextual information, making it difficult to extract fine-grained details and local discriminative features. Recently, Vision Transformer (ViT) have demonstrated exceptional performance in various image tasks by leveraging attention mechanisms to capture global context and establish long-range dependencies, resulting in powerful feature representations. However, ViT tends to focus more on global coarse-grained information while paying insufficient attention to local fine-grained details, limiting its effectiveness in FGVC tasks. To address this limitation, we propose a novel fine-grained visual classification framework based on selective attention aggregation and multi-head noise suppression(SaM). The framework consists of two key components: First, to avoid relying solely on high-level weights while neglecting the importance of low-level information, we introduce a Selective Attention Aggregation (SAA) module. Second, to further refine the selection of discriminative regions, we propose a Multi-Head Noise Suppression (MHNS) module, which reduces the impact of background noise while preserving feature diversity. Experimental results on several popular FGVC datasets demonstrate that our method achieves competitive performance compared to state-of-the-art approaches, highlighting its effectiveness and robustness in fine-grained classification tasks.
L2H-NeRF: Low- to High-Frequency-Guided NeRF for 3D Reconstruction With a Few Input Scenes
ABSTRACT. Nowadays, three-dimensional (3D) reconstruction techniques are becoming increasingly important in the fields of architecture, game development, movie production, and more. Due to common issues in the reconstruction process, such as perspective distortion and occlusion, traditional 3D reconstruction methods face significant challenges in achieving high-precision results, even when dense data are used as inputs. With the advent of neural radiance field (NeRF) technology, high-fidelity 3D reconstruction results are now possible. However, high computational resources are usually required for NeRF computations. Recently, few data inputs are used to ensure the highest quality. In this paper, we propose an innovative low- to high-frequency-guided NeRF (L2H-NeRF) framework that decomposes scene reconstruction into coarse and fine stages. For the first stage, a low-frequency enhancement network based on a vision transformer (ViT) is proposed, where the low-frequency-based globally coherent geometric structure is recovered, with the dense depth restored in a depth completion way. In the second stage, a high-frequency enhancement network is incorporated, where the high-frequency-related detail is compensated by robust feature alignment across adjacent views using a plug-and-play feature extraction and matching module. Experiments demonstrate that both the accuracy of the geometric structure and the feature detail of the proposed L2H-NeRF outperforms state-of-the-art methods.
Advancements and Challenges of Deep Learning in Virtual Try-On: A Comprehensive Review
ABSTRACT. Virtual try-on is a new technology that allows users to preview how clothing will look on them in a virtual environment before buying, thereby reducing the cost of trying on clothes. With the growing convenience of online shopping, virtual try-on technology has become increasingly significant in the retail industry. This technology enhances the interactivity of the shopping experience and influences consumer purchasing decisions by allowing customers to mix, match, and visualize products on themselves without the constraints of physical conditions. In addition, the combination of virtual try-on technology and animation is further enhancing its realism and dynamic performance. However, the widespread adoption and accuracy of virtual try-on solutions still face significant technical challenges requiring further advancements. This paper examines the development of virtual try-on technology through the lens of deep learning methods. It analyzes the performance of various methods in handling product details, adapting to diverse datasets, and improving user experience. By exploring the limitations of current approaches and identifying potential areas for improvement, this study aims to provide a comprehensive perspective to advance research in the field of virtual try-on technology.
FaceCapGes: Real-Time Frame-by-Frame Gesture Generation from Audio, Facial Capture, and Head Pose
ABSTRACT. Recent AI-based gesture generation methods primarily rely on speech modalities to synthesize gestural motions for virtual avatars, reducing the need for manually crafted animations. However, these models typically require full speech or text segments as input, making them unsuitable for latency-sensitive applications such as live streaming or metaverse interactions.
Real-time gesture generation for user-facing applications poses significant challenges. Without access to future information, models must infer appropriate gestures solely from past inputs, while maintaining temporal alignment with speech and ensuring natural expressiveness. While prior work has explored facial cues to enhance generation quality, head pose — an easily captured modality that naturally accompanies speech — remains underutilized.
To address this, we propose FaceCapGes, a multimodal cascaded network that integrates speech, facial capture, and head pose for real-time gesture generation. Without relying on future context, our model incorporates head pose through a cascaded architecture to improve naturalness. This design is especially suited for seated or constrained settings, where users can drive expressive gestures for virtual avatars using only facial and head movements.
Subjective evaluations show that our method achieves naturalness on par with existing state-of-the-art models. The generated gestures demonstrate good alignment with speech and exhibit a significant advantage in real-time responsiveness. The model can run on lightweight devices such as an iPhone, provided the input is compatible with ARKit-based facial capture formats, enabling a wide range of real-time interactive scenarios.
Alphanumeric Fingerspelling: A New Large-Scale Dataset and Comparative Analysis of Methods for Indian Sign Language
ABSTRACT. Alphanumeric fingerspelling is a key component of Indian Sign Language (ISL), especially for expressing names, acronyms, and uncommon words. While a few datasets exist for ISL alphabet and number recognition, they are often small in size, lack diversity, and typically focus only on isolated hand images. In this paper, we present a new large-scale image dataset for ISL fingerspelling that captures full-body visuals along with the hand signs. This provides richer contextual information, which can improve real-world recognition tasks where background, body posture, and hand shape interplay. Our dataset spans all 36 alphanumeric signs (A–Z, 0–9), with varying sample counts ranging from 92 to 399 images per sign (average of 287 images per class), contributed by 80 diverse participants across varying lighting conditions and poses. The dataset includes a total of approximately 13,000 images, with letters generally having more samples (average 329 per letter) than digits (average 202 per digit). This makes it one of the most comprehensive and context-rich resources for ISL fingerspelling to date. We also conduct a comparative study using multiple state-of-the-art vision models to benchmark performance on this dataset. The results highlight the challenges of sign variability and the importance of holistic context in improving classification accuracy. We hope this resource serves as a foundation for future research in sign language recognition and inclusive communication technologies.
PLNet: Entropy-Guided Pseudo Label Refinement with Sensitivity-Specificity Enhancement for Medical Image Segmentation
ABSTRACT. Pseudo-labels play a crucial role in weakly supervised medical image segmentation. However, their quality is often compromised by uncertainty and noise, which can significantly degrade segmentation performance. To address this challenge, we propose a novel pseudo-label generation and optimization strategy designed to improve label reliability under weak supervision. Specifically, we employ an entropy-based weighted averaging mechanism that assigns greater importance to low-entropy predictions, thereby suppressing uncertain regions and enhancing structural consistency.
To further refine pseudo-labels, we introduce a sensitivity-specificity enhancement module that balances attention between foreground and background areas, effectively reducing false positives and false negatives. This dual strategy significantly mitigates noise interference and improves the stability and accuracy of the generated pseudo-labels. We integrate these components into a unified framework, PLNet (Pseudo Label Network), which is explicitly designed for high-quality pseudo-label learning.
Our segmentation backbone adopts a hybrid CNN-Transformer architecture, leveraging the strengths of CNNs in local feature extraction and Transformers in capturing global contextual dependencies. Additionally, we design a hybrid loss supervision mechanism that combines scribble supervision loss, co-teaching loss, and branch ensemble strategies, enabling the model to learn effectively from limited annotations and generalize better across diverse cases. Extensive experiments on multiple public medical imaging datasets demonstrate that our method achieves superior performance compared to state-of-the-art approaches, particularly in scenarios involving complex anatomical structures or incomplete annotations. The proposed PLNet notably enhances both segmentation accuracy and robustness under weak supervision.
Texture and Geometry Optimization for 3D Reconstruction
ABSTRACT. To obtain high-quality texture for the reconstructed 3D model, existing methods leverage the inherent coupling between camera poses, geometry, and texture, through joint optimization frameworks based on differentiable rendering. However, these approaches often struggle with complex textures, limited detail recovery, and unstable convergence. To overcome these challenges, we propose a dynamic joint optimization framework based on differentiable rendering that adaptively and collaboratively refines camera poses, mesh geometry, and texture mapping for 3D reconstruction. First, we introduce a novel multi-resolution pyramid residual bias optimization strategy that hierarchically decouples global structures from fine-grained local details. Second, we develop a dynamic threshold adjustment mechanism that adaptively modulates the optimization intensity of each component based on real-time loss trends. Finally, we employ a hybrid loss function that combines L1, perceptual, and SSIM losses to guide texture refinement. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach achieves state-of-the-art performance, effectively suppressing error propagation, and improving geometric precision and texture realism.
Enhancing Small Object Detection in Aerial Images Under Multi-Weather Conditions Using Modified YOLO
ABSTRACT. To address the challenges of small object detection under complex weather conditions—particularly particularly those with blurred features caused by rain and fog—this paper proposes We-YOLO, an enhanced YOLOv10-based model. The framework integrates an Enhanced Multi-path Feature Pyramid Network (EMFPN) for improved multi-scale feature fusion, a C2f-TKSA module that combines adaptive sparse selection with channel attention to suppress weather-induced noise, and a CSP-OMni block for better scale adaptability. A Dense-Mosaic data augmentation strategy is also introduced to enhance small-object feature learning. Experimental results show that We-YOLO achieves a 5.5% improvement on the custom WeDrone dataset, with additional gains of 5.1% on VisDrone and 4.4% on AI-TOD, while maintaining an average detection time of 21.8ms, meeting real-time application requirements. The WeDrone dataset is available on GitHub.
BadHPE: Backdoor Attack for Precise Manipulation of Human Pose Estimation Models
ABSTRACT. Deep learning-based human pose estimation (HPE) models have demonstrated remarkable success, achieving high accuracy in predicting keypoints across various scenarios.
However, these models remain vulnerable to backdoor attacks, which can subtly manipulate predictions when specific triggers are introduced.
Existing backdoor attacks on HPE remain in their infancy, suffering from inconsistencies after image transformations or weak mappings between triggers and target outputs.
In this paper, we propose BadHPE, a novel backdoor attack targeting HPE models.
BadHPE embeds triggers as fake keypoints, establishing a direct mapping between trigger positions and model outputs.
To enhance effectiveness, we introduce adaptive triggers and poisoned indicators to maintain the consistency and performance of the victim model.
Evaluated on the MPII and COCO datasets, BadHPE achieves high poisoned accuracy and outperforms existing methods in generalizability and robustness.
It also demonstrates resilience against classical backdoor defenses, making it a highly effective and versatile attack.
Our code is now publicly available at https://anonymous.4open.science/r/BadHPE-E4AF/
Fuzzy Logic-Based Virtual Leading Path Planning for Autonomous Underwater Vehicles: An Evaluation Index Approach
ABSTRACT. Autonomous underwater vehicles (AUVs) face significant challenges
in dynamic navigation due to unknown obstacles and constrained environments.
This work proposes a local path planning algorithm for AUVs, utilizing a forward-looking
sonar detection strategy that divides the visual region into three
sections and evaluates object pixels. To facilitate obstacle avoidance and goaldirected
movement, a fuzzy inference system is implemented for the design of
the AUV's coordinate controller. A virtual leading AUV is introduced to guide
the real AUV along an online trajectory. Path length rate and path maintenance
rate are defined as evaluation indices to ensure the real AUV follows the planned
path. The stability of the closed-loop control system is analytically proven using
the Lyapunov function. Numerical simulations demonstrate the performance of
the control system in a constrained environment.
Weakly Supervised Video Anomaly Detection via Temporal Dynamic Modeling and Semantic-Assisted Approach
ABSTRACT. The widespread application of video surveillance systems has made anomaly event detection a critical technology in the field of security. Traditional frame-level supervised methods suffer from high annotation costs and an emphasis on low-level features, while weakly supervised approaches often rely on single modalities or static clip modeling, struggling to balance semantic discrimination and temporal coherence, which easily leads to missed detections and false alarms. To tackle these challenges, this paper proposes a weakly supervised learning-based anomaly detection method that relies solely on video-level labels. We design a Temporal Dynamic Modeling (TDM) module that integrates global and local self-attention, incorporates dynamic relative position encoding (DRPE), and employs an adaptive fusion network (AFN) to address the challenges of modeling long- and short-term dependencies and mitigating temporal jitter. Additionally, we construct a Semantic Assisted Anomaly Recognition (SAAR) module, which leverages external knowledge bases to generate semantic cues and achieves cross-modal alignment, compensating for the semantic deficiencies of single visual modalities. Finally, we adopt a Time Series Confidence Smoothing (TCS) strategy to reduce false alarms and enhance detection coherence. Experimental results demonstrate that this method achieves AUCs of 97.64% and 85.62%, along with an AP of 85.13%, on the ShanghaiTech, UCF-Crime, and XD-Violence datasets, respectively.
DAN: Dual-Attention Network for Occlusion-Robust 3D Hand Pose Estimation
ABSTRACT. Accurate 3D hand pose estimation from depth images is a fundamental yet challenging task in computer vision due to self-occlusion and the complex spatial dependencies between hand joints. Therefore, to improve the accuracy and robustness of 3D hand pose estimation, in this paper, we propose a dual attention network (DAN) that effectively captures joint-discriminative features in a coarse-to-fine manner. First, a ResNet-based backbone is employed to extract multi-scale shallow features from the input depth image. These features are then refined by a dual attention module, consisting of spatial and channel attention mechanisms, to highlight joint-related information. Finally, to regress accurate 3D joint positions, we introduce a two-branch decoder, one branch estimates coarse 3D hand joint positions directly, while the other predicts depth offsets to refine the final output. Our method is evaluated on the NYU and ICVL datasets and achieves competitive performance. Experimental results demonstrate that the proposed DAN enhances hand pose estimation by focusing on informative regions and refining joint localization.
Coordinate-Aware Implicit Image Function for Arbitrary-Scale Image Super-Resolution
ABSTRACT. Real-world images commonly suffer from an imbalanced distribution of smooth and texture-rich hard regions. Current super-resolution methods mainly employ uniform sampling strategies for training data acquisition, where the massive smooth samples significantly affect reconstruction performance in hard regions. Meanwhile, the model optimization imbalance can also lead to its decrease in attention to high-frequency details. To address these imbalance problems, we propose Coordinate-Aware Sampler, which performs pixel-level selection directly on high-resolution references and transfers corresponding coordinates to guide input queries for high-resolution reconstruction. Furthermore, we design a Confidence-Regulated Decoder, which employs progressive refinement with a point-wise attention strategy to adaptively fuse coarse and fine predictions and recover details based on regional complexity. Through evaluation across diverse models, datasets, and scaling factors, our method demonstrates consistent performance advantages over existing approaches, particularly in reconstructing high-frequency hard image regions.
Real-Time EEG-Based Fear Intensity Estimation for Virtual Reality Exposure Therapy
ABSTRACT. Electroencephalography (EEG) has been widely applied in emotion recognition due to its non-invasive nature and real-time responsiveness. In particular, fear intensity estimation in virtual reality (VR)-based therapies remains a promising yet underexplored application. Despite advances in classification-based methods, accurate continuous regression of fear intensity is hindered by noisy labeling and limited model capability. To address this gap, we propose a real-time EEG-based regression framework incorporating a refined labeling strategy and a novel model, 4DCNN-Transformer (4DCT). The model leverages multi-scale and grouped convolutions, a residual attention mechanism, and frequency-domain features including power spectral density, spectral centroid, peak frequency, and differential entropy. We implement a VR-based system and validate performance under LOTO and LOSO settings. Experimental results show that 4DCT significantly outperforms baseline models in both cross-validation accuracy and subjective user evaluations, while maintaining competitive computational latency.
Feature Replacement in Gaussian Splatting for 3D Stylization
ABSTRACT. 3D generation and editing, as creative tools, have attracted increasing attention, with 3D stylization being a significant area of scene editing. However, existing 3D stylization methods suffer from several limitations, including the need for retraining for new style inputs, inadequate separation of scene content and style information, and mismatches between the stylized results and the reference style images. In this paper, we introduce a feature replacement module that utilizes reversible network to decouple content and style features, ensuring the effective substitution of style information while preserving scene content. Additionally, we propose a Feature Chamfer Loss to align the high-dimensional feature space of the generated image with the reference style image, improving consistency and visual coherence. Experimental results demonstrate that our method outperforms existing techniques in terms of generation quality and multi-view consistency, advancing the state of 3D scene stylization.
RelieFormer:End-to-End Bas Relief Generation via Transformer
ABSTRACT. Bas relief is a sculpture made from a flat surface, in which the sculpture slightly projects from the background. Generating bas relief traditionally required extensive manual labor for creation. With the advancement of geometric deep learning, predicting the height field of relief from an input height field using neural networks is possible, which is time and labor-saving. The paper introduces RelieFormer, which integrates attention mechanisms into the bas relief generation task, enabling the production of a bas relief height field in a coarse-to-fine manner. The main contributions of this work are three-fold: first, the introduction of RelieFormer, which combines the local feature learning capabilities of Convolutional Neural Networks (CNNs) with the powerful feature extraction and generalization abilities of transformers; second, the construction of a high-quality bas relief dataset for public use; and third, the demonstration of RelieFormer's effectiveness through comprehensive experimental evaluations. This work advances the field by presenting a robust framework for generating detailed and accurate bas relief representations.
Multi-Granularity Feature Extraction Based on Long-Short Chains for Motion Retargeting
ABSTRACT. Motion retargeting is an issue of great significance in the fields of computer graphics and vision. However, current motion retargeting methods struggle with two main problems, both of which affect the quality of the retargeting. 1) The single granularity in spatial modeling of the skeleton cannot simultaneously ensure both the integrity of the motion and the accuracy of the details. 2) Incorrect joint set partitioning methods lead to an inability to learn effective local features. In this paper, a multi-granularity feature extraction method based on long-short chains is proposed to address these problems. By leveraging biomechanical prior knowledge, we partition joint chains at different granularities and utilize an attention mechanism to model the spatial relationships of character motion in a multi-granularity manner, extracting hierarchical features. Specifically, we use three granularities to partition the joint set: All Joint Set, Long Chain Joint Set and Short Chain Joint Set. Then, All Joint Set covers all joints and is used to extract global features. The long chain and short chain joint sets correspond to motion chains from an end joint to another end joint, and from an end joint to the central joint, respectively, and are employed to extract local motion features. Finally, a feature fusion module is used to blend multi-granularity features for motion retargeting. Extensive evaluations demonstrate that our approach significantly outperforms state-of-the-art methods. Furthermore, visualization results indicate that our method can reduce prediction errors in motion retargeting.
D^2-Diff: Controllable Fashion Image Generation with Disentangled Style and Content
ABSTRACT. Fashion image generation has gained increasing attention due to its applications in virtual try-on, online retail, and creative design.
However, precisely controlling style and content features remains a significant challenge, especially when guided by both text and reference style images.
To address this problem, we propose D^2-Diff, a disentangled diffusion framework that enables controllable fashion image generation. Specifically, D^2-Diff disentangles style attributes (e.g., texture, color) from content attributes (e.g., silhouette) by training with text-image pairs. To support this disentanglement, we design a dynamic gated cross-modal attention fusion module to encode semantic and stylistic cues into separate attention pathways. In addition, we construct a new multimodal fashion dataset consisting of text-image pairs, where each pair is designed to express either content or style, in an alternating manner, addressing the lack of suitable datasets for disentangled fashion generation.
Extensive experiments show that D^2-Diff effectively captures subtle design elements and achieves a faithful and nuanced expression of fashion styles.
MELD: A Multi-Element Latent Diffusion Model for Urban Greenspace Layout Generation
ABSTRACT. Layout generation for urban scenes remains a challenging task. While most prior work concentrates on synthesizing road networks, city blocks, and parcel patterns, the design of urban greenspaces poses additional difficulties: these areas exhibit intricate internal relationships and geometries that go well beyond simple rectangles. Furthermore, the inner relationships between greenspace components are not well-captured by existing generative models. To address this gap, we introduce MELD, a Multi-Element Latent Diffusion model for the generation of urban greenspace layouts. MELD leverages structured scene encoding and compact latent-space diffusion to generate semantically rich and spatially coherent environments. Greenspace scenes are decomposed into greenspace, metadata, flora, and non-flora regions, each described by category, geometry, location, and size. Experimental results demonstrate that MELD produces diverse, semantically aligned greenspace layouts, providing a scalable foundation for AI-assisted urban planning and generative design.
BIDP: Brain-Inspired Dual-Process CNN-Transformer for Salient Object Detection
ABSTRACT. In the extensive research on salient object detection, most existing methods can accurately predict salient regions. However, the perceptual accuracy of salient object boundaries remains suboptimal. Inspired by the brain's dual-process mechanism, we design a novel Pyramid Efficient Channel Attention U-Net (PECA-U-Net) model. It primarily handles logical and local detail processing, facilitating the extraction of initial salient features. Subsequently, we develop a Residual Efficient Channel Attention Transformer (RECA-Transformer) model that focuses on global details and holistic perception, further refining edge details. As a result, this sequential integration of PECA-U-Net and RECA-Transformer ultimately achieves comprehensive, boundary-aware predictions of salient objects. Specifically, the model integrates a robust encoder-decoder network with the RECA-Transformer, responsible for initial saliency detection and subsequent refinement of the saliency map. Additionally, we introduce an innovative boundary optimization loss function that leverages attention mechanisms, including both base and boundary losses. By adjusting the weight of the boundary loss, the model encourages a greater focus on boundary details. Comprehensive experiments and evaluations across five benchmark datasets validate the proposed model's superior performance in salient object detection.The code is published at https://github.com/1245179435/PECA-REAT.
SigNet-TAM: Indian Sign Language Recognition with ResNet50-BiLSTM and Temporal Attention Mechanism
ABSTRACT. This paper presents SigNet-TAM, an architecture designed for isolated Indian Sign Language (ISL) recognition, evaluated on a custom dataset comprising 20 location-based signs performed by 50 signers. The proposed approach integrates a ResNet-50 backbone for spatial feature extraction, bidirectional LSTM layers for capturing temporal dynamics, and a temporal attention mechanism that adaptively weights informative frames. On this challenging dataset, SigNet-TAM achieves a top-1 accuracy of 81.5% and a macro F1-score of 81.0%, demonstrating a strong trade-off between computational efficiency and recognition accuracy. Extensive ablation studies analyze the impact of architectural choices, and confusion matrix analyses highlight common misclassifications, providing insights into difficult sign categories. The dataset, which includes varied lighting conditions and signer diversity, further establishes its value as a benchmark for advancing robust ISL recognition systems.