A Two Stage Controllable Co-Speech Gesture Generation Method
ABSTRACT. Co-speech gestures generation plays a key role in field of virtual reality interaction, synthesizing proper digital human's actions passively with speech. The generative algorithm based method create realistic gestures accompanying speech's rhythm and semantic content, improving interactive experience. To match the persona of digital humans, gestures from algorithm often require additional modification before being applied to virtual character. However, motion sequences are difficult to edit when generated from hidden motion representations. To make motion synthesis editable, the proposed method develops a two stage controllable gesture generation pipeline for co-speech gesture generating problem. In stage 1, we design a novel large language model based IK-Decoder that takes speech and style label as input to synthesize inverse kinematic style control points time series, which are highly editable. In stage 2, we divide motion sequence into body or fingers part for VQ-based latent motion representation learning relatively. And a diffusion-based IK-Denoiser is proposed for latent motion representation synthesis under condition of such time series. In co-speech gesture generation numeric experiments, the proposed method gets loss convergence into a stable state, which indicates the module's stability. Comparing to other representative algorithm, the proposed method get competitive performance of metric such as Frechet Gesture Distance, Beat Consistency and Diversity. For multi style generation, there is significant distinction of joint location distribution among different styles. Besides, to demonstrate controllability, it provides three explicit control strategy for motion editing. With introduce of control points, we provide a new co-speech gesture generation paradigm. The experiments demonstrates effectiveness and performance.
Multi-Agent Learning with Hierarchical Prompt for Efficient 3D Human Pose Estimation in Virtual Reality
ABSTRACT. In virtual reality (VR) applications, real-time and robust 3D human pose estimation is paramount to enhance user experience, yet existing methodologies often encounter challenges such as high computational burden, occlusion sensitivity, and inadequate adaptation to complex actions. To mitigate these issues, we propose a novel 3D human pose estimation method based on a multi-agent hierarchical prompting architecture. This method achieves efficient heatmap prediction in the local feature space through a parallel agents architecture, while simultaneously integrating a hierarchical loss function and dynamic context modeling. It incorporates virtual avatars geometric constraints into network training, thereby enhancing pose plausibility and effectively addressing occlusion and intricate actions. Moreover, it substantially improves cross-frame stability and estimation accuracy in occlusion scenarios through multiview spatiotemporal consistency optimization. Compared to existing methods, the proposed framework provides adaptability to the unique demands of virtual environments with reduced computational cost. We experimentally validate our approach on the widely used Human3.6M and MPI-INF-3DHP datasets, and further demonstrate through ablation experiments that the dynamic occlusion compensation module, which fuses multimodal perception with a spatio-temporal diffusion mechanism, significantly enhances the robustness of pose estimation under occlusion scenarios with virtual costumes.
Turbulence Estimation in Smoke Simulation via Curvature-Based Image Features
ABSTRACT. Smoke simulation is a crucial element in entertainment applications, such as movies and video games. In particular, the required smoke texture varies depending on the scene, ranging from smooth textures for cigarette smoke to highly irregular ones for explosions. The texture of smoke is primarily influenced by light sources, scattering properties, and turbulence components. Light sources and scattering properties affect the appearance of smoke during rendering, influencing its color, brightness, and density. Turbulence components significantly influence the shape of smoke. While many methods have been proposed for generating turbulence components, they all require manual adjustment of turbulence parameters, which is both time-consuming and labor-intensive. To address this issue, we propose a method for easily generating the desired turbulence by estimating turbulence parameters from images. In our approach, users input images including the desired turbulence components, and optimal parameters representing those components are automatically estimated. This reduces the time and effort required for parameter adjustment, allowing the desired turbulence components to be represented more efficiently.
Prototype XR Elastodynamics System for Disaster Medical Response
ABSTRACT. This paper presents a prototype XR system for disaster medical response, demonstrating the feasibility of real-time interactive elastodynamics simulations in emergency scenarios. The system delivers an end-to-end workflow utilizing XR technology, from on-site data acquisition to remote simulation. Specifically, we propose an image-guided mesh-processing pipeline that converts photographs of injured individuals into solver-ready tetrahedral meshes. We also develop a constraint based elastodynamics solver capable of simulating deformable bodies and visualizing internal stresses. Additionally, the system integrates multiple advanced XR devices and addresses the coordinate-alignment problem between these devices and the simulator. We validate the system's performance in both AR/VR modes, under textured and stress-visualization configurations, and demonstrate its applicability for remote medical guidance. Beyond whole-body elastic simulations, we conduct preliminary organ-level experiments to inform future remote surgical applications. This prototype, validated using a two-room setup, provides a feasible solution for remote emergency medical response.
Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control
ABSTRACT. Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses.
StencilQR: Connectivity-enhanced Fabricable QR Codes for Stencil
ABSTRACT. Quick Response (QR) codes are ubiquitous in logistics, URL embedding, and digital transactions due to their ability to store and convey data efficiently. Despite their widespread use, applying QR codes to materials unsuitable for stickers or impractical for laser engraving remains challenging. Stenciling offers a viable alternative, yet creating QR code stencils introduces difficulties in maintaining connectivity and structural integrity, as isolated regions (islands) can detach during fabrication. This paper presents StencilQR, a novel adaptation of QR codes specifically designed for stencil applications. StencilQR addresses the island issue by detecting and connecting these areas with strategically placed bridges, thereby enhancing structural integrity. Furthermore, StencilQR reinforces weak connections, such as cantilever-like parts, by employing a mass-spring simulation to efficiently identify and fortify fragile regions. Our evaluations demonstrate that StencilQR is robust across diverse applications and materials, ensuring high-fidelity data encoding and decoding while maintaining durability during stencil creation and use. Moreover, because StencilQR does not exploit structures unique to QR codes, it can be readily adapted to other 2D patterns requiring minimal yet effective connectivity modifications.
Recognize Me If You Can: Two-stream Adversarial Transfer for Facial Privacy Protection using Fine-grained Makeup
ABSTRACT. The popularity of social media brings about a surge in privacy risks, e.g., abusing face recognition (FR) systems for excessive surveillance. Adversarial attack techniques can prevent the unauthorized recognition of facial images at the cost of low visual quality. Recent attempts to integrate adversarial perturbations with makeup transfer demonstrate improved natural appearance. However, an conflict appears between visual quality and adversarial effectiveness in facial privacy protection. Existing works are merely focused on adjusting the adversarial network framework to achieve a rough balance. Instead, we conduct a theoretical analysis to break through this trade-off. We observe that identity-related features and makeup information occupy distinct frequency bands within an image. Based on this insight, we decompose the image and separate the two features. Therefore, we can transfer the makeup in a fine-grained manner independently from the adversarial generation. we decompose the image to separate these two sets of features. This enables a fine-grained makeup transfer, independent of the adversarial generation. Accordingly, we design a two-stream adversarial transfer network. Consequently, we successfully protect face privacy against malicious black-box FR with high transferability and visual quality. Extensive experiments demonstrate that our solution defends against two commercial APIs (i.e., Face++ and Aliyun) with little image quality degradation.
Transforming Time and Space: Efficient Video Super-Resolution with Hybrid Attention and Deformable Transformers
ABSTRACT. Space-time video super-resolution (STVSR) aims to enhance low frame rate (LFR) and low resolution (LR) videos into high frame rate (HFR) and high resolution (HR) outputs. Traditional two-stage methods decompose STVSR into video super-resolution (VSR) and video frame interpolation (VFI), resulting in significant computational overhead. The challenge of designing efficient and high-performance one-stage STVSR methods remains unsolved. While Transformer-based one-stage approaches have shown promise by processing frames in parallel and effectively capturing temporal dependencies, they suffer from large model sizes, hindering practical applications. Key to optimizing such methods is the effective utilization of extracted features, as improper feature management can degrade performance. In this work, we propose a novel one-stage STVSR framework, termed DHAT, which leverages guided deformable attention (GDA) and hybrid attention mechanisms. In the feature propagation stage, we introduce a recurrent feature refinement module based on GDA, balancing parallelism with recurrent processing. Additionally, we design a hybrid attention block that combines cross-attention and channel-attention, enabling refined spatio-temporal feature aggregation. The cross-attention mechanism plays a pivotal role in fusing multi-scale temporal information across frames. Extensive experiments demonstrate that DHAT outperforms state-of-the-art methods on several benchmark datasets, achieving superior performance as evidenced by higher PSNR and SSIM scores.
Exploring Structural Lines for Interior Floorplan Segmentation
ABSTRACT. In this work, we address the semantic segmentation of various types of 2D floorplans.
Previous works mainly focus on the segmentation of furnished floorplans and are not competent at segmenting floorplans with bare walls and very few semantic symbols and furniture, e.g., typical finishing floorplans, because these methods usually neglect important structural information, such as walls and doors, for the floorplan segmentation as shown in Fig.~\ref{fig:teaser}.
On the contrary, interior designers determine the semantics of different rooms based on structural information, e.g., the straight walls and curved doors, for further refurbishing and furnishing design.
Based on this observation, we propose a structural line primitive-based framework to tackle this problem by incorporating line contexts and mutual line relations with a Transformer-based network.
% relations between different lines
Besides, we further collect an interior finishing floorplan dataset with more diverse semantic labels for evaluation.
When applied to the proposed dataset, the method outperforms existing approaches by a large margin (+5.33\% mIoU, +4.10\% mAcc).
Experiments on furnished floorplan segmentation datasets showcase that the proposed method also outperforms previous counterparts with 80.48(+2.95)\% mIoU, 89.48(+1.68)\% mAcc on R2V, 89.12(+4.92)\% mIoU, 93.87(+2.56)\% mAcc on CubiCasa-5k.
The project will be released.
OSH-Splat: Optimizable Semantic Hyperplanes for Enhanced 3D Language Feature Gaussian Splatting
ABSTRACT. With the rapid technological advancement in the field of computer vision, building 3D language field models to support 3D open language queries has recently received increasing attention. This article introduces OSH-Splat, which constructs a 3D language field that allows for accurate and efficient open-ended lexical queries in 3D space. First, we use the Segment Anything Model (SAM) to obtain hierarchical semantic information divided into three semantic levels: part, subpart, and whole, which not only solves the target disambiguation problem, but also obtains pixel-aligned CLIP embeddings. Then the memory requirement is reduced by a pair of scene-specialized encoder and decoder, and the semantic features are learned as 3D Gaussian Splatting (3DGS) features separately in the second stage of training, where the expanded 3D language field supporting open semantic queries is obtained. Furthermore, we propose Optimizable Semantic Hyperplane (OSH), an innovative query strategy that enhances our 3D Language Feature Gaussians, which has moved away from traditional methods relying on fixed empirical threshold and shows better accuracy and robustness in 3D semantic segmentation tasks. For each text query, OSH is iteratively optimized with the help of the Reference Expression Segmentation (RES) model to achieve accurate target region localization. Extensive experimental results show that our approach outperforms state-of-the-art methods.
BWANet: A low-light image detail enhancement network with multi-scale large receptive fields
ABSTRACT. Recently, enhancement methods based on large convolutional kernels have significantly improved model performance by expanding the receptive field. However, these approaches often neglect the inherent multi-scale nature of images and lack global modeling capabilities. Furthermore, most existing methods fail to dynamically adapt to frequency-specific information, limiting their ability to reconstruct fine texture details. To address these challenges, we propose a novel low-light image enhancement network, termed BWANet. Specifically, we design a Parallel Large Kernel Convolution Block (PLKB) that captures multi-scale structural features using multi-branch large kernel convolutions. Additionally, we introduce a Frequency-domain Channel Attention (FCA) mechanism to enhance global feature representation. Moreover, this paper introduces the Bidirectional Wavelet Transform Attention Block (BWAB), which enables the effective recovery of fine-grained texture information through multi-frequency sub-band decomposition and adaptive modulation. Extensive experiments on seven publicly available low-light datasets demonstrate that our model achieves outstanding performance with only 6.68 GFLOPs and 4.32M parameters.
Guiding Steady Fluid Flow with Terrain-Based Repulsive Forces
ABSTRACT. We introduce a novel approach for controlling steady fluid flows, such as rivers and waterfalls, simulated using Smoothed Particle Hydrodynamics (SPH). Our method enables the user to intuitively guide fluid motion by specifying target points through which the fluid should pass. Repulsive forces generated during fluid-terrain interaction are utilized to dynamically control the flow, guiding it to pass around the target points. Our method iteratively updates the repulsive forces by the following two processes: first, it identifies critical positions on the terrain that significantly influence fluid behavior near the target points; second, it dynamically adjusts the magnitudes of these forces using a feedback control mechanism. This framework allows the user to locally control steady fluid flows. We demonstrate its effectiveness through several examples, emphasizing its capability to create visually realistic and user-directed fluid motion with low computational cost.
Knowledge Graph Completion by Integrating Local and Global Features
ABSTRACT. Most existing knowledge graph completion (KGC) models primarily focus on local semantic features while neglecting the influence of global structural information among entity relationships. To address this issue, we propose CRAKGC, a knowledge graph completion model that integrates both local and global features. The model employs a local feature extraction module to capture semantic interactions between entities and relations, while a global feature extraction module captures deep structural dependencies among entity relationships. Experimental results demonstrate that, compared to traditional models, CRAKGC improves Hits@3 by 2.7% and Hits@10 by 3.1% on the FB15K-237 dataset. On the WN18RR dataset, it achieves a 2.2% increase in MRR, a 2.9% increase in Hits@1, and a 5.3% increase in Hits@3. On the Kinship dataset, it improves MRR by 2.2%, Hits@3 by 0.9%, and Hits@10 by 0.8%, demonstrating the effectiveness of the proposed model.
TerraFusion: Joint Generation of Terrain Shape and Texture Using Latent Diffusion Models
ABSTRACT. 3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures.
Bimodal VIT Adaptive Cross-Fusion Micro-Expression Recognition Based on ECG-QRS Waves
ABSTRACT. The combination of multimodal is one of the important research directions in micro-expression recognition. However, the lack of multimodal micro-expression data has led to slow development of its research. In this paper, based on the existing multimodal micro-expression data of ECG signals and video signals, we propose an adaptive cross-fusion micro-expression recognition algorithm based on the ECG-QRS wave for the bimodal Visual Transformer. Firstly, for the problem of emotion feature representation of ECG signals, we constructed two-dimensional QRS temporal domain emotion features to express the emotion information of ECG signals effectively and extracted the deep features of ECG using the convolutional network. Secondly, for the multimodal fusion problem, we first proposed a visual transformer adaptive cross-fusion algorithm based on bimodal. It mainly feeds the ECG deep features and RGB features into the shallow transformer module. It uses the two features to construct a bimodal multi-head attention mechanism, realizes the feature cross-fusion and adaptive weight extraction, and realizes the fusion enhancement using adaptive weighted feed-forward features. Finally, the adaptive cross-fusion algorithm is constructed. It effectively solves the fusion problem in multimodal micro-expression recognition. We conducted an experiment validation on the CAS(ME)3 dataset with four classifications to evaluate the effectiveness and rationality of the algorithm. The experiment results show that compared with the micro-expression recognition with unimodal RGB, the multimodal micro-expression recognition combined with ECG signals significantly improves the model performance, and the recognition rate reaches 0.9699, proving the algorithm's effectiveness and feasibility.
PDFT: Parameter-Diminish Fine-Tuning for Transformer-based Models
ABSTRACT. Recent research in deep learning has focused on various large
models that excel in many non-industrial environments. How-
ever, deploying these models in practical applications of-
ten faces significant computational and storage challenges.
The research community commonly uses knowledge distil-
lation methods to optimize the spatial dimensions of large
models to meet industrial requirements. Nevertheless, knowl-
edge distillation typically separates the processes of knowl-
edge transfer and downstream adaptation. To solve this, we
propose Parameter-Reduction Fine-Tuning (PDFT), a tech-
nique that compresses Transformer-based large models dur-
ing fine-tuning, enabling initial lightweighting on down-
stream datasets without significantly sacrificing performance.
We further introduce a Probabilistic Stepping Replacement
(PSR) method and an advanced training schedule to enhance
the performance of PDFT. Our PDFT allows the compression
of SAM and BERT models to parameter levels acceptable for
computation-constrained devices based on specific needs. Ex-
periments conducted on SAM and BERT models validate the
versatility and effectiveness of our PDFT. For both models,
our PDFT achieves up to 0.93% and 1.6% accuracy improve-
ments, respectively. Released code is available at supplemen-
tary material.
Dual-Path Spatio-Temporal Mamba for Skeleton-Based Action Recognition
ABSTRACT. Mamba emerges as a new paradigm for modeling long sequences, presenting a compelling alternative to Transformer. Although Mamba has demonstrated its utility in modeling temporal structures within skeleton-based action recognition, its spatio-temporal union context modeling has not been fully explored. To bridge this gap, we propose SkeMamba, a novel approach centered entirely on Mamba. Firstly, we introduce Adaptive Topology Transformation (ATT) to convert skeletal graphs into sequential representations while preserving anatomical semantics. Secondly, we formulate Dual-Path Spatio-Temporal Mamba (DSTMba) to thoroughly account for the dual spatio-temporal nature of joint features. Lastly, we propose the Spatio-temporal Gated Mamba (ST-GateMba) for the pivotal components of DSTMba, which strategically mediates between transient motion dynamics and sustained behavioral semantics. Experimental results demonstrate that SkeMamba outperforms existing Mamba-based methods on NTU series and NW-UCLA datasets.
Boosting Memory Network for Video Object Segmentation in Complex Scenes
ABSTRACT. Memory-based methods are the leading solutions for video object segmentation. However, due to incorrect feature matching, single-scale memory reading, and inefficient memory management, these methods often struggle in complex scenarios with similar distractors, small objects, or large deformations. In this paper, we propose a novel boosting memory network (BMN), which consists of a context-aware module (CAM), a multi-scale memory readout module (MMRM), and a reinforcement learning-based memory refinement module (RL-MRM) to enhance segmentation performance in complex scenarios. Specifically, we introduce the CAM into the encoding process, which employs transformer blocks that contain global and local branches to effectively model long-range and short-range spatial dependencies. Additionally, we employ the MMRM to perform efficient multi-scale memory readout, thereby generating multi-scale memory features that deliver strong cues for the final segmentation. Furthermore, during inference, we utilize the RL-MRM to construct a query-specific refined memory to provide precise segmentation guidance for each query frame, effectively mitigating temporal drift and avoiding error accumulation. Extensive experiments on five challenging benchmarks demonstrate that our BMN can achieve competitive performance compared to state-of-the-art methods. Our source code and models are available at: https://github.com/csustYyh/BMN.
ViT-BF: Vision Transformer with Border-aware Features for Visual Tracking
ABSTRACT. Existing object trackers commonly approach the tracking process by utilizing classification and regression techniques. However, they often encounter difficulties in managing complex scenarios such as occlusions and appearance changes. Moreover, the quality of candidate boxes is a critical factor for effecting the tracking performance. To overcome these challenges, this study introduces a border-aware tracking framework based on a Vision Transformer (ViT), termed ViT-BF. Through the integration of a boundary alignment operation, ViT-BF extracts boundary features from the extremal points of objects, thereby enhancing classification and regression precision. To handle the dynamic appearance variations of objects, ViT-BF integrates a template update mechanism through a score prediction module (SPM), which enhances the tracker’s robustness and accuracy. Experimental results demonstrate that ViT-BF achieves state-of-the-art performance across multiple standard datasets, including LaSOT, TrackingNet, GOT-10k and UAV123, showing the exceptional stability and adaptability in handling complex scenarios.
Physics-Guided Deep Learning Framework with Attention for Image Denoising
ABSTRACT. Deep neural networks (DNNs) have achieved remarkable success in image denoising, yet their design is predominantly empirical without clear theoretical guidance. Recent studies have revealed connections between neural networks and physics-based differential equations, offering a reliable guideline for network designs. Nevertheless, most of the theories used to guide the design of the model are not specific to the task, the mismatch between the guiding theory and the specific task will undoubtedly undermine the suitability and strength of data-driven models in specific scientific applications. To address this, we propose a novel physics-guided learning framework by incorporating the structure of the physics-based differential equations specialized for image denoising into the advanced deep model. Our framework features an asymmetric multi-scale U-Net architecture, combining an attention-based encoder with a physics-guided decoder and loss function. Experimental results show that our approach not only surpasses state-of-the-art methods in both Gaussian and real noise removal tasks but also reduces the model's reliance on large datasets.
Interpretable Two-Stage Action Quality Assessment via 3D Human Pose Estimation and Dynamic Feature Alignment
ABSTRACT. With the shift towards online learning and autonomous training in physical education, Action Quality Assessment (AQA) has emerged as a crucial component in creating a closed-loop learning system. Existing methods, while improving accuracy and reliability, often lack interpretability. This paper proposes an interpretable and reliable AQA framework comprising two stages. First, a Motion Information Enhanced Transformer (MiE Transformer) is introduced for 3D human pose estimation. By integrating Graph Convolutional Networks (GCNs) with Transformers, the MiE Transformer enhances action detail representation and motion dynamics, reducing joint motion errors through motion constraints in the loss function. Second, the 3D pose data are decomposed into static and dynamic features, such as velocity, center of gravity changes, and normalized bone vectors. An improved Dynamic Time Warping (DTW) algorithm is then applied to quantify multidimensional differences between practice and standard actions. The final action quality score integrates multiple feature scores, demonstrating strong generalization and reliability in evaluating martial arts like TaiChi across varying body sizes and skill levels. This work not only advances the field of AQA but also highlights its potential for broad applications in online physical education.
SDOD: Towards Reliable Object Detection under Diverse Rainy Conditions with Deraining and Mutual Learning
ABSTRACT. Object detection has achieved impressive performance under normal weather conditions. However, under rainy weather, raindrops and rainstreaks can significantly reduce image quality, leading to severe performance degradation of detectors. To address this issue, we propose an unsupervised domain adaptation (UDA) training framework, called Self-training with Deraining for Object Detection (SDOD), which leverages unlabeled rainy images to transfer knowledge from labeled normal images to improve the performance of detectors under rainy weather conditions. The proposed SDOD consists of a learnable lightweight deraining module and a cross-weather self-training architecture. Specifically, we introduce a deraining module as a weak augmentation to restore the rainy image, which is then input to the teacher model, thereby strengthening its ability to explore object features in the rain. Next, we employ mutual learning between the teacher model (taking images from rainy weather) and the student model (taking images from both normal and rainy weather), which iteratively optimizes the performance of both the detectors and the deraining module. Moreover, we designed a rainy object estimation strategy to produce more reliable pseudo-labels for rainy images. Extensive experiments conducted on rainy benchmarks demonstrate the superiority of our SDOD compared with existing UDA methods, showing detection performance improvements in rainy weather conditions.
ABSTRACT. With the expansion of global agricultural production, pests have become a major constraint on agricultural development. Accurate pest identification in natural environments remains a significant challenge for precision agriculture. To overcome the limitations of traditional detection methods in feature extraction and complex background recognition, this paper proposes YOLO-Pest, a pest detection model based on YOLOv11. YOLO-Pest incorporates three key modules to improve detection accuracy and efficiency. First, the AvgPool and MaxPool dual-branch downsampling module optimizes downsampling, preserving critical image information and enhancing multi-scale feature extraction for pests of different sizes. Second, the Dynamic Texture Attention module amplifies pest texture features in shallow layers, improving small pest detection. Third, the Spatial Reconstruction Multi-Scale Dilated Attention module enhances feature representation through spatial reconstruction and multi-scale dilation rates, capturing both local and global features while reducing computational redundancy. Experiments on the IP102 and Pest24 datasets show that YOLO-Pest outperforms state-of-the-art models in mean Average Precision (mAP) and recall. Ablation studies confirm the contribution of each module, demonstrating the effectiveness of the proposed approach. These findings indicate that YOLO-Pest provides an efficient and reliable solution for intelligent pest monitoring in complex agricultural environments.
DOMVS:Unsupervised Multi-View Stereo for Dealing With Occlusion Scenes
ABSTRACT. Deep learning-based multi-view stereo (MVS) methods have made significant progress in recent years. Due to limited access to large-scale annotated datasets, researchers explore unsupervised MVS methods that do not require ground-truth depth data. However, unsupervised MVS methods struggle with occluded or texture-less regions, as they rely on the assumption of photometric consistency. To address these issues, we propose an unsupervised MVS method named Unsupervised Multi-View Stereo for Dealing With Occlusion Scenes (DOMVS). We first propose a feature-level perceptual consistency module that minimizes reconstruction errors by comparing the differences in high-level semantic features between images. Meanwhile, we propose a structured occlusion generation module. It improves the accuracy and completeness of depth estimation by generating augmented samples for contrastive learning. Moreover, we propose the DLA-Net module for normalization to address the limitations of the receptive field. It enhances the accuracy of the depth map through the aggregation of global information. We test DOMVS on the DTU and Tanks &Temples datasets. Results demonstrate that DOMVS achieves an overall score of 0.339 on the DTU dataset, higher than the state-of-the-art method RC-MVSNet. On the Tanks & Temples dataset, DOMVS outperforms ADR-MVSNet and JDACS-MS by 4.31% and 20.34%, respectively.
GartransNet: 3D garments animation via transmission optimized networks
ABSTRACT. 3D garment animation is a challenging problem in computer vision and computer graphics.Although realistic animations can be generated by mapping human poses or time series to clothing deformations, these methods often limit specific clothing and are difficult to predict loose clothing. Additionally, current methods rely on predefined material parameters to control fabric deformation, yet they often lack physical rationality in terms of the stretching effect of clothing.We propose GartransNet, a transmission optimization network that enhances neighborhood sampling strategy, reduces computational redundancy, and improves real-time fabric simulation performance. We refine the mixed weights between the human body and clothing, so that the model is no longer limited to specific clothing, and introduce a collision sensitive geometric perception constraint mechanism to improve the physical consistency of clothing deformation.The experimental results indicate that our method accelerates training and inference. In addition, it is suitable for all types of clothing and exhibits excellent performance, especially when simulating loose clothing.
IMFDM: Improved Mamba-based Feature Decomposition Model for Multi-Modality Image Fusion
ABSTRACT. Multi-Modality Image Fusion(MMIF) aims to integrate complementary information from multiple source images into a single image while still preserving the original highlights and detailed textures. To tackle the critical problem of cross-modality feature modeling and decomposition, an Improved Mamba-based Feature Decomposition Model(IMFDM) is proposed. Firstly, IMFDM extracts cross-modality shallow features using the Shallow Feature Encoder(SFE). Secondly, the dual-branch feature extractor decomposes these shallow features into low-frequency and high-frequency. Finally, the decomposed features are adaptively integrated through the Content-Guided Attention Fusion module(CGAFusion), and then the Decoder outputs the fusion result. This method explores the application of Mamba in Image Fusion and introduces a dual-branch State-Space Model(SSM)-Convolutional Neural Network(CNN) encoder: the Base Feature Encoder(BFE) leverages Mamba's efficient long-range modeling for global low-frequency features, while the Detail Feature Encoder(DFE) extracts local high-frequency details by multi-branch strategy. Extensive experiments show that IMFDM exhibits advanced performance in various fusion tasks. Furthermore, IMFDM also achieves satisfactory results in downstream Multi-modality object detection in subsequent unified benchmarks.
Task-adaptive Channel Attention Graph Network for Few-shot 3D Point Cloud Classification
ABSTRACT. Few-shot 3D point cloud classification remains a challenging task due to the irregular geometry and sparse nature of point clouds, coupled with the scarcity of labeled data. Existing methods often rely on task-agnostic feature extractors, limiting their ability to adapt deeper features to novel tasks effectively. To address this limitation, we propose the Task-Adaptive Channel Attention Graph Network (TCAGN), a novel framework that dynamically adapts feature extraction to both instance-specific and task-aware contexts. TCAGN extends the widely used DGCNN backbone by integrating two attention mechanisms: Instance Channel Attention Module (ICAM), which emphasizes semantically meaningful channels for individual instances, and Task-adaptive Channel Attention Module (TCAM), which generates task-specific channel attention for all task features. These modules are hierarchically integrated into each EdgeConv block, enabling progressive feature refinement tailored to the unique characteristics of each few-shot task. We evaluate TCAGN on three benchmark datasets, ModelNet40-FS, ShapeNet70-FS, and ScanObjectNN-FS, demonstrating state-of-the-art performance with improvements of 1.1%–2.7% over existing methods. Ablation studies indicate that the synergistic combination of ICAM and TCAM contributes 3.0%–6.2% of the overall performance gain.
DL-DETR: Exploring the Future Directions of Fusion Between Dictionary Learning and DETR
ABSTRACT. Recently proposed, RT-DETR and its variants are designed to achieve end-to-end
real-time object detection while demonstrating strong performance. In this pa-
per, we introduce a novel dictionary learning-based improvement method called
ADDSC (Adaptive Dictionary-Driven Dynamic Sparse Convolution). ADDSC
comprises three core sub-modules that together enable efficient feature extrac-
tion and optimization. The Adaptive Dictionary Learning (ADL) module utilizes
dictionary learning to decompose input features and generate sparse coefficients,
adaptively adjusting the granularity of feature decomposition. The Dynamic
Sparse Convolution (DSC) module applies sparse convolution to the sparse coef-
ficients, thereby reducing computational complexity through learnable sparsity.
The Learnable Parameter Optimization (LPO) module jointly optimizes both
the ADL and DSC, achieving a dynamic balance between feature representa-
tion and computational efficiency. Experimental results show that the DL-DETR
model based on ADDSC significantly improves multi-scale object detection ac-
curacy through feature decoupling and enhanced sparsity, while maintaining
computational complexity comparable to state-of-the-art detection models such
as D-FINE. Notably, ADDSC exhibits superior robustness in small object de-
tection and complex scenarios, making it particularly well-suited for real-time
object detection tasks.
Facial Action Unit Detection with Iterative Rank Reduction Adapter and Directional Attention
ABSTRACT. Facial action unit (AU) detection is a challenging task, as AUs are subtle, dynamic, and diverse. Recently, the prevailing techniques of visual foundation models (VFMs) and large model fine-tuning have been introduced to many computer vision tasks. However, most existing AU detection methods neglect the fine-tuning of VFMs, and thus still suffer from the difficulty of learning powerful feature representations. In this paper, we propose a new iterative rank reduction adapter (IR2A) to fine-tune a VFM for AU detection. In particular, we freeze the pre-trained model parameters and introduce trainable rank decomposition matrices to top self-attention blocks. We set an initial rank for rank decomposition matrices, and then iteratively reduce the rank via principal component analysis. Moreover, we propose a directional attention to learn relevant features to each AU, in which important information in different directions are captured. Extensive experiments show that our method outperforms state-of-the-art AU detection approaches on challenging benchmarks including BP4D, DISFA, and GFT.
Intrinsic Reflective Symmetry Axis Curve Generation for Meshes
ABSTRACT. In computer graphics, symmetry serves as a critical indicator of an object's shape and structure, motivating extensive research into various types of symmetry. In this study, we introduce a novel method for extracting an intrinsic reflective symmetry axis curve that partitions a mesh into two nearly equal halves. First, we sample points on the mesh using both geodesic-based sampling methods. Next, we perform point-to-point matching by computing geodesic paths from pairs of points to their respective bisector regions and extracting histograms based on spectral signatures of vertices along or near these paths. The histogram bins are then compared using optimal transport in their normalized forms. To enhance the robustness of our algorithm, we employ multiple paths to various points within each bisector region, thereby generating a richer set of histograms for comparison. Subsequently, we use a voting procedure based on the midpoints of these point pairs to select the optimal bisector region, which defines our intrinsic reflectional symmetry axis curve. Finally, we stabilize the curve through successive iterations of point matching. Our approach outperforms two state-of-the-art symmetry matching techniques in comparing the areas between the symmetric halves on complete SCAPE and TOSCA datasets. We also obtain plausible results on the diverse collection of Princeton dataset.
SCAU-Net: An Efficient Module for Multi-scale Feature Extraction and Aggregation in Video Super-Resolution
ABSTRACT. Video Super-Resolution (VSR) is a critical technology for enhancing the quality of low-resolution videos. Existing methods, particularly those based on the U-Net architecture, exhibit limitations in effectively transferring information across multi-scale feature maps due to the absence of robust mapping mechanisms. These shortcomings result in suboptimal preservation of granular details and textures, compromising the clarity and quality of super-resolved videos.
To address these challenges, we propose the Scale-aware Cross-Attention U-Net (SCAU-Net), a novel architecture designed to enhance multi-scale feature restoration for VSR. At the core of SCAU-Net is the Scale-aware Pixel Cross-Attention Module (SPCAM), which is integrated into the U-Net framework. This module facilitates efficient mapping between features at different scales, enabling one-step extraction and aggregation of multi-scale information. By optimizing the transfer and integration of features across adjacent layers, SCAU-Net significantly improves the preservation of fine details and textures.
Extensive experiments demonstrate that the method with the proposed SCAU-Net, achieves a remarkable balance of computational efficiency and performance, delivering state-of-the-art results in VSR tasks.
A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images
ABSTRACT. In the segmentation of remotely sensed images, deep learning models are typically pre-trained using large image databases like ImageNet before fine-tuned on domain-specific datasets. However, the performance of these fine-tuned models is often hindered by the large domain gaps (i.e., differences in scenes and modalities) between ImageNet’s images and remotely sensed images being processed. Therefore, many researchers have undertaken efforts to establish large-scale domain-specific image datasets for pre-training, aiming to enhance model performance. However, establishing such datasets is often challenging, requiring significant effort, and these datasets often exhibit limited generalizability to other application scenarios. To address these issues, this study introduces a novel yet simple pre-training strategy designed to guide a model away from learning domain-specific features in a pre-training dataset during pre-training, thereby improving the generalisation ability of the pre-trained model. To evaluate the strategy’s effectiveness, deep learning models are pre-trained on ImageNet and subsequently fine-tuned on four semantic segmentation datasets with diverse scenes and modalities, including iSAID, MFNet, PST900 and Potsdam. Experimental results show that the proposed pre-training strategy led to state-of-the-art accuracies on all four datasets, namely 67.4% mIoU for iSAID, 56.9% mIoU for MFNet, 84.22% mIoU for PST900, 91.88% mF1 for Potsdam. This research lays the groundwork for developing a unified foundation model applicable to both computer vision and remote sensing applications. The code and pre-trained models will be released after peer review.
CoFormer:Coupling Attentive Model for Visual Sentiment Analysis with Hierarchical Emotion Loss
ABSTRACT. Visual sentiment analysis aims to verify the creator's attitude towards the overall contextual polarity of an image. Existing CNN-based schemes derive emotional representations from local features, missing the positive effects of both global image objects and long-range dependencies on emotion representation learning. To work around this restriction, we present a hybrid network structure named \textit{CoFormer} to enhance representation learning with convolution operations and self-attention mechanisms in this paper. Our model extracts visual features through a series of residual blocks, which are then transformed into semantic tokens via hybrid attention. We insert a tokenizer-based Transformer to discover the correlation between these tokens and local visual features and achieve a conclusive sentiment prediction. In addition, considering the inherent hierarchical structure of emotion's granularity, we put forward a hierarchical emotion loss to distinguish hard false examples to optimize the model. We also design an automatic image augmentation agent to enhance the model's generalization ability.
Multilevel Monte Carlo for Asymptotically Efficient Path Tracing
ABSTRACT. Efficiency improvement techniques are widely used to improve the computational
efficiency of Monte Carlo path tracing. While numerous methods have been
proposed with different strategies, they fundamentally aim to control the number
of samples at each depth to improve efficiency. To achieve this, previous
approaches sample contributions at each vertex of the incrementally constructed path and employ
Russian roulette and splitting to adjust the sample count at each depth.
However, additional errors introduced by correlations from path sharing can
limit the potential efficiency of these approaches by increasing the number of
samples needed to optimize efficiency.
To address this limitation, we propose an alternative efficiency improvement
technique for Monte Carlo path tracing using Multilevel Monte Carlo. We start by
defining a multilevel estimator that sums independent Monte Carlo estimators,
each of which samples contributions at a specified depth.
Then, the efficiency is optimized by adjusting each estimator's sample size, eliminating the need for
spatial data structures.
While this multilevel setup increases sampling costs, it reduces variance by
removing correlations between contributions. Essentially, the reduced variance
leads to substantial performance gains by reducing the number of samples
required for efficiency optimization. Consequently, our approach
achieves noticeable speedup over the state-of-the-art methods without relying on
complex spatial data structures.
Video Sketching using Multi-domain Guidance and Implicit Encoding
ABSTRACT. Sketch data is a common element in visual communication. While synthesizing sketches from photos has been extensively explored, creating sketches from video remains a complex challenge due to its inherent intricacy and the necessity for temporal consistency. This study delves into the generation of a sequence of vector sketches from a video clip. We have developed an optimization framework that utilizes the CLIP perceptual loss with guidance from multiple domains, including natural images and stylized line drawings. This approach aids in capturing the prominent visual content within a complex scene. We initialize the sketches by propagating control points from the keyframes through the video content deformation field. These initial points are implicitly encoded and serve as input to a transformer network that predicts the control point offsets for each frame. We also conduct an additional temporal refinement stage by using more precise initial points for optimization. Experimental results on the DAVIS video dataset demonstrate that our method successfully delivers high visual fidelity and temporal consistency.
Scene-Enhanced Social Interpretable Movement Behavior for Multimodal Pedestrian Trajectory Prediction
ABSTRACT. Pedestrian trajectory prediction aims to forecast pedestrians' future positions based on their historical movements and surrounding environmental information. This capability is crucial in applications such as autonomous driving. Many existing methods utilize deep learning models to analyze historical trajectories and interactions among pedestrians. While these models often achieve high predictive accuracy, their data-driven nature can result in trajectories that do not accurately reflect pedestrians' real-world behavioral responses in specific contexts, lacking interpretability. Additionally, most studies do not consider the influence of specific scene information on pedestrians' movement behavior decisions during social interaction modeling.In this paper, we propose a framework named SE-MBMP (Scene-Enhanced Social Interpretable Movement Behavior for Multimodal Pedestrian Trajectory Prediction). By clustering extensive real-world pedestrian movement behavior data, SE-MBMP constructs an interpretable Movement Behavior Set that encompasses potential future behaviors. Furthermore, scene information is incorporated into social interaction modeling, enhancing the accuracy of predicted trajectories. Extensive experiments on the ETH and UCY datasets demonstrate that our strategy achieves average ADE and FDE scores of 0.32 and 0.57, respectively, representing reductions of 28.1% and 31.3% compared to the METF method, underscoring its potential in trajectory prediction.
ConDT: A 2D curve reconstruction algorithm based on a constrained-neighbor proximity graph
ABSTRACT. We introduce ConDT algorithm, a proximity based reconstruction method relying on Delaunay Triangulation. The underlying proximity graph is referred as ConDT. In addition to being simple, algorithm could successfully handle various challenging cases where classical reconstruction algorithms often struggle. Outlier removal is done in the post-processing phase using the Interquartile Range (IQR) criteria, computed for the specific instance of the proximity graph. Relying on the recent benchmark on 2D reconstruction, we show that our method works better or on par with the state-of-the art methods.
MindCanvas: A Human-Centric Approach to AI-Driven Artistic Creation Through Brain Signals
ABSTRACT. The intersection of neuroscience and artificial intelligence (AI) offers new avenues for understanding and replicating human creativity. However, current AI systems struggle to capture the depth, emotion, and cognitive complexity inherent in human artistic expression. In response, we propose MindCanvas, an AI framework that integrates fMRI with diffusion models to generate artwork directly from neural signals. MindCanvas decodes brain activity, reconstructs mental images, and refines them through text prompts, producing not only the final artwork but also a dynamic video that captures the stroke-by-stroke creative process. Two user studies demonstrated the effectiveness of the model in translating neural activity into visually compelling and cognitively resonant artworks. By maintaining coherence between neural signals and artistic output, MindCanvas addresses key limitations in existing AI art systems, offering a novel approach that mirrors the evolving and emotional nature of human creativity. Our results underscore the potential of merging neuroscience and AI to create art that transcends technical inputs, moving toward a deeper, more holistic representation of human creativity.
Graph Convolutional Networks for 3D Skeleton-Based Scoliosis Screening Using Gait Sequences
ABSTRACT. Adolescent idiopathic scoliosis is a significant health concern, ranked as the third most prevalent issue among adolescents after obesity and myopia. Traditional screening methods rely on the use of complex and expensive measuring instruments and expert physicians to interpret X-ray images. These methods can be both time-consuming and inaccessible for widespread screening efforts. To address these challenges, we propose a standardized protocol for the collection of scoliosis gait dataset. This protocol enables the systematic capture of relevant gait characteristics associated with scoliosis, leading to the creation of a comprehensive, annotated dataset tailored for research and diagnostic purposes. Leveraging this dataset, we developed a sophisticated deep learning algorithm based on graph convolutional networks. This algorithm is specifically designed to analyze the intricate patterns of gait asymmetry associated with scoliosis. We also explored various optimization strategies to enhance the model's accuracy and efficiency, ensuring robust performance across diverse scenarios. Our innovative approach allows for the rapid and non-invasive recognition of scoliosis. This method is not only scalable but also eliminates the need for specialized equipment or extensive medical expertise, making it ideal for large-scale screening initiatives. By improving the accessibility and efficiency of scoliosis detection, our approach has the potential to facilitate early intervention.
GUM-DiT: A Foundation Model for Generating Urban Morphological Layouts
ABSTRACT. Layout generation has become a critical frontier in computer vision and computational design, with especially strong impact in urban planning. However, existing research often suffers from limited datasets and rarely addresses the diversity of city layout styles. To tackle these issues, we introduce learnable morphology tokens that embed urban style information directly into our model. This approach lets us capture both structural features and stylistic nuances across different cities. We also curate a large-scale dataset of urban layouts paired with descriptive style annotations and develop a multiscale data processing pipeline. Finally, we propose the Diffusion Transformer(DiT) framework —combining diffusion-based generation with transformer architectures—to produce a wide variety of realistic urban patterns. Qualitative and quantitative evaluations demonstrate that DiT outperforms baseline methods in generating distinct, style-consistent urban layouts.Our code and dataset are publicly available at \url{https://anonymous.4open.science/r/label2layout-dit-78C3/README.md}.
WA-FDNet: A Unified Weight Adaptation Network for Multimodal Image Fusion and Object Detection
ABSTRACT. Multimodal image fusion and object detection are critical tasks in computer vision, particularly in scenarios requiring robust perception under low illumination condition. Existing approaches that attempt to combine these tasks often rely on cascaded or loosely coupled designs, which can result in suboptimal performance due to gradient conflicts and task imbalance. In this paper, we propose WA-FDNet, a novel Weight Adaptation Fusion Detection Network that unifies multimodal image fusion and object detection into a single end-to-end framework. WA-FDNet adopts a shared encoder–private decoder architecture, enabling efficient feature sharing while preserving task-specific characteristics. The image fusion branch employs a spatial attention-based feature reconstruction module to generate high-quality fused images by emphasizing semantically important regions. Meanwhile, the detection branch introduces a dual-cross attention feature interaction module that enhances inter-modal representation learning for accurate object detection. To address training instability caused by conflicting objectives, we propose a Dynamic Task Weight Adaptation (DTWA) strategy that dynamically balances gradient contributions across tasks based on optimization feedback. Extensive experiments on public benchmarks demonstrate that WA-FDNet achieves state-of-the-art performance in both fusion quality and detection accuracy, validating the effectiveness of our unified multitask learning approach. The source codes will be public available upon acceptance.
Sequential Fluid Image Generation Network Based on Spatio-Temporal Swin Transformer
ABSTRACT. In recent years, fluid dynamics simulation has attracted widespread attention in computer graphics and scientific computing. To address the issues of long-term dependency capture and insufficient local feature extraction in multi-frame fluid image prediction methods, this paper proposes a spatio-temporal deep learning network model that combines ConvLSTM and Swin Transformer. The model leverages ConvLSTM to capture both long-term and short-term dependencies in time series, while utilizing the hierarchical attention mechanism of the Swin Transformer to extract multi-scale spatial features. Experimental results demonstrate that the proposed method achieves an average PSNR of 39.00 dB across five fluid prediction datasets, showing an 9.1% improvement over the optimal baseline model (Swin Transformer). Compared to best benchmark model (Swin Transformer), the SSIM metric has improved by 6.5%, confirming the significant advantages of the generated images in terms of both physical consistency and visual quality. This research provides a new approach for efficient fluid dynamics prediction and has potential application value in related fields.
Learning Human-Object Interactions in Videos with Optical Flow
ABSTRACT. Learning human-object interactions (HOI) in videos is a critical yet challenging task. Existing methods often rely solely on appearance cues and struggle to capture fine-grained temporal dynamics. We propose FlowHOI, a novel two-stream framework that explicitly incorporates optical flow as a motion prior for HOI learning. FlowHOI consists of two parallel streams, including a spatial stream that extracts appearance features from RGB frames, and a temporal stream that encodes motion patterns from dense optical flow. Both streams capture complementary spatial-temporal dependence with the spatial-temporal mixed module and fused them before the HOI prediction. The decoupling of spatial and motion cues enables the model to focus on transient interaction patterns such as brief hand to object contact and object displacement, which reduces ambiguity in feature learning. Experimental results on two public human-object interaction video benchmarks demonstrate that FlowHOI achieves significant improvements over existing methods, e.g., 94.2% and 94.5% F1 Score for Sub-activity and Affordance on CAD-120 dataset.
ABSTRACT. Currently, most trackers treat the tracking issue as classification and regression tasks. The tracking performance is determined significantly by the allocation of positive and negative samples. In this paper, an Adaptive Training Sample Selection (ATSS) is employed to define positive and negative samples according to a dynamic IoU threshold. Additionally, ranking candidate boxes accurately is also a key factor affecting the tracker's performance. The traditional method sorts candidate boxes based on classification scores, but it fails to produce reliable sorting for these candidate boxes. Therefore, this paper introduces an IoU-aware Classification Score (IACS) for sorting candidate boxes. Varifocal loss is employed to train the tracker to generate IACS. Furthermore, to further refine predicted bounding boxes and predict IACS, this paper utilizes a star-shaped bounding box feature representation. Combining these structures, we propose a transformer-based anchor-free tracker TADT by incorporating ATSS and IACS strategies, which can solve the misalignment problem between classification and regression. The proposed tracking framework achieves remarkable performance on five benchmarks. In particular, our tracker TADT sets a state-of-the-art performance on TrackingNet, with AUC of 85.3%.
Reproducing the Appearance of Metallic Materials by Capturing Long-Range Dependencies using Image-to-Image Translation Network with Vision Transformer
ABSTRACT. Seamlessly integrating virtual objects into real scenes is a critical challenge in Augmented Reality. To achieve this, it is important to provide globally consistent estimations of lighting conditions in the real scene. In this study, we propose a method to reproduce the appearance of metallic materials using an image-to-image translation network incorporating Vision Transformer (ViT). The network receives an image consisting of a background and the normal map of a virtual object, and transforms the normal map into a virtual object with the metallic material appearance. Specifically, ViT, which effectively captures long-range dependencies in the input image, is introduced into the network's encoder to extract lighting information, and the network produces the natural appearance of the virtual object that reflects the extracted lighting information in the output image. We created a synthetic dataset and compared images generated by the proposed method with those generated by a CNN-based image-to-image translation network. The results showed that the proposed method outperformed the CNN-based method in three quantitative metrics and reproduced a more natural appearance of metallic materials that is consistent with the lighting conditions.
Veiled Diffusion: A Diffusion-Based Anti-Customization Adversarial Attack Method via Semantic Misalignment and Low-Frequency Information
ABSTRACT. With the advancement of diffusion-based customization methods like DreamBooth, users can generate highly realistic personalized images from only a few samples and textual prompts. However, this technology poses risks such as image forgery, threatening privacy and security. To address this, active protection techniques introduce adversarial perturbations during customization to suppress the spread of falsified content. Existing methods often fail to fully exploit low-frequency information and exhibit limited perturbation intensity, leaving residual facial features recognizable. To overcome these limitations, we propose Veiled Diffusion (VeilDiff), a novel adversarial attack method that enhances privacy protection in diffusion customization. Specifically, a low-frequency information enhancement module integrates dominant structural features into the image before latent variable generation, while a Gaussian low-pass filter induces low-frequency loss for more targeted perturbations. Additionally, we design a semantic misalignment module to weaken the influence of textual prompts by manipulating the cross-attention mechanism. We further propose a multi-level feature guidance loss that dynamically enhances perturbations by focusing on critical semantic regions. Experimental results indicate that VeilDiff surpasses the previous state-of-the-art method by 13.1% in FDFR and 14.8% in ISM, achieving superior overall performance across all evaluated settings and demonstrating strong potential for privacy protection.
Generating Multi-Illumination Leaf Images through Single-Image Material Estimation for Robust Disease Recognition
ABSTRACT. Visual conditions are critical for vision-based smart agricultural production tasks. However, traditional field acquisition methods struggle to capture complex illumination variations and are often time-consuming, labor-intensive, and expensive. In this work, we propose an innovative method for generating datasets of crop leaf images under diverse visual conditions, which synthesizes high-fidelity leaf data through physics-based simulation of realistic agricultural illumination variations. Our methodology comprises three stages. First, we use advanced techniques to estimate depth from the input image and reconstruct accurate leaf geometry. Second, we employ a physics-based inverse rendering framework to estimate the material and illumination characteristics of leaf surfaces. Finally, high dynamic range (HDR) environment maps are utilized to achieve photorealistic relighting across diverse agricultural conditions. Extensive experiments conducted on the AppleLeaf9 dataset validate the effectiveness of the proposed image generation method for leaf disease classification across various deep learning architectures. The synthesized images achieved classification accuracies of 97.10% with MobileNetV2, 95.65% with DenseNet121, 96.62% with ResNet50, and 98.07% with VGG16. These results underscore the strong potential of the proposed method in agricultural vision applications, particularly in addressing domain-specific challenges that are often insufficiently handled by existing image generation approaches.
Te3DFR: Texture-enabled 3D Face Reconstruction from Monocular Image via Self-supervised Learning
ABSTRACT. Reconstructing high-fidelity 3D faces from monocular RGB images has garnered significant research interest, with most approaches relying on 3D Morphable Models (3DMM). However, the capabilities of 3DMM-based methods are often constrained by the limited dimensionality of the representation space, leading to inaccuracies in detailed facial shapes and realistic skin textures. While several recent works aim to enhance 3DMM-based face reconstruction, many focus on fine-tuning the outputs of pre-trained 3DMM models—such as estimating offset maps—rather than improving the underlying pre-trained model itself, which is the root of these inaccuracies.
To address this gap, we propose a Visual Transformer (ViT)-based framework for predicting 3DMM coefficients for face reconstruction, thereby creating a new pre-trained model that improves reconstruction performance. Additionally, our model incorporates an optimization module with innovative filter constraints for face textures, significantly enhancing the high-frequency components that correspond to fine details in facial textures. Extensive experiments conducted on publicly available face reconstruction datasets, including SCUT-FBP5500 and FaceScape, demonstrate that our method achieves promising generalization performance and markedly improves the precision of texture details compared to state-of-the-art techniques.
Lightweight Image Super-Resolution Using Fine-Grained Feature Distillation in a Dense Residual U-Net
ABSTRACT. In recent years, convolutional neural networks (CNNs) have achieved remarkable progress in single image super-resolution (SISR). Meanwhile, Transformers are gaining increasing attention for their exceptional capabilities in extracting global and non-local features. However, most existing studies focus primarily on improving reconstruction accuracy and expanding the receptive field, with relatively less emphasis on the requirements of edge devices and fine-grained feature extraction. To address this gap, we propose a lightweight model, Fine-Grained Feature Distillation Dense Residual U-Net (FDDRU), which enhances fine-grained feature extraction while significantly reducing model parameters and maintaining superior performance. FDDRU incorporates an innovative Fine-Grained Feature Distillation Block (FFDB) and builds its core modules using Fine-Grained Shallow Residual Blocks (FSRB). By leveraging a residual multiplication mechanism, it achieves efficient collaboration between depth-wise and point-wise convolutions to emphasize fine-grained information and reduce model complexity. Based on the U-Net architecture, FDDRU exhibits three main advantages: First, a Dense Residual Connection Mechanism (DRCM) is introduced in the encoder to enhance feature transmission efficiency by accommodating channel expansion patterns. Second, a Multi-Level Information Supplementation Mechanism (MISM) bridges the encoder and decoder, compensating for information loss during channel compression in the decoder. Finally, a Bottom Module (BM) integrates encoder features to explore inter-channel correlations and ensure smooth transmission to the decoder. To achieve fine-grained reconstruction while enhancing visual perceptual quality, we optimize the training strategy by jointly applying L1 loss and rigid loss functions, ensuring both pixel-level reconstruction accuracy and local structural consistency in the reconstructed results, thus recovering high-frequency details. Experimental results demonstrate that our method outperforms state-of-the-art approaches on benchmark datasets, showcasing outstanding performance and satisfactory visual results.
Extending Implicit Density Projection for Multiphase Fluids
ABSTRACT. To increase the realism of fluid simulations, interactions between different fluids (such as water and air, or water and oil) must be considered.
This allows for the simulation of phenomena such as the `glugging' effect seen when a bottle is turned upside and air must fill the space left by the liquid.
A common approach to fluid animation is to use Particle-in-Cell (PIC) methods.
However, such methods are known to lose fluid volume over time due to accumulating numerical errors, an issue that is exacerbated when simulating multiple interacting fluids.
Implicit Density Projection (IDP) is used to overcome volume preservation issues for PIC methods. However, it is formulated for single fluids. To address this, we present two novel extensions to IDP: generalised IDP and D1-IDP. Generalised IDP extends IDP to multiple fluids. D1-IDP then further improves on the volume preservation capabilities. We show that D1-IDP is particularly good in multiphase fluid simulations with complex fluid interfaces. We also demonstrate its applicability to multiphase animations involving variable density fluids.
D1-IDP is able to achieve a maximum volume error of $<1.0\%$ for the majority of presented scenarios, while having a negligible impact on computation performance compared to generalised IDP.
CartoonAnimalPose: A New Dataset and Method for Cartoon Animal Pose Estimation
ABSTRACT. This paper introduces CartoonAnimalPose, the first dataset for non-photorealistic animal pose estimation, containing over 4,000 images and 8,800 instances with precise annotations of 21 keypoints and bounding boxes across various artistic styles and species. We also propose a high-resolution network with a joint learning strategy and a Channel-Spatial Collaborative Attention (CSA) module. This framework enables cross-domain learning between real and cartoon animal data, while the CSA module improves feature representation by modeling both channel and spatial dimensions. Our approach achieves an AP@0.5 of 66.4\% on the CartoonAnimalPose dataset. This work establishes the first standardized benchmark for cartoon animal pose estimation and provides new insights into cross-domain learning and attention mechanisms for keypoint detection. The dataset and code are available at https://github.com/Jll0716/CartoonAnimalPose.