Batch Specular Manifold Sampling for Caustics Rendering
ABSTRACT. Caustics rendering has long posed a formidable challenge within the realm of light transport simulation. In a recent breakthrough, Zeltner introduced Specular Manifold Sampling (SMS), a technique adept at unbiasedly addressing caustics within the traditional Monte Carlo framework. SMS ingeniously employs the Bernoulli experiment to assess the inverse probability of sampled specular paths. However, the intricacies of complex scene geometries can render this process computationally intensive and the resulting estimations less than ideally accurate. To address this problem, this paper presents an innovative method designed to expedite the estimation process and bolster its precision. Our approach cleverly allocates Bernoulli trials among diverse specular solutions, effectively spreading the computational burden and enhancing the sampling rate for both the shading point and individual solutions. This optimization significantly refines the accuracy of the inverse probability estimation. Empirical results demonstrate that our proposed method not only surpasses existing state-of-the-art techniques but also achieves a marked reduction in the variance associated with caustics rendering.
Distilling Complementary Information from Temporal Context for Enhancing Human Appearance in Human-Specific NeRF
ABSTRACT. Reconstructing and animating digital avatars with free views from monocular videos has been an interesting research task in the computer vision field for a long time. Recently, some methods have introduced a novel category method of leveraging the neural radiance field to represent the human body in a canonical space with the help of the SMPL model. With the deformation of the points from an observation space into a canonical space, the human appearance can be learned in various poses and viewpoints. However, previous methods highly rely on pose-dependent representation learned from frame-independent optimization and ignore the temporal contexts across the continuous motion video, causing a bad influence on the dynamic appearance texture generation. To overcome these problems, we propose a novel free-viewpoint rendering framework, TMIHuman. It aims at introducing temporal information into NeRF-based rendering and distilling task-relevant information from complex pixel-wise representations. To be specific, we build a Temporal Fusion Encoder that imports timestamps into the learning of non-rigid deformation and fuses the visual features of other frames into human representation. Then, we propose to disentangle the fused features and extract useful visual cues via mutual information objectives. We have extensively evaluated our method and achieved state-of-the-art performance on different public datasets.
Attention-Guided Self-Supervised Distinctive Region Detection in Point Clouds
ABSTRACT. Detecting distinctive regions in point clouds is a fundamental task in shape analysis, critical for applications such as fine-grained classification, shape retrieval, and shape matching. Recent unsupervised deep-learning approaches have shown promise, moving beyond hand-crafted features and labeled data. However, their results as well as their specific distinctive point selection mechanisms leave room for improvement. This work aims to enhance these approaches and extend them to more general learning scenarios. We propose two key algorithmic improvements. First, an attention-based mechanism for selecting distinctive points, and second, a novel Semantic Consistency Loss that enhances the framework's ability to identify meaningful distinctive regions consistently within a given shape. Additionally, we extend the framework to a few-shot learning setup, useful in cases where distinctive regions are ambiguous or poorly defined. To support our research, we have constructed what we believe to be the first benchmark with ground-truth distinctive region labels. Our experimental results, conducted across multiple real and synthetic datasets, demonstrate that our approach, dubbed Distinctive Region Attention-Guided detection in point clouds (DRAG), provides significant improvements over state-of-the-art methods.
Energy-guided Test-time Adaptation for Data Shifts in Multi-modal Perception
ABSTRACT. In multi-modal perception tasks, test-phase data often suffers from environmental interference and sensor degradation. This causes distribution shifts from the training phase, significantly degrading model performance. Test-Time Adaptation (TTA) is an emerging unsupervised learning strategy that allows pre-trained models to adapt to new data distributions during testing without requiring labeled data. However, existing TTA methods often perform poorly on multi-modal data, as they typically focus on single-modal tasks and fail to fully leverage complementary information between modalities. Furthermore, some TTA methods rely on high-confidence pseudo-labels to update model parameters, which can result in a lack of effective information to support model parameter fine-tuning when all modalities are corrupted, potentially leading to worse performance than before fine-tuning. To address these issues, we propose a two-stage TTA framework that incorporates an energy-guided loss function and a memory bank mechanism. The energy-guided loss function smooths class distributions within each batch, reducing overconfidence from noisy pseudo-labels. The memory bank stores high-confidence samples for each class, allowing the model to refine predictions for low-confidence samples without additional parameter updates, mitigating catastrophic forgetting. Our method demonstrates superior robustness in multi-modal tasks, significantly outperforming state-of-the-art methods in scenarios with varying levels of modality corruption, particularly under severe distribution shifts.
Bijective Spherical Parameterization via Stereographic Projection
ABSTRACT. Spherical parameterization serves as a critical technique in computer graphics, with existing methods primarily focusing on minimizing isometric distortion. Despite significant advancements, current approaches often struggle to simultaneously achieve both bijectivity and low isometric distortion. This paper presents a novel method for computing continuous spherical parameterizations that addresses these challenging constraints.
Our approach combines inverse stereographic projection with a bijective mapping that maps mesh to the extended complex plane. Unlike traditional methods that map triangles to geodesic spherical triangles, our technique continuously maps input mesh triangles to stereographic triangles, resulting in a strictly one-to-one and onto mapping. We also identify a critical limitation in existing approaches: vanilla triangles (i.e., triangles with straight edges) in the extended complex plane can induce significant distortions near poles during stereographic projection. To mitigate this issue, we propose utilizing Bézier triangles instead of traditional vanilla triangles, which notably reduce the isometric distortion. Through extensive experimentation and comparisons with state-of-the-art techniques, we demonstrate the effectiveness and robustness of our method across various input scenarios.
Dynamic voxel grid optimization for high-fidelity rgb-d supervised surface reconstruction
ABSTRACT. Direct optimization of interpolated features on multi-resolution voxel grids has emerged as a more efficient alternative to MLP-like modules. However, this approach is constrained by higher memory expenses and limited representation capabilities. In this paper, we introduce a novel dynamic grid optimization method for high-fidelity 3D surface reconstruction that incorporates both RGB and depth observations. Rather than treating each voxel equally, we optimize the process by dynamically modifying the grid and assigning more finer-scale voxels to regions with higher complexity, allowing us to capture more intricate details. Furthermore, we develop a scheme to quantify the dynamic subdivision of voxel grid during optimization without requiring any priors. The proposed approach is able to generate high-quality 3D reconstructions with fine details on both synthetic and real-world data, while maintaining computational efficiency, which is substantially faster than the baseline method NeuralRGBD.
MOT FCG++: Enhanced Representation of Spatio-temporal Motion and Appearance Features
ABSTRACT. The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial-temporal motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial-temporal motion feature representation, improving upon the hierarchical clustering association method MOT FCG. For spatial-temporal motion features, we first propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. Second, Mean Constant Velocity Modeling is proposed to reduce the effect of observation noise on target motion state estimation. For appearance features, we utilize a dynamic appearance representation that in corporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT FCG, we have realized further improvements in the performance of all.We achieve 63.1 HOTA, 76.9 MOTA and 78.2 IDF1 on the MOT17 test set and achieves SOTA level in FP,Prcn and DetPr metrics, ranking first in the private detection rankings.The code will be publicly available at the time of publishing.
Attention-enhanced 3D craniomaxillofacial anatomical landmark detection based on projection
ABSTRACT. Craniomaxillofacial deformities can substantially disrupt the quality of life, and treatments for these conditions present considerable complexity and risk. Intelligent surgical planning systems hold the potential to address these challenges effectively, and one key aspect lies in the accurate detection of anatomical landmarks within the craniomaxillofacial region. However, in clinical applications, landmark detection based on 2D images does not adequately represent the structural details of the craniomaxillofacial region. Landmark detection utilizing 3D models is largely confined to CBCT datasets, which exhibits a deficiency in adaptability to more general modeling platforms. We propose an attention-enhanced method based on common 3D models for automatically detecting craniomaxillofacial anatomical landmarks. The approach relies on the dimensional reduction of a 3D craniomaxillofacial model achieved through multi-angle projections under multi-light conditions. To enhance the robustness of our method, an attention module is introduced to integrate detection results from projections under diverse lighting conditions. We also design a user-friendly interactive detection interface to allow users to inspect the models and associated detection outcomes in detail. Our method has undergone evaluation using data from professional surgeons at the Chinese Academy of Medical Sciences and Peking Union Medical College. The average error in landmark localization on the 3D model is successfully constrained to approximately 2.51 mm. From experimental evaluations and the user study, both the qualitative and quantitative results consistently indicate that our techniques can accurately detect 3D craniomaxillofacial anatomical landmarks. Our prototype system can provide an efficient and precise solution for potentially intelligent automation of craniomaxillofacial surgery.
SWAN: A Synergistic Wavelet Attention Network for Enhanced Underwater Image Enhancement
ABSTRACT. Underwater images are essential for advancing research in marine biology, archaeology, and robotics; however, capturing high-quality images in aquatic environments remains a significant challenge due to factors such as light attenuation, scattering, and color distortion. To address these limitations, Underwater Image Enhancement (UIE) techniques have been developed to improve image clarity and fidelity, thereby enabling more accurate scientific analysis. In this work, we introduce the Synergistic Wavelet Attention Network (SWAN), a novel UIE model built upon the U-Net architecture, which incorporates global frequency-domain feature refinement and a hybrid attention mechanism. Specifically, our Hierarchical Wavelet Fusion Convolution (HWF-Conv) module leverages wavelet transforms to perform multi-frequency decomposition of input images, effectively capturing both low-frequency structural information and high-frequency details. This approach significantly enhances the network's multiscale representation capabilities, which are critical for processing complex underwater scenes. Furthermore, the Synergistic Attention Mixing Transformer (SAM-Transformer) integrates Dimensional and Hierarchical Receptive Mixing layers to capture both local and global dependencies within the image, thereby improving the recovery of fine details and overall visual quality. Extensive experimental evaluations demonstrate that SWAN achieves superior performance compared to existing deep learning-based UIE methods, excelling in both objective quality metrics and perceptual similarity measures. Quantitative and qualitative results on multiple benchmark underwater image datasets confirm that our proposed model outperforms state-of-the-art approaches, establishing a new standard for underwater image enhancement.
MDSAM: Intrinsic Cues Guided Segmentation for Mirror Detection
ABSTRACT. Mirror detection is crucial for avoiding collisions and misrecognition of reflections in real-world scenes. Existing methods often rely on assumptions, such as semantic similarity between objects and their reflections or contextual correlations, which may not hold in cluttered environments, limiting generalization. Recent advances in foundation models like the Segment Anything Model (SAM) have boosted general-purpose segmentation, yet SAM struggles with mirrors due to prompt ambiguity and reflective complexity. Moreover, Prior methods neither exploit SAM nor generalize well to diverse scenes. To address this, we adapt SAM for mirror segmentation by introducing a Frequency-Chirality Adapter (FCA), which encodes frequency-domain textures and visual chirality to enhance mirror-specific perception. Additionally, we design a Prior-Aware Localization (PAL) module to provide automatic prompt guidance from effective semantic priors, eliminating the need for manual inputs. Experiments on PMD, MSD, and RGBD-Mirror datasets show that our method consistently outperforms previous state-of-the-art approaches, significantly enhancing SAM’s performance on mirror segmentation.
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions
ABSTRACT. Video recognition remains an open challenge, requiring the identification of diverse content categories within videos. Mainstream approaches often perform flat classification, overlooking the intrinsic hierarchical structure relating categories. To address this, we formalize the novel task of hierarchical video recognition, and propose a video-language learning framework tailored for hierarchical recognition. Specifically, our framework encodes dependencies between hierarchical category levels, and applies a top-down constraint to filter recognition predictions. We further construct a new fine-grained dataset based on medical assessments for rehabilitation of stroke patients, serving as a challenging benchmark for hierarchical recognition. Through extensive experiments, we demonstrate the efficacy of our approach for hierarchical recognition, significantly outperforming conventional methods, especially for fine-grained subcategories. The proposed framework paves the way for hierarchical modeling in video understanding tasks, moving beyond flat categorization.
Dynamic Prompting and Cross-Modal Attention for Context-Aware Multimodal Emotion Recognition
ABSTRACT. Multimodal emotion recognition is crucial for applications like human computer interaction (HCI) and mental health, but capturing fine-grained emotions from complex scenes remains challenging. Existing methods often generate descriptions lacking context-specific emotional nuance, struggle to effectively fuse visual and textual cues, and may suffer from inconsistencies between explanations and predictions. This paper proposes CARE (Context-Aware Reasoning for Emotion), a novel framework that dynamically integrates scene-level context and textual reasoning. CARE first generates emotion-guided prompts by combining predicted affective cues with structured scene features, guiding a vision-language model to produce contextually rich descriptions. It then applies a targeted cross-modal attention mechanism that aligns visual observations with relevant semantic details. Finally, a consistency loss enforces agreement between the predicted emotion distribution and that inferred from the generated text, improving robustness and interpretability. Experiments show our framework achieves 36.82\% mAP on EMOTIC and 90.17\% Accuracy on CAER-S, while also significantly improving description quality. These results underscore the framework's effectiveness in modeling complex scene-emotion interactions for context-aware emotion recognition.
UNet-3D with Adaptive TverskyCE Loss for Pancreas Medical Image Segmentation
ABSTRACT. Pancreatic cancer, which has a low survival rate, is the most intractable one among all cancers. Most diagnoses of this cancer heavily depend on abdominal computed tomography (CT) scans. Therefore, pancreas segmentation is crucial but challenging. Because of the obscure position of the pancreas, surrounded by other large organs, and its small area, the pancreas has often been impeded and difficult to detect. With these challenges , the segmentation results based on Deep Learning (DL) models still need to be improved. In this research, we propose a novel adaptive TverskyCE loss for DL model training, which combines Tversky loss with cross-entropy loss using learnable weights. Our method enables the model to adjust the loss contribution automatically and find the best objective function during training. All experiments were conducted on the National Institutes of Health (NIH) Pancreas-CT dataset. We evaluated the adaptive TverskyCE loss on the UNet-3D and Dilated UNet-3D, and our method achieved a Dice Similarity Coefficient (DSC) of 85.59%, with peak performance up to 95.24%, and the score of 85.14%. DSC and the score were improved by 9.47% and 8.98% respectively compared with the baseline UNet-3D with Tversky loss for pancreas segmentation.
Building 3DGS Representation for Single Interested Object via Joint Segmentation-training Framework
ABSTRACT. Interactive segmentation for single interested object in 3D Gaussians brings new opportunity for 3D scene understanding, thanking to the real-time representation by 3D Gaussian Splatting (3DGS). However, current 3D segmentation methods on 3D Gaussians follow a two-stage framework that first reconstructs the entire scene and then performs segmentation, leading to ambiguous Gaussians near boundaries and large amount of time cost. In order to take these challenges, we introduce a joint segmentation-training framework, whose purpose is to rapidly build the accurate 3DGS representation for the interested object. Our method aims to get the multiview-consistency segmentation masks, and rapidly build the 3DGS representation based on the masks. To get the multiview-consistency masks, we use a confidence-based filtering strategy to divide the masks provided by pre-trained segmentation model, into correct masks and incorrect masks, and train the Gaussians with the correct ones to build the coarse 3DGS representation of interested object. The coarse representation is used in our mask correction process to correct the incorrect masks. In addition, to accurately express the object in scenes with complex occlusions, we use a weighted loss function to avoid the calculation of gradient in the objectively occluded area. Experiments shows that our method have the highest segmentation accuracy and the lowest time cost in single interested object, and improves robustness against inaccurate masks from pre-trained video segmentation model.
Poll-Sketcher: Visual Exploration of Time-Varying Air Pollutant Data Based on Hand-drawn Sketches
ABSTRACT. Air pollution has recently attracted growing attention from different stakeholders including the government, the industry, and the public. Understanding the common changing patterns of air pollutants is crucial for effective air pollution prevention and treatment. To achieve this, stakeholders often required to see an intended change pattern that can be feasibly defined by sketches, promoting the development of a sketch-based querying system. This is nevertheless a challenging task, as air pollutants typically exhibit properties of multi-variate (e.g., PM2.5, PM10, SO2, etc.), multiscale (e.g., day, week, and month scales), and large volume spanning several years. To address the challenges, we present Poll-Sketcher, a sketch query visualization system allowing users to sketch freehand and then view the air pollutant changes that match the sketch at different scales. Poll-Sketcher leverages a set of time series transformation methods, including smoothing and sampling, to handle multi-scale and large-volume time series data. We propose a new similarity measurement method to calculate the similarity of hand-drawn sketches with the trend of time series data, namely SDist-based Sketch Matching, to accurately match scaleless hand sketches with varying-scale air pollutants. We further develop multiple coordinated views to help users explore insightful information on air pollutants from spatial, temporal, and query perspectives. Quantitative comparisons with existing sketch-based query methods prove the effectiveness of our proposed sketch query method. Case studies and expert feedback further demonstrate that Poll-Sketcher can meet the analyst’s needs for querying patterns in multivariate time series and support an understandable, flexible and user-friendly exploration process.
An Asymmetric Calibrated Transformer Network for Underwater Image Restoration
ABSTRACT. Underwater image restoration remains challenging due to complex light propagation and scattering effects in aqueous environments. While recent transformer based methods have shown promising results in image restoration tasks, they often struggle with the domain gap between underwater and air-captured images, and lack explicit mechanisms to handle underwater-specific degradations. This paper presents \emph{AsymCT-UIR}, a novel asymmetric transformer network specifically designed for underwater image restoration. Unlike conventional symmetric architectures, our model employs distinct processing pathways: an encoder with learnable calibration modules for underwater-to-air domain adaptation, and a decoder with dual attention transformers for feature restoration. The asymmetric design enables effective distortion correction while preserving fine details through adaptive feature calibration and efficient attention mechanisms. Extensive experiments on benchmark datasets demonstrate that our AsymCT-UIR achieves superior performance compared to state-of-the-art methods. The proposed method shows robust performance across various underwater conditions and degradation types, making it practical for real-world applications.
Instance-guided Cartoon Editing with a Curated Large-scale Dataset
ABSTRACT. Cartoon editing, appreciated by both professional illustrators and hobbyists, allows extensive creative freedom and the development of original narratives within the cartoon domain. However, the existing literature on cartoon editing is complex and leans heavily on manual operations, owing to the challenge of automatic identification of individual character instances. Therefore, an automated segmentation of these elements becomes imperative to facilitate a variety of cartoon editing applications such as visual style editing, motion decomposition and transfer, and the computation of stereoscopic depths for an enriched visual experience. Unfortunately, most current segmentation methods are designed for natural photographs, failing to recognize from the intricate aesthetics of cartoon subjects, thus lowering segmentation quality. The major challenge stems from two key shortcomings: the rarity of high-quality cartoon dedicated datasets and the absence of competent models for high-resolution instance extraction on cartoons. To address this, we introduce a high-quality dataset of over 100k paired high-resolution cartoon images and their instance labeling masks. We also present an instance-aware image segmentation model that can generate accurate, high-resolution segmentation masks for characters in cartoon images. We present that the proposed approach enables a range of segmentation-dependent cartoon editing applications like 3D Ken Burns parallax effects, text-guided cartoon style editing, and puppet animation from illustrations and manga.
ABSTRACT. This paper proposes a method to estimate the locations of grid handles in free-form deformation (FFD) while preserving the local shape characteristics of the 2D/3D input model embedded into the grid, named locality-preserving FFD (lp-FFD). Users first specify some vertex locations in the input model and grid handle locations. The system then optimizes all locations of grid handles by minimizing the distortion of the input model's mesh elements. The proposed method is fast and stable, allowing the user to directly and indirectly make the deformed shape of mesh model and grid. This paper shows some examples of deformation results to demonstrate the robustness of our lp-FFD. In addition, we conducted a user study and confirm our lp-FFD's efficiency and effectiveness in shape deformation is higher than those of existing methods used in commercial software.
ABSTRACT. Path tracing methods typically generate incoherent rays due to randomized direction sampling, leading to inefficiency on modern processors that rely on coherence. While previous approaches aimed to enhance coherence by reordering rays based on origins and directions, they often suffer from significant overhead of ray encoding and sorting, and show limited effectiveness in large, complex scenes. To further accelerate performance, we propose a technique to generate coherent rays directly by reusing secondary ray directions within spatially grouped pixels, thereby eliminating the need for reordering. Additionally, to control sampling correlation and preserve visual quality, we introduce an interleaved grouping strategy that distributes shared directions while maintaining local coherence. Compared to traditional reordering-based methods, our approach achieves significant speedup while maintaining high rendering quality with minimal artifacts, as demonstrated across a variety of test scenes.
Submodular-based View Selection for Low-Quality Points Rendering with Multi-Feature Point-based NeRF
ABSTRACT. NeRF has revolutionized view synthesis and 3D reconstruction. However, significant challenges persist when using RGB/RGB-D sensors for 3D reconstruction. Issues such as uneven lighting, object occlusions, or incomplete scanning processes often result in incomplete reconstructed point clouds. These incomplete point clouds may occur even when the captured images contain comprehensive information about the entire scene. In order to effectively tackle these challenges, we propose a robust approach that includes three essential components: (1) a submodular-driven view selection strategy that maximizes scene coverage from limited views. (2) A multi-feature fusion technique that combines point, voxel, and pixel features enhances scenes with low-quality point cloud rendering. Point features are derived from neighboring surface points, voxel features are learned through 3D-UNet, and pixel features are extracted from pre-selected high-quality views. (3) A hybrid rendering approach balancing non-trainable and trainable features for efficient, high-quality rendering. Experiments on NeRFSynthetic and Scannet datasets demonstrate that our approach improves rendering and reconstruction quality, particularly in scenes with low-quality point clouds, outperforming existing point-based neural rendering methods across various environmental conditions.
MAS-KCL: Knowledge Component Graph Structure Learning with Large Language Model-based Agentic Workflow
ABSTRACT. Knowledge components (KCs) are the fundamental units of knowledge in the field of education. A KC graph illustrates the relationships and dependencies between KCs. An accurate KC graph can assist educators in identifying the root causes of students' poor performance on specific KCs, thereby enabling targeted instructional interventions. To achieve this, we have developed a KC graph structure learning algorithm, named MAS-KCL, which employs a multi-agent system driven by large language models for fine-tuning and optimization of the KC graph. Additionally, a bidirectional feedback mechanism is integrated into the algorithm, where AI agents leverage this mechanism to assess the value of edges within the KC graph and adjust the distribution of generation probabilities for different edges, thereby accelerating the efficiency of structure learning. We applied the proposed algorithm to a real-world educational dataset, and experimental results validate its effectiveness in learning path recognition. By accurately identifying students' learning paths, teachers are able to design more comprehensive learning plans, enabling learners to achieve their educational goals more effectively, thus promoting the sustainable development of education.
AMNet: An Attention-Enhanced Multi-Branch Network for Micro-Expression Recognition
ABSTRACT. Micro-expressions, subtle facial movements revealing concealed emotions, are challenging to recognize due to their short duration, low intensity, and limited balanced datasets. This paper proposes an attention-enhanced multi-branch network (AMNet) with three core innovations to overcome these limitations in micro-expression recognition. First, an improved attention mechanism is designed to dynamically emphasize discriminative facial regions critical for identifying subtle expressions. Second, a spatiotemporal fusion module, built on the synergy of lightweight STSTNet and ConvLSTM, efficiently integrates spatial and temporal information for the comprehensive micro-expression dynamics modeling. Third, a hierarchical feature fusion strategy is implemented to progressively refine multi-branch features, ensuring the robust learning of micro-expression characteristics. Extensive experiments on CASME II, SAMM, SMIC, and CAS(ME)³ demonstrate AMNet's superiority over state-of-the-art approaches, with exceptional accuracy and generalization. The code will be released to the public upon the acceptance of this paper.
ABSTRACT. Virtual try-on (VTON) technology enables the rapid creation of realistic try-on experiences, which makes it highly valuable for metaverse and e-commerce. However, 2D VTON methods struggle to convey depth and immersion, while existing 3D methods require multi-view garment images and face challenges in generating high-fidelity garment textures. To address the aforementioned limitations, this paper proposes a panoramic Gaussian VTON framework guided solely by front-and-back garment information, named PG-VTON, which uses a adapted local controllable diffusion model for generating virtual dressing effects in specific regions. Specifically, PG-VTON adopts a coarse-to-fine architecture consisting of two stages. The coarse editing stage employs a local controllable diffusion model with a score distillation sampling (SDS) loss to generate coarse garment geometries with high-level semantics. Meanwhile, the refinement stage applies the same diffusion model with a photometric loss not only to enhance garment details and reduce artifacts but also to correct unwanted noise and distortions introduced during the coarse stage, thereby effective enhancing realism. To improve training efficiency, we further introduce a dynamic noise scheduling (DNS) strategy, which ensures stable training and high-fidelity results. Experimental results demonstrate the superiority of our method, which achieves geometrically consistent and highly realistic 3D virtual try-on generation.
Towards Extended Reality in Emergency Response: Guidelines and Challenges for First Responder Friendly Augmented Interfaces
ABSTRACT. As Extended Reality (XR) technologies continue gaining popularity, various domains seek to integrate them into their workflows to enhance performance and user satisfaction. However, integrating XR technologies into emergency response presents unique challenges. Unlike other fields, such as healthcare, entertainment, or education, emergency response involves physically demanding environments and information-intensive tasks that first responders (FRs) must perform. Augmented reality (AR) head-mounted displays (HMDs) present promising solutions for improving situational awareness and reducing the cognitive load of the FRs. However, there has been limited research focused on the specific needs of FRs. Moreover, existing studies investigating FR needs have primarily been conducted in controlled laboratory settings, revealing a significant gap in the literature concerning FR requirements in real-life scenarios.
This work addresses this gap through a comprehensive user study with subject matter experts (SMEs) and FRs. User studies were conducted after two different real-life scenarios using AR HMDs. To further understand FR needs, we extensively reviewed the literature for similar studies that reported FR needs, explicitly focusing on studies including interviews with SMEs and FRs. Our findings identified key design guidelines for FR-friendly AR interfaces while also highlighting the direction for future research to improve the user experience of the FRs.
Predicting and Optimizing Crowd Evacuations: An Explainable AI Approach
ABSTRACT. In this paper, we explore the usability of an explainable Artificial Neural Network (ANN) model to provide recommendations for architectural improvements aimed at enhancing crowd safety and comfort during emergency situations. We trained an ANN to predict the outcomes of crowd simulations without the need for direct simulation while also generating recommendations for the studied space. Our dataset comprises approximately 36,000 simulations of diverse crowds evacuating rooms of different sizes, capturing data on room characteristics, crowd composition, evacuation times, densities, and velocities. To identify the most influential environmental factors affecting evacuation performance, we employ Shapley values. Based on these insights, we propose modifications to the architectural design of the space. Our results demonstrate that the proposed model effectively predicts crowd dynamics and provides meaningful recommendations for improving evacuation efficiency and safety.
Adaptive Sampling for Interactive Simulation of Granular Material
ABSTRACT. We present a method for simulating granular materials faster within a position based dynamics framework. We do this by combining an adaptive particle sampling scheme with an upsampling approach. This allows for faster simulations in interactive applications, while maintaining visual resolution. Particles are merged or split based on their distance from the boundary, allowing for high details in areas of importance such as the surface and edges. Merging particles into a single particle reduces the number of particles for which collisions have to be simulated, thus reducing the overall simulation time. The adaptive sampling technique is then combined with an upsampling scheme that gives the coarser particle simulation the appearance of much finer resolution.
Dynamic Translational Gains Manipulation for Tiny Object Interaction
ABSTRACT. Interacting with small objects in virtual reality (VR) can be challenging due to the physical limitations of controllers and headsets, which often lead to unintended collisions and tracking loss when devices come too close, thereby disrupting the user's immersive experience. While researchers have developed techniques like translational gain, hand remapping, and specialized interaction to address these challenges, these approaches are often task-specific or insufficient for precise, detailed interactions or observations. To address these challenges, in this paper, we introduce a novel interaction technique called dynamic translational gains manipulation (DTGM), which adjusts scaling in real-time based on the user's proximity to objects. We conducted a user study to evaluate the effectiveness of improving precision during object manipulation and understanding the subjective mental workload of the proposed DTGM technique. Our results revealed that the DTGM technique improved interaction efficiency, making it suitable for various VR applications where precision and space optimization are crucial.
Text-driven High-quality 3D Human Generation via Variational Gradient Estimationl and Latent Reward Models
ABSTRACT. Recent advances in Score Distillation Sampling (SDS) have enabled text-driven 3D human generation, yet the standard classifier-free guidance (CFG) framework struggles with semantic misalignment and texture oversaturation due to limited model capacity. We propose a novel framework that decouples conditional and unconditional guidance via a dual-model strategy: a pretrained diffusion model ensures geometric stability, while a preference-tuned latent reward model enhances semantic fidelity. To further refine noise estimation, we introduce a lightweight U-shaped Swin Transformer (U-Swin) that regularizes predicted noise against the reward model, reducing gradient bias and local artifacts. Additionally, we design a time-varying noise weighting mechanism to dynamically balance the two guidance signals during denoising, improving stability and texture realism. Extensive experiments show that our method significantly improves alignment with textual descriptions, enhances texture details, and outperforms state-of-the-art baselines in both visual quality and semantic consistency.
PPVSR:Video Super-Resolution Reconstruction Algorithms Based on Progressive Processing
ABSTRACT. Video super-resolution reconstruction (VSR) refers to the process of enhancing low-resolution video frames into high-resolution continuous frames using advanced super-resolution techniques. Unlike single-frame super-resolution, VSR must take into account several critical factors, such as feature extraction between adjacent frames, maintaining spatiotemporal consistency, and accurately estimating and compensating for motion. The inherent complexity of these factors makes frame alignment one of the most significant challenges in the field. Effectively utilizing information from neighboring frames for precise frame alignment is essential for successful video reconstruction. In this paper, we propose a novel network called Phased Processing Video Super-Resolution (PPVSR), which addresses these challenges through a structured approach. Our method comprises three main steps: Pre-cleaning enhances the quality of the input video by reducing noise and improving clarity; Deformable convolution frame alignment ensures that the frames maintain spatiotemporal consistency by adapting convolution operations to align the features across frames effectively; finally, the reconstruction stage converts the low-resolution video into a high-resolution output, synthesizing details that were not present in the original frames. Experimental results demonstrate that our proposed algorithm significantly improves video reconstruction performance on multiple benchmark datasets, including REDS4 and Vid4. These results underscore the effectiveness and superiority of our approach in achieving high-quality video super-resolution.
Reliable AI-Driven Decision Analysis of Ophthalmic Plastic Surgery Parameters with SparseInst Network
ABSTRACT. Addressing the common issue of inaccuracies in measuring human eye regions during cosmetic procedures due to physician error, this study presents an innovative detection algorithm based on the advanced SparseInst network framework. This method automates complex computations, achieving high-precision quantification of multiple ophthalmic features that are typically challenging to discern and analyze computationally. The Strip Pooling Module (SPM) is integrated to capture broader and more distant contextually-rich information, enhancing the fusion of discriminative features. Additionally, the BasicBlock of the backbone network is fortified with Deformable Convolutional Networks version 2 (DCNv2), effectively leveraging image features while preserving essential structural details. The Global Attention Mechanism (GAM) is implemented in the final convolutional layer of the backbone network, strategically improving the precision of the resultant segmentation map and meeting stringent measurement criteria. The proposed algorithm synergistically combines with tailored mathematical models to accurately parameterize crucial ocular dimensions, including area, eyebrow-eye distance, pupil diameter, and inner canthus angle. Empirical evaluations demonstrate superior segmentation performance, with an Average Precision (AP) of 71.2\%, AP50 of 99.3\%, and AP75 of 84.6\% on the challenging MAD dataset. This exceptional accuracy enables precise measurement of human eye data when combined with relevant mathematical algorithms, surpassing the precision and reliability required for demanding industrial applications.
DTKD-UIE: Underwater Image Enhancement based on Dual Teacher knowledge distillation
ABSTRACT. The demand for underwater development is
gradually increasing; however, light refraction, absorp
tion, and scattering in underwater scenes lead to color
distortion, blurriness, and low contrast in underwater
images. To address color distortion and fog-induced blur
riness, we propose the DTKD-UIE network, a dual teacher
underwater image enhancement framework based on
knowledge distillation. It utilizes a ”unified” network
that trains a single set of weights capable of concur
rently handling color correction and defogging in under
water images. To facilitate the effective assimilation of
multi-teacher knowledge during knowledge distillation,
we propose a feature matching and distillation module.
Furthermore, we design a vision Fourier processor in the
network to enhance the texture and detail information
in images. We validate our method on the ImageNet
and UIEB datasets, and the experimental results show
that our method achieves promising results.
An Improved Small Object Detection Method for Shuttlecock Activity Analysis
ABSTRACT. In the task of shuttlecock tracking and strike counting, existing methods suffer from significant limitations in detection accuracy and tracking stability due to the small size and high speed of the target. Furthermore, the lack of a dedicated shuttlecock dataset further restricts the improvement of model performance. To address these issues, this paper proposes a novel network model that integrates a small object detection branch and a convolutional attention mechanism to enhance the detection and tracking capabilities of shuttlecock targets. Specifically, a small object detection branch is added to the detection framework to strengthen high-resolution feature extraction, and a convolutional block attention module is introduced to enhance the representation of key features under complex backgrounds and scale variations. Meanwhile, a dedicated shuttlecock dataset is constructed through real-scene recording and rigorous filtering, effectively alleviating the shortage of training data. Experimental results demonstrate that the proposed method achieves significant improvements in detection precision, recall, and overall mAP metrics, with the overall mAP increasing by approximately 6 percentage points compared to the baseline model, while maintaining high detection speed. Additionally, a Web-based client system is developed to encapsulate the model and visualize the results, further verifying the feasibility and efficiency of the proposed approach in real-world applications. This work provides new insights and technical support for small object detection and intelligent sports analysis.
PGI-ViMamba: A Robust Neural Network Architecture for Enhanced Small Target Detection Against Information Loss
ABSTRACT. This paper addresses the critical issue of information loss in deep neural networks, particularly in small target detection tasks where the problem is exacerbated. Contrary to the prevailing theory attributing decreased recognition rates to information bottlenecks, we propose PGI-ViMamba, a novel model designed to mitigate information attenuation in neural networks. Our approach integrates two key components: the Multi-level Attention Gated PGI (MLAG PGI) and the SPD-Conv-VSS module. The model architecture features an enhanced VSS module as its backbone feature extractor, complemented by an auxiliary branch inspired by YOLOv9. This innovative structure not only preserves small target feature information during forward propagation but also achieves parameter efficiency. Notably, the MLAG backbone provides robust gradient information, enhancing model performance. Experimental results demonstrate significant improvements, with our model achieving a 4.7% increase in small target recognition accuracy on the VisDrone dataset compared to state-of-the-art models, while maintaining recall rates. Ablation studies further validate the effectiveness of both the Multi-level-AG PGI and SPD-Conv-VSS components in our proposed architecture.
Hybrid-SANet: Hybrid Self-Attention Transformer for Efficient Image Super-Resolution
ABSTRACT. While Convolutional Neural Networks (CNNs) excel at local feature modeling, they struggle with global context in lightweight super-resolution (SR) tasks and suffer from high computational complexity. Vision Transformers (ViTs) are effective at capturing long-range dependencies but tend to be computationally expensive and less efficient at handling local details. To address these challenges, we propose Hybrid-SANet, a novel hybrid self-attention network architecture that combines the strengths of both CNNs and ViTs. To this end, we design a Local Self-Attention (LSA) module consisting of wavelet transform and variable large kernel block to efficiently extract high-frequency details and capture short-range dependencies within the image. This allows the model to effectively preserve fine textures and local features, which are crucial for high-quality SR reconstruction. To enhance the global feature extraction, Hybrid-SANet also incorporates frequency projection into sparse GSA, which enables the model to capture long-range dependencies and global context. Experimental results demonstrate that Hybrid-SANet outperforms existing SR methods on benchmark datasets, particularly in terms of detail recovery and computational efficiency.
PCM-Net: A Hierarchical Medical Image Registration Framework Integrating Channel Adaptability and Multi-Scale Awareness
ABSTRACT. Deformable medical image registration aims to align the underlying anatomical structures of moving and fixed images by identifying the optimal spatial transformation. The current registration methods face difficulties in removing redundant information and balancing local and global deformations. This paper proposes PCM-Net, a two-stream pyramid model based on channel adaptability and multi-scale spatial capture. PCM-Net dynamically regulates dual-stream encoder feature fusion through the Adaptive Channel Refinement Gate (ACRG) and employs the Multi-scale Depthwise Convolution Block (MDCB) to achieve a progressive, coarse-to-fine deformation field modeling. This design effectively balances global deformation estimation with precise local displacement capture. Additionally, the model introduces an innovative cross-layer difference propagation mechanism, enabling synergistic optimization of error correction and detail preservation. Therefore, PCM-Net achieves accurate and interpretable registration of moving images. we have designed three variants of PCM-Net to achieve higher registration robustness and accuracy. Experimental evaluations on two public brain MRI datasets demonstrate that PCM-Net and its variants consistently outperforms various state-of-the-art methods in several evaluation metrics. The code is publicly available at: https://anonymous.4open.science/r/PCM-Net-E3B9/.
Real-Time Immersive Haptic Sculpting with Elastoplastic Virtual Clay
ABSTRACT. Virtual sculpting has evolved significantly, yet existing tools often neglect material physics and haptic rendering critical for realism. This paper presents a real-time immersive sculpting system that integrates material point method (MPM) based elastoplastic material simulation with haptic rendering. To model the complex elastoplastic behaviors and the seamless interactions with tools within the MPM framework, we address this limitation by introducing a deformation-aware function that dynamically adjusts PK1 stress computation. It enables a seamless shift from elasticity to plasticity in the material. And for the realistic haptic feedback, we proposed a novel three-degree-of-freedom haptic rendering algorithm by solving a nonlinear least square problem. Additionally, to alleviate the visual artifacts of the marching cubes approach for surface reconstruction during cutting operations, we proposed a dual-field marching cubes algorithm that maintains topological consistency through adaptive isosurface blending. The experiments demonstrate real-time performance (78.4 FPS at 104k particles) and superior penetration resistance compared to the traditional MPM approach. A user study with 32 participants revealed significantly lower cognitive load (NASA-TLX, p<0.05) and higher usability (SUS, p<0.05), with haptic force correlating strongly (r=0.983) with real-world sculpting forces. The proposed framework advances immersive sculpting by unifying physical accuracy, haptic realism, and computational efficiency.
SemanticAvatar: Human Surface Reconstruction Based on Semantically Consistent Biplane Features
ABSTRACT. Efficient semantic embedding is important for fine 3D human surfaces or avatars reconstructed from single images. However, this embedding is weak among existing methods. In particular, the widely-adopted triplane features extracted from single images often exhibit the semantic inconsistencies and mutual interferences. This paper aims at resolving those limitations for better single-image based human surface reconstruction. First, simplification of the triplane features by semantically consistent two-plane representation is introduced. Then, based on the biplane features, a novel diffusion based framework, SemanticAvatar, is proposed. It adopts the large visual models for inferring the occluded view as the complement and further obtains semantically consistent biplane features through shared inference. Two feature-level semantic enhancement strategies are further incorporated: 1) A semantic alignment module which embeds the image semantics by multi-scale aligning image features with the latent representations, thus promoting semantic consistency during the diffusion process; 2) a human-structure based feature cropping strategy, which enables more precise semantic alignment through body-part based decomposition, ultimately yielding more accurate reconstruction details. Experimental results demonstrate the effectiveness of SemanticAvatar.
GenericAvatar: Generic Human Modeling from Monocular Video Based on Mesh-guided Gaussians
ABSTRACT. We propose a universal human avatar modeling framework GenericAvatar, which leverages mesh-guided Gaussian splatting to achieve personalized, high-fidelity reconstruction of human bodies or heads from monocular videos. Our method consists of two steps. First, the Gaussian initialization module based on explicit triangular meshes embeds Gaussian splats onto the mesh surface and then transforms them into the global coordinate system, thereby stably capturing the low frequency motions and surface deformations of human avatars. Second, the Gaussian adjustment module employs a Triplane representation to encode 3D Gaussian splats, followed by a spatial-posture cross-attention module and an MLP module to adjust Gaussian attributes. The second module effectively overcomes the limitations of traditional linear blend skinning (LBS) in modeling complex non-rigid deformations, enabling precise modeling of high-frequency details such as clothing wrinkles and dynamic hair. By fully integrating the geometric priors provided by explicit meshes with implicit Gaussian representations, GenericAvatar demonstrates high-fidelity reconstruction on PeopleSnapshot, ZJU-Mocap, and a monocular head dataset, preserving complex texture details. Experimental results indicate that GenericAvatar outperforms state-of-art methods on both human body reconstruction and head reconstruction. The code is made public in https://genericavatar2025.github.io.
EE-Head: Emotion Estimation for Precise Facial Expression in NeRF Head Avatars
ABSTRACT. Reconstructing animatable avatars from images and videos plays an important role in virtual domains and immersive telepresence. Existing methods that combine parametric face models with neural radiance fields, ignore the influence of subtle expressions on facial geometry. They have difficulty reconstructing photo-realistic avatars with accurate geometries and expressions. To tackle this problem, we propose EE-Head, which introduces a parametric face model with high-accurate facial expressions to neural radiance fields, enhancing the rendering quality of animatable avatars. Specifically, we first propose an emotion estimation algorithm, which utilizes an emotion consistency loss to encourage emotion similarity between input images and parametric face models. This algorithm can estimate a parametric face model with more accurate facial expressions, particularly in the lip region. Then, we propose a joint training method that optimizes a neural radiance field by adjusting the weights of the image and parametric models across different frames. Our optimization can adaptively adjust the impact of different frames on the neural radiance field, improving rendering quality. As a result, EE-Head can reconstruct photo-realistic and animatable avatars with accurate expressions from multi-view images and monocular dynamic videos. Extensive experiments on various subjects demonstrate that EE-Head outperforms SOTA methods in quantitative and qualitative rendering quality.
PhysAvatar: Physically Plausible Avatar Generation from Sparse Tracking
ABSTRACT. Recently, many studies achieved impressive performance in generating full-body motion from tracked head and hand data. However, to our knowledge, all existing approaches suffer from jittering of the knees and feet, foot skating, and foot ground penetration. These motion artifacts are because head-mounted devices(HMD) do not provide any lower-body data, and lower-body motion is generated solely depending on head and hand data.
Humans tend to be sensitive to the end-effectors, i.e., lower legs and arms, when they recognize the pose, and therefore, even minor motion artifacts in the knees and feet break in their presence.
This introduces the challenge of minimizing artifacts and generating physically plausible motion.
This paper presents the first approach, PhysAvatar, which combines neural kinematics regression and physically plausible motion optimization to generate full-body motions from sparse tracking.
The kinematic module is a neural network that regresses the full-body motion.
The physics module is an optimization that refines the motion to satisfy the physical constraints while reproducing the reference pose.
The reference pose is the blending of the inference and controlled pose.
The pose control detects foot-skating artifacts and corrects the legs using the motion-captured clips.
Experiments demonstrate a clear improvement over the state-of-the-art regarding minimal motion artifacts and physical plausibility.
Malicious Manipulation of Incident Light: Enabling Adversarial Relighting Attack with Physical Interpretability
ABSTRACT. Illumination variation is a pervasive challenge for vision models deployed in real-world scenarios. Previous works regarding illumination robustness of vision models by adversarial relighting concentrate primarily on the naturalness rather than physical interpretability. This oversight exacerbates the disparity between the physical world and its digital representation. In this paper, we propose an adversarial relighting framework IL-Attack by maliciously manipulating the incident light with inverse rendering (IR) and physically based rendering (PBR). The intrinsic decoupling ability of IR and physical interpretability of PBR facilitate the realism of our attack, thereby providing a potential solution for physical attacks. Moreover, inspired by the natural effects such as occlusions, we propose an optimizable soft mask applied to the incidence as a portable plug-in. This soft mask is compatible with any ray-tracing technique and is able to mimic natural light effects such as neon lighting and shadow casting. We conduct extensive experiments to validate that the proposed methods can generate natural adversarial relighting results with physical interpretability. The rendering results obtained in a simulated environment guided by our attacks further underscore the potential threats of adversarial relighting examples.
Decoding Emotions: How Eye Features Influence Perception of Emotional Intensity in Virtual Characters
ABSTRACT. The way we perceive emotions in virtual characters significantly impacts our engagement with digital applications. A deeper understanding of how eye features such as astigmatism, eye size, and pupil color influence the perception of emotional intensity can provide more targeted guidance for virtual character design, enhancing emotional intensity and evoking richer emotional experiences. In this study, we investigated how variations in these eye features affect the perception of emotional intensity. We compared virtual characters displaying different states of these features and analyzed how each state influences six primary emotions. Our findings reveal that astigmatism is a key factor in conveying happiness, eye size is crucial to expressing fear, surprise, and anger.
Conversely, pupil color does not significantly affect the perception of emotional intensity. Our findings offer valuable insight into the creation of targeted emotional expressions in interactive virtual agents, especially in face-to-face communication.
ODFM: Orientation-Aware Dual-Branch Functional Maps for Unsupervised Non-rigid Shape Matching
ABSTRACT. Accurately matching non-rigid shapes is a significant challenge in computer vision and graphics, primarily due to the intricate variations and deformations that these shapes experience. This study introduces a novel unsupervised learning method for establishing 3D shape correspondences within the functional map framework. Traditional methods for shape matching are typically based on fully intrinsic representations, which can lead to symmetry ambiguities during prediction and result in unstable correspondences. To address this, we propose a new approach for computing complex functional maps that allows direct induction of these maps from their underlying representations. Based on this framework, we design an innovative dual-branch network and introduce a novel unsupervised loss function that couples point-wise maps with standard complex functional maps. This coupling establishes an intrinsic relationship that helps resolve symmetry ambiguities. Extensive experiments across multiple near-isometric and non-isometric datasets demonstrate our method's effectiveness and
MIC-OPCC: Multi-Indexed Convolution model for Octree Point Cloud Compression
ABSTRACT. For point cloud compression, capturing sufficient spatial context is crucial for accurately modeling the point cloud distribution.
However, voxel-based methods tend to lose effectiveness when dealing with sparse point clouds of higher precision, as the context they gather becomes less comprehensive.
This study introduces an octree-based point cloud compression method that utilizes an entropy model powered by deep learning to estimate probabilities, which are then used to guide an Arithmetic Range Coder in reducing the bit rate of the serialized octree code.
Our proposed model extracts local features using lightweight 1D convolution applied in varied ordering and analyzes causal relationships by optimizing the cross-entropy.
This approach efficiently replaces the voxel-convolution techniques and attention models used in previous works, providing significant improvements in both time and memory consumption.
The effectiveness of our model is demonstrated on two datasets, where it outperforms recent deep learning-based compression models in this field.
A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext
ABSTRACT. Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.
Efficient Modeling of Long-Range Morphology for 3D Neuron Reconstruction from Electron Microscopy Images
ABSTRACT. Reconstructing 3D neurons from large-scale electron microscopy (EM) images is crucial in brain connectomics research. Current 3D neuron reconstruction pipelines require labor-intensive proofreading to manually merge over-segmented neuron fragments generated by automatic image segmentation algorithms. We propose a novel neuron tracing method based on point cloud representation to streamline the neuron proofreading process. We frame the reconstruction problem as a binary classification task to predict whether a pair of segments should be merged into the same neuron, through their 3D morphological features. Existing models based on Multi-layer Perceptrons and Transformer architectures encounter challenges in processing segments with complicated morphological structures and substantial variations in scale. This limitation arises from their restricted ability to model large-scale contextual information and their quadratic time complexity. Inspired by selective state space models (SSMs) that offer sequence modeling capabilities similar to transformers but with linear time complexity, we propose \emph{MambaTracingNet (MTN)} to integrate the growth direction of neuron segments into the dependency learning for neuron tracing.
Our MTN provides robust long-range dependency for neuron segments while maintaining superior computational efficiency based on the selective SSM architecture. We conducted comprehensive experiments on 20,000 segment pairs from the FlyTracing dataset. The results demonstrate that our method outperforms previous MLP and Transformer-based point cloud approaches with a more lightweight network and significantly reduces the GPU memory usage during training.
Obstacle Avoidance Strategy Based on Spatial Contraction for Multi-User Redirected Walking
ABSTRACT. To reduce the collisions in the limited physical space for redirected walking, this paper proposes an obstacle avoidance strategy for multi-user redirected walking algorithm based on virtual spatial contraction. During the redirected walking, this method adjusts the virtual scene within the user’s field of view, and enables users to reach their destination more quickly. Compared to acceleration achieved by adjusting translation gain, this method directly shortens the distance to destination. The spatial contraction for redirected walking does not remap movement between physical and virtual spaces, thereby minimizing the risk of inducing 3D motion sickness. The applications in different obstacle avoidance algorithms show that spatial contraction can reduce the frequency of user resets compared to pre-contraction conditions and remain effective across various avoidance strategies. Moreover, the results show that under the same obstacle avoidance algorithm, smaller scaling factors lead to shorter physical walking distances without compromising user experience.
ABSTRACT. Artistic style transfer is a prominent area of research with significant practical applications. However, existing methods often struggle to achieve a harmonious blend between the content of an image and the style of another, resulting in noticeable artifacts or loss of content fidelity. To address these challenges, we present a novel framework called the Dynamic Style-Adaptive Image Transformation Network (DySAIN), designed to improve the quality of stylized images. We introduce Dynamic Style Module (DSM) that refines style adaptation by dynamically adjusting style features. Additionally, the Attention-Guided Region Styling (AGRS) used to enable precise stylization by focusing on specific image regions, while Progressive Style Blending (PSB) ensures smooth transitions between different style elements. Finally, the Refinement Module polishes the final output for enhanced coherence and visual appeal. Extensive experiments show that the proposed method produces high-quality stylized images while significantly reducing artifacts compared to existing state-of-the-art approaches.
A continuous test-time adaptation method for dynamic haze removal
ABSTRACT. Image dehazing is a crucial research topic in computer vision, focusing on removing haze from images to enhance visibility, contrast, saturation, and clarity, thereby improving overall image quality. Most existing dehazing methods rely on synthetic datasets and often struggle to generalize well in complex real-world scenarios. To address this issue, we propose a dynamic test-time adaptation method designed for continuously changing haze conditions. Specifically, we integrate the continual test-time adaptation (CoTTA) approach into the dehazing task, establishing a test-adaptive framework for image dehazing. Additionally, based on Cutout technique, we introduce a region cropping-based secondary haze addition data augmentation strategy to improve the model's generalization capability. Additionally, we incorporate the dark channel prior into the framework, which further boosts the dehazing performance. Extensive experiments on our custom-built HazeImg dataset demonstrate the effectiveness of the proposed method. Experimental results show that, compared to existing dehazing methods and other test-time adaptation approaches, our region cropping-based secondary haze addition significantly improves dehazing performance in open scenarios, with notable improvements in image clarity and detail restoration. Overall, the proposed method not only significantly improves dehazing accuracy but also enhances the model's ability to adapt to dynamic and complex environments.
Bearing Remaining Useful Life Prediction Using Multimodal Features by Balanced Optimization Between Time-series and Images
ABSTRACT. Estimating the remaining useful life (RUL) of bearings is crucial to ensure their safe operation. In recent years, data-driven methods have garnered significant research interest for accurate RUL prediction. However, most existing methods focus on unimodal data or overlook the imbalance between multimodal during training, which limits the effective utilization of multimodal data. In this article, we represent the degradation state by combining the bearing time series data with the images generated from these time series. An LSTM structure is applied to extract temporal features from the time series data. At the same time, a graph convolutional neural network (GCN) is used to extract spatial features from the generated images. This joint representation allows us to capture both temporal and spatial characteristics of the degradation process. In addition, we propose a balanced optimization approach via gradient modulation to address the issue of imbalance between different modalities, where faster and better-trained modalities may inhibit relatively weaker ones. This approach dynamically adjusts the gradients of different modalities and each modality contribution to the feature extraction process in real time during backpropagation. This not only balances the training process but also enhances the sufficiency and effectiveness of feature extraction from both modalities. The experiment study on the PRONOSTIA platform is used to evaluate the proposed method.
The results show that the proposed method can well balance the bearing series and graph aspects and will achieve superior performance to the existing methods.
Optimisation of Web-Edge-Cloud Intelligent Collaborative Rendering System For Dynamic Web3D Scenes
ABSTRACT. In this article, we present an intelligent rendering framework that leverages a “web-edge-cloud” collaboration approach. Our framework employs cloud and edge servers to distribute the task of rendering global illumination scenes, while the web-based client-ends or specific edge servers handle the post-processing and display the final rendering results on the web. For dynamic scenes, our system leverages the additional rendering capacity of the edge/cloud server to achieve the desired rendering results. To enhance the rendering efficiency of the collaborative system, we have implemented a multi-objective optimization based on the particle swarm algorithm, which can ensures high-quality rendering while simultaneously reducing transmission delays. Our optimization algorithm focuses on minimizing interaction latency while maximizing rendering quality, which has a significant impact on users' experience. We validate the final image quality and operational efficiency of the optimized system visually and quantitatively, and the results demonstrate a substantial improvement in both rendering quality and interaction latency due to our optimization algorithm.
Image Halftoning Using a Single Closed Non-self-intersecting Curve
ABSTRACT. Continuous-line-based halftoning represents a distinct category of image stylizing technology, where a single closed non-self-intersecting curve is used instead of pixels to render a continuous-tone image. This category includes many variants that differ significantly in how lines are generated using the employed technique. We propose a novel method for creating continuous line drawings (CLDs) by offsetting the medial axes of connected cells generated under the control of density function and directional field. Our algorithm first generates cells based on the centroidal Voronoi tessellation(CVT), and a spanning tree of the cell adjacency graph is constructed to control the generation of the final curve. During this step, we minimize an energy function that combines the CVT energy and a point alignment energy to make the density and the direction of the tree area better approximate the image tone. Secondly, a one-stroke curve is created by offsetting the medial axis of the region. To address situations involving many relatively simple polygons, we employ a discrete approach to approximate the medial axes of those polygons and develop a fast and robust technique to offset the medial axes to generate a single stroke curve. We present a range of experimental results demonstrating that the method is quick and capable of producing aesthetically pleasing curves.
Towards Culturally Aware AI: Quantified Evaluation of Relevance and Similarity in AI-Generated Images
ABSTRACT. Despite their continuous advancements, text-to-image models still suffer from multiple limitations including cultural bias and inaccuracy, highlighting the need for a reliable unbiased quantitative metric to evaluate cultural representation in AI-generated images. In this work, we first introduce a prompt-guided, context-aware cultural relevance index (CRIX) as a quantified metric to evaluate image relevancy to a given culture. We employ the underlying knowledge of visual language models (VLMs), leveraging their visual capabilities in performing the task of visual question answering (VQA) to calculate the proposed metric by taking into account the contextual aspects of the image in addition to its abstract content. Applied to the Arabic and South Asian cultures, the proposed metric scores the lowest mean squared errors of 0.0022 and 0.0053 for the Arabic culture and South Asian culture, respectively, surpassing those of CRI and baseline metrics. Additionally, we propose the cultural similarity score (CSS), which aims to
quantify the similarity between two images in representing a given culture.