CGI 2025: 42ND COMPUTER GRAPHICS INTERNATIONAL 2025
PROGRAM FOR WEDNESDAY, JULY 16TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-09:15 Session 17: Welcome to CGI 2025

Zoom Link: https://polyu.zoom.us/j/89298697634?pwd=lWl32BN6kaRycVtGjmi3RW9BqhtiEj.1   Meeting ID: 892 9869 7634   Passcode: CGI2025

Chair:
Location: M1603, PolyU
09:15-10:00 Session 18: CGI 2025 Panel Discussion 1

Zoom Link: https://polyu.zoom.us/j/89298697634?pwd=lWl32BN6kaRycVtGjmi3RW9BqhtiEj.1   Meeting ID: 892 9869 7634   Passcode: CGI2025

Chair:
Location: M1603, PolyU
10:25-12:00 Session 19A: TVC 4 Animation and Simulation

Zoom Link: https://polyu.zoom.us/j/89298697634?pwd=lWl32BN6kaRycVtGjmi3RW9BqhtiEj.1   Meeting ID: 892 9869 7634   Passcode: CGI2025

Chair:
Location: M1603, PolyU
10:25
SelfAge: Personalized Facial Age Transformation Using Self-reference Images

ABSTRACT. Age transformation of facial images is a technique that edits age-related person's appearances while preserving the identity. Existing deep learning-based methods can reproduce natural age transformations; however, they only reproduce averaged transitions and fail to account for individual-specific appearances influenced by their life histories. In this paper, we propose the first diffusion model-based method for personalized age transformation. Our diffusion model takes a facial image and a target age as input and generates an age-edited face image as output. To reflect individual-specific features, we incorporate additional supervision using self-reference images, which are facial images of the same person at different ages. Specifically, we fine-tune a pretrained diffusion model for personalized adaptation using approximately 3 to 5 self-reference images. Additionally, we design an effective prompt to enhance the performance of age editing and identity preservation. Experiments demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to existing methods.

10:40
Enhanced Material Point Method with Affine Projection Stabilizer for Efficient Hyperelastic Simulations

ABSTRACT. The explicit integration scheme in the Moving Least Squares Material Point Method (MLS-MPM) is often constrained by small time steps, as larger steps approaching the CFL condition can lead to instability. As a result, traditional explicit MLS-MPM struggles to achieve interactive performance in large-scale simulations. Inspired by recent advancements in position-based MPM, we address this limitation by introducing the Affine Projection Stabilizer (APS), a novel technique that enhances the stability of explicit MLS-MPM while preserving visually plausible results. This enhancement is achieved through an XPBD-style affine projection at the particle level, applied across multiple grid-to-particle-to-grid (G2P2G) stages. By leveraging the computational power of modern GPUs, APS-MPM demonstrates a speedup of 2.9 to 3.3 times over GPU-optimized traditional explicit MLS-MPM across a range of large-scale scenarios. This enables interactive simulations of hyperelastic materials with approximately one million material points, paving the way for real-time simulation and interaction in practical applications.

10:55
CADGCL: Unsupervised Retrieval of CAD Models Via Boundary Representations

ABSTRACT. With the widespread application of CAD technology in the industrial manufacturing sector, the efficient retrieval of target models has become a critical research topic. Despite the outstanding performance of traditional supervised retrieval methods, their reliance on large amounts of labeled data significantly limits practical applications. Data labeling is not only time-consuming and costly but also difficult to ensure accuracy and consistency. To tackle this issue, this paper introduces CADGCL, an unsupervised method for retrieving CAD models based on boundary representations. The proposed method transforms CAD models represented by boundary representations into B-rep attributed graphs that integrate geometric information and topological structures. Graph Contrastive Learning facilitates unsupervised CAD model retrieval. To overcome the limitations of traditional GCL methods in data augmentation and negative sample selection, two novel strategies are introduced: an edge perturbation strategy based on Edge Betweenness Centrality and a negative sample selection strategy based on the Beta Mixture Model. These strategies effectively improve the performance of contrastive learning. Experimental results show that the proposed method outperforms existing approaches in mAP and F1 scores under unsupervised scenarios, validating its potential for applications in industrial manufacturing.

11:10
From Sci-Fi to Reality: Exploring Teenagers' Perceptions and Ideas of AI Robots

ABSTRACT. This paper presents a user study investigating how exposure to science fiction films and real-world interaction with a humanoid robot influence teenagers’ emotional responses, technology acceptance, and visions of ideal AI robots. Addressing the limited empirical research comparing mediated and embodied experiences in adolescents, we conducted a three-phase study combining questionnaires and thematic analysis involving 55 teenagers. Our quantitative results show that while science fiction films temporarily decreased comfort and increased nervousness, direct robot interaction moderated these effects without significantly altering perceptions of usefulness, ease of use, or intention to adopt. Our qualitative findings reveal that adolescents predominantly envision robots for practical domestic and personal assistance. However, critical and rejecting views emerged following exposure, indicating a complex engagement with robotic technologies. These insights provide a foundation for designing adolescent-centered robots and developing targeted media literacy initiatives to promote realistic perceptions of AI robots among teenagers.

11:25
HumanIR-MGI: Human Inverse Rendering via Jointly Optimizing Geometry, Material, and Illumination

ABSTRACT. Accurately decoupling geometry, material, and illumination remains a significant challenge in human inverse rendering. Meanwhile, detailed geometric reconstruction and accurate lighting modeling can enhance the quality of material reconstruction. To improve the accuracy of estimated human materials and their resemblance to reality, we propose a novel method, HumanIR-MGI, which jointly optimizes materials, geometry, and illumination from multi-view images captured under unknown lighting. Specifically, our method employs a two-stage framework: a pre-training stage to reconstruct an initial geometry and a fine-tuning stage to jointly refine geometry while estimating illumination and materials. Besides, we treat the outgoing radiance from indirect-bounce points as indirect illumination and estimate visibility to differentiate between direct and indirect illumination. Moreover, we incorporate an anisotropy model to account for the complex material properties of the human body. Through extensive quantitative and qualitative experiments, our method demonstrates its superiority over state-of-the-art methods.

11:40
Deep Learning-based Digital Twin for Real-time Fire Safety Management

ABSTRACT. The complexity of a building's indoor environment exacerbates its fire safety issues, making them particularly challenging. Traditional fire protection systems only detect fires when their characteristics are already evident, which creates a delay that does not align with the demands of modern building fire management. In response, this paper introduces a digital twin framework for the real-time management of building fire safety. This framework leverages sensor data and the Bert_C model to analyze the transition of fire from the negative ignition phase to the positive ignition phase. The Bert_C model is trained on a dataset derived from numerical simulations, enabling the early prediction of fire occurrences. This model is then incorporated into a digital twin platform created with Unity 3D. The findings indicate that the proposed framework markedly enhances the timeliness and precision of early fire warnings. The digital twin platform is capable of visualizing 3D fire scenes, offering robust technical support for firefighting and emergency rescue operations. This research underscores the viability of employing 3D environments and digital twins for real-time fire safety management, presenting innovative solutions for managing fire safety in built environments.

10:25-12:00 Session 19B: LNCS 5

Zoom Link: https://us06web.zoom.us/j/81900726710?pwd=aEYm8qBzCJspHAYNhC3tBNbZvaoPBk.1    Meeting ID: 819 0072 6710    Passcode: CGI2025

Location: BC201, PolyU
10:25
Generation of Realistic Synthetic ECG using Uncertainty-Aware Diffusion Model

ABSTRACT. Electrocardiogram (ECG) synthesis plays a crucial role in medical research, education, and device development. However, achieving high-fidelity ECG signal synthesis remains challenging, particularly in accurately reproducing specific waveform patterns at the sample level. In this paper, we propose an Uncertainty-Aware Diffusion Model that integrates uncertainty estimation into the ECG synthesis process. Unlike traditional diffusion models that focus primarily on texture, our approach simultaneously leverages uncertainty of ECG signals. The uncertainty module preserves meaningful waveform characteristics. The model combines diffusion models, known for generating high-quality samples from complex distributions, with an uncertainty module that captures and propagates uncertainty throughout the pipeline. Extensive experiments demonstrate that our approach outperforms existing methods in terms of both distribution-level and sample-level evaluation.

10:40
Remote Sensing Cross-Domain Semantic Segmentation for Unknown Class Detection in Real-World Scenarios

ABSTRACT. Existing unsupervised domain adaptation (UDA) methods have shown success in semantic segmentation of high-resolution remote sensing (HRS) images. However, these methods assume that the source and target domains share the same set of labeled categories, which becomes problematic when new classes appear in the target domain. This assumption complicates the accurate prediction of boundaries and shapes of unknown classes. Additionally, self-supervised easy-hard adaptation strategies may result in the model learning erroneous knowledge from noisy pseudo-labels. To address these challenges, we propose a novel Open Set Domain Adaptation method, OpenRS-Net, specifically designed for remote sensing images. Our framework introduces a boundary-aware loss module, MorphoCon, based on morphological operations (dilation and erosion) to improve the representation of object boundaries. Additionally, we introduce a prototype-based pseudo-label denoising module (PPD) to reduce pseudo-label noise by calculating prototype distances. We conduct experiments on benchmark datasets, including Potsdam, Vaihingen, and the custom MultiLandRS dataset, demonstrating superior performance on both known and unknown class datasets, verifying the generalizability of our method.

10:55
Efficient Document Shadow Removal with Contrast-Aware Guidance

ABSTRACT. Document shadows pose a significant challenge in digitization, as they obscure dense textual and pattern information, necessitating specialized removal techniques. Many existing methods either rely on additional inputs such as shadow masks, or operate without them but suffer from limited effectiveness and generalization across diverse shadow conditions. This often results in incomplete shadow removal or loss of original document content and tones. In this paper, we refocus our approach on the document images themselves, which inherently contain rich information. We investigate the role of contrast information in document shadow removal and design a dedicated shadow representation module, a contrast-aware loss function, and a contrast guided network to effectively leverage this information.By extracting document contrast information, we can ef fectively and quickly locate shadow shapes and positions without the need for additional masks. This information is then integrated into the refined shadow removal process, providing better guidance for network-based removal and feature fusion. Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance.

11:10
Line Drawing Abstraction Based on Line Importance Evaluation

ABSTRACT. Line drawing is a frequently used art form that depicts an object or a scene with intensive strokes in the field of design. Especially, creating a line drawing with different levels of detail is also an important task for different design and presentation purposes. Unfortunately, the existing line drawing methods either cannot deal with arbitrary line drawings or fail to generate high fidelity line drawings at different levels of detail. In this paper, we propose a novel line drawing abstraction method that can create different levels of detail for an input line drawing based on a line importance evaluation system. In particular, we prepare a new paired dataset consisting of line drawings with varying levels of detail and propose a novel distance transform-based metric to estimate the importance of lines. A novel line removal method is further proposed to achieve multi-level levels of detail abstraction based on the predicted line importance and the lengths of the lines. We evaluated our method visually and quantitatively. Convincing results are obtained in all cases.

11:25
Cartoon Animation Shading Removal

ABSTRACT. Cartoon animation is a popular form of visual representation that utilizes shading to enhance the visual effect and create a three-dimensional appearance. However, shading presents significant challenges in the editing and processing of cartoon animations, affecting tasks such as segmentation, structural line extraction, and region tracking. Existing shading removal methods cannot be directly applied to cartoon animations due to differences in image features and shading patterns. In this paper, we propose a novel network architecture specifically designed for the shading removal task in cartoon animations. The model leverages attention mechanisms and a feature matching loss function to enhance contextual awareness, facilitating precise shading removal while preserving animation details. Since the proposed approach is based on supervised learning, the training process necessitates a high-quality labeled dataset. To this end, we manually created a dataset comprising 2,000 high-quality images and introduced an innovative fully automated shading removal method to generate additional training data. Furthermore, we demonstrate the applicability of shading-free cartoon animations across multiple tasks, including image segmentation, structural line extraction, and region tracking. Extensive qualitative and quantitative evaluations confirm that our proposed method consistently outperforms existing approaches in terms of both visual quality and statistical metrics, providing strong validation of its effectiveness and superiority.

11:40
SpectralVAE: spectral variational autoencoder for 3D mesh representation learning

ABSTRACT. In the field of geometric deep learning based on polygonal meshes, researchers have worked on designing specialized convolution and pooling operators for the irregular structure of mesh connectivity in the spatial domain to accomplish representation learning task. Inspired by spectral mesh processing, we propose a network, named SpectralVAE, which is a variational autoencoder network based entirely on self-attention mechanism. This network first transforms mesh vertices to spectral coordinates by using eigenvectors of the Laplacian matrix, these spectral coordinates contain compressed information about the geometry of the original vertices. Then, we train SpectralVAE in an endto-end manner in the spectral domain, thus avoiding the necessity to deal with irregular connectivity of the mesh. We design transformer layers, spectral pooling layers and spectral unpooling layers as basic components to form a hierarchical VAE framework. The effectiveness and superiority of our model is verified through comparative and ablation experiments on public largescale human datasets. SpectralVAE achieves better performance than existing methods on mesh reconstruction and mesh interpolation tasks.

10:25-12:00 Session 19C: LNCS 6

Zoom Link: https://us06web.zoom.us/j/82346226482?pwd=Rwc3dmFUMa4UZ7qcAcfcsBPjJ7baXv.1    Meeting ID: 823 4622 6482    Passcode: CGI2025

Location: BC202, PolyU
10:25
Cross-Layer Feature Aggregation Object Detection Networks for Modality-enhanced UAV Images

ABSTRACT. Unmanned Aerial Vehicle (UAV) object detection has recently emerged as an active and challenging field in remote sensing. UAV imagery often encompasses various perspectives and altitudes, leading to significant scale variations, occlusions, and dense small objects, which reduce detection performance. To address these challenges, we propose a cross-layer feature aggregation object detection network for modality-enhanced UAV images, named CLFAMENet. First, the wavelet pooling modal enhancement (WPME) module is constructed to effectively utilize high-frequency and low-frequency information through wavelet domain learning, fusing these features with the original feature map to produce a modality-enhanced feature map. Second, the the efficient spatial edge information aggregation (ESEIA) module combines edge information extracted by SobelConv with spatial information extracted by convolution, thereby generating feature representations encompassing rich edge and spatial information. Finally, the proposed cross-layer feature aggregation (CLFA) module introduces a bidirectional fusion mechanism between high-resolution and low-resolution features, enhancing the interaction between shallow and deep features. Experimental results on the VisDrone dataset and the UAVDT dataset show that CLFAMENet effectively improves the accuracy and robustness of object detection, outperforming existing methods.

10:40
Weakly Semi-supervised Classroom Teacher Visual Tracking by Single-point Annotations

ABSTRACT. We study a weak semi-supervised teacher object tracking framework based on point annotations. The training dataset labels comprise a small portion of fully annotated information and a large portion of weakly annotated information. The main objective is to train a point-to-box regressor using a small-scale dataset that contains fully labeled information. Specifically, we introduce an effective method that utilizes the prevalent ViT architecture to predict bounding boxes based on single-point annotations. Our approach generates plausible pseudo-labels for images that only have point annotations. The process of constructing datasets for object tracking tasks often involves a significant amount of labeling, and practical applications require model fine-tuning. By leveraging more efficient point annotations, we can effectively advance the task solution and alleviate the labeling burden. To validate our method, we conducted experiments on a classroom teacher dataset. We have demonstrated the advantages of our method, and provided a new solution for weak semi-supervised single-object tracking based on point annotations, with potential applications in various domains.

10:55
RegistrationBooster: Enhancing Rigid Registration Performance through Correspondence Refinement

ABSTRACT. Rigid registration of point clouds is a critical component in 3D vision and plays a significant role in many downstream tasks. Registration methods typically involve feature extraction followed by the selection of point-to-point correspondences to infer the unknown rigid transformation. Each correspondence links a point in the source point cloud P to the closest point in the target point cloud Q, based on the similarity of features. The ratio of inliers in these correspondences is crucial for the accuracy of the registration. However, an often-neglected problem is that multiple points in P may correspond to a single point in Q, but usually, only one of these correspondences is useful. This situation is unfavorable for predicting accurate transformations. In this paper, we introduce RegistrationBooster, a novel method for refining correspondences. We formulate the refinement of correspondences as a minimum cost maximum flow problem, solvable using the Push-relabel algorithm and the Network simplex algorithm. During experiments with 3DLoMatch as the dataset and FCGF as the feature descriptor, we found that inputting our refined correspondences into the recently developed SC2-PCR++ results in a significant improvement in registration recall, up to 6%, compared to using unprocessed correspondences. This highlights the efficacy of RegistrationBooster in enhancing the quality of the input correspondences.

11:10
Face Sketch Synthesis via Sparse Spatial Channel Generators and Multi-Scale Discriminators

ABSTRACT. Facial sketch synthesis has broad applications in law enforcement. Although existing generative adversarial network (GAN) methods have made progress in synthesizing high-quality sketches from facial photographs, significant limitations remain in accurately reconstructing facial features. To address this issue, the proposed method, based on sparse spatial channel generators (SSCGs) and multi-scale discriminators (MSDs), aims to enhance both the quality and efficiency of generated sketches. Facial feature extraction and the capture of key features are efficiently improved through the integration of non-local sparse attention and polarized filtering. The performance of the discriminator is also improved by the multi-scale feature enhancement module (MSFEM), which effectively guides the flow of channel information by merging information from different scales, eliminating ambiguity, and enhancing the quality and consistency of the generated images. Experimental results on the CUHK, AR, XM2VTS, and CUFSF datasets demonstrate that the proposed method significantly outperforms existing methods in terms of generation quality, facial consistency, and preservation of texture details.

11:25
Enhancing the Transferability of Adversarial Examples against No-Reference Image Quality Assessment Models

ABSTRACT. Learning-based No-Reference Image Quality Assessment (NR-IQA) metrics have demonstrated remarkable performance compared to traditional methods. However, NR-IQA models that rely on neural network architectures are vulnerable to adversarial attacks, significantly undermining their credibility and reliability. Current research on the robustness of NR-IQA metrics predominantly focuses on white-box attack settings, with a limited exploration of enhancing the black-box transferability of adversarial examples. This study aims to bridge this gap. We propose a Score Boosting Adaptive Transferable (SBAT) adversarial attack against NR-IQA models. Specifically, we introduce a customized loss function to generate adversarial examples, which seeks to increase the quality scores predicted by the victim model when estimating perturbed images. To extend the applicability of counterexamples to black-box scenarios, we design an adaptive post-filtering module that strategically regulates the distribution of injected global noise in response to different models. Extensive experiments demonstrate that adversarial examples generated through our proposed method not only effectively disrupt the objective scoring performance of victim NR-IQA models, but also exhibit competitive transferability in black-box scenarios. Furthermore, these adversarial examples are more imperceptible than those originating from baseline attacks, as evidenced by their superior image quality quantified using FR-IQA metrics.

11:40
Exploring Procedural Content Generation of 3D City Scenes for 2D Semantic Segmentation

ABSTRACT. 3D virtual environments can be alternative domains for gathering synthetic images to train a computer vision (CV) model. While 3D virtual environments could potentially generate unlimited data, manually creating 3D objects and setting up the scenes would still be time-consuming and challenging. Furthermore, real-world constraints must be incorporated into the 3D scene to make them consistent with real-world scenarios. One promising approach is to explore grammar-based procedural content generation (PCG) for generating diverse 3D scenes with real-world constraints represented as a set of grammar rules. In this study, we explored the use of an L-grammar PCG for generating 3D city scenes. Using the generated city scene (XXX-PCG), we gather synthetic images for training a DeepLabV3+ semantic segmentation network (SSN) and observe if the SSN could perform well on real-world images, such as in the Cityscapes dataset. In addition to our proposed PCG method, we propose a new approach for bridging the synthetic-real domain gap where instead of using domain adaptation (DA) or image-to-image style transfer like other works, we propose manipulating the atlas textures directly by sampling image patches from the Cityscapes dataset. Thus, the synthetic images we gather from XXX-PCG no longer need a separate DA operation, potentially saving time and cost when generating training data. Quantitative results demonstrate the viability of our approach where our SSN, trained purely on XXX-PCG images, achieves an almost 5% increase in segmentation accuracy.

13:30-15:10 Session 20A: TVC 5 Image Analysis and Processing

Zoom Link: https://polyu.zoom.us/j/89298697634?pwd=lWl32BN6kaRycVtGjmi3RW9BqhtiEj.1   Meeting ID: 892 9869 7634   Passcode: CGI2025

Chair:
Location: M1603, PolyU
13:30
Privacy Image Secrecy Scheme Based on Chaos-Driven FSM and FQM

ABSTRACT. In the digital age, the proliferation of social communication platforms has significantly heightened public awareness regarding privacy security. The potential leakage of image privacy poses a grave threat, as it can expose sensitive information and lead to severe crises. To address this critical issue, this paper introduces an advanced privacy protection scheme that integrates chaos-driven Fractal Sorting Matrix (FSM) and Fibonacci Q-Matrix (FQM) techniques. The proposed scheme, known as privacy image secrecy scheme based on chaos-driven FSM and FQM (PICFF), utilizes a 2D-LSM chaotic system to generate secure pseudorandom sequences for encryption. The FSM permutation technique, driven by chaos, effectively alters the positional information of the image, enhancing security through iterative and self-similar transformations. Additionally, the FQM encryption matrix, which traditionally employs static encryption, is dynamically enhanced by incorporating a chaotic system, ensuring strong correlation with the plaintext and further bolstering encryption reliability. Experimental validation demonstrates that the PICFF scheme excels in terms of information entropy and robustness. The encrypted images exhibit uniform pixel distribution, reduced pixel correlation, and high information entropy, closely aligning with theoretical values. The scheme also showcases exceptional resistance to various attacks, including cropping attacks and salt-and-pepper noise, thereby effectively preventing privacy information leakage and providing robust support for maintaining information security and enhancing privacy protection.

13:45
Geometry Guidance Diffusion Image Morphing with Large Shape Difference

ABSTRACT. Image diffusion models have facilitated the generation of visually compelling images, and this powerful generative capability has also opened new avenues for tasks such as image morphing. Previous image morphing approaches using diffusion models primarily focus on interpolating text embeddings and latent vectors. However, these methods lack explicit shape control, resulting in morphing processes that lack of smooth shape transitions. Moreover, the direct interpolation of text embeddings and latent vectors without constraints pushes the process out of the domain of the diffusion model, leading to noticeable artifacts. To address these issues, we propose a novel diffusion model-based method that leverages normal maps as geometric guidance to control image morphing. By integrating 3D reconstruction techniques with variational implicit surface methods, our approach ensures smoother and more stable morphing sequences, preserving shape consistency throughout the transformation. Comparative experiments demonstrate that our method produces smooth, consistent, and stable results and outperforms existing SOTA techniques, such as IMPUS and DiffMorpher, especially when dealing with images with large shape differences.

14:00
RainRWKV: A Deep RWKV Model for Video Deraining

ABSTRACT. In driving scenarios, videos recorded in rainy weather conditions are often distorted by rain streaks and raindrops, posing a significant challenge in recovering the obscured background details. The inherent temporal redundancy in videos offers stability advantages for rain removal. Traditional video deraining techniques primarily depend on optical flow estimation and kernel-based methods, which are constrained by a limited receptive field. Although transformer architectures can capture long-term dependencies, they introduce substantial computational complexity. Recently, the Receptance Weighted Key Value Model (RWKV), characterized by its linear computational complexity, has emerged as an effective tool for efficient long-term temporal modeling, which is essential for the removal of rain streaks and raindrops in video sequences. To optimize RWKV for video deraining, we introduce a wavelet transform shift mechanism that enhances low-frequency features by targeting distinct frequency bands. Additionally, we present a tubelet embedding mechanism for RWKVs, augmenting the model's capacity to capture high-frequency details by integrating the spatio-temporal context of input frames. Extensive experiments demonstrate that our approach achieves superior performance over state-of-the-art methods.

14:15
Single Image Shadow Removal using 2D Signed Distance Field

ABSTRACT. Due to substantial fluctuations in brightness at shadow boundaries, existing shadow removal algorithms often struggle to accurately eliminate these boundaries. This challenge is further compounded by the reliance on binary mask representations for shadow regions. Therefore, we propose a novel approach to shadow removal using 2D Signed Distance Field, which serves as a smooth weight prior to handle shadow boundaries more effectively. First, we introduce a Fast Fourier Transform (FFT) framework to capture both global frequency and local spatial features, enhancing the overall quality of shadow removal. Additionally, we propose an Information Interaction Module (IIM) to fuse local spatial information with global frequency information obtained from the FFT, thereby improving the precision of shadow boundary handling. Second, we specially design a Boundary Refinement Module (BRM) for shadow boundaries, leveraging the characteristics of the SDF to ensure smoother and more natural elimination of shadow boundaries. Finally, we introduce a global feature modulation technique to combine features from SDF, FFT, and non-shadow regions, further enhancing the overall shadow removal results. Extensive experiments demonstrate that our method is comparable to state-of-the-art approaches, particularly in effectively removing shadow boundaries and achieving high-quality results.

14:30
Adaptive Box-Level Supervision with Superpixel Shape Guidance for Ultrasound Image Segmentation

ABSTRACT. Compared with the fully supervised segmentation that highly relies on pixel-wise annotations, weakly supervised segmentation based on box labels can maximize the utilization of annotations from clinical practice. However, existing weakly supervised methods lack direct perception of the shape of segmentation targets and fail to balance shape and location constraints. In this paper, we propose a weakly supervised segmentation method based on shape-guided adaptive mutual training. Specifically, we design the pixel aggregation based superpixel generation (PASG) module, which utilizes neural networks to generate more cohesive superpixel blocks from superpixel segmentation and selects appropriate ones to produce the so-called superpixel labels. Then, we propose the confidence-based cross pseudo supervision (CCPS) mechanism, which adaptively selects superpixel labels, pseudo-labels, and box labels based on their consistency with the actual lesion location, alleviating the illusionary network training under the shape-only guidance. Experimental results on two ultrasound datasets demonstrate that our method outperforms existing wea-kly supervised methods in segmentation performance and is comparable to fully supervised methods. The proposed method facilitates the integration of the deep learning methods into the practical medical workflows.

14:45
SEGNet: Shot-Flexible Exposure Guided Image Reconstruction Network

ABSTRACT. Multi-Exposure image Fusion (MEF) is a fundamental task in computer vision. While mainstream deep learning methods have made some progress, they are primarily limited to processing exposure sequences with a fixed number of shots, which restricts their ability to capture scene information and makes it difficult to meet the high dynamic range requirements of real-world scenarios. Therefore, this paper proposes SEGNet, which includes three core components: a pair of pseudo-twin Spatial-Frequency Attention Modules (SFAM) that correct features based on image correlations by combining spatial and frequency domain attention mechanisms to achieve effective feature complementation; an Exposure Guided-Merging Module (EGM) that adaptively weights image features according to the exposure values of the input images, enabling flexible adaptation to different exposure shots; and an Exposure Restoration Module (ERM) that employs a Vision State-Space Module (VSSM) combined with convolutional modules to balance global restoration and local enhancement, achieving natural transitions in texture, color, and illumination. Additionally, a structural tensor regularization term is innovatively introduced into the loss function to further preserve image details. Extensive qualitative and quantitative experiments demonstrate that SEGNet outperforms existing mainstream models.

13:30-15:10 Session 20B: LNCS 7

Zoom Link: https://us06web.zoom.us/j/81900726710?pwd=aEYm8qBzCJspHAYNhC3tBNbZvaoPBk.1    Meeting ID: 819 0072 6710    Passcode: CGI2025

Location: BC201, PolyU
13:30
PortraitFormer: Global Illumination Helps Portrait Shadow Removal

ABSTRACT. Shadows are a common optical phenomenon, and when present on the face, they often cause visual discomfort. Due to the complexity of facial geometry and shadows, removing portrait shadows while maintaining a natural and consistent visual effect is challenging. In this paper, we propose a global illumination fusion mechanism based on feature extraction and illumination balancing, followed by the design of a Transformer model for portrait shadow removal, named PortraitFormer. The model effectively removes shadows by leveraging global illumination to re-illuminate the facial region. Specifically, the PortraitFormer Block (PFB) with global illumination balancing capability captures both texture details and global illumination of the face at different scales through the Feature Illumination Balancing (FIB) mechanism, then adaptively applies global illumination to fine-grained features, progressively restoring lighting in shadowed regions while maintaining detail consistency, ultimately producing a shadow-free image. Experimental results demonstrate that this method outperforms existing techniques in facial shadow removal, providing greater accuracy and better visual effects.

13:45
UTFusion: A General Image Fusion Framework with a Unified Transformer

ABSTRACT. The goal of image fusion tasks is to extract and integrate complementary information from multiple image sources to create a single image that captures the most relevant and salient features from each source. However, existing methods cannot often employ a single, unified model across various fusion tasks. In this paper, we introduce a novel unified model named UTFusion to handle different image fusion tasks. Our method uses patches as the fusion unit, adds an embedding of the fused scheme, and changes the fusion strategy by switching the embedding of the fused scheme. Qualitative and quantitative experiments show the competitiveness of our UTFusion compared to the state-of-the-art general image fusion models in terms of visual effects and quantitative metrics.

14:00
DDVT: Dynamic Dual-level Vision Transformer Fusion Network for Answer Grounding in Visual Question Answering

ABSTRACT. Answer grounding in visual question answering aims to locate the region from a given natural language question associated with the visual content of an image, which has garnered significant attention due to its practical applications. Answer grounding in visual question answering aims to locate the region from a given natural language question associated with the visual content of an image, which has garnered significant attention due to its practical applications. Specifically, we propose a question-guided dynamic regional-level module (QGDR) that combines complementary image context through ROI Align and text content, enabling precise localization of text-related visual content. Moreover, we present a cross-modal multi-scale aggregation module (CMA) that enhances feature fusion between pixel-level and region-level features, facilitating the effective localization of visual content associated with grounded answers. Furthermore, we fuse the located visual content with text features to locate the region and provide answers to questions posed about the image. Experimental results demonstrate that our DDVT outperforms state-of-the-art methods on several widely-used benchmarks.

14:15
MIGEdit: Multimodal Interactive Garment Editing

ABSTRACT. Existing clothing image editing methods typically rely on single-modal traditional models. However, as user demands evolve, these methods still face challenges in global consistency, fine-grained control, and intuitive interaction. Traditional GAN-based approaches produce limited editing effects, while current diffusion-based methods struggle with region-aware editing. To address these issues, we propose MIGEDIT, a multimodal interactive clothing editing process framework. It integrates potential spatial optimization, inversion guidance, and region aware editing into a pre trained diffusion model. MIGEDIT supports the generation of clothing images from clothing sketches and supports point based interactive editing as well as text or region guided modifications, enabling precise and flexible clothing adjustments. Experimental results demonstrate that MIGEDIT outperforms existing methods in visual quality and editing accuracy, making it well-suited for intelligent fashion design and virtual try-on applications.

14:30
LA-mUNet: An Efficient Hybrid CNN-Mamba Network with Lesion Attention for 3D Brain Tumor MRI Segmentation

ABSTRACT. Volumetric medical image segmentation has emerged as a fundamental component of modern diagnostic imaging and treatment planning, where it also plays a crucial role in the accurate diagnosis of 3D brain tumors. Convolutional neural networks have dominated in this task for local feature extraction, but their limited receptive fields hinder the capture of long-range dependencies in 3D brain MRI scans. Although Transformer-based approaches are effective at modeling global relationships in input data, their high computational cost poses challenges for scalability and clinical deployment. Inspired by the recent advances in the application of Mamba state space modeling in the area of medical imaging, we propose LA-mUNet an efficient brain tumor segmentation method using: 1) the hybrid Mamba-CNN encoder to seamlessly extract local and global features from all relevant tumor tissues, 2) the Decoder based on Lesion Attention Module (LAM) and efficient Depthwise Separable Convolutions (DSC) to capture fine-grained details of small tumor lesions. In our lightweight model design with Mamba-LAM integration, we have ensured the balance in performance and computational cost. The proposed method is evaluated on the public BraTS 2023 and BraTS-Africa datasets. The experimental results show the proposed method achieves an average dice score of 91.71% and an average HD95 of 3.11 mm on the BraTS 2023 dataset and an average dice score of 91.27% and an average HD95 of 3.56 mm on the limited-size BraTS-Africa dataset, outperforming state-of-the-art methods.

14:45
Scale-aware Guidance Network for SAR Ship Object Detection

ABSTRACT. In the ship detection task, synthetic aperture radar (SAR) is regarded as a powerful tool for obtaining data. However, the scale variation of ship objects in SAR images poses challenges during the detection process. To address it, we decouple the detection head and propose a new detection framework based on global precision optimization. The framework is composed of a scale-aware subnetwork and a detection subnetwork. During training, the scale-aware subnetwork guides the detection subnetwork, enhancing the extraction of multi-scale features. This approach optimizes detection precision across three scales, ensuring balanced performance. During inference, the scale-aware subnetwork is discarded to eliminate additional parameters and computation costs. Moreover, unlike Grad-CAM, this paper provides a more faithful post-hoc interpretability method based on hierarchical similarity. The proposed method improves the performance of six classical detectors on three widely used ship datasets.

13:30-15:10 Session 20C: LNCS 8

Zoom Link: https://us06web.zoom.us/j/82346226482?pwd=Rwc3dmFUMa4UZ7qcAcfcsBPjJ7baXv.1    Meeting ID: 823 4622 6482    Passcode: CGI2025

Chair:
Location: BC202, PolyU
13:30
Small Object Detection Algorithm with Selective Fusion of Small Scale Features

ABSTRACT. Current object detection technologies have become relatively mature, yet significant challenges remain in the detection of small objects, such as frequent false positives and missed detections. To address the issues associated with small objects, including small scale, high noise levels, and variable shapes, several improvements have been proposed. Firstly, since small objects are prone to be missed due to their size, an additional shallow feature map was introduced to facilitate subsequent feature fusion, enhancing the ability to capture small objects. Secondly, a novel feature aggregation module, Deformable Selective Aggregation (DSA), was designed to reduce noise and focus the model more effectively on the fine features of small objects. Lastly, Focaler-IoU combined with MPDIoU was utilized as the loss function, improving the accuracy of small object bounding box localization. Experimental results demonstrate that the improved model significantly enhances performance metrics such as mAP50 and mAP50-95 on the VisDrone2019-DET, CARPK, and WiderPerson datasets, reflecting excellent generalization and robustness.

13:45
VCHFC: Visual-Content Hybrid Feature Cue Module for Lipreading

ABSTRACT. Lipreading is a task in which the visual expression of a speaker can be understood solely through lip movements. Although many lipreading models utilizing audio and visual have achieved good results, relying only on visual information in lipreading tasks is still a challenge. In this paper, we focus on the challenge of feature fusion between lip movements and content in word-level lipreading tasks and propose a Visual-Content Hybrid Feature Cue Module (called VCHFC). In the training phase, global content information is transformed into a content sequence as content streams through word embedding processing, while the video is processed as a visual stream by the front-end network. To facilitate the interaction between the content and visual modality, the VCHFC cleverly integrates a simplified attention mechanism where the content modality guides the visual modality to perform correlation calculations to derive cues that in turn enhance the visual features. The performance of VCHFC compared to other state-of-the-art algorithms and the results demonstrate the effectiveness of the VCHFC structure. The code related to this work will be available upon acceptance.

14:00
Multi-Scale Attention and Adaptive Self-Attention for Occlusion-Aware Depth Estimation in Light Field

ABSTRACT. Light field depth estimation is a fundamental and hot research topic, which aims to exploiting angular and spatial information for estimating the depth of object surfaces. Though a lot of methods have been proposed for the light field depth estimation task, few methods can effectively capture depth information in challenging scenes, especially for occluded scenes. Meanwhile, these methods rarely reduce the impact of occluded pixels on the cost values in the cost construction stage of depth estimation. In this paper, we propose a novel Occlusion-Aware Depth Estimation Network, named OADENet, for light field image depth estimation. Specifically, to capture the depth variations between distant backgrounds and foreground objects, which are challenging to represent using single-scale feature extraction, we propose a Multi-Scale Attention Fusion Module (MSAFM) for capturing and integrating features across different scales. Subsequently, we propose a cost constructor based on an adaptive self-attention mechanism to adaptively capture global pixel-wise matching costs and reduce the impact of occluded pixels, improving the reliability of the cost volume. Finally, we also propose an occlusion mask supervision strategy to effectively learn our network, enhancing the network’s depth estimation capability in occluded regions. Extensive experiments on synthetic and real light field datasets demonstrate that our method achieves good performance in both accuracy and efficiency.

14:15
UMan: Multi-Scale Feature Reconstruction and Super-Resolution for Single-Image 3D Human Modeling

ABSTRACT. Generating accurate multi-view human images from a single, low-resolution input remains challenging due to limited texture information and uncertainties in 3D pose estimation. To address these issues, we propose a novel pipeline that first utilizes a super-resolution (SR) module to enhance the reference image quality. Subsequently, a diffusion-based multi- view synthesis component leverages SMPL-X geometric priors to ensure consistent body shape. In order to refine structural fidelity, we introduce a Geometry-Enhanced Two-Branch Module, which outputs both RGB images and normal maps, reinforcing geometric details. Most importantly, we present a multi-scale feature iteration strategy, wherein the newly generated multi-view images iteratively update SMPL-X parameters to correct pose inaccuracies and reduce shape distortions. Extensive experiments on both scanned 3D human datasets and in-the-wild images demonstrate that our approach achieves superior results in multi- view consistency and reconstruction quality, even in scenarios with complex poses and diverse clothing.

14:30
FusionCraft: A New Paradigm for Fine-Grained Multimodal Fashion Design Generation

ABSTRACT. The growing demand for fine-grained local detail control and global stylistic coherence in garment design has exposed the limitations of traditional workflows, which rely heavily on manual sketching, resulting in prolonged design cycles and limited scalability. In this paper, we propose FusionCraft, a multi-modal collaborative generation framework that integrates diffusion models and Mamba layers. Our framework comprises two core modules: 1)Hierarchical Semantic Prioritization Module (HSPM), which addresses the semantic fragmentation issue in traditional text descriptions. By employing semantic disentanglement, HSPM decomposes textual descriptions into hierarchical design elements, ensuring that global attributes govern local details and thus maintain semantic consistency during design translation. 2)Contextual Mask Integration Module (CMIM), which introduces Mamba layers, combined with a Mask-Adaptive Cross-Attention (MACA) mechanism, to dynamically adjust attention weights and enable the precise fusion of garment style and structural constraints while preserving geometric fidelity. Furthermore, we contributed a novel tripartite dataset comprising fine-grained masks, textual descriptions, and high-resolution images, and compared our method with existing approaches using both this dataset and the public Fashionpedia dataset. Extensive experiments demonstrate that FusionCraft enhances the fidelity of local detail and stylistic consistency, while improving personalization capabilities. Code and dataset will be released at:https://github.com/GHH1010/FashionDataset

14:45
An Effective Wavelet Neural Network for Ultra-High-Definition Image Deraining

ABSTRACT. In order to overcome the high computational complexity, existing ultra-high-definition (UHD) image deraining methods usually leverage downsampling operations in the spatial domain, such that UHD deraining becomes possible on resource-limited devices. However, directly performing downsampling in the spatial domain easily leads to information loss, which further increases the difficulty of image deraining. To address this issue, we design an effective wavelet neural network (EWNet) for UHD image draining. The EWNet introduces wavelet transform to alleviate the information loss of image downsampling on the one hand, and moreover, it decomposes the high-frequency and low-frequency information of the image to remove rain streaks at different levels respectively. To retain the expected texture information of the target image when removing the rain information of the low-frequency image,we design a low-frequency detail preservation module(LDPM) in EWNet. Furthermore, we propose a high-frequency detail enhancement module (HDEM), which not only takes the decomposed high-frequency information as input but also uses the output information of LDPM to assist in generating the clean high-frequency details. We conducted experiments on publicly available datasets, and the results showed that when with fewer than 1M parameters, our proposed method achieves superior rain removal effects.

15:40-17:30 Session 21A: TVC 6 3D Modeling and LNCS (1)

Zoom Link: https://polyu.zoom.us/j/89298697634?pwd=lWl32BN6kaRycVtGjmi3RW9BqhtiEj.1   Meeting ID: 892 9869 7634   Passcode: CGI2025

Chair:
Location: M1603, PolyU
15:40
RevolRecon:Neural Representation for Reconstructing Surface of Revolution

ABSTRACT. Neural signed distance functions (SDFs) have proven to be highly effective in reconstructing organic models. These neural SDFs are known for their expressiveness and ability to predict reasonable shapes even in the presence of incomplete data. However, when the target object is a surface of revolution, incorporating this prior into surface reconstruction remains a significant challenge. In this paper, we observe that for any point p on a surface of revolution, the normal vector must align with the sectional plane passing through p. Our approach, named~{\em RevolRecon}, leverages this key observation in a self-supervised manner to handle missing data. Additionally, we propose a dynamic sampling strategy to extract points from the underlying surface, ensuring the loss function is estimated across the entire surface. Our approach also applies to the scenario of a curved rotation axis. A comprehensive comparison with state-of-the-art methods demonstrates the significant advantages of our approach.

15:55
High-Quality Neural Surface Reconstruction from Unoriented Point Clouds via Multilevel Tensor Product B-spline Hash Encoding and Viscosity Regularization

ABSTRACT. Surface reconstruction is a fundamental and critical task in computer graphics, computer vision and geometric modeling. Recent learning-based reconstruction methods have made significant progress, but reconstructing high-quality surfaces from unoriented point clouds remains very challenging. This paper tackles this issue by directly learning a neural implicit representation from raw point clouds, leveraging the power of multilevel tensor product B-spline hash encoding and viscosity regularization. Our approach consists of two key components: (1) A hybrid representation model that utilizes multilevel tensor product B-spline functions to parameterize the bounding box of point clouds for positional encoding and MLPs for representing implicit functions. Using cubic B-spline functions, our positional encoding can achieve $C^2$ continuity which is crucial to the quality of reconstruction. To reduce memory usage and speed up convergence, we employ a hash table to store the coefficients of B-spline basis functions; (2) A new loss function integrates data, viscosity, Hessian and minimal surface terms. Our innovation lies the introduction of the viscosity term, inspired by the vanishing viscosity method, to alleviate the instability in the optimization process and yield a smooth signed distance function solution. To prevent undesired shape variations and ghost geometries, the loss function also incorporates the Hessian term and the minimal surface term. Extensive experimental results demonstrate that our method effectively captures intricate geometric and topological details, and outperforms existing reconstruction methods in both quality and accuracy across a diverse range of 3D datasets.

16:10
ImS: Implicit Shell for the Sandwich-Walled Space Surrounding Polygonal Meshes

ABSTRACT. In computer graphics, simplifying a polygonal mesh surface~$\mathcal{M}$ into a geometric proxy that maintains close conformity to~$\mathcal{M}$ is crucial, as it can significantly reduce computational demands in various applications. In this paper, we introduce the Implicit Shell (ImS), a concept designed to implicitly represent the sandwich-walled space surrounding~$\mathcal{M}$, defined as~$\{\textbf{x}\in\mathbb{R}^3|\epsilon_1\leq f(\textbf{x}) \leq \epsilon_2, \epsilon_1< 0,  \epsilon_2>0\}$. Here, $f$ is an approximation of the signed distance function~(SDF) of~$\mathcal{M}$, and we aim to minimize the thickness~$\epsilon_2-\epsilon_1$. To achieve a balance between mathematical simplicity and expressive capability  in~$f$, we employ a first-degree tri-variate tensor-product B-spline to represent~$f$. This representation is coupled with adaptive knot grids that adapt to the inherent shape variations of~$\mathcal{M}$. In this manner, the analytical form of~$f$ can be rapidly determined by solving a sparse linear system. Moreover, the process of identifying the extreme values of~$f$ among the infinitely many points on~$\mathcal{M}$ can be simplified to seeking extremes among a finite set of candidate points. By exhausting the candidate points, we find the extreme values~$\epsilon_1<0$ and $\epsilon_2>0$ that define the thickness. The constructed ImS is guaranteed to wrap~$\mathcal{M}$ strictly, without any intersections between the bounding surfaces and~$\mathcal{M}$. ImS offers numerous potential applications thanks to its rigorousness, tightness, expressiveness, and computational efficiency. We demonstrate the efficacy of ImS in mesh simplification through the control of global error.

16:25
ZAP-2.5DSAM: Zero Additional Parameters Advancing 2.5D SAM Adaptation to 3D Tumor Segmentation

ABSTRACT. The Segment Anything Model (SAM) demonstrated outstanding performance in 2D segmentation tasks, exhibiting robust generalization to natural images through its prompt-driven design. However, due to the lack of volumetric spatial information modeling and the domain gap between nature and medical images, its direct application to 3D medical image segmentation is suboptimal. Existing approaches to adapting SAM for 3D segmentation typically involve architectural adjustments by integrating additional components, thereby increasing trainable parameters and requiring higher GPU memory during fine-tuning. Moreover, retraining the prompt encoder may result in degraded spatial localization, especially when annotated data is scarce. To address these limitations, we propose ZAP-2.5DSAM, a parameter-efficient fine-tuning framework, which effectively extends the segmentation capacity of SAM to 3D medical images through a 2.5D decomposition scheme without introducing any additional adapter modules. Our method fine-tunes only 3.51M parameters from the original SAM, significantly reducing GPU memory requirements during training. Extensive experiments on multiple 3D tumor segmentation benchmarks demonstrate that ZAP-2.5DSAM achieves superior segmentation accuracy compared to conventional fine-tuning methods.

16:40
TriAlign: Revisiting Deep Functional Map from Map Representation Alignment Perspectives

ABSTRACT. Current deep functional map methods face a critical gap in map representation alignment. While shape correspondence can be represented as point-wise maps, functional maps, and complex functional maps in the spatial, spectral, and complex spectral domains, respectively, existing approaches typically integrate at most two representations, resulting in symmetry ambiguity or spatial inconsistency. In this paper, we propose the \textbf{TriAlign} (\textbf{Tri}ple Maps \textbf{Align}ment) framework, a novel three-branch deep functional map-based method that harmonizes map representations across spatial, spectral, and complex spectral domains. Additionally, we introduce an alignment loss function to align the point-wise map with the complex functional map. Extensive experiments on (near-)isometric and non-isometric datasets demonstrate the superior accuracy of our method and its generalization capabilities across different datasets and mesh discretizations. Furthermore, the new loss function improves the stability of network training.

16:55
Efficient selective labeling strategies for enhanced few-shot 3D point cloud semantic segmentation

ABSTRACT. Few-shot 3D point cloud semantic segmentation, a critical area in computer vision and 3D data processing, addresses the challenge of accurately segmenting new object categories with minimal supervision—typically just one to five labeled examples. Current methods often overlook selective annotation, assuming random or uniform selection of support samples.In the critical area of computer vision and 3D data processing, few-shot 3D point cloud semantic segmentation tackles the challenge of accurately segmenting new object categories with minimal supervision, typically requiring only one to five labeled examples. Current methods often neglect selective annotation, instead assuming random or uniform selection of support samples, which can be suboptimal. To address this limitation, we propose a novel selective labeling approach that strategically selects and annotates the most informative and representative instances from a large pool of unlabeled data using a greedy algorithm, thereby maximizing information gain. By optimizing our approach, we enhance model performance on unseen categories while minimizing the target budget. In extensive experiments, our strategy yields significant improvements in mIoU scores in few-shot scenarios, outperforming other selective annotation methods. This work not only advances theoretical applications of few-shot learning in 3D spaces but also facilitates more efficient annotation processes and stronger adaptive learning abilities in practical scenarios.

15:40-17:30 Session 21B: LNCS 9

Zoom Link: https://us06web.zoom.us/j/81900726710?pwd=aEYm8qBzCJspHAYNhC3tBNbZvaoPBk.1    Meeting ID: 819 0072 6710    Passcode: CGI2025

Location: BC201, PolyU
15:40
An Industrial Multi-Machining Feature Dataset and Contrastive Learning-based Network for Feature Recognition

ABSTRACT. Machining feature recognition plays a crucial role in Computer-Aided Design (CAD) and Manufacturing (CAM), enabling the effective integration of CAD, Computer-Aided Process Planning (CAPP), and CAM systems. However, current machining feature datasets lack diversity and authenticity in terms of both machining features and base models. Current recognition methods also face challenges, such as losing geometric information of long-range complex intersecting features and difficulties in instance segmentation and localization. In this paper, we first construct a novel Multi-Machining Feature model Dataset (MMFD) that includes various non-machining feature shapes as base models and complex machining features commonly encountered in real industrial manufacturing processes. On the other hand, we propose CRNet, a Contrastive learning-based network for machining feature Recognition. The network uses a Transformer-based encoder to capture long-distance contextual information and employs a novel Face Geometric Position Encoding (FGPE) to enrich the spatial geometric relationship information between complex features. Furthermore, we utilize contrastive learning to directly supervise the feature embeddings, enhancing the robustness of feature representations and improving the generalization and scalability of the network in recognizing complex mechanical machining features. Finally, our experimental results demonstrate the effectiveness and accuracy of our CRNet on MMFD compared to the state-of-the-art methods.

15:55
IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

ABSTRACT. In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes—such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

16:10
DualShield: Dual-Layer Protection for Secure and Robust Stable Diffusion Steganography

ABSTRACT. Stable diffusion image steganography (SDIS) provides reliable protection for generating image traceability and secret transmission of information. However, existing SDISs still face challenges in simultaneously achieving robust watermark traceability and secure secret information transmission. To address this problem, we propose DualShield, a dual-protection robust steganography for generating images based on a stable diffusion model. DualShield utilizes controllable public and private conditional texts as keys and embeds the preprocessed watermark into the reverse-encoded latent vectors in the latent space in which the public key participates. To make the fusion of the watermark with the latent vectors more sufficient, we employ a U-Net network with the capability of fusing low- and high-level features as an encoder. To ensure the accuracy of the extracted watermark, we efficiently add random noise perturbations in the training phase. The design of the private key ensures that the receiver reverse decodes the high-worth images transmitted in secret. Experimental results show that DualShield performs satisfactorily against common image attacks, steganalysis, and removal attacks.

16:25
Accelerating Error Diffusion Halftoning for High-Throughput Wide-Format Industrial Printing

ABSTRACT. Digital halftoning is a technique that converts grayscale or color images into binary or multi-level formats, enabling high-quality image rendering with a limited color palette. It is widely employed in printing and digital imaging. However, with the growing demand for real-time printing, accelerating halftone processing has become a critical challenge. In industrial printing, traditional high-performance hosts are constrained by serial computing models and high costs, making it difficult to efficiently process massive image data. As a result, the speed of halftone image generation often falls short of the requirements for high-speed printing. This paper analyzes the principles of the error diffusion algorithm and designs a parallel pipelined processing core based on an FPGA platform to significantly accelerate image processing. Additionally, the development of the hardware accelerator follows an agile approach, utilizing differential testing for functional verification, greatly improving development efficiency. Experimental results show that the designed single hardware accelerator achieves image quality nearly identical to that of software algorithms, with processing speed improved by over 7 times. Under a multicore deployment scheme, it surpasses the performance level of a host computer, validating the solution for achieving high throughput and wide-format image high-speed printing on low-cost, low-power control boards.

16:40
A Novel Registration Framework For Large-scale Point Clouds via Geometric Salience Computation

ABSTRACT. Large-scale point cloud registration is crucial for 3D scene modeling, mapping, and understanding. However, even the most advanced state-of-the-art approaches continue to grapple with the challenge of building correct 3D correspondences and accurately estimating rigid transformations for large-scale point clouds with massive low geometric salience points. In this paper, we propose a novel geometric salience-aware framework for large-scale point cloud registration. First, we present a simple but efficient approach to compute geometric salience priors for both source and target point clouds to classify all points into high-, low-, and non-salience points. Then, we leverage a salience-aware voting method for ranking and selecting correspondences from the initial correspondence set with pairwise compatibility constraints. Finally, with the confidence values of the selected correspondences, we further utilize a robust estimation function to compute the rigid transformation information. Extensive experiments on several benchmarks unequivocally demonstrate substantial enhancements to the state-of-the-art performance in both quality and quantity.

16:55
Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans

ABSTRACT. High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.

15:40-17:30 Session 21C: LNCS 10

Zoom Link: https://us06web.zoom.us/j/82346226482?pwd=Rwc3dmFUMa4UZ7qcAcfcsBPjJ7baXv.1    Meeting ID: 823 4622 6482    Passcode: CGI2025

Location: BC202, PolyU
15:40
MC-Gaussian: 3D Gaussian Splatting with Multi-camera in Autonomous Driving

ABSTRACT. Novel view synthesis plays a crucial role in devel- oping closed-loop autonomous driving simulation systems. The modern autonomous driving system typically employs multiple cameras to achieve wide perception, but this also introduces challenges for neural rendering, i.e., the color inconsistency among different cameras and the sparse ob- servation in the side cameras. While NeRF-based methods offer solutions, their inefficiencies in rendering speed limit practical applications in simulation. In this paper, we pro- pose a real-time rendering solution for multi-camera settings based on 3D Gaussian representation. Since natively com- bining the solutions of multi-camera in NeRF with 3DGS is ineffective, we propose MC-Gaussian, a novel rendering framework tailored for 3D Gaussian. We introduce separate Gaussian models for the foreground and sky background, applying pose-conditioned color correction respectively to resolve color inconsistencies from different cameras. Addi- tionally, we introduce the regularization terms of geometric consistency across cameras to alleviate the overfitting from sparse view observations. Our method achieves state-of-the- art results on static scenes from the Waymo and NuScenes datasets, producing high-fidelity and real-time rendering.

15:55
EDiffNet: An Enhanced DiffusionNet for Non-isometric 3D Shape Correspondence and Matching

ABSTRACT. Non-rigid 3D shape correspondence and matching are key problems in computer graphics. However, owing to the coupling of complex non-isometric deformation, dynamic topological changes and strong noise interference, existing methods have significant bottlenecks in maintaining geometric consistency, optimizing local feature sensitivity and improving computational efficiency. Although deep learning methods based on functional maps (such as DiffusionNet) have improved feature expression capabilities through geometric deep learning, the Laplace Basis function that they use is inefficient in representing high-frequency details and lacks a multi-scale correspondence optimization mechanism. In view of the shortcomings of existing non-rigid shape correspondence methods, this study proposes an enhanced deep functional maps scheme: EDiffNet, which replaces the Laplace Operator with an approximate basis and integrates it into the DiffusionNet structure to improve the computational efficiency of the model. Simultaneously, a Scalable-ZoomOut iterative optimization layer is designed to achieve multiscLale correspondence by dynamically adjusting the radius. In terms of feature similarity calculation, the Euclidean distance metric of the WKS (Wave Kernel Signature) spectral descriptor was used to effectively overcome the interference of rigid transformation on the matching metric. Finally, the average geodesic errors of EDiffNet in FAUST and SCAPE distance deformation scenarios are reduced to 0.025 and 0.0268, which are 8.75% and 1.8% higher than DiffusionNet. Meanwhile, the EDiffNet framework is verified to exhibit high accurary and robustness in the feature similarity matching task by calculating the Euclidean distances of the pointwise correspondences WKS between models. The twomodule synergetic architecture proposed in this study provides a new methodological framework for 3D shape analysis under complex deformations, and its efficient basis function construction strategy is generalizable to geometric deep learning model design.

16:10
Enhancing INR-based Super-Resolution Performance in Scientific Visualization via a Priori and a Posteriori Constraints

ABSTRACT. In scientific visualization, preserving low resolution volumetric data and employing super-resolution techniques to restore or enhance resolution for further analysis is an effective approach. Currently, neural network methods, particularly those based on implicit neural representations (INR), are widely used for this purpose. However, serious overfitting often arises when INR networks are used in super-resolution tasks, significantly impacting visualization results. This problem has not been previously discussed or resolved. Therefore, our aim is to address the challenges faced by INR networks in super-resolution tasks, pursuing improved reconstruction results and a more stable training process. We propose an approach based on a priori and a posteriori constraints that enhances super-resolution performance with minimal computational overhead. The core idea of this approach is to introduce constraints with additional information to bridge the gap between the model’s optimization goal and the specific objectives of the super-resolution task, thereby mitigating overfitting. We conducted multiple experiments on ensemble simulation datasets covering domains such as the universe, turbulence, and ignition. Comparisons with several state-of-the-art methods demonstrate the effectiveness of our approach.

16:25
Enhancing Single-View 3D Clothed Human Reconstruction with Hybrid Prior Integration

ABSTRACT. Achieving high-fidelity 3D reconstructions of clothed humans from a single image is pivotal for applications in virtual reality, gaming, and the fashion industry. However, the challenge of accurately reconstructing subjects in loose clothing and complex poses has yet to be fully addressed. To bridge this gap, we propose a novel hybrid approach that combines decoupled side-view features and rebalanced parametric body model prior to guide detailed 3D human reconstruction. This approach can handle loose clothing and unusual poses simultaneously. Specifically, guided by the SMPL-X-based parametric human body prior, we leverage a Transformer-based framework to effectively decouple side-view features from input images. This process significantly enhances the accuracy of implicit-function-based reconstruction for complex poses, enabling a more precise representation of human body postures. Furthermore, to better adapt to diverse clothing types and avoid overfitting to the training data mainly consisting of tight clothing, we introduce a rebalancing coefficient within a positional embedding-based strategy. This coefficient adjusts the model's reliance on the parametric body prior, enhancing the ability to capture details of loose clothing. Consequently, the model can generate more reliable SDF (Signed Distance Function) values, which are essential for creating high-fidelity 3D clothed human bodies. Extensive experiments demonstrate superior performance in detailed representations for loose clothing and maintaining robust reconstruction of complex poses.

16:40
VS-DSN: Variable-Speed Dual-Stream Network for Continuous Sign Language Recognition

ABSTRACT. In Continuous Sign Language Recognition (CSLR), effectively capturing both spatial and temporal semantic properties—such as handshape, facial expressions, trajectories, and orientation—remains a significant challenge. While RGB modality has become the dominant input due to its rich visual information, it still faces limitations in capturing the intricate temporal features of sign language. To address this, we propose the Variable-Speed Dual-Stream Network (VS-DSN), which combines the spatial semantics of the RGB modality with the dynamic motion captured by the optical flow modality. We introduce a variable-speed sampling strategy based on semantic importance to selectively sample key frames from the RGB sequence, while reducing the feature dimension of the optical flow sequence. This strategy strikes a balance between computational efficiency and recognition performance. Furthermore, we design a Spatio-Temporal Interaction Module (STIM) that effectively integrates the spatial details of the RGB modality with the motion information from the optical flow modality. Experimental results on the CSL-Daily, PHOENIX14, and PHOENIX14-T datasets demonstrate the superior performance of the VS-DSN model. Compared to existing methods, VS-DSN achieves a Word Error Rate (WER) of 27.5% on CSL-Daily and 19.4% on both PHOENIX14 and PHOENIX14-T, while reducing computational complexity by 36% in terms of FLOPs. These results confirm the effectiveness of VSDSN in CSLR tasks, highlighting its enhanced accuracy and computational efficiency.

16:55
Enhancing Nvshu Recognition Based on Polarity-aware Linear Attention and Learnable Local Salient Kernel

ABSTRACT. Nvshu is a unique Chinese writing system and a gender-specific script in the world. Its characters feature rich variations in stroke direction, numerous intersecting and decorative strokes, and specific writing rules. These characteristics often lead to inaccurate recognition and poor robustness when processed by traditional methods. To address these challenges, this paper proposes a new Nvshu recognition model, Polarity-aware Linear Attention and Learnable Local Salient Kernel(PALLS), which incorporates collaborative optimization of multiple modules in its network architecture.The PALLS model enhances feature extraction by introducing a Learnable Local Salient Kernel Module (LLSKM) after traditional convolution. This module performs multi-scale feature extraction, significantly improving the model's ability to capture the edges and texture details of Nvshu strokes. Additionally, the model replaces traditional linear attention with a Polarity-aware Linear Attention (PLAttention) module. By decomposing and interactively modeling positive and negative directional features, this module enhances the model's discriminative power in identifying stroke directions. Furthermore, a Multi-scale Feature Fusion Module (MSFFM) is constructed to integrate semantic information from different levels, enhancing the model's adaptability to diverse samples.Experiments prove that the PALLS model achieves a recognition accuracy of 92.2% on a self-built Nvshu dataset, representing a significant improvement of 4.6% to 6.8% over baseline models. This enhanced precision and robustness fill the gap in deep learning-based research on Nvshu recognition and provide crucial technical support for the preservation and dissemination of Nvshu culture.