Program for Wednesday, June 4th

PROGRAM FOR WEDNESDAY, JUNE 4TH

Days:

08:30-09:30 Session 10: Keynote: Jehee Lee

Generative GaitNet and Beyond: Foundational Models for Human Motion Analysis and Simulation by Prof. Jehee Lee (Seoul National University)

Understanding the relationship between human anatomy and motion is fundamental to effective gait analysis, realistic motion simulation, and the creation of human body digital twins. We will begin with Generative GaitNet (SIGGRAPH 2022), a foundational model for human gait that drives a comprehensive full-body musculoskeletal system comprising 304 Hill-type musculotendons. Generative GaitNet is a pre-trained, integrated system of artificial neural networks that operates in a 618-dimensional continuous space defined by anatomical factors (e.g., mass distribution, body proportions, bone deformities, and muscle deficits) and gait parameters (e.g., stride and cadence). Given specific anatomy and gait conditions, the model generates corresponding gait cycles via real-time physics-based simulation. Next, we will discuss Bidirectional GaitNet (SIGGRAPH 2023), which consists of forward and backward models. The forward model predicts the gait pattern of an individual based on their physical characteristics, while the backward model infers physical conditions from observed gait patterns. Finally, we will present MAGNET (Muscle Activation Generation Networks)—another foundational model (SIGGRAPH 2025)—designed to reconstruct full-body muscle activations across a wide range of human motions. We will demonstrate its ability to accurately predict muscle activations from motions captured in video footage. We will conclude by discussing how these foundational models collectively contribute to the development of human body digital twins, and explore their future potential in personalized rehabilitation, surgery planning, and human-centered simulation.

Chair:

Hyewon Seo

Location: Amphie 23

09:35-10:50 Session 11A: Detection & Recognition

Chair:

Michael Adjeisah

Location: Amphie 23

09:35	Hongyu Liu and Zhenyu Gu Perspective Matters: Investigating the Effects of Vibrotactile Mode Design on User Experience in Action-Role Playing Game and Media PRESENTER: Hongyu Liu
09:53	Yejuan Xie, Xinrui Wu, Yichen Zhang, Rongrong Chen, Tulika Saha, Yuehan Dou and Chengtao Ji Exploring Cultural Heritage with AR: The TAM Case Study of Nvshu PRESENTER: Yejuan Xie
10:11	Lanqi Xu, Yifan Zhang, Xu Lang, Jianing Liu, Baiheng Liu, Xianxuan Lin, Jing Zhang, Zheng Wang and Tianming Wu A Design Study on Contextual and Interactive Serious Games for Children’s Learning of Chinese Character Culture PRESENTER: Xu Lang
10:29	Siyao Du, Haoxiang Yang, Yajie Deng, Liuxuan Xie, Yanzhe Kong, Haohan Zhang and Hammadi Nait-Charif Summon Arcane: An AI-Driven Pixel Art Game with Interactive Narrative and Immersive Summoning Experience PRESENTER: Haoxiang Yang

09:35-10:50 Session 11B: AR/VR for Interaction

Chair:

Frederic Cordier

Location: Amphie 24

09:35	Rui Liu, Fangbo Lu, Wanchuang Luo, Tianjian Cao, Hailian Xue and Meili Wang YOLOv8-HAC: Safety helmet detection model for complex underground coal mine scene PRESENTER: Rui Liu ABSTRACT. The underground coal mine working environment is complicated, and the detection of safety helmet wearing is vital for assuring worker safety. This paper proposes an improved YOLOv8 safety helmet detection model, YOLOv8-HAC, to address the issues of coexisting strong light exposure and low illumination, equipment occlusions that result in partial target loss, and the missed detection of small targets due to limited surveillance perspectives in underground coal mines. The model substitutes the suggested HAC-Net for the C2f module in YOLOv8's backbone network in order to improve feature extraction and detection performance for targets with motion blur and low-resolution images. To improve detection stability in complicated situations and lessen background interference, the AGC-Block module is also included for dynamic feature selection. Additionally, a tiny target detection layer is included to increase the long-range identification rate of tiny safety helmets. According to experimental data, the enhanced model outperforms existing popular object detection algorithms, with a mAP of 94.8\% and a recall rate of 90.4\%. This demonstrates how well the suggested approach works to identify safety helmets in situations with complicated lighting and low-resolution photos.
10:00	Zhongguang Zhang, Tingwei Wu, Qifei Zhang, Li Wang and Zhao Wang STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection PRESENTER: Tingwei Wu
10:25	Xiaohui Tan, Weiqi Xu, Jiazheng Wu and Qichuan Geng AU-guided Feature Aggregation for Micro-Expression Recognition PRESENTER: Weiqi Xu ABSTRACT. Micro-expressions (MEs) are spontaneous and transient facial movements that reflect real internal emotions and they have been widely applied in various fields. Recent deep learning-based methods have been rapidly developing in the area of micro-expression recognition (MER), but it is typical to focus on one-sided nature of MEs, covering only representational features or low-ranking Action Unit (AU) features. The subtle changes in MEs characterize its feature representation to be weak and inconspicuous, which makes it tough to analyze MEs only from single piece or a small amount of information to achieve a considerable recognition effect. In addition, the lower-order information can only distinguish MEs from a single low-dimensional perspective and neglects the potential of corresponding MEs and AU combinations to each other. To address these issues, we first explore that the higher-order relations of different AU combinations have correspondence with MEs through statistical analysis. Afterwards, based on this attribute, we propose an end-to-end multi-stream model that integrates global feature learning and local muscle movement representation guided by AU semantic information. The comparative experiments were carried out on benchmark datasets, with better performance than the stateof-art methods. Also, the ablation experiments demonstrate the necessity of our model to introduce the information of AU and its relationship to MER.

10:50-11:05 Coffee break

11:05-12:20 Session 12A: Cross-Modal and Semantic Representation Learning

4 LNCS

Chair:

Xiaosong Yang

Location: Amphie 23

11:05	Haoyuan Du, Xia Yu, Wei Yu, Dan Xue and Yuhan Lin Potential Representation Learning for Visible-Infrared Person Re-Identification in Virtual Surveillance Systems PRESENTER: Haoyuan Du
11:30	Xudong He, Li Wang, Zhao Wang and Jun Xiao Hybrid-Granularity Image-Music Retrieval Using Contrastive Learning between Images and Music PRESENTER: Xudong He
11:55	Yudai Ichimura and Syuhei Sato Text-driven Tree Modeling via CLIP-based Optimization PRESENTER: Yudai Ichimura

11:05-12:20 Session 12B: Image Restoration & Enhancement

2 CAVW 2 LNCS

Chair:

Bin Sheng

Location: Amphie 24

11:05	Hangbin Xu, Changjun Zou and Chuchao Lin UTMCR:3U-Net Transformer with Multi-Contrastive Regularization for Single Image Dehazing PRESENTER: Hangbin Xu ABSTRACT. Convolutional neural networks have a long history of development in single-width dehazing tasks, but have gradually been dominated by the Transformer framework due to their sufficient global modeling capability and large number of parameters. However, the existing Transformer network structure adopts a single U-Net structure, which is insufficient in multi-level and multi-scale feature fusion and modeling capability. Therefore, we propose an end-to-end dehazing network (UTMCR-Net). The network consists of two parts: 1) UT module, which connects three U-Net networks in series. By connecting three U-Net networks in series, we can improve the image global modeling capability and capture multi-scale information at different levels to achieve multi-level and multi-scale feature fusion.2) MCR module, which improves the original contrastive regularization method by splitting the results of the UT module into four equal blocks, which are then compared and learned by using the contrast regularization method respectively. Specifically, we use three U-Net networks to enhance the global modeling capability of UTMCR as well as the multi-scale feature fusion capability. The image dehazing ability is further enhanced using the MCR module. Experimental results show that our method achieves better results on most datasets.
11:23	Chuchao Lin, Changjun Zou and Hangbin Xu SCNet: A Dual-Branch Network for Strong Noisy Image Denoising Based on Swin Transformer and ConvNeXt PRESENTER: Chuchao Lin ABSTRACT. Image denoising plays a vital role in restoring high-quality images from noisy inputs and directly impacts downstream vision tasks. Traditional methods often fail under strong noise, causing detail loss or excessive smoothing. While recent CNN-based and Transformer-based models have shown progress, they struggle to jointly capture global structure and preserve local details. To address this, we propose SCNet, a dual-branch fusion network tailored for strong-noise denoising. It combines a Swin Transformer branch for global context modeling and a ConvNeXt branch for fine-grained local feature extraction. Their outputs are adaptively merged via a Feature Fusion Block (FFB) using joint spatial and channel attention, ensuring semantic consistency and texture fidelity. A multi-scale upsampling module and the Charbonnier loss further improve structural accuracy and visual quality. Extensive experiments on four benchmark datasets show that SCNet outperforms state-of-the-art methods, especially under severe noise, and proves effective in real-world tasks such as mural image restoration.
11:41	Xun Chen, Yushi Li, Yunyao Shen, Rong Chen, Chao Xu, Xiaobo Jin, Along Jin and Yu Han ShadowCraft-NeRF: Occlusion and Shadow Mitigation via SAM-Guided NeRF PRESENTER: Xun Chen
11:59	Haoran Jia, Chen Baijun and Nan Xiang Visualizing the Invisible: An Efficient Framework for Microscopic Visualization PRESENTER: Haoran Jia ABSTRACT. The growing focus on microscopic entities such as cells or viruses in medical education and public health has highlighted the need for better visualizations of microscopic life. Traditional methods like microscopy provide detailed images but are often hard for non-specialists to understand. Existing 3D models for visualization are mostly manually created and face issues with accuracy, time, and cost, limiting their applicability in large-scale educational and outreach efforts. This project presents an efficient and high-quality visualization framework that directly generates high-fidelity 3D models from real biological scanning data. The proposed approach integrates 3D reconstruction, texture mapping, coloring, and lighting, enabling the creation of detailed and accurate microscopic models with reduced labor and time costs. By overcoming the limitations of existing methods, this framework has the potential to enhance both medical education and public engagement with the microscopic world, offering an efficient, scalable, and accurate solution for visualizing complex biological structures.

12:20-14:00 Lunch

14:00-15:40 Session 13A: Human Motion & Gesture Synthesis

CAVW

Chair:

Hyewon Seo

Location: Amphie 23

14:00	Ghazanfar Ali, Hwangyoun Kim and Jae-In Hwang RIDGE: Rule-Infused Deep Learning for Realistic Co-Speech Gesture Generation PRESENTER: Ghazanfar Ali ABSTRACT. Co-speech gestures are essential for natural human communication, yet existing synthesis methods fall short in delivering semantically aligned and contextually appropriate motions. In this paper, we present RIDGE, a hybrid system that combines rule-based and deep learning approaches to generate realistic gestures for virtual avatars and human-computer interaction. RIDGE employs a high-fidelity rule base generated from motion capture data with the assistance of large language models, to select reliable gesture mappings. When a high-confidence match is not available, a contrastively trained deep learning model steps in to produce semantically appropriate gestures. Evaluated using a novel Gesture Cluster Affinity (GCA) metric, our system outperforms existing baselines, achieving a GCA score of 0.73 compared to rule-based baseline 0.6 and end-to-end: 0.52, while ground truth score was 0.90. Detailed analyses of system architecture, data preprocessing, and evaluation methodologies demonstrate RIDGE’s potential to enhance gesture synthesis.
14:20	Jiawen Peng, Zhuoran Liu, Jingzhong Lin and Gaoqi He Precise Motion Inbetweening via Bidirectional Autoregressive Diffusion Models PRESENTER: Jiawen Peng ABSTRACT. Conditional Motion diffusion models have demonstrated significant potential in generating natural and reasonable motions response to constraints such as keyframes, that can be used for motion inbetweening task. However, most methods struggle to match the keyframe constraints accurately, which resulting in unsmooth transitions between keyframes and generated motion. In this paper, we propose Bidirectional Autoregressive Motion Diffusion Inbetweening (BAMDI) to generate seamless motion between start and target frames. The main idea is to transfer the motion diffusion model to autoregressive paradigm, which predicts subsequence of motion adjacent to both start and target keyframes to infill the missing frames through several iterations. This can help to improve the local consistency of generated motion. Additionally, bidirectional generation make sure the smoothness on both start frame target keyframes. Experiments show our method achieves state-of-the-art performance compared with other diffusion-based motion inbetweening methods.
14:40	Rui Zeng, Ju Dai, Junxuan Bai and Junjun Pan Motion In-betweening via Recursive Keyframe Prediction PRESENTER: Rui Zeng ABSTRACT. Motion in-betweening is a flexible and efficient technique for generating 3-dimensional animations. In this paper, we propose a keyframe-driven method that effectively addresses the pose ambiguity issue and achieves robust in-betweening performance. We introduce a keyframe-driven synthesis framework. At each recursion, the key poses at both ends keep predicting the new one at the midpoint. The recursive breakdown reduces motion ambiguities by simplifying the in-betweening sequence as the integration of short clips. The hybrid positional encoding scales the hidden states to adapt to long-and-short-term dependencies. Additionally, we employ a temporal refinement network to capture the local motion relationships, thereby enhancing the consistency of the predicted pose sequence. Through comprehensive evaluations that include both quantitative and qualitative comparisons, the proposed model demonstrates its competitiveness in prediction accuracy and in-betweening flexibility.
15:00	Hong Son Nguyen, Daeun Cheong, Andrew Chalmers, Myoung Gon Kim, Taehyun Rhee and Junghyun Han Interaction with Virtual Objects using Human Pose and Shape Estimation PRESENTER: Hong Son Nguyen ABSTRACT. In this paper, we propose an AR system that facilitates a user’s natural interaction with virtual objects in an augmented reality environment. The system consists of three modules: human pose and shape estimation, camera-space calibration, and physics simulation. The first module is capable of estimating a user's 3D pose and shape from a single RGB video stream, thereby reducing the system setup cost and broadening potential applications. The camera-space calibration module estimates the user's camera-space position to align the user with the input RGB image. The physics simulation enables seamless and physically natural interaction with virtual objects. Two prototyping applications built upon the system prove an enhancement in the quality of interaction, fostering a more immersive and intuitive user experience.
15:20	Siyao Du, Boyuan Cheng, Yi Wen, Zixuan Zhou and Xiaosong Yang Motion Style Transfer: Methods, Challenges, and Future Directions PRESENTER: Siyao Du

14:00-15:40 Session 13B: 3D Reconstruction & Representation

CAVW

Chair:

Zhao Wang

Location: Amphie 24

14:00	Haowei Xue and Meili Wang LGNet:Local-and-Global Feature Adaptive Network for Single Image Two-Hand Reconstruction PRESENTER: Haowei Xue ABSTRACT. Accurate 3D interacting hand mesh reconstruction from RGB images is crucial for applications such as robotics, augmented reality (AR), and virtual reality (VR). Especially in the field of robotics, accurate interacting hand mesh reconstruction can significantly improve the accuracy and naturalness of human-robot interaction. This task requires accurate understanding of complex interactions between two hands and ensuring reasonable alignment of the hand mesh with the image.Recent Transformer-based methods directly utilise the features of the two hands as input tokens, ignoring the correlation between local and global features of the interacting hands, leading to hand ambiguity, self-occlusion and self-similarity problems.We propose LGNet, Local and Global Feature Adaptive Network, through separating the hand mesh reconstruction process into three stages:a joint stage for predicting hand joints; a mesh stage for predicting a rough hand mesh; and a refine stage for fine-tuning the mesh image alignment using an offset mesh. LGNet enables high-quality fingertip-level mesh image alignment, effectively models the spatial relationship between two hands, and supports real-time prediction.Comprehensive quantitative and qualitative evaluations on benchmark datasets reveal that LGNet surpasses existing methods in mesh accuracy and alignment accuracy, while also showcasing robust generalization performance in tests on in-the-wild images.Our source code will be made available to the community.
14:25	Mengyao Zhang, Jie Zhou, Tingyun Miao, Yong Zhao, Xin Si and Jingliang Zhang Joint-learning: A Robust Segmentation Method for 3D Point Clouds under Label Noise PRESENTER: Tingyun Miao ABSTRACT. Most of point cloud segmentation methods are based on clean datasets and are easily affected by label noise. We present a novel method called Joint-learning, which is the first attempt to apply a dual-network framework to point cloud segmentation with noisy labels. Two networks are trained simultaneously, and each network selects clean samples to update its peer network. The communication between two networks is able to exchange the knowledge they learned, possessing good robustness and generalization ability. Subsequently, adaptive sample selection is proposed to maximize the learning capacity. When the accuracies of both networks are no longer improving, the selection rate is reduced, which results in cleaner selected samples. To further reduce the impact of noisy labels, for unselected samples, we provide a joint label correction algorithm to rectify their labels via two networks’ predictions. We conduct various experiments on S3DIS and ScanNet-v2 datasets under different types and rates of noises. Both quantitative and qualitative results verify the reasonableness and effectiveness of the proposed method. By comparison, our method is substantially superior to the state-of-the-art methods and achieves the best results in all noise settings. The average performance improvement is more than 7.43%, with a maximum of 11.42%.
14:50	Huiqiang Hu, Changyan He, Xiaojun Liu, Jinyuan Jia and Ting Yu Weisfeiler-Lehman kernel augmented product representation for queries on large-scale BIM scenes PRESENTER: Xiaojun Liu ABSTRACT. To achieve efficient querying of BIM products in large-scale virtual scenes, this study introduces a Weisfeiler-Lehman (WL) kernel augmented representation for BIM products based on Product Attributed Graphs (PAGs). Unlike conventional data-driven approaches that demand extensive labeling and preprocessing, our method directly processes raw BIM product data to extract stable semantic and geometric features. Initially, a PAG is constructed to encapsulate product features. Subsequently, a WL kernel enhanced multi-channel node aggregation strategy is employed to integrate BIM product attributes effectively. Leveraging the bijective relationship in graph isomorphism, an unsupervised convergence mechanism based on attribute value differences is established. Experiments demonstrate that our method achieves convergence within an average of 3 iterations, completes graph isomorphism testing in minimal time, and attains an average query accuracy of 95%. This approach outperforms 1-WL and 3-WL methods, especially in handling products with topologically isomorphic but oppositely attributed spaces.
15:15	Xinying Dai and Li Yao DTGS: Defocus-Tolerant View Synthesis using Gaussian Splatting PRESENTER: Xinying Dai ABSTRACT. Defocus blur poses a significant challenge for 3D reconstruction, as traditional methods often struggle to maintain detail and accuracy in blurred regions. Building upon the recent advancements in 3DGS technique, we propose an architecture for 3D scene reconstruction from defocused blurry images. Due to the sparsity of point clouds initialized by SfM, we improve the scene representation by reasonably filling in new Gaussians where the Gaussian field is insufficient. During the optimization phase, we adjust the gradient field based on the depth values of the points and introduce perceptual loss in the objective function to reduce reconstruction bias caused by blurriness and enhance the realism of the rendered results. Experimental results on both synthetic and real datasets show that our method outperforms existing approaches in terms of reconstruction quality and robustness, even under challenging defocus blur conditions.

15:40-15:55 Session 14: Best Paper Awards and Closing

Chair:

Frederic Cordier

Location: Amphie 23