CASA 2025: COMPUTER ANIMATION AND SOCIAL AGENTS 2025
PROGRAM FOR WEDNESDAY, JUNE 4TH
Days:
previous day
all days

View: session overviewtalk overview

08:30-09:30 Session 10: Keynote: Jehee Lee

Generative GaitNet and Beyond: Foundational Models for Human Motion Analysis and Simulation by Prof. Jehee Lee (Seoul National University)

Understanding the relationship between human anatomy and motion is fundamental to effective gait analysis, realistic motion simulation, and the creation of human body digital twins. We will begin with Generative GaitNet (SIGGRAPH 2022), a foundational model for human gait that drives a comprehensive full-body musculoskeletal system comprising 304 Hill-type musculotendons. Generative GaitNet is a pre-trained, integrated system of artificial neural networks that operates in a 618-dimensional continuous space defined by anatomical factors (e.g., mass distribution, body proportions, bone deformities, and muscle deficits) and gait parameters (e.g., stride and cadence). Given specific anatomy and gait conditions, the model generates corresponding gait cycles via real-time physics-based simulation. Next, we will discuss Bidirectional GaitNet (SIGGRAPH 2023), which consists of forward and backward models. The forward model predicts the gait pattern of an individual based on their physical characteristics, while the backward model infers physical conditions from observed gait patterns. Finally, we will present MAGNET (Muscle Activation Generation Networks)—another foundational model (SIGGRAPH 2025)—designed to reconstruct full-body muscle activations across a wide range of human motions. We will demonstrate its ability to accurately predict muscle activations from motions captured in video footage. We will conclude by discussing how these foundational models collectively contribute to the development of human body digital twins, and explore their future potential in personalized rehabilitation, surgery planning, and human-centered simulation.

Chair:
Location: Amphie 23
09:35-10:50 Session 11A: Detection & Recognition
Location: Amphie 23
09:35
Perspective Matters: Investigating the Effects of Vibrotactile Mode Design on User Experience in Action-Role Playing Game and Media
PRESENTER: Hongyu Liu
09:53
Exploring Cultural Heritage with AR: The TAM Case Study of Nvshu
PRESENTER: Yejuan Xie
10:11
A Design Study on Contextual and Interactive Serious Games for Children’s Learning of Chinese Character Culture
PRESENTER: Xu Lang
10:29
Summon Arcane: An AI-Driven Pixel Art Game with Interactive Narrative and Immersive Summoning Experience
PRESENTER: Haoxiang Yang
09:35-10:50 Session 11B: AR/VR for Interaction
Location: Amphie 24
09:35
YOLOv8-HAC: Safety helmet detection model for complex underground coal mine scene
PRESENTER: Rui Liu

ABSTRACT. The underground coal mine working environment is complicated, and the detection of safety helmet wearing is vital for assuring worker safety. This paper proposes an improved YOLOv8 safety helmet detection model, YOLOv8-HAC, to address the issues of coexisting strong light exposure and low illumination, equipment occlusions that result in partial target loss, and the missed detection of small targets due to limited surveillance perspectives in underground coal mines. The model substitutes the suggested HAC-Net for the C2f module in YOLOv8's backbone network in order to improve feature extraction and detection performance for targets with motion blur and low-resolution images. To improve detection stability in complicated situations and lessen background interference, the AGC-Block module is also included for dynamic feature selection. Additionally, a tiny target detection layer is included to increase the long-range identification rate of tiny safety helmets. According to experimental data, the enhanced model outperforms existing popular object detection algorithms, with a mAP of 94.8\% and a recall rate of 90.4\%. This demonstrates how well the suggested approach works to identify safety helmets in situations with complicated lighting and low-resolution photos.

10:00
STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection
PRESENTER: Tingwei Wu
10:25
AU-guided Feature Aggregation for Micro-Expression Recognition
PRESENTER: Weiqi Xu

ABSTRACT. Micro-expressions (MEs) are spontaneous and transient facial movements that reflect real internal emotions and they have been widely applied in various fields. Recent deep learning-based methods have been rapidly developing in the area of micro-expression recognition (MER), but it is typical to focus on one-sided nature of MEs, covering only representational features or low-ranking Action Unit (AU) features. The subtle changes in MEs characterize its feature representation to be weak and inconspicuous, which makes it tough to analyze MEs only from single piece or a small amount of information to achieve a considerable recognition effect. In addition, the lower-order information can only distinguish MEs from a single low-dimensional perspective and neglects the potential of corresponding MEs and AU combinations to each other. To address these issues, we first explore that the higher-order relations of different AU combinations have correspondence with MEs through statistical analysis. Afterwards, based on this attribute, we propose an end-to-end multi-stream model that integrates global feature learning and local muscle movement representation guided by AU semantic information. The comparative experiments were carried out on benchmark datasets, with better performance than the stateof-art methods. Also, the ablation experiments demonstrate the necessity of our model to introduce the information of AU and its relationship to MER.

11:05-12:20 Session 12A: Cross-Modal and Semantic Representation Learning

4 LNCS

Location: Amphie 23
11:05
Potential Representation Learning for Visible-Infrared Person Re-Identification in Virtual Surveillance Systems
PRESENTER: Haoyuan Du
11:30
Hybrid-Granularity Image-Music Retrieval Using Contrastive Learning between Images and Music
PRESENTER: Xudong He
11:55
Text-driven Tree Modeling via CLIP-based Optimization
PRESENTER: Yudai Ichimura
11:05-12:20 Session 12B: Image Restoration & Enhancement

2 CAVW 2 LNCS

Chair:
Location: Amphie 24
11:05
UTMCR:3U-Net Transformer with Multi-Contrastive Regularization for Single Image Dehazing
PRESENTER: Hangbin Xu

ABSTRACT. Convolutional neural networks have a long history of development in single-width dehazing tasks, but have gradually been dominated by the Transformer framework due to their sufficient global modeling capability and large number of parameters. However, the existing Transformer network structure adopts a single U-Net structure, which is insufficient in multi-level and multi-scale feature fusion and modeling capability. Therefore, we propose an end-to-end dehazing network (UTMCR-Net). The network consists of two parts: 1) UT module, which connects three U-Net networks in series. By connecting three U-Net networks in series, we can improve the image global modeling capability and capture multi-scale information at different levels to achieve multi-level and multi-scale feature fusion.2) MCR module, which improves the original contrastive regularization method by splitting the results of the UT module into four equal blocks, which are then compared and learned by using the contrast regularization method respectively. Specifically, we use three U-Net networks to enhance the global modeling capability of UTMCR as well as the multi-scale feature fusion capability. The image dehazing ability is further enhanced using the MCR module. Experimental results show that our method achieves better results on most datasets.

11:23
SCNet: A Dual-Branch Network for Strong Noisy Image Denoising Based on Swin Transformer and ConvNeXt
PRESENTER: Chuchao Lin

ABSTRACT. Image denoising plays a vital role in restoring high-quality images from noisy inputs and directly impacts downstream vision tasks. Traditional methods often fail under strong noise, causing detail loss or excessive smoothing. While recent CNN-based and Transformer-based models have shown progress, they struggle to jointly capture global structure and preserve local details. To address this, we propose SCNet, a dual-branch fusion network tailored for strong-noise denoising. It combines a Swin Transformer branch for global context modeling and a ConvNeXt branch for fine-grained local feature extraction. Their outputs are adaptively merged via a Feature Fusion Block (FFB) using joint spatial and channel attention, ensuring semantic consistency and texture fidelity. A multi-scale upsampling module and the Charbonnier loss further improve structural accuracy and visual quality. Extensive experiments on four benchmark datasets show that SCNet outperforms state-of-the-art methods, especially under severe noise, and proves effective in real-world tasks such as mural image restoration.

11:41
ShadowCraft-NeRF: Occlusion and Shadow Mitigation via SAM-Guided NeRF
PRESENTER: Xun Chen
11:59
Visualizing the Invisible: An Efficient Framework for Microscopic Visualization
PRESENTER: Haoran Jia

ABSTRACT. The growing focus on microscopic entities such as cells or viruses in medical education and public health has highlighted the need for better visualizations of microscopic life. Traditional methods like microscopy provide detailed images but are often hard for non-specialists to understand. Existing 3D models for visualization are mostly manually created and face issues with accuracy, time, and cost, limiting their applicability in large-scale educational and outreach efforts. This project presents an efficient and high-quality visualization framework that directly generates high-fidelity 3D models from real biological scanning data. The proposed approach integrates 3D reconstruction, texture mapping, coloring, and lighting, enabling the creation of detailed and accurate microscopic models with reduced labor and time costs. By overcoming the limitations of existing methods, this framework has the potential to enhance both medical education and public engagement with the microscopic world, offering an efficient, scalable, and accurate solution for visualizing complex biological structures.

14:00-15:40 Session 13A: Human Motion & Gesture Synthesis

CAVW

Chair:
Location: Amphie 23
14:00
RIDGE: Rule-Infused Deep Learning for Realistic Co-Speech Gesture Generation
PRESENTER: Ghazanfar Ali

ABSTRACT. Co-speech gestures are essential for natural human communication, yet existing synthesis methods fall short in delivering semantically aligned and contextually appropriate motions. In this paper, we present RIDGE, a hybrid system that combines rule-based and deep learning approaches to generate realistic gestures for virtual avatars and human-computer interaction. RIDGE employs a high-fidelity rule base generated from motion capture data with the assistance of large language models, to select reliable gesture mappings. When a high-confidence match is not available, a contrastively trained deep learning model steps in to produce semantically appropriate gestures. Evaluated using a novel Gesture Cluster Affinity (GCA) metric, our system outperforms existing baselines, achieving a GCA score of 0.73 compared to rule-based baseline 0.6 and end-to-end: 0.52, while ground truth score was 0.90. Detailed analyses of system architecture, data preprocessing, and evaluation methodologies demonstrate RIDGE’s potential to enhance gesture synthesis.

14:20
Precise Motion Inbetweening via Bidirectional Autoregressive Diffusion Models
PRESENTER: Jiawen Peng

ABSTRACT. Conditional Motion diffusion models have demonstrated significant potential in generating natural and reasonable motions response to constraints such as keyframes, that can be used for motion inbetweening task. However, most methods struggle to match the keyframe constraints accurately, which resulting in unsmooth transitions between keyframes and generated motion. In this paper, we propose Bidirectional Autoregressive Motion Diffusion Inbetweening (BAMDI) to generate seamless motion between start and target frames. The main idea is to transfer the motion diffusion model to autoregressive paradigm, which predicts subsequence of motion adjacent to both start and target keyframes to infill the missing frames through several iterations. This can help to improve the local consistency of generated motion. Additionally, bidirectional generation make sure the smoothness on both start frame target keyframes. Experiments show our method achieves state-of-the-art performance compared with other diffusion-based motion inbetweening methods.

14:40
Motion In-betweening via Recursive Keyframe Prediction
PRESENTER: Rui Zeng

ABSTRACT. Motion in-betweening is a flexible and efficient technique for generating 3-dimensional animations. In this paper, we propose a keyframe-driven method that effectively addresses the pose ambiguity issue and achieves robust in-betweening performance. We introduce a keyframe-driven synthesis framework. At each recursion, the key poses at both ends keep predicting the new one at the midpoint. The recursive breakdown reduces motion ambiguities by simplifying the in-betweening sequence as the integration of short clips. The hybrid positional encoding scales the hidden states to adapt to long-and-short-term dependencies. Additionally, we employ a temporal refinement network to capture the local motion relationships, thereby enhancing the consistency of the predicted pose sequence. Through comprehensive evaluations that include both quantitative and qualitative comparisons, the proposed model demonstrates its competitiveness in prediction accuracy and in-betweening flexibility.

15:00
Interaction with Virtual Objects using Human Pose and Shape Estimation
PRESENTER: Hong Son Nguyen

ABSTRACT. In this paper, we propose an AR system that facilitates a user’s natural interaction with virtual objects in an augmented reality environment. The system consists of three modules: human pose and shape estimation, camera-space calibration, and physics simulation. The first module is capable of estimating a user's 3D pose and shape from a single RGB video stream, thereby reducing the system setup cost and broadening potential applications. The camera-space calibration module estimates the user's camera-space position to align the user with the input RGB image. The physics simulation enables seamless and physically natural interaction with virtual objects. Two prototyping applications built upon the system prove an enhancement in the quality of interaction, fostering a more immersive and intuitive user experience.

15:20
Motion Style Transfer: Methods, Challenges, and Future Directions
PRESENTER: Siyao Du
14:00-15:40 Session 13B: 3D Reconstruction & Representation

CAVW

Chair:
Location: Amphie 24
14:00
LGNet:Local-and-Global Feature Adaptive Network for Single Image Two-Hand Reconstruction
PRESENTER: Haowei Xue

ABSTRACT. Accurate 3D interacting hand mesh reconstruction from RGB images is crucial for applications such as robotics, augmented reality (AR), and virtual reality (VR). Especially in the field of robotics, accurate interacting hand mesh reconstruction can significantly improve the accuracy and naturalness of human-robot interaction. This task requires accurate understanding of complex interactions between two hands and ensuring reasonable alignment of the hand mesh with the image.Recent Transformer-based methods directly utilise the features of the two hands as input tokens, ignoring the correlation between local and global features of the interacting hands, leading to hand ambiguity, self-occlusion and self-similarity problems.We propose LGNet, Local and Global Feature Adaptive Network, through separating the hand mesh reconstruction process into three stages:a joint stage for predicting hand joints; a mesh stage for predicting a rough hand mesh; and a refine stage for fine-tuning the mesh image alignment using an offset mesh. LGNet enables high-quality fingertip-level mesh image alignment, effectively models the spatial relationship between two hands, and supports real-time prediction.Comprehensive quantitative and qualitative evaluations on benchmark datasets reveal that LGNet surpasses existing methods in mesh accuracy and alignment accuracy, while also showcasing robust generalization performance in tests on in-the-wild images.Our source code will be made available to the community.

14:25
Joint-learning: A Robust Segmentation Method for 3D Point Clouds under Label Noise
PRESENTER: Tingyun Miao

ABSTRACT. Most of point cloud segmentation methods are based on clean datasets and are easily affected by label noise. We present a novel method called Joint-learning, which is the first attempt to apply a dual-network framework to point cloud segmentation with noisy labels. Two networks are trained simultaneously, and each network selects clean samples to update its peer network. The communication between two networks is able to exchange the knowledge they learned, possessing good robustness and generalization ability. Subsequently, adaptive sample selection is proposed to maximize the learning capacity. When the accuracies of both networks are no longer improving, the selection rate is reduced, which results in cleaner selected samples. To further reduce the impact of noisy labels, for unselected samples, we provide a joint label correction algorithm to rectify their labels via two networks’ predictions. We conduct various experiments on S3DIS and ScanNet-v2 datasets under different types and rates of noises. Both quantitative and qualitative results verify the reasonableness and effectiveness of the proposed method. By comparison, our method is substantially superior to the state-of-the-art methods and achieves the best results in all noise settings. The average performance improvement is more than 7.43%, with a maximum of 11.42%.

14:50
Weisfeiler-Lehman kernel augmented product representation for queries on large-scale BIM scenes
PRESENTER: Xiaojun Liu

ABSTRACT. To achieve efficient querying of BIM products in large-scale virtual scenes, this study introduces a Weisfeiler-Lehman (WL) kernel augmented representation for BIM products based on Product Attributed Graphs (PAGs). Unlike conventional data-driven approaches that demand extensive labeling and preprocessing, our method directly processes raw BIM product data to extract stable semantic and geometric features. Initially, a PAG is constructed to encapsulate product features. Subsequently, a WL kernel enhanced multi-channel node aggregation strategy is employed to integrate BIM product attributes effectively. Leveraging the bijective relationship in graph isomorphism, an unsupervised convergence mechanism based on attribute value differences is established. Experiments demonstrate that our method achieves convergence within an average of 3 iterations, completes graph isomorphism testing in minimal time, and attains an average query accuracy of 95%. This approach outperforms 1-WL and 3-WL methods, especially in handling products with topologically isomorphic but oppositely attributed spaces.

15:15
DTGS: Defocus-Tolerant View Synthesis using Gaussian Splatting
PRESENTER: Xinying Dai

ABSTRACT. Defocus blur poses a significant challenge for 3D reconstruction, as traditional methods often struggle to maintain detail and accuracy in blurred regions. Building upon the recent advancements in 3DGS technique, we propose an architecture for 3D scene reconstruction from defocused blurry images. Due to the sparsity of point clouds initialized by SfM, we improve the scene representation by reasonably filling in new Gaussians where the Gaussian field is insufficient. During the optimization phase, we adjust the gradient field based on the depth values of the points and introduce perceptual loss in the objective function to reduce reconstruction bias caused by blurriness and enhance the realism of the rendered results. Experimental results on both synthetic and real datasets show that our method outperforms existing approaches in terms of reconstruction quality and robustness, even under challenging defocus blur conditions.