View: session overviewtalk overview
Harmonized XR: Seamlessly Bridging Physical and Perceptual Realism by Prof. Xubo yang (Shanghai Jiao Tong University)
Extended Reality (XR) represents a spectrum of immersive technologies that seamlessly blend the digital and physical worlds, creating environments where users can interact with virtual content as if it were part of their reality This keynote synthesizes cutting-edge research across visual perception, physical simulation, and interactive rendering to explore how XR can achieve both physical realism (accurate representation of physical phenomena) and perceptual realism (alignment with human visual and sensory perception).
We begin by addressing the challenges of visual fidelity in XR through innovative techniques that enhance occlusion, color accuracy, and rendering efficiency, ensuring that virtual content aligns seamlessly with human perception. Next, we delve into advancements in simulation methodologies that bring unprecedented physical accuracy to virtual environments, enabling the realistic representation of complex phenomena such as fluids, bubbles, and surface tension effects. Finally, we explore interactive experiences that bridge the gap between physical and perceptual realism by optimizing virtual interactions to align with natural human behavior and visual focus.
By integrating these advancements, XR can achieve a harmonious balance between physical and perceptual realism, creating immersive environments that are not only computationally efficient but also deeply engaging and believable. This keynote will highlight the interplay between these dimensions, offering a comprehensive roadmap for the future of XR technologies.
3 CAVW 1 LNCS
09:35 | Talking Face Generation with Lip and Identity Priors PRESENTER: Jiajie Wu ABSTRACT. Speech-driven talking face video generation has attracted significant research attention. While person-specific methods can produce high-fidelity videos, they require training or fine-tuning on target speakers' videos. In contrast, general-person methods often struggle with generating lip-synced videos, failing to accurately preserve identity information and produce clear and natural lip movements. To address this issue, we propose a novel network architecture comprising an alignment model and a rendering model. Given facial landmarks extracted from speech signals, the rendering model integrates a partially occluded target face, multi-reference lip features, and audio features to generate identity-consistent lip movements. Meanwhile, the alignment model leverages the partially occluded target face and a static reference image prior to predict optical flow, aligning facial poses and lip shapes. This facilitates the rendering model in producing more realistic and identity-preserving results. Extensive experiments demonstrate that our approach generates high-quality talking face videos with improved lip details and identity retention. |
09:53 | Speech-Driven 3D Facial Animation with Regional Attention for Style Capture PRESENTER: Jiahao Pan |
10:11 | Coarse-to-Fine 3D Craniofacial Landmark Detection via Heat Kernel Optimization PRESENTER: Xingfei Xue ABSTRACT. Accurate 3D craniofacial landmark detection is critical for applications in medicine and computer animation, yet remains challenging due to the complex geometry of craniofacial structures. In this work, we propose a coarse-to-fine framework for anatomical landmark localization on 3D craniofacial models. First, we introduce a Diffused Two-Stream Network (DTS-Net) for heatmap regression, which effectively captures both local and global geometric features by integrating pointwise scalar flow, tangent space vector flow, and spectral features in the Laplace-Beltrami space. This design enables robust representation of complex anatomical structures. Second, we propose a heat kernel-based energy optimization method to extract landmark coordinates from the predicted heatmaps. This approach exhibits strong performance across various geometric regions, including boundaries, flat surfaces, and high-curvature areas, ensuring accurate and consistent localization. Our method achieves state-of-the-art results on both a 3D cranial dataset and the BU-3DFE facial dataset. |
10:29 | GSFaceMorpher: High-Fidelity 3D Face Morphing via Gaussian Splatting PRESENTER: Xiwen Shi ABSTRACT. High-fidelity 3D face morphing aims to achieve seamless transitions between realistic 3D facial representations of different identities. While 3D Gaussian Splatting (3DGS) excels in high-quality rendering, its application to morphing is hindered by the lack of Gaussian primitive correspondence and variations in primitive quantities. To address this, we propose \oursmethodname, a novel framework for high-fidelity 3D face morphing based on 3DGS. Our method constructs an auxiliary model that bridges the source and target face models by aligning geometry through Radial Basis Function (RBF) warping and optimizing appearance in image space. This auxiliary model enables smooth parameter interpolation, while a diffusion-based refinement step enhances critical facial details through attention replacement from reference faces. Experiments demonstrate that our method produces visually coherent and high-fidelity morphing sequences, significantly outperforming NeRF-based baselines in both quantitative metrics and user preference. Our work establishes a new benchmark for high-fidelity 3D face morphing, with applications in visual effects, animation, and immersive experiences. |
CAVW
09:35 | Chinese Painting Generation with A Stroke-by-Stroke Renderer and a Semantic Loss PRESENTER: Yuan Ma ABSTRACT. Chinese painting is the traditional way of painting in China, with distinctive artistic characteristics and strong national style. Creating Chinese paintings is a complex and difficult process for ordinary people, so utilizing computer-aided Chinese painting generation is a meaningful topic. In this paper, we propose a novel Chinese painting generation model, which can generate vivid Chinese paintings in a stroke-by-stroke manner. In contrast to previous neural renderers, we design a Chinese painting renderer that can generate two classic stroke types of Chinese painting (i.e., middle-tip stroke and side-tip stroke), without the aid of any neural network. To capture the subtle semantic representation from the input image, we design a semantic loss to compute the distance between the input image and the output Chinese painting. Experiments demonstrate that our method can generate vivid and elegant Chinese paintings. |
10:00 | Research on Multi-Feature Fusion Shadow Puppet Motifs Generation Based on CSPMotifsGAN and Cultural Heritage Preservation PRESENTER: Rui Wang ABSTRACT. As quintessential cultural symbols in traditional shadow puppetry, artistic motifs encapsulate profound historical narratives and serve as vital conduits for intangible cultural heritage preservation. However, this craft confronts existential threats from digital entertainment proliferation and practitioner attrition. To address these challenges, this study proposes CSPMotifsGAN, an enhanced CycleGAN framework for constructing a motif dataset through three-stage processing: adaptive denoising, hierarchical classification, and multi-branch feature extraction (contour, texture, color). By integrating adversarial loss, cycle-consistency loss, and identity preservation loss, the model effectively resolves color distortion and textural degradation inherent in conventional CycleGAN. Experimental results demonstrate significant improvements: Fréchet Inception Distance(FID), Peak Signal-to-Noise Ratio(PSNR), and Structural Similarity Index(SSIM), validated through both subjective evaluations and statistical analysis. |
10:25 | CLPFusion: A Latent Diffusion Model Framework for Realistic Chinese Landscape Painting Style Transfer PRESENTER: Jiahui Pan ABSTRACT. This study focuses on transforming real-world scenery into Chinese landscape painting masterpieces through style transfer. Traditional methods using convolutional neural networks (CNNs) and generative adversarial networks (GANs) often yield inconsistent patterns and artifacts. The rise of diffusion models (DMs) presents new opportunities for realistic image generation, but their inherent noise characteristics make it challenging to synthesize pure white or black images. Consequently, existing DM-based methods struggle to capture the unique style and color information of Chinese landscape paintings.To overcome these limitations, we propose CLPFusion, a novel framework that leverages pre-trained diffusion models for artistic style transfer. A key innovation is the Bidirectional State Space Models-CrossAttention (BiSSM-CA) module, which efficiently learns and retains the distinct styles of Chinese landscape paintings. Additionally, we introduce two latent space feature adjustment methods, Latent-AdaIN and Latent-WCT, to enhance style modulation during inference.Experiments demonstrate that CLPFusion produces more realistic and artistic Chinese landscape paintings than existing approaches, showcasing its effectiveness and uniqueness in the field. |
CAVW
11:05 | Decoupling Density Dynamics: A Neural Operator Framework for Adaptive Multi-Fluid Interactions PRESENTER: Yuhang Xu ABSTRACT. The dynamic interface prediction of multi-density fluids presents a fundamental challenge across computational fluid dynamics and graphics, rooted in nonlinear momentum transfer and multi-scale aliasing induced by density disparities. We present Density-Conditioned Dynamic Convolution, a novel neural operator framework that establishes that establishes differentiable density-dynamics mapping through decoupled operator response. The core theoretical advancement lies in continuously adaptive neighborhood kernels that transform local density distributions into tunable filters, enabling unified representation from homogeneous media to multi-phase fluid. Experiments demonstrate autonomous evolution of physically consistent interface separation patterns in density contrast scenarios including cocktail and bidirectional hourglass flow. |
11:30 | A Control Simulation of Multiple Bubbles for Representing Desired Shapes PRESENTER: Syuhei Sato ABSTRACT. This paper presents a control simulation that represents user-desired shapes using multiple connected soap bubbles. A previous method attempted to control a single soap bubble using external forces. However, due to the strong surface tension making spherical babbles, elongated shapes could not be achieved. To address this issue, this paper aims to develop a control simulation that achieves diverse soap bubble shapes by dividing the target shape into connected soap bubbles. In our approach, we first generate an initial soap bubble configuration composed of multiple bubbles to represent the target shape. Then, by applying external forces to each bubble, we simulate the bubbles to maintain their shape along the target form. We use an implicit-function-like representation for the connected soap bubbles and develop a new polygonizer that makes shapes including the inner faces of bubbles. By demonstrating examples with various target shapes such as objects and text, we show the effectiveness of our proposed control method. |
11:55 | A versatile energy-based SPH surface tension with spatial gradients PRESENTER: Qianwei Wang ABSTRACT. We propose a novel simulation method for surface tension effects based on the Smoothed Particle Hydrodynamics framework, capturing versatile tension effects using a unified interface energy description. Guided by the principle of energy minimization, we compute the interface energy from multiple interfaces solely using the original kernel function estimation, which eliminates the dependence on second-order derivative discretization. Subsequently, we incorporate an inertia term into the energy function to strike a balance between tension effects and other forces. To simulate tension, we propose an energy diffusion based method for minimizing the objective energy function. The particles at the interface are iteratively shifted from high-energy regions to low-energy regions through several iterations, thereby achieving global interface energy minimization. Furthermore, our approach incorporates surface tension parameters as variable quantities within the energy framework, enabling automatic resolution of tension spatial gradients without requiring explicit computation of interfacial gradients. Experimental results demonstrate that our method effectively captures the wetting, capillary and Marangoni effects, showcasing significant improvements in both the accuracy and stability of tension simulation. |
11:05 | Virtual Guides and Crowd Behaviors: Understanding Evacuation Decision-Making in Virtual Reality PRESENTER: Ziyuan Feng |
11:23 | BACH: Bi-stage Data-driven Piano Performance Animation for Controllable Hand motion PRESENTER: Jihui Jiao ABSTRACT. This paper presents a novel framework for generating piano performance animations using a two-stage deep learning model. By using discrete musical score data, the framework transforms sparse control signals into continuous, natural hand motions. Specifically, in the first stage, by incorporating musical temporal context, the keyframe predictor is leveraged to learn keyframe motion guidance. Meanwhile, the second stage synthesizes smooth transitions between these keyframes via an inter-frame sequence generator. Additionally, a Laplacian operator-based motion retargeting technique is introduced, ensuring that the generated animations can be adapted to different digital human models. We demonstrate the effectiveness of the system through an audiovisual multimedia application. Our approach provides an efficient, scalable method for generating realistic piano animations and holds promise for broader applications in animation tasks driven by sparse control signals. |
11:41 | Risk-Aware Pedestrian Behavior Using Reinforcement Learning in Mixed Traffic PRESENTER: Tzu-Yu Chen ABSTRACT. This paper introduces a reinforcement learning method to simulate agents crossing roads in unsignalized, mixed-traffic environments. These agents represent individual pedestrians or small groups. The method ensures that agents adopt safe interactions with nearby dynamic objects (bikes, motorcycles, or cars) by considering factors such as conflict zones and post-encroachment times. Risk assessments based on interaction times encourage agents to avoid hazardous behaviors. Additionally, risk-informed reward terms incentivize agents to perform safe actions, while collision penalties deter collisions. The method achieved collision-free crossings and demonstrated normal, conservative, and aggressive pedestrian behaviors in various scenarios. Finally, ablation tests revealed the impact of reward weights, reward terms, and key agent state components. |
11:59 | Improving Fidelity of Close Social Interaction Animations in Social VR with a Machine Learning-based Refinement Framework PRESENTER: Roberta Macaluso |
4 CAVW 1 LNCS
14:00 | Scene-EEGCNN: Visualization of Zen Meditation Experience Based on EEG-Cultural Heritage Integration PRESENTER: Longfei Yang ABSTRACT. With the deepening of global cultural exchange, Zen culture, as one of the traditional Chinese cultures, has gradually gained admiration from modern society. Zazen, the core practice of Zen, is central to this culture, but the emotional changes of a practitioner during Zazen are difficult to perceive and visualize. By constructing Zen-inspired scenes, the inner world of the practitioner can be depicted. Chinese cultural heritage, including Buddhist sculptures, poetry, and landscape paintings, provides rich materials for presenting these Zen scenes. This paper proposes a method for analyzing emotional changes based on EEG assessment, and maps it to elements of traditional Chinese cultural heritage, using virtual scenes to showcase the emotional fluctuations of the Zazen practitioner. Specifically, this paper introduces the Scene-EEGCNN algorithm, which reads the EEG signals of the practitioner in real-time to assess their emotional state and inner fluctuations. Since the emotional changes of a Zazen practitioner are often difficult for the outside world to detect, this algorithm maps the emotional data to specific elements of Zen culture, constructing a Zen-inspired virtual scene to intuitively represent the practitioner's inner world. With this technology, practitioners can not only gain a deeper understanding of their emotional changes but also share and communicate their Zen meditation experiences with others in a visual way, thus promoting global cultural exchange and understanding. |
14:20 | Exploring the Therapeutic Potential of VR-Based ASMR Animation: A Comparative Study on Relaxation and Sleep Aid PRESENTER: Jiahao Du ABSTRACT. Although numerous studies have explored relaxation and sleep aid through Autonomous Sensory Meridian Response (ASMR) videos or conventional Virtual Reality (VR) relaxation methods, the integration of VR 3D animation with ASMR and its comparison to traditional VR relaxation methods remains underexplored. To address this gap, this study proposes a VR-based ASMR 3D animation and examines its potential therapeutic benefits in promoting relaxation, aiding sleep, and alleviating stress. First, we investigate a standardized process for creating VR-based ASMR 3D animation games and its impact on triggering the ASMR tingling sensation in VR environments. Then, we develop a VR 3D environment game featuring four different natural environments, along with one ASMR video as a control group. Finally, a comprehensive experiment is conducted to compare the effects of VR-based ASMR 3D animation, conventional VR relaxation, and traditional ASMR videos viewed on a smartphone. Forty-seven participants aged 18-35 from Bournemouth University were recruited and divided into three experimental groups. Participants' emotional and physiological responses were monitored using both subjective questionnaires and physiological data collection i. e., heart rate (HR) and electrodermal activity (EDA). Our findings show that VR-based ASMR 3D animation effectively triggers the ASMR tingling experience and offers superior relaxation, sleep assistance, and emotional regulation compared to watching ASMR videos and conventional VR relaxation methods, resulting in a significant reduction in anxiety and stress, as well as increased feelings of calmness and sleepiness. This research highlights the potential of VR-based ASMR 3D animation as a promising tool for relaxation and sleep aid, offering new insights into VR-assisted therapeutic interventions. |
14:40 | Immersion Discrepancies in Educational Serious Games Among Children's Age Groups PRESENTER: Yukun Li ABSTRACT. With the development of virtual reality technology, serious games have become a new type of teaching tool, and exploring the differences in their sense of immersion is of great significance in enhancing user experience and promoting personalized education. In this study, we designed three educational-themed serious games and compared the power spectral densities (PSD) of immersion-related brain waves of children of different ages by using a difference analysis algorithm based on the game test model. The results showed that the PSDs of theta, alpha, and beta waves differed significantly in different age groups; in the tutor-guided experiment, only theta wave differed significantly. The younger group had higher levels of θ-wave and α-wave activity, and were more relaxed and creative during the game; the older children had higher levels of β-wave activity, and had better attention and cognitive level during the game. This study reveals the influence of age on children's cognitive and emotional participation in educational games from a neurophysiological point of view, and provides a neuroscientific basis for the development of personalized educational tools. |
15:00 | Immersive Video Game Experience through Naturalistic and Emotive Dialogue Agent PRESENTER: Michael Adjeisah ABSTRACT. We present a game companion that dynamically personalizes its responses based on players' emotional shifts in real-time, enhancing user immersion in games. The model integrates sequential data, leveraging the dynamic nature of streams to benefit from player information, spoken lines, and in-game context. We leverage recent progress in LLMs' tool-calling capabilities to extract vital information from memory and recognize potential constraints for accurate reasoning, thereby tackling the complexity of NPC conversation scenarios. |
15:15 | Photorealistic 3D Head Reconstruction via 2D Gaussians PRESENTER: Anil Bas ABSTRACT. Radiance fields have significantly enhanced novel view synthesis and 3D reconstruction techniques. Recently, 3D Gaussian Splatting has proven to be a milestone, offering compact and differentiable volumetric primitives that enable photorealistic rendering with efficient training. However, while 3DGS excels at view synthesis, extracting high-quality surface meshes remains a challenge due to the loose geometric alignment of ellipsoidal Gaussians. This issue is particularly pronounced in high-detail domains like human head reconstruction, where capturing subtle anatomical features and precise surface geometry is essential. Following recent studies, we represent 3D Gaussians as 2D ellipses (disks) to address surface misalignment, which improves reconstruction quality. Furthermore, our approach allows for efficient optimisation through multiple loss functions and uses Poisson reconstruction to extract photorealistic meshes. We apply our method to the NeRSemble dataset, preprocessing data similar to GaussianAvatars. We provide known camera parameters, background masks, and normal maps to support optimisation and accurate surface alignment. Despite using only 16 input images per subject, our method successfully reconstructs high-fidelity, textured meshes, as shown in Figure 1. Unlike previous studies, our method is tailored for mesh extraction rather than view synthesis only, and does not require morphable face models (e.g. FLAME or BFM) or facial landmark detection, tracking, and annotation processes. |
16:00 | Peridynamics-Based Simulation of Viscoelastic Solids and Granular Materials PRESENTER: Jiamin Wang ABSTRACT. Viscoelastic solids and granular materials have been extensively studied in Classical Continuum Mechanics (CCM). However, CCM faces inherent limitations when dealing with discontinuity problems. Peridynamics, as a non-local continuum theory, provides a novel approach for simulating complex material behavior. We propose a unified viscoelasto-plastic simulation framework based on State-Based Peridynamics(SBPD) which derives a time-dependent unified force density expression through the introduction of the Prony model. Within SBPD, we integrate various yield criteria and mapping strategies to support granular flow simulation, and dynamically adjust material stiffness according to local density. Additionally, we construct a multi-material coupling system incorporating viscoelastic materials, granular flows, and rigid bodies, enhancing computational stability while expanding the diversity of simulation scenarios. Experiments show that our method can effectively simulate relaxation, creep, and hysteresis behaviors of viscoelastic solids, as well as flow and accumulation phenomena of granular materials, all of which are very challenging to simulate with earlier methods. Furthermore, our method allows flexible parameter adjustment to meet various simulation requirements. |
16:20 | Automating Visual Narratives: Learning Cinematic Camera Perspectives from 3D Human Interaction PRESENTER: Boyuan Cheng ABSTRACT. Cinematic camera control is a cornerstone of visual storytelling in film, animation, and interactive media, yet remains a labor‑intensive task typically handled by expert artists. While recent deep learning methods automate camera placement and movement from video, they depend heavily on large, annotated video corpora and struggle to generalize to novel character interactions. In this work, we propose a novel framework that learns to predict Toric camera parameters directly from two‑person 3D motion data, bypassing the need for preexisting visual datasets. Our model employs a dual‑stream Transformer to encode each character’s motion, fuses these streams via bidirectional cross‑attention to capture inter‑character dynamics, and incorporates explicit spatial vectors to ground geometric relationships. A lightweight fusion network then regresses per‑frame Toric parameters, yielding smooth, compositionally balanced camera trajectories. To enable training and evaluation, we introduce a new dataset of over 3,400 motion–camera sequences spanning diverse interaction scenarios. Experiments demonstrate that our approach significantly outperforms a strong Example‑Driven Camera baseline and ablated variants in trajectory accuracy, framing quality, and temporal coherence. |
16:40 | Intelligent Compilation System for Chinese Character Animation Based on Dynamic Data Sets PRESENTER: Xin Luo ABSTRACT. With the rapid evolution of artificial intelligence and human-computer interaction technology, Chinese character animation, as a form of intersection between visual communication and semantic expression, is being widely used in many scenarios, such as film and television special effects, digital education, synchronous display of speech and digitisation of cultural heritage. Traditional Chinese character animation relies on static datasets, which has the problems of low generation efficiency, single style, and difficulty in adapting to real-time changes, etc. In this paper, we propose a dynamic dataset-based Chinese character animation system. In this paper, we propose an intelligent compilation system for Chinese character animation based on dynamic datasets, construct a character description library that supports dynamic updating of Chinese characters, and design a decoupled compilation-rendering animation generation architecture, which realises fast and dynamic drawing of more than 3,000 Chinese characters and adapts to a variety of styles. The system adopts the stroke feature point extraction and stroke order reduction algorithms to generate the corresponding animation in real time by inputting the query characters, and supports curve smoothness and speed adjustment. The experimental results show that the system significantly improves the efficiency of animation generation, and has good consistency and interactivity in multi-platform, multi-style and multi-scene, which verifies its technical feasibility and application potential in the field of intelligent Chinese character animation generation. |
17:00 | Unsupervised Salient Object Detection with Pseudo-Labels Refinement PRESENTER: Hao Liu ABSTRACT. In Salient Object Detection(SOD), most methods rely on manually annotated labels, which are costly. As a result, unsupervised methods have gained significant attention. Existing methods often generate noisy pseudo-labels using traditional techniques, which can affect model performance. To address this, we propose an unsupervised method for RGB image salient object detection that generates high-quality pseudo-labels without manual annotation and uses them to train the detection model.The method generates initial pseudo-labels and improves their quality by introducing contrastive learning pre-trained weights and a pseudo-label self-updating strategy. Additionally, we design a detection network with a Multi-Feature Aggregation (MFA) module and a Context Feature Interaction (CFI) module to enhance the model’s ability to detect salient objects in complex scenarios. The model we proposed, trained with our pseudo-labels, shows significant improvement on USOD and achieves excellent scores on public benchmarks. |
17:20 | Using Large Language Models for Evaluation of Radiological Textual Reports PRESENTER: Nicolay Rusnachenko ABSTRACT. The accurate assessment of practitioners' knowledge in the field of radiology represents a critical task that heavily relies on the expertise and availability of experts. Textual narratives represent a common approach of reporting patient conditions. To reduce the need for manual annotation by experts, the development of autonomous systems in radiology is essential. At present, Large Language Models (LLMs) represent a promising yet effective framework for developing autonomous explainable systems. Models of this type demonstrate a promising solution across various fields of Natural Language Processing (NLP), including Information Retrieval (IR) domain. In this poster we propose methodology of adopting LLMs in radiological textual reports evaluation by leveraging IR capabilities of facts. We use SN-hcc-tcia dataset of structured MR/CT image textual narratives on image acquisitions in liver cancer imaging. Our evaluation results provide insights into the relationship between model size and performance. Practical applications of our experimental findings is demonstrated through a web application, in which each case represents a task for scoring a patient's liver condition. |
17:32 | AssetMask: Mask R-CNN-based approach for Asset detection in railroad track health monitoring PRESENTER: Aradhya Saini ABSTRACT. Railroad track safety is of utmost importance as the negligence may lead to loss of life and property. Track assets are inspected for the smooth functioning of the railroad track environment. Mask R-CNN-based framework is considered useful for the detection of track assets namely ‘Train’, ’Track’, ’Vegetation’,’ Sign’, ’Person’. The monitoring and safety of these aforementioned assets is useful for implementing the track safety in the railroad environment. The asset management framework is developed based on this purpose. |
17:44 | LLM-Powered VR Nursing Training for Dynamic Risk Assessment PRESENTER: Ehtzaz Chaudhry ABSTRACT. This project presents a Virtual Reality (VR) training simulation system powered by Large Language Models (LLMs), aimed at improving the skills of nurses by enabling them to practice various procedures through realistic patient interactions and enhance risk management skills with in patient home environments. In this poster, we present our work focused on developing a VR-based nurse training simulation system integrated with Open AI. The LLM architecture enables learners to interact with realistic patient scenarios, ask questions, make decisions, receive context-relevant responses and accompanied by appropriate facial expressions. Our system lays the foundation for the future development of LLM-powered Non-Player Characters (NPCs) in VR. |