Program for Thursday, November 20th

11:00	Gibran Zazueta-Cruz, Xiaodong Guo and Patrice Delmas Exploring Image-Language data to enhance Football Understanding ABSTRACT. To advance the tactical analysis of football (soccer) plays, we introduce GoalActions, a novel dataset focusing on the last five seconds of open-play scoring opportunities. Each clip is paired with descriptive captions and a vector of player actions inspired by sports action descriptive language. Our process involves using natural language processing techniques to extract key actions from commentary and an automated action-spotting pipeline to identify relevant events in the clips. We then present an experimental framework based on an image-text model to evaluate how supervision with these action vectors impacts feature representation. Our findings show that while a specialized model performs better at predicting the final outcome, models trained on specific actions learn features that are more useful for identifying individual tactical elements.
11:02	Robert Amor, Daniel Eir, Georgia Rudd, Frauke Meyer and Jo Smith Measuring Teacher Empathy in a Virtual Reality Scenario Simulating Racial Bias ABSTRACT. This pilot study explored whether a Virtual Reality (VR) scenario in which participants embody an Indigenous 12-year-old student experiencing bias can evoke empathic responses in educators. Participants engaged with a 15‐minute VR narrative. Physiological measures including heart rate (HR), heart rate variability (HRV), and electrodermal activity (EDA) were recorded. Further, self-reported empathy was assessed using the Interpersonal Reactivity Index (IRI). Repeated measures ANOVA indicated a significant overall effect for HR, however, pairwise comparisons did not reveal any significant increase during the VR scenario. No significant changes were observed in HRV, EDA, or IRI subscale scores. These findings suggest that the VR scenario may not robustly induce the anticipated physiological or self-reported empathic responses and points to areas for future research.
11:04	Anastasia Mozhaeva, Igor Vlasuyk, Aleksei Potashnikov, Steven J. Rogers, Noor Alani and Lee Streeter Enhancing Wide-Angle VR Video Transmission Using Human Perception ABSTRACT. Limited radio transmission bandwidth seriously limits the ability to use wide-angle video in technology. This problem is caused by video compression techniques currently being substantially sub-optimal for large field-of-view, leading to video transmission not being fast enough. While modern video compression leverages spatial redundancy to minimise data from the visual periphery, it often neglects temporal redundancy. Our research focuses on video processing, considering spatial-temporal human visual models, and demonstrates that incorporating temporal knowledge of human visual perception into video transmission could reduce up to 60% of data without noticeable loss in quality. To enable optimal transmission of wide-angle video, we developed a new method for processing spatio-temporal information. Its performance was evaluated through subjective quality tests, in which 20 participants rated reference videos and videos processed with our method and modern foveal compression techniques in a controlled VR environment. We believe our research will open users ' eyes to see in real-time through 180-degree headsets.
11:06	Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield and Richard Green Genetic Algorithms For Parameter Optimization for Disparity Map Generation of Radiata Pine Branch Images ABSTRACT. Traditional stereo matching algorithms like Semi- Global Block Matching (SGBM) with Weighted Least Squares (WLS) filtering offer speed advantages over neural networks for UAV applications, generating disparity maps in approximately 0.5 seconds per frame. However, these algorithms require metic- ulous parameter tuning. We propose a Genetic Algorithm (GA) based parameter optimization framework that systematically searches for optimal parameter configurations for SGBM and WLS, enabling UAVs to measure distances to tree branches with enhanced precision while maintaining processing efficiency. Our contributions include: (1) a novel GA-based parameter optimization framework that eliminates manual tuning; (2) a comprehensive evaluation methodology using multiple image quality metrics; and (3) a practical solution for resource- constrained UAV systems. Experimental results demonstrate that our GA-optimized approach reduces Mean Squared Error by 42.86% while increasing Peak Signal-to-Noise Ratio and Structural Similarity by 8.47% and 28.52%, respectively, com- pared with baseline configurations. Furthermore, our approach demonstrates superior generalization performance across varied imaging conditions, which is critcal for real-world forestry applications.
11:08	Guangliang Yang, Minh Nguyen, Weiqi Yan and Xuejun Li ChatPPG: Computational Analysis and Statistics of Table Tennis Games ABSTRACT. This paper presents ChatPPG, which is an innovative system that combines large language models (LLMs) fine-tuned with Low-Rank Adaptation (LoRA) and computer vision for real-time data analysis and coaching for table tennis games. By integrating multi-camera 3D reconstruction, object detection and object tracking, ChatPPG processes match data such as player speed, ball trajectories, and service legality, transforming raw metrics into actionable insights. The fine-tuned model achieved a Q/A accuracy of 92.3%, surpassing the baseline model 83.7%, with sub-second response times enabled by 8-bit quantization. Practical applications demonstrated its ability to deliver personalized training plans and tactical recommendations tailored to individual player profiles. User feedback from professional coaches and athletes rated tactical suggestions at 9.3/10 and training recommendations at 8.9/10. Integrating structured CV outputs with LLM capabilities enhanced transparency and interpretability, allowing users to trace recommendations to data-driven decisions. Despite dataset limitations and the need for advanced query handling, ChatPPG bridges the gap between data analysis and decision-making, setting a new standard for integrating LLMs and CV technologies in fast-paced sports analytics.
11:10	Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield and Richard Green Performance Evaluation of Deep Learning Architectures for Tree Branch Segmentation in Autonomous Forestry Systems ABSTRACT. UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256×256, 512×512, 1024×1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256×256. At 512×512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024×1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy–efficiency trade-offs in embedded forestry systems. Implementation is available at \url{https://github.com/BennyLinntu/Performance_Tree_Branch_Segmentation}.
11:12	Kehan Ma, Tahereh Hassanzadeh and Sonit Singh Improving Multi-organ Segmentation in Abdomen CT images incorporating Shape priors and Spatial information in Deep Learning ABSTRACT. Accurate and reliable segmentation of multiple abdominal organs in CT scans is fundamental for computer-assisted diagnosis, treatment planning, and longitudinal follow-up. Yet state-of-the-art deep networks either struggle with small or elongated structures or demand prohibitive computational resources. We introduce a lightweight distillation framework that couples Pseudo-3D inputs with Shape-Intensity Knowledge Distillation (SIKD). During training, a high-capacity teacher ingests three adjacent slices together with an auxiliary Z-spacing channel, enabling it to capture inter-slice context and anatomical priors. The distilled student receives only single-slice images at inference time but is guided by feature-, prediction-, and shape-aware losses to internalise this 3D information at minimal computational cost. Evaluated on the AMOS benchmark, our method increases the mean Dice similarity coefficient by 5% and increases the average Dice of six small organs by 3 pp compared to a 2D UNet baseline, while cutting GPU memory by 47% and latency by 38%. These results demonstrate a practical route for deploying accurate multi-organ segmentation in real-world clinical workflows.
11:14	Mohammad Hamidi Farahani, Tahereh Hassanzadeh and Sonit Singh Enhancing Domain Generalisability for Lung Nodule Detection: A Hybrid Strategy with Multi-Source Training and MixStyle ABSTRACT. Domain shift presents a significant obstacle to deploying deep learning models for lung nodule detection in chest X-rays (CXRs) across varied clinical environments. Such environments differ in imaging protocols, scanner technologies, and patient demographics, each introducing inconsistencies that can degrade model performance. This study presents a hybrid approach aimed at enhancing the domain generalisability of YOLOv5s, a lightweight object detection framework, by combining multi-source training with MixStyle, a feature-level style perturbation method. Using the NODE21 challenge dataset and a subset of the NIH ChestX-ray8 dataset (Kaggle), the approach measures the extent of domain shift, applies multi-source training, and determines the optimal placement of MixStyle within the YOLOv5s architecture through an ablation study. Relative to a multi-source baseline, the proposed model achieves significant performance improvements: an 8.6% gain in mAP@.5 and 14.0% in mAP@.5:.95 on NODE21, and a 2.4% gain in mAP@.5 and 3.8% gain in mAP@.5:.95 on Kaggle. Qualitative results also indicate improved detection of difficult nodules, including very small or partially obscured lesions. The proposed strategy is computationally efficient, improves robustness, and offers a scalable pathway to reliable early lung cancer diagnosis across diverse clinical settings, thereby strengthening the real-world applicability of AI-enabled medical imaging.
11:16	Christopher Rapson, Duncan Bakke, Thor Besier, Jamie Kydd, Xianghui Luo, Marco Schneider and Ju Zhang From 2D X-rays to a 3D surgical plan: progress with AI reconstruction ABSTRACT. Surgeons and patients achieve better outcomes when planning total hip arthroplasty surgery in 3D than in 2D. However, planning in 3D currently requires Computed Tomography (CT) imagery which introduces disadvantages in terms of cost, accessibility and radiation exposure. Reconstructing 3D models of a patient's anatomy from 2D X-ray images would achieve the best of both worlds. This paper demonstrates the training of an AI model for 3D reconstruction from X-rays and its intended application to 3D surgical planning. The training process makes use of Digitally Reconstructed Radiographs (DRRs) - synthetic images that are close approximations to X-rays, with the advantages of more flexible augmentation possibilities and perfect alignment to CT (and the associated 3D ground truth). Reconstruction results are shown and compared to ground truth. The resulting meshes can be seamlessly integrated into existing 3D surgical planning software, and are demonstrated to give only small differences when compared to plans generated using 3D segmentation of a CT scan.
11:18	Yeonchang Jeong and Soon-Yong Park Camera Pose Estimation in Multi-object Scenes Using Ray Diffusion and Point Cloud Alignment ABSTRACT. Accurate camera pose estimation from image sets is a fundamental problem in various applications, including 3D reconstruction, view synthesis, and robotic vision. Recently, Ray Diffusion has demonstrated its effectiveness in estimating relative camera poses for sparsely sampled, single-object-centric image sets. However, its performance decreases if a scene contains multiple objects. In this study, we propose a novel method that extends the Ray Diffusion framework by integrating monocular depth estimation and Iterative Closest Point (ICP) alignment, enabling robust pose estimation for densely sampled, multi-object image sets. The proposed approach first detects object bounding boxes and extracts object masks. According to the number of objects in the scene, the image sets are divided into several sub-scenes. For a single-object sub-scene, we directly apply Ray Diffusion. For a multi-object sub-scene, we reconstruct 3D point clouds of the objects by employing a monocular depth network and perform ICP alignment between two consecutive frames to recover relative camera poses. To solve the similarity transformation up to scale between the sub-scenes, the projection error of the 3D points in the multi-object sub-scene to the neighboring single-object sub-scenes is minimized. Experimental results show that the proposed method consistently outperforms Ray Diffusion in various multi-object scenarios.
11:20	Gee Hoon Lee and Soon-Yong Park Targetless Extrinsic Calibration of Fisheye Cameras Using Vehicle Detection and Monodepth Alignment in Cylindrical Image Space ABSTRACT. We propose a novel targetless method for estimating the extrinsic parameters of multiple fisheye cameras mounted outside of a vehicle. Without using any artificial calibration target, we use image features in the detected road vehicles and depth data of the surrounding scenes. To solve the severe image distortion problem, fisheye images are first transformed into cylindrical image space. This enables the direct use of existing object detection and monocular depth estimation networks. After the image transformation, feature points in the detected vehicles are extracted and matched across multiple camera views. At the same time, the estimated depth from the monodepth network is used to convert the cylindrical images into 3D point clouds. To solve the inherent scale ambiguity in monocular depth, the depth information of the surrounding scene is rescaled using the assumption that the xz-plane of the front camera is parallel to the ground. Then, we simultaneously optimize both the global scale and the per-point depth (expressed as a radial distance to an object point) to align the point clouds and estimate the extrinsic parameters. Video sequences from the fisheye cameras are used to collect enough image features and depth for optimization. We validate our method through both real-world and simulation-based experiments. The evaluation results show that the proposed method outperforms COLMAP and DUSt3R even under conditions of wide baseline and low frame rate. These results indicate that the proposed method is effective in practical scenarios and can be applied without the use of calibration targets.
11:22	Syed Afaq Ali Shah Beyond Deep Learning: Agentic AI Framework for Object Detection ABSTRACT. Object detection remains a fundamental yet challenging problem in machine vision. Over the past decade, numerous state-of-the-art solutions have been developed, predominantly based on deep learning. While effective, these models typically require large-scale annotated datasets and substantial computational resources, limiting their scalability and adaptability. To address these constraints, zero-shot and few-shot learning approaches have been introduced. However, they often struggle with generalization and task-specific performance. Agentic AI has recently emerged as a promising paradigm, enabling autonomous task execution by leveraging powerful vision-language models without the need for task-specific training. In this paper, we propose an agentic AI framework for object detection and investigate its feasibility in the context of assistive robotics. Our experimental results demonstrate the framework’s potential for real-world deployment, highlighting its ability to perform zero-shot detection and reasoning in indoor environments.
11:24	Matthew Jakeman, Matan Yosef, Hugh Parsons, Jia Tee, Jordan York, Dominik Lange-Nawka, Steffan Hooper and Burkhard Wünsche Player Perceptions of Path-First Procedural Content Generation Level Design for 3D Platformer Games ABSTRACT. Video game development faces increasing challenges from higher player expectations and expanding budgets. Procedural Content Generation (PCG) addresses these issues by reducing developer resource requirements. For platformer games, PCG methods often require time-consuming post-generation validation. Path-first PCG, designed for 2D platformers, first creates a playable path and then constructs the level around it. Existing path-first PCG techniques solve resource and playability concerns in 2D platformers, but no guarantee exists that these levels are perceived the same as manually designed levels, and few have attempted to apply path-first to 3D platformers. This paper prototypes a game to assess the effectiveness of a rhythm-based path-first PCG level design for 3D platformers. Empirical evidence from a controlled user study (n=18) reveals minimal differences in perceptions between path-first PCG and manually designed levels, showcasing promise for the automated PCG technique.
11:26	Bradley Scott, Sam Schofield and Richard Green Keypoint Estimation for Real-Time Pinus Radiata Cutpoint Detection ABSTRACT. In this paper, we propose a real-time capable method for detecting and tracking Pinus Radiata branch cutpoints in a forestry environment using a UAV platform. Our method applies the YOLOv11 pose estimation network to detect 2D cutpoint positions. Depth estimates around each cutpoint are improved through DBSCAN clustering, frame differencing, and filtering to remove fine-structure noise. A paired Kanade-Lucas-Tomasi tracker was used for frame-to-frame tracking and re-acquisition of cutpoints after occlusion. The method achieved a Mean Average Precision of 0.88 on video sequences captured in a Pinus Radiata plantation, with a depth error of less than 50 mm at a distance of up to 1200 mm. An average pipeline processing time of 5.4ms was achieved on an Nvidia Jetson Orin AGX 32GB. These results and frame processing times indicate that the proposed method would be suitable for enabling autonomous, real-time pruning of Pinus Radiata.
11:28	Afsana Ahmed Munia, Mehedi Hasan, Abbas Khosravi, Ibrahim Hossain and Ashikur Rahman US-Loss: Integrating Uncertainty Estimation in the Loss Function of Image Segmentation ABSTRACT. While deep learning models achieve excellent re- sults in image segmentation, their ability to express uncertainty in predictions remains a key limitation. This is especially crucial in safety-critical domains like medical imaging and autonomous driving where prediction errors can lead to life-threatening situations. In this research, we propose an Uncertainty-aware Segmentation Loss (US-Loss) function that focuses on optimiz- ing both: higher segmentation accuracy and reduced epistemic uncertainty. Unlike existing methods that require multiple forward passes or complex architectural modifications, US- Loss provides uncertainty estimates through a single forward pass. Moreover, this loss function can easily be integrated into any segmentation architecture. We evaluated our approach against two popular uncertainty measurement approaches, Deep Ensemble and Monte Carlo Dropout, on two challenging datasets: skin lesion images (ISIC-2018) and breast ultrasound images (BUS). Both quantitative and qualitative results show that our proposed approach performs best across both datasets.
11:30	Muhammad Waleed Anwar, Muhammad Awais Maqsood, Aleena Amjad and Basim Azam (Online) Joint Super-Resolution and Segmentation for Low-Resolution Brain MRI Analysis ABSTRACT. Low-resolution brain MRI acquisitions compromise anatomical segmentation accuracy, limiting clinical utility in time-constrained imaging protocols. We present a unified framework integrating generative super-resolution with volumetric segmentation to enable precise tissue delineation from degraded inputs. Our architecture employs advanced generative modeling to reconstruct perceptually faithful high-resolution MRI while preserving anatomical fidelity. Segmentation labels are appropriately scaled through parallel processing, and a deep neural network with robust architectural backbone generates corresponding high-resolution segmentation maps. The framework coordinates both components through joint optimization, the super-resolution module prioritizes features critical for tissue boundary detection, while the segmentation network adapts to characteristics of synthetically enhanced images.
11:32	Mamehgol Yousefi, Ahmad Shahi, Mos Sharifi, Alvaro Romera, Simon Hoermann and Thammathip Piumsomboon (Online) Empathic Risk Companion: Multimodal Vision-Language Fusion with Emotion Prediction Error for Decision Support ABSTRACT. We present the Empathic Risk Companion (ERC), a multimodal vision–language system that fuses facial, speech, and text affect estimates via a reliability-aware late-to-hybrid strategy, and operationalises emotion prediction error (EPE) as a first-class signal for adaptive decision support. ERC couples a dedicated emotion detection module with a domain risk module and a vision-language policy to generate transparent, just-in-time guidance. We evaluate ERC in a controlled trading task with four between-subjects conditions (dashboard baseline, vision-only, ERC-noEPE, ERC-Full). Although there is no real financial loss, participants reported above-midpoint perceived stress in early rounds, indicating non-trivial affective engagement. ERC achieved high usability (SUS = 84.37) and reduced absolute EPE over six rounds, indicating improved affect calibration and decision consistency. Exploratory ablations benchmark ERC against (i) a dashboard baseline, (ii) a unimodal vision-only pipeline, and (iii) ERC without EPE (same interface). ERC-Full consistently yielded the largest drop in \|EPE\| and higher decision consistency, suggesting that EPE contributes beyond interface and unimodal effects.
11:34	Hongyan Chen, Weiqi Yan, Qinjun Zhao and Yueyang Li (Online) An Improved ORB-SLAM2 Algorithm Based on Extended Kalman Filtering and Particle Swarm Optimization ABSTRACT. To address the challenges of pose estimation jitter and cumulative errors in visual SLAM systems operating in complex dynamic environments, this paper proposes an enhanced ORB-SLAM2 algorithm that integrates an Extended Kalman Filter (EKF)-based frontend pose smoother with a Particle Swarm Optimization (PSO)-driven backend optimizer. At the frontend, the EKF pose smoother fuses Inertial Measurement Unit (IMU) data with visual odometry to correct camera trajectories in real-time, effectively suppressing short-term pose drift. At the backend, the PSO algorithm dynamically optimizes node constraints in the pose graph and adaptively adjusts loop-closure thresholds, refining the weights of reprojection errors to enhance consistency between mapping outputs and the original environment. Experimental results demonstrate that compared to the baseline ORB-SLAM2 and PV-LIO algorithms, the proposed method significantly improves efficiency across three map scales—reducing mapping time and scanning duration while maintaining mapping quality—and achieves notable error suppression.
11:36	Munish Rathee, Boris Bačić and Maryam Doborjeh (Online) Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure ABSTRACT. This paper presents the SIFT-SNN framework, based on a low-latency neuromorphic signal processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, supports transparent decision-making, and operates efficiently on embedded hardware. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.
11:38	Muhammad Asharib Rehan, Shahnawaz Qureshi and Ali Zia (Online) Text-to-Floorplan Synthesis via Graph-Conditioned Diffusion Processes ABSTRACT. This paper introduces a novel AI-based framework for automated 2D architectural floor plan generation from natural language descriptions, employing graph-conditioned denoising diffusion probabilistic models (DDPM) integrated with a U-Net architecture. The proposed system transforms user-provided textual specifications of room types, counts, and spatial relationships into structured room-adjacency graphs. These semantic graphs are encoded into compact embeddings and used to condition the diffusion model during the generative process. The framework produces segmented floor plans with clearly labelled architectural components, which are further refined through post-processing into editable 2D layouts. We evaluate our method on the RPLAN dataset and incorporate qualitative assessments from domain experts. Results demonstrate superior performance in generating diverse, semantically valid, and spatially coherent floor plans compared to existing GAN- and transformer-based approaches. This work highlights the effectiveness of combining semantic graph representations with diffusion-based generative models in advancing AI-driven architectural design.
11:40	Yan Huan, Kien Tran, Minh Nguyen and Weiqi Yan (Online) RobotFlags: AI-Powered Semaphore Interactions Between Chatbot and Humanoid Robot ABSTRACT. This paper introduces RobotFlags, an intelligent system that integrates flag language with deep learning, large language models, and humanoid robots to support learning and interactive communication. The core of the system is the improved YOLO-AKEMA model, which incorporates attention mechanisms and adaptive convolutions to achieve high-accuracy recognition across 27 flag categories, forming a reliable foundation for gesture analysis. The user interface is implemented on the Dify AI platform, with a retrieval-augmented generation (RAG) framework constructed from curated semaphore documents and the BGE-M3 embedding model, enabling context-aware responses. A humanoid robot is seamlessly integrated as both a demonstrator and evaluator: it performs flag gestures, assesses learners’ performance, and provides detailed feedback. To ensure real-time interaction, optimization strategies such as half-precision computation, streaming inference, and caching are employed, maintaining average response times under three seconds. Altogether, RobotFlags delivers a robust, multimodal learning environment that advances flag language education and creates new opportunities for gesture-based human–robot interaction.
11:42	Kourosh Neshatian (Online) Rotation-Invariant Game State Evaluation via Canonical Board Tensors ABSTRACT. Evaluating game states benefits from exploiting rotational symmetries in board configurations. We introduce a parameter-free canonicalisation layer that maps any board tensor to a unique representative under the rotation group, enforcing rotation invariance without data augmentation or group-equivariant architectures. We formalise the induced equivalence relation over game states and define a total order to select canonical representatives. Inserted between the input and a standard MLP, the layer reduces the effective state space while leaving the learnable parameter count unchanged. On $m$–$n$–$k$ games, the resulting model consistently achieves lower training and test MSE than an otherwise identical baseline that consumes raw board tensors, and it converges faster across architectures and minimum-depth filters. These results indicate that explicit symmetry reduction via canonicalisation is a simple, general mechanism for improving game state evaluation.
11:44	Ali Mijiyawa and Fatiha Sadat (Online) Visual Question Answering using Multimodal Data Augmentation for Hausa ABSTRACT. This paper presents a classification-based Visual Question Answering (VQA) framework that integrates Large Language Models (LLMs) and vision transformers for Hausa, a low-resource African language. By fine-tuning LLMs on monolingual Hausa text and fusing their embeddings with state-of-the-art vision encoders, our system predicts answers from a fixed vocabulary. Evaluations on no-augmentation, inline, and offline text-image enhancement strategies conducted on the HaVQA dataset demonstrate that the offline augmentation strategy yields the best performance, achieving 35.85\% accuracy, 35.89\% WUPS, and 15.32\% F1-score. These results surpass the baseline by more than 5\%. Our findings underscore the importance of language-specific pretraining and comprehensive data enrichment for robust classification-based VQA in under-resourced settings.
11:46	Jean-Marc Spat and Houda Chabbi-Drissi (Online) Comparative Study of DINOv2, I-JEPA, and ViT Embeddings for Unsupervised Anomaly Detection ABSTRACT. This paper presents a unified, unsupervised frame- work for Visual Anomaly Detection (VAD) in dynamic, fixed-view scenes, leveraging modern Vision Transformers (ViTs) without additional fine-tuning. We investigate whether embeddings from backbones like a generic ViT, DINOv2, and I-JEPA, combined with a simple clustering approach, are sufficient for identifying anomalies. Our methodology extracts both global (CLS token) and local (patch-level) embeddings, applies clustering (k-Means, HDBSCAN) to model the distribution of normal scenes, and uses a scalable vector database (ChromaDB) for efficient similarity search. The three backbones delivered comparable performance, with the generic ViT showing a small but consistent advantage in balanced accuracy. Although local embeddings provided a 3% gain in balanced accuracy, their sequential processing time is considerably higher. This limitation could be mitigated through parallelization, potentially bringing their efficiency closer to that of global embeddings. In contrast, global embeddings deliver comparable performance while being approximately 256 times faster, enabling near real-time batch processing of 15 s video segments. The k-Means clustering and the proposed retrieval strategy proved most effective, achieving a practical operational trade-off with a balanced accuracy of 82%. These results sug- gest that ViT embeddings are effective for separating normal and anomalous patterns. Local embeddings appear to capture complementary and pertinent information, and we expect that with modest adjustments in the detection strategy, they could be further leveraged to improve anomaly detection performance.
11:48	Nima Mirnateghi, Sharjeel Tahir, Syed Islam and Syed Afaq Ali Shah (Online) M-AIDE: Mechanistic Agentic Interpretability for Decoding Empathy in Language Models ABSTRACT. Large language models (LLMs) have transformed conversational agents, powering applications from everyday assistants to domain-specific systems. Yet, their internal mechanisms remain opaque, limiting our understanding of how complex behvaviours are represented. Therapeutic conversational agents provide a compelling setting to study this problem, as they require models to encode empathic behaviours. For a better understanding of these behaviours, we present M-AIDE, an agentic framework designed to systematically interpret empathy-related features in LLMs. We apply this technique to therapeutic dialogue data, specifically to understand how LLMs may encode perceived empathy. Our approach leverages mechanistic interpretability to uncover artificial empathy features aligned with psychological categories of empathy. M-AIDE integrates automated interpretability into its pipeline, enabling large-scale classification and explanation of discovered features without exhaustive manual inspection. Our experiments reveal a gradient of representation: low-level features predominate at early layers, while distinct empathy features emerge as layers become deeper. We will release the code upon acceptance.
11:50	Chuanyuan Wang and Shigeo Takahashi (Online) Leveraging Pupil Facial Fusion for Enhanced Micro-Expression Recognition ABSTRACT. Micro-expressions—fleeting, involuntary facial movements that reveal genuine emotions—are difficult to recognize because they span only a few frames and occur in highly class-imbalanced data. Prior work concentrates on facial appearance while overlooking pupillary dynamics that provide complementary arousal cues. We propose the Enhanced Merged Divided Space–Time Transformer (EM-DSTA), a lightweight yet accurate framework that unifies facial, peri-ocular, and pupil information within a single early–mid fusion backbone. A two-stage detector localizes the iris and pupil in high-resolution frames, encoding the pupil-to-eye area ratio as a scalar arousal token. Full-face and eye crops are channel-concatenated, and the scalar token is injected via a Pupil Gate positioned between spatial and temporal attention blocks. Variable-length videos are handled by a sliding-window schedule with attention masks, while depth-wise convolutions enhance locality without incurring quadratic cost. On CAS(ME)3, EM-DSTA attains 62.50% top-1 accuracy, 51.52% UAR, and UF1 of 51.83%. Ablation studies confirm the dominant role of the pupil signal and show that early fusion outperforms late or dual-stream alternatives.
11:52	Abira Sengupta, Nuzla Ismail and Shanika Amarasoma (Online) Machine Learning Models for Predicting Post-Wildfire Methane Emissions in Australia Using Multivariate Data ABSTRACT. Wildfires are significant geophysical disasters that destroy ecosystems and accelerate climate change by emitting greenhouse gases like methane (CH4) and carbon dioxide (CO2). CH4, in particular, is a potent greenhouse gas that has a signif- icant impact on global warming and atmospheric composition. Wildfires emit CH4 during biomass combustion, making proper quantification of these emissions an essential research objective. This work aims to estimate CH4 emissions from wildfires in Australia using satellite-based remote sensing data, specifically fire radiative power (FRP) measurements from MODIS (NASA EOS-Terra and Aqua), as incorporated in the CAMS Global Fire Assimilation System (GFAS). Our predictive system combines advanced model optimisation approaches with SHAP (Shapley Additive Explanations) values to uncover significant features affecting CH4 prediction accuracy. Using this framework, we analyse the performance of various machine learning models and recommend the most effective ways to anticipate wildfire-related CH4 emissions in Australia.
11:54	Xiaoyu Li, Zuopeng Zhao, Maocai Ning and Jianfeng Hu (Online) SAIMNet：Object Detection Based on Semantic Alignment of Infrared Image and Microwave Non-image Information Fusion ABSTRACT. Due to the inability of unimodal data to provide comprehensive information and its poor environmental adaptability, which fails to meet the detection accuracy requirements, this paper proposes a object detection algorithm that integrates infrared images and microwave non-image information based on semantic alignment, leveraging their complementary characteristics. We improve radar image quality by converting radar point cloud data into radar RD (Range-Doppler) images and introducing hierarchical wavelet threshold denoising techniques. Subsequently, to address the noise issue in infrared images, we design an adaptive median filtering algorithm to effectively remove salt-and-pepper noise while preserving edge details. Furthermore, we propose the Adaptive Convolutional Context Attention (ACCA) module, which dynamically adjusts the size of the convolutional kernel, enhancing the model's ability to perceive features across different image regions. To further improve feature representation capability, we design the Adaptive Feature Enhancement (AFE) module, which strengthens the model’s ability to capture and integrate semantic information across channels. Experimental results demonstrate that SAIMNet improves object detection accuracy.
11:56	Sudipta Dash and Aakanksha Sharaff (Online) Deep Spectral Analytics based Soil Nutrient Prediction using Spatial-Semantic Feature Embedding with Prototype-Guided Perturbation ABSTRACT. Soil nutrient prediction from hyperspectral imagery is not an easy task, owing to the high-dimensional data, spatial redundancy, and non-adaptability of feature-learning techniques. In most cases, conventional deep learning methods are inefficient when it comes to capturing spatial-spectral semantics and are vulnerable to noise interference or irrelevant band information. A hybrid deep learning approach integrating spatial-semantic embedding from convolutional autoencoder, multi-granular CNN (MG-CNN) for richer spectral feature extraction, prototype-guided attention (PGA) for task-aware interpretability, and prototype-guided perturbation (PGP) to promote robustness has been proposed to obviate these problems. Each module has been tailored to address a certain aspect of hyperspectral complexity, from spatial encodings to adversarial resistance. Experimental results obtained on land use/cover area frame statistical survey (LUCAS) soil nutrient data ensure the effectiveness of the proposed model with an R2 score of 0.9944, 0.9958, and 0.9960 for N, P, and K, respectively, and low error rate in prediction. This framework is a promising solution and a reliably effective method in the field of precise soil nutrient estimation in precision agriculture.
11:58	Kristoko Dwi Hartomo, Christian Arthur, Aaron Berliano Handoko and Raymond Chiong (Online) Synthetic Chest X-ray Augmentation via Generative Variational Autoencoding for Pneumonia Detection ABSTRACT. Pneumonia remains a major global health concern, particularly in low-resource environments or settings where access to radiological expertise is limited. Developing AI methods, especially deep learning models, for pneumonia detection using chest X-ray (CXR) is often hindered by the scarcity of well-annotated data. This study proposes a data augmentation approach using a Variational Autoencoder with the Generative Adversarial Network (VAE-GAN) to generate pneumonia-specific synthetic CXR images. By combining variational encoding with adversarial training, the method produces anatomically consistent images that supplement limited datasets. These synthetic samples are used to train a Vision Transformer (ViT) for pneumonia classification. Compared to conventional augmentation and GAN-based alternatives such as DCGAN and Autoencoder, the use of VAE-GAN improves the model's performance when evaluated on real test data. Grad-CAM visualizations further show that the ViT model trained with VAE-GAN data attends more consistently to relevant lung regions. These results suggest that incorporating structured synthetic images can help improve classification outcomes and interpretability in environments or set-tings with limited access to expert-labeled medical data.
12:00	Ziyuan Gao and Philippe Morel (Online) DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation ABSTRACT. One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation.
12:02	Shukla Mondal, Arup Kumar Pal, Sk Hafizul Islam and Debabrata Samanta (Online) GRASP-former: A Lightweight Global-Random Sparse Attention for Domain-Aware Multi-Class Obscenity Detection PRESENTER: Shukla Mondal ABSTRACT. The availability of obscene content across diverse domains poses a threat to user well-being, especially children, as well as adults at the workplace. Detecting obscene content across image domains is vital, whereas a single classifier approach lacks the ability to represent the domain differences semantically. To tackle this issue, we propose a domain-aware obscenity classification framework in this paper. It trains the domain-aware classifier with shared head that aligns the semantic boundary between obscene and non-obscene content across domains. We present GRASP-former, a novel feature representation architecture for obscene image classification, where the random–global sparse attention builds a lightweight global context by attending to a small set of learnable global tokens and randomly sampled tokens. We fuse it with depthwise local convolution, which further refines the obscenity features to distinguish visual ambiguities present in obscene and non-obscene classes. We evaluate the proposed architecture with standard performance metrics using samples from the NPDI and NSFW datasets.
12:04	Hira Qayyum, Engr. Dr. Muhammad Hanif, Dr. Munnal Gulzar, Ishika Kirpalani and Farwa Toor (Online) Utilizing Keypoint R-CNN for Automated Root Angulation Detection in OPGs PRESENTER: Hira Qayyum ABSTRACT. Accurate measurement of root angulation is crucial in orthodontics for treatment planning and diagnosis. Traditional methods rely on manual estimation, which is time-consuming and prone to errors. Existing automated approaches focus mainly on tooth segmentation and lack precise detection of key points at both the crown and root. There is a need for a deep learning based solution that can accurately identify these keypoints and compute root angulation from panoramic dental X-rays (OPGs). This study uses keypoint R-CNN to detect tooth crown and root points in OPGs, enabling an automated calculation of angulation. The model was trained on annotated dental X-rays and tested on unseen images for validation. The results show that our approach achieves high accuracy in keypoint detection and provides consistent angulation measurements, reducing the risk of human error. The automated system can assist orthodontists in making precise assessments, improving efficiency in dental diagnostics. Our work contributes to a novel application of Keypoint R-CNN for root angulation analysis, enhancing automated orthodontic evaluation.
12:06	Yuxue Bao, Elliott Wen, Xin Wang and Burkhard Wunsche (Online) Cybersickness in VR: State-of-the-Art and Future Research Agenda ABSTRACT. Cybersickness (CS) remains a critical barrier to the widespread adoption of virtual reality (VR), manifesting as nausea, disorientation, and visual fatigue despite significant technological advancements. This paper presents a systematic umbrella review of 25 survey papers (January 2020–May 2025) to quantitatively synthesize evidence on CS influencing factors, assessment methods, and mitigation strategies. Our analysis reveals that latency is the most consistently cited hardware factor (72\% of reviews), while demographic factors like age and gender show inconclusive results due to methodological confounds. Content design choices, particularly continuous locomotion, are major contributors to CS, whereas discrete movement methods like teleportation significantly reduce symptoms (76\% of reviews). A critical finding is the field’s overreliance on subjective questionnaires like the Simulator Sickness Questionnaire (SSQ), despite widespread recognition of their limitations, and a persistent gap between the discussion of objective physiological measures and their implementation. Based on these insights, we propose a research agenda advocating for: 1) multi-factorial studies to disentangle variable interactions, 2) standardized multi-modal assessment protocols, and 3) rigorous testing of closed-loop adaptive systems. This review provides a comprehensive evidence map and actionable roadmap to advance CS research from fragmented findings toward engineered solutions for improved VR accessibility and comfort.
12:08	Haipeng Li (Online) Enhanced Emphysema Classification in CT Images Using RIU4-LQP and Spatial Texture Features ABSTRACT. Emphysema is a major subtype of chronic obstructive pulmonary disease (COPD) which presents significant challenges in early detection and subtype classification. Computed tomography (CT) imaging provides valuable structural detail, and texture-based analysis has become a key strategy for identifying emphysema patterns. This study introduces an improved texture descriptor, Rotation-Invariant Uniform Local Quinary Pattern (RIU4-LQP), designed to reduce feature dimensionality while preserving fine-grained local details. Additionally, the Average Nearest Neighbor Index (ANNI) is integrated to capture spatial distribution features not adequately represented by traditional descriptors. A feature selection step based on recursive feature elimination (RFE) further optimizes the feature set for classification. Using a leave-one-subject-out cross-validation strategy with support vector machines (SVMs), experiments demonstrate that combining RIU4-LQP with ANNI yields superior classification performance compared to standard LQP. The findings highlight the importance of integrating local and spatial texture descriptors, alongside feature selection, for robust emphysema subtype classification in CT imaging.
12:10	Naeem Ul Islam, Syed H. Shah, Kara K. Al Biruni, Austin N. The and Jenson C. Halim (Online) Depth-Aware YOLO Segmentation: Enhancing Small Object Detection via MiDaS-Based Spatial Reasoning ABSTRACT. In this paper, we propose a depth-enhanced object detection system that couples MiDaS monocular depth estimation with YOLO-based segmentation to improve the robustness and accuracy of small object detection. Conventional object detectors, e.g., YOLO family members, are 2D-spatial-cue-dependent and depth-unaware and thus misdetect or lose detection for objects in visually complex or depth-ambiguous scenes. Such a limitation is especially acute for the detection of small objects with varied distances. By using a MiDaS-calculated depth map as an input during post-processing, we propose a spatial reasoning mechanism with depth-guided filtering and prioritization that improves the accuracy of YOLO segmentation. The novel method was tested on a small fruit data set, with YOLO segmentation fine-tuned for 20 epochs. Experimental results demonstrate fewer false positives and improved localization of small objects, especially objects closer to the camera. Our findings suggest that the use of depth cues leads to dramatically improved object detection performance with interesting areas to be explored in depth-guided object recognition.

12:30-13:30Lunch Break & Poster Session

13:30-15:30 Session 4: Paper Session: Computer Vision & Object Detection

Paper Session 1

Chair:

Junhong Zhao

Location: RHLT 1

13:30	Mitsuki Sato, Haibo Zhang and Takeshi Saitoh Sentence-Level Lip-reading with Integrated Synthetic Data and Speaker Normalization ABSTRACT. We present a systematic study of generative data for Japanese sentence-level lip-reading, demonstrating that synthetic speech videos can substantially improve accuracy in a low-resource setting. Using Real3D-Portrait, we synthesize 75-frame clips and combine two data-side strategies: data supplementation and speaker normalization that maps speaker-independent to speaker-dependent conditions. The recognizer instantiation uses the Vision Module and a DC-TCN trained with CTC on 48 phonemes. On a 13-speaker dataset, aligning both training and testing to the generated domain reduces the phoneme error rate from 0.535 to 0.274. Joint training on original plus generated clips achieves 0.247 on the generated test domain, improving domain robustness. Performance degrades on unknown sentences, highlighting the need for broader coverage of phoneme sequences. Models trained only on female speakers tend to underperform on male speakers; however, lightweight fine-tuning with target-group data mitigates this gap. The choice of source image for generation has a minor effect, with a maximum difference of 0.037. The approach is model-agnostic and can be applied to alternative front ends and sequence decoders using the same training recipe.
13:45	Jiawei Leong, Junhong Zhao, Bing Xue, William Gibson and Mengjie Zhang Enhanced Small Object Detection Using Multi-Scale Attention for Automated Seabird Detection ABSTRACT. Commercial fishing vessels catch fish as food in Exclusive Economic Zones to support the New Zealand economy. However, fishers can sometimes accidentally catch seabirds as bycatch, which can result in accidental seabird deaths and injuries. To monitor and document such incidents, video recordings are reviewed. However, this process relies heavily on human efforts, making it labour-intensive and costly. Machine learning and computer vision techniques can be employed to automate this process. One of the major limitations is that the seabirds often appear as small objects, which are typically harder to detect than the larger ones. To overcome this, we proposed a new deep learning method for small seabird detection named the YOLOv8-Seabird model. The model features a new small detection head to improve the detection performance on small objects and a new Multi-Scale Convolutional Block Attention Module (MSCBAM) that utilises multi-scale pooling operations to preserve small object features at different scales. We evaluated the performance of our developed model under diverse data conditions. Experimental results show that the model achieved a mAP@50 score of 99.05% and a mAP@50-95 score of 85.2% on the test data, outperforming state-of-the-art YOLO models. The new method also excels at detecting small seabirds.
14:00	Zheng-Yuan Lin, Hsiang-An Wang and Li-Ling Hung (Online talk) Transformer-based Approaches to Description Sequence Generation for Chinese Characters ABSTRACT. A Chinese character has multiple decomposition methods, which can be represented by breaking it down into units such as strokes and components, or further expressed as ideographic description sequences through structural symbols. However, when expanding to new characters, existing methods mostly rely on manual processing, and due to the existence of multiple decomposition methods, the description sequence generation of newly included or rare characters often consumes a large amount of time, resulting in delays in data updates and limiting database applications. Although existing automated models are able to generate description sequences from glyph images, most of them lack the ability to learn the spatial positions and structural relationships of components within a glyph, thus making predictions prone to errors when dealing with complex characters. To overcome these limitations, this study explores Transformer-based models and incorporates spatial information into both stroke and component datasets. The findings provide evidence that the integration of spatial information within the stroke dataset achieves the most favorable performance in generating structural description sequences.
14:15	Nam Nguyen Tu and Hiroki Takahashi STMixer: Spatial-Temporal Mixer for Continuous Sign Language Recognition ABSTRACT. Continuous Sign Language Recognition (CSLR) aims to bridge communication gaps by translating continuous sign language videos into sequences of glosses. A major challenge lies in modeling complex spatio-temporal dynamics under weak supervision, where precise frame-level sign boundaries are unavailable. While deep learning models have advanced CSLR performance, many rely on 2D CNNs that struggle to capture inter-frame motion, or require auxiliary modalities such as optical flow or pose estimation, thereby increasing complexity. To address these limitations, we propose the Spatio-Temporal Mixer (STMixer), a novel module that explicitly factorizes spatio-temporal learning into three sequential stages: temporal mixing, spatial mixing, and channel mixing. This disentangled design enables efficient and effective modeling of motion patterns and structural features directly from RGB video input. Integrated into a standard ResNet-based CSLR framework, STMixer enhances spatio-temporal reasoning without requiring additional modalities or complex preprocessing. We evaluate our method on the PHOENIX14 and PHOENIX14-T benchmarks, achieving competitive performance with a Word Error Rate (WER) of 18.6% and 19.0% on the Test set, respectively. Notably, our model outperforms existing methods using similar backbones and matches the performance of more complex, multi-modal approaches—demonstrating that explicit spatio-temporal factorization offers a powerful and efficient paradigm for CSLR.
14:30	Luis Esteban Acevedo Bringas, Gibran Benitez-Garcia and Hiroki Takahashi (Online talk) GAMNet: Graph Attention MLP-based Network for 3D Human Pose Estimation ABSTRACT. Graph convolutional networks (GCNs) and their variants have shown significant promise in 3D human pose estimation, but despite their efficiency advantages they still lag slightly behind recent Transformer-based methods in accuracy. To bridge this gap, we propose GAMNet: an Efficient Graph Attention MLP-based Network that unifies adaptive graph con- volutions with lightweight MLP blocks, augmented by efficient channel and spatial attention mechanisms. Specifically, GAM- Net learns skeleton topology via an adaptive graph modules that refines the graph structure during training, recalibrates feature channels using an Efficient Channel Attention block to emphasize informative dimensions, and highlights salient joint interactions with a lightweight Spatial Attention. By integrating these components within a dual-branch GCN–MLP architecture, GAMNet effectively models both local skeletal connectivity and global contextual dependencies with minimal overhead. Extensive experiments on Human3.6M and MPI-INF-3DHP demonstrate that GAMNet requires only 8.97 M parameters and 157 MFLOPs for a 243-frame input to surpass the accuracy of leading Graph–based methods and show competitive results compared with strong Transformer-based methods with lower computa- tional cost.
14:45	Anastasia Mozhaeva, Igor Vlasuyk, Aleksei Potashnikov, Steven J. Rogers and Patrice Delmas Visual Artificial Intelligence: Unlocking Efficiency with Psychovisual Models ABSTRACT. While our world is highly dependent on computer vision, and despite Artificial Intelligence's impressive success in various vision tasks, we still struggle with heavy computation and intensive memory costs. Performing inference with deep learning models for video remains challenging due to the significant computational resources required to achieve reliable recognition. While machine learning has revolutionised computer vision, current systems still struggle to perceive the world as humans do. This work creates a novel approach to improving prediction vision AI algorithms, namely, reducing training video dataset redundancy by 15% and improving the accuracy of algorithms trained on our dataset by 3.5% while accelerating convergence and reducing computing requirements. We believe our study will expand further research across diverse AI-driven visual processing systems.
15:00	Ruigeng Wang, Xiaodong Guo, Mingxin Shi, Jiaxuan Wang, Mitchell Rogers, Shahrokh Heidari, Gibrán Zazueta and Patrice Delmas AOD-RSE: Improved Dehazing Network for Object Detection Models in Self-Driving Scenarios ABSTRACT. TBD
15:15	Achutha M and Bhaskarjyoti Das (Online talk)To Skip or to Mask? A Study of the Adversarial Purification Spectrum ABSTRACT. Adversarial purification has emerged as a promising, model-agnostic defense against adversarial attacks. Yet, recent advances using large generative models often obscure the foundational principles for designing efficient purifiers. To the best of our knowledge, this is the first controlled study to deconstruct the purification landscape, isolating the impact of architectural topology from the training objective and aims to translate the principles behind these successes into lightweight convolutional frameworks. Contrary to the scaling trend, we focus on purely convolutional architectures to isolate core design decisions without confounding effects from model scale or training complexity. This approach directly examines the trade-off between complexity and cross-dataset generalizability. We perform extensive experiments on CIFAR-10, CIFAR-100, SVHN versus a variety of attacks. Our experiments indicate that, in this lightweight setting, architectural choice is a more influential factor than the training objective for purification efficacy. Specifically, a simple Denoising U-Net substantially outperforms all other variants with up to 73.26% accuracy against C&W attacks, structural similarity indices (SSIM) of up to 0.82 across datasets and a mean positive transfer gap of nearly 3.5%. These results indicate that architectural innovations yield greater returns in lightweight adversarial purification than increasingly complex reconstruction objectives.

15:30-16:00Coffee Break

16:00-17:00 Session 5: Industry Panel

Industry Panel

Location: RHLT 1

17:00-18:00 Session 6: Paper Session: VR / AR Applications and Graphics

Paper Session 3

Chair:

Steven Mills

Location: RHLT 1

17:00	Yurun Wu, Yousong Sun, Burkhard Wunsche, Wei Fan, Jia Wang and Elliott Wen VRScout: Towards Real-Time, Autonomous Testing of Virtual Reality Games ABSTRACT. Virtual Reality (VR) has rapidly become a mainstream platform for gaming and interactive experiences, yet ensuring the quality, safety, and appropriateness of VR content remains a pressing challenge. Traditional human-based quality assurance is labor-intensive and cannot scale with the industry’s rapid growth. While automated testing has been applied to traditional 2D and 3D games, extending it to VR introduces unique difficulties due to high-dimensional sensory inputs and strict real-time performance requirements. We present VRScout, a deep learning–based agent capable of autonomously navigating VR environments and interacting with virtual objects in a human-like and real-time manner. VRScout learns from human demonstrations using an enhanced Action Chunking Transformer that predicts multi-step action sequences. This enables our agent to capture higher-level strategies and generalize across diverse environments. To balance responsiveness and precision, we introduce a dynamically adjustable sliding horizon that adapts the agent’s temporal context at runtime. We evaluate VRScout on commercial VR titles and show that it achieves expert-level performance with only limited training data, while maintaining real-time inference at 60 FPS on consumer-grade hardware. These results position VRScout as a practical and scalable framework for automated VR game testing, with direct applications in both quality assurance and safety auditing.
17:15	Yuxue Bao, Elliott Wen, Xin Wang and Burkhard Wunsche (Online talk) Cross-Domain Analysis of Cybersickness and Motion Sickness Mitigation Strategies ABSTRACT. Cybersickness (CS) represents a critical barrier to VR adoption, with current mitigation strategies predominantly relying on visual output modifications that compromise immersion quality. This study addresses the research gap between the CS and motion sickness (MS) fields through systematic comparative analysis to identify cross-domain transfer opportunities. We conducted a comprehensive literature review of 44 papers (2020-July 2025) and established a unified classification framework with 15 standardised strategy categories. Our analysis reveals that CS strategies concentrate on visual processing technologies while MS strategies demonstrate diversification with predictive information as the core approach. Through systematic screening based on non-visual dependency and effectiveness validation, we identified three high-potential transferable MS strategies: audio feedback, predictive information, and haptic feedback. We propose specific VR environment adaptive design solutions that enable CS mitigation without compromising visual fidelity. This research provides the first systematic cross-domain comparison, establishes scientific migration criteria, and offers practical solutions for improving VR user acceptance and session duration.
17:30	Yuchao Zhang, Kien Tran, Minh Nguyen and Weiqi Yan A Diffusion Model for Virtual Try-On Systems ABSTRACT. We present a modular virtual try-on (VTON) system that integrates natural language control, efficient diffusion-based image synthesis, and lightweight garment classification. User intent is parsed by a large language model (LLM) into structured visual prompts. A LoRA-tuned diffusion model generates try-on images conditioned on pose and segmentation maps, while a compact classifier, LightClothNet, handles five-category clothing recognition and pre-filtering. The pipeline is built using ComfyUI nodes and orchestrated via Dify. Compared to existing methods, the proposed system offers improved realism, garment-pose alignment, and controllability. Evaluation on the DressCode and VITON-HD datasets shows that LoRA fine-tuning enhances fidelity under limited data, while LightClothNet achieves up to 91.76% precision and 0.91 F1-score with low latency. This work demonstrates how multimodal control, lightweight classification, and diffusion generation can be unified for fast, flexible, and user-driven VTON applications.
17:45	Juanjuan Chen, Qiumei Li, Junjuan Chen and Weiqi Yan An Alternative Cardinal Spline for Cubic B-Spline Interpolation ABSTRACT. This paper aims at introducing a special cubic cardinal spline, which is able to obtain almost the same interpolation results of cubic B-spline, but without solving the system of equations for getting control vertices, and exploring its applications in geometric modelling and image processing. The spline involved can be obtained by a linear combination of several shifted B-splines of the same degree. The spatial and frequency domain comparison demonstrates its superior local support and frequency domain performance. In addition, a number of examples are also involved in geometric modeling and digital image processing, comparing them with a few conventional methods. As shown in this paper, the cubic cardinal spline is a good alternative choice instead of the cubic B-spline in the interpolation procedure in order to avoid solving the system of equations over and over again.

18:30-20:30Conference Dinner