View: session overviewtalk overview
Keynote 1
| 09:00 | The search for quantum advantage in optimisation: myths, maths, and the travelling salesman problem |
Conference Overview & 40th Anniversary
Poster Barking Session
| 11:00 | Exploring Image-Language data to enhance Football Understanding ABSTRACT. To advance the tactical analysis of football (soccer) plays, we introduce GoalActions, a novel dataset focusing on the last five seconds of open-play scoring opportunities. Each clip is paired with descriptive captions and a vector of player actions inspired by sports action descriptive language. Our process involves using natural language processing techniques to extract key actions from commentary and an automated action-spotting pipeline to identify relevant events in the clips. We then present an experimental framework based on an image-text model to evaluate how supervision with these action vectors impacts feature representation. Our findings show that while a specialized model performs better at predicting the final outcome, models trained on specific actions learn features that are more useful for identifying individual tactical elements. |
| 11:02 | Measuring Teacher Empathy in a Virtual Reality Scenario Simulating Racial Bias ABSTRACT. This pilot study explored whether a Virtual Reality (VR) scenario in which participants embody an Indigenous 12-year-old student experiencing bias can evoke empathic responses in educators. Participants engaged with a 15‐minute VR narrative. Physiological measures including heart rate (HR), heart rate variability (HRV), and electrodermal activity (EDA) were recorded. Further, self-reported empathy was assessed using the Interpersonal Reactivity Index (IRI). Repeated measures ANOVA indicated a significant overall effect for HR, however, pairwise comparisons did not reveal any significant increase during the VR scenario. No significant changes were observed in HRV, EDA, or IRI subscale scores. These findings suggest that the VR scenario may not robustly induce the anticipated physiological or self-reported empathic responses and points to areas for future research. |
| 11:04 | Enhancing Wide-Angle VR Video Transmission Using Human Perception ABSTRACT. Limited radio transmission bandwidth seriously limits the ability to use wide-angle video in technology. This problem is caused by video compression techniques currently being substantially sub-optimal for large field-of-view, leading to video transmission not being fast enough. While modern video compression leverages spatial redundancy to minimise data from the visual periphery, it often neglects temporal redundancy. Our research focuses on video processing, considering spatial-temporal human visual models, and demonstrates that incorporating temporal knowledge of human visual perception into video transmission could reduce up to 60% of data without noticeable loss in quality. To enable optimal transmission of wide-angle video, we developed a new method for processing spatio-temporal information. Its performance was evaluated through subjective quality tests, in which 20 participants rated reference videos and videos processed with our method and modern foveal compression techniques in a controlled VR environment. We believe our research will open users ' eyes to see in real-time through 180-degree headsets. |
| 11:06 | Genetic Algorithms For Parameter Optimization for Disparity Map Generation of Radiata Pine Branch Images ABSTRACT. Traditional stereo matching algorithms like Semi- Global Block Matching (SGBM) with Weighted Least Squares (WLS) filtering offer speed advantages over neural networks for UAV applications, generating disparity maps in approximately 0.5 seconds per frame. However, these algorithms require metic- ulous parameter tuning. We propose a Genetic Algorithm (GA) based parameter optimization framework that systematically searches for optimal parameter configurations for SGBM and WLS, enabling UAVs to measure distances to tree branches with enhanced precision while maintaining processing efficiency. Our contributions include: (1) a novel GA-based parameter optimization framework that eliminates manual tuning; (2) a comprehensive evaluation methodology using multiple image quality metrics; and (3) a practical solution for resource- constrained UAV systems. Experimental results demonstrate that our GA-optimized approach reduces Mean Squared Error by 42.86% while increasing Peak Signal-to-Noise Ratio and Structural Similarity by 8.47% and 28.52%, respectively, com- pared with baseline configurations. Furthermore, our approach demonstrates superior generalization performance across varied imaging conditions, which is critcal for real-world forestry applications. |
| 11:08 | ChatPPG: Computational Analysis and Statistics of Table Tennis Games ABSTRACT. This paper presents ChatPPG, which is an innovative system that combines large language models (LLMs) fine-tuned with Low-Rank Adaptation (LoRA) and computer vision for real-time data analysis and coaching for table tennis games. By integrating multi-camera 3D reconstruction, object detection and object tracking, ChatPPG processes match data such as player speed, ball trajectories, and service legality, transforming raw metrics into actionable insights. The fine-tuned model achieved a Q/A accuracy of 92.3%, surpassing the baseline model 83.7%, with sub-second response times enabled by 8-bit quantization. Practical applications demonstrated its ability to deliver personalized training plans and tactical recommendations tailored to individual player profiles. User feedback from professional coaches and athletes rated tactical suggestions at 9.3/10 and training recommendations at 8.9/10. Integrating structured CV outputs with LLM capabilities enhanced transparency and interpretability, allowing users to trace recommendations to data-driven decisions. Despite dataset limitations and the need for advanced query handling, ChatPPG bridges the gap between data analysis and decision-making, setting a new standard for integrating LLMs and CV technologies in fast-paced sports analytics. |
| 11:10 | Performance Evaluation of Deep Learning Architectures for Tree Branch Segmentation in Autonomous Forestry Systems ABSTRACT. UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256×256, 512×512, 1024×1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256×256. At 512×512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024×1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy–efficiency trade-offs in embedded forestry systems. Implementation is available at \url{https://github.com/BennyLinntu/Performance_Tree_Branch_Segmentation}. |
| 11:12 | Improving Multi-organ Segmentation in Abdomen CT images incorporating Shape priors and Spatial information in Deep Learning ABSTRACT. Accurate and reliable segmentation of multiple abdominal organs in CT scans is fundamental for computer-assisted diagnosis, treatment planning, and longitudinal follow-up. Yet state-of-the-art deep networks either struggle with small or elongated structures or demand prohibitive computational resources. We introduce a lightweight distillation framework that couples Pseudo-3D inputs with Shape-Intensity Knowledge Distillation (SIKD). During training, a high-capacity teacher ingests three adjacent slices together with an auxiliary Z-spacing channel, enabling it to capture inter-slice context and anatomical priors. The distilled student receives only single-slice images at inference time but is guided by feature-, prediction-, and shape-aware losses to internalise this 3D information at minimal computational cost. Evaluated on the AMOS benchmark, our method increases the mean Dice similarity coefficient by 5% and increases the average Dice of six small organs by 3 pp compared to a 2D UNet baseline, while cutting GPU memory by 47% and latency by 38%. These results demonstrate a practical route for deploying accurate multi-organ segmentation in real-world clinical workflows. |
| 11:14 | Enhancing Domain Generalisability for Lung Nodule Detection: A Hybrid Strategy with Multi-Source Training and MixStyle ABSTRACT. Domain shift presents a significant obstacle to deploying deep learning models for lung nodule detection in chest X-rays (CXRs) across varied clinical environments. Such environments differ in imaging protocols, scanner technologies, and patient demographics, each introducing inconsistencies that can degrade model performance. This study presents a hybrid approach aimed at enhancing the domain generalisability of YOLOv5s, a lightweight object detection framework, by combining multi-source training with MixStyle, a feature-level style perturbation method. Using the NODE21 challenge dataset and a subset of the NIH ChestX-ray8 dataset (Kaggle), the approach measures the extent of domain shift, applies multi-source training, and determines the optimal placement of MixStyle within the YOLOv5s architecture through an ablation study. Relative to a multi-source baseline, the proposed model achieves significant performance improvements: an 8.6% gain in mAP@.5 and 14.0% in mAP@.5:.95 on NODE21, and a 2.4% gain in mAP@.5 and 3.8% gain in mAP@.5:.95 on Kaggle. Qualitative results also indicate improved detection of difficult nodules, including very small or partially obscured lesions. The proposed strategy is computationally efficient, improves robustness, and offers a scalable pathway to reliable early lung cancer diagnosis across diverse clinical settings, thereby strengthening the real-world applicability of AI-enabled medical imaging. |
| 11:16 | From 2D X-rays to a 3D surgical plan: progress with AI reconstruction ABSTRACT. Surgeons and patients achieve better outcomes when planning total hip arthroplasty surgery in 3D than in 2D. However, planning in 3D currently requires Computed Tomography (CT) imagery which introduces disadvantages in terms of cost, accessibility and radiation exposure. Reconstructing 3D models of a patient's anatomy from 2D X-ray images would achieve the best of both worlds. This paper demonstrates the training of an AI model for 3D reconstruction from X-rays and its intended application to 3D surgical planning. The training process makes use of Digitally Reconstructed Radiographs (DRRs) - synthetic images that are close approximations to X-rays, with the advantages of more flexible augmentation possibilities and perfect alignment to CT (and the associated 3D ground truth). Reconstruction results are shown and compared to ground truth. The resulting meshes can be seamlessly integrated into existing 3D surgical planning software, and are demonstrated to give only small differences when compared to plans generated using 3D segmentation of a CT scan. |
| 11:18 | Camera Pose Estimation in Multi-object Scenes Using Ray Diffusion and Point Cloud Alignment ABSTRACT. Accurate camera pose estimation from image sets is a fundamental problem in various applications, including 3D reconstruction, view synthesis, and robotic vision. Recently, Ray Diffusion has demonstrated its effectiveness in estimating relative camera poses for sparsely sampled, single-object-centric image sets. However, its performance decreases if a scene contains multiple objects. In this study, we propose a novel method that extends the Ray Diffusion framework by integrating monocular depth estimation and Iterative Closest Point (ICP) alignment, enabling robust pose estimation for densely sampled, multi-object image sets. The proposed approach first detects object bounding boxes and extracts object masks. According to the number of objects in the scene, the image sets are divided into several sub-scenes. For a single-object sub-scene, we directly apply Ray Diffusion. For a multi-object sub-scene, we reconstruct 3D point clouds of the objects by employing a monocular depth network and perform ICP alignment between two consecutive frames to recover relative camera poses. To solve the similarity transformation up to scale between the sub-scenes, the projection error of the 3D points in the multi-object sub-scene to the neighboring single-object sub-scenes is minimized. Experimental results show that the proposed method consistently outperforms Ray Diffusion in various multi-object scenarios. |
| 11:20 | Targetless Extrinsic Calibration of Fisheye Cameras Using Vehicle Detection and Monodepth Alignment in Cylindrical Image Space ABSTRACT. We propose a novel targetless method for estimating the extrinsic parameters of multiple fisheye cameras mounted outside of a vehicle. Without using any artificial calibration target, we use image features in the detected road vehicles and depth data of the surrounding scenes. To solve the severe image distortion problem, fisheye images are first transformed into cylindrical image space. This enables the direct use of existing object detection and monocular depth estimation networks. After the image transformation, feature points in the detected vehicles are extracted and matched across multiple camera views. At the same time, the estimated depth from the monodepth network is used to convert the cylindrical images into 3D point clouds. To solve the inherent scale ambiguity in monocular depth, the depth information of the surrounding scene is rescaled using the assumption that the xz-plane of the front camera is parallel to the ground. Then, we simultaneously optimize both the global scale and the per-point depth (expressed as a radial distance to an object point) to align the point clouds and estimate the extrinsic parameters. Video sequences from the fisheye cameras are used to collect enough image features and depth for optimization. We validate our method through both real-world and simulation-based experiments. The evaluation results show that the proposed method outperforms COLMAP and DUSt3R even under conditions of wide baseline and low frame rate. These results indicate that the proposed method is effective in practical scenarios and can be applied without the use of calibration targets. |
| 11:22 | Beyond Deep Learning: Agentic AI Framework for Object Detection ABSTRACT. Object detection remains a fundamental yet challenging problem in machine vision. Over the past decade, numerous state-of-the-art solutions have been developed, predominantly based on deep learning. While effective, these models typically require large-scale annotated datasets and substantial computational resources, limiting their scalability and adaptability. To address these constraints, zero-shot and few-shot learning approaches have been introduced. However, they often struggle with generalization and task-specific performance. Agentic AI has recently emerged as a promising paradigm, enabling autonomous task execution by leveraging powerful vision-language models without the need for task-specific training. In this paper, we propose an agentic AI framework for object detection and investigate its feasibility in the context of assistive robotics. Our experimental results demonstrate the framework’s potential for real-world deployment, highlighting its ability to perform zero-shot detection and reasoning in indoor environments. |
| 11:24 | Player Perceptions of Path-First Procedural Content Generation Level Design for 3D Platformer Games ABSTRACT. Video game development faces increasing challenges from higher player expectations and expanding budgets. Procedural Content Generation (PCG) addresses these issues by reducing developer resource requirements. For platformer games, PCG methods often require time-consuming post-generation validation. Path-first PCG, designed for 2D platformers, first creates a playable path and then constructs the level around it. Existing path-first PCG techniques solve resource and playability concerns in 2D platformers, but no guarantee exists that these levels are perceived the same as manually designed levels, and few have attempted to apply path-first to 3D platformers. This paper prototypes a game to assess the effectiveness of a rhythm-based path-first PCG level design for 3D platformers. Empirical evidence from a controlled user study (n=18) reveals minimal differences in perceptions between path-first PCG and manually designed levels, showcasing promise for the automated PCG technique. |
| 11:26 | Keypoint Estimation for Real-Time Pinus Radiata Cutpoint Detection ABSTRACT. In this paper, we propose a real-time capable method for detecting and tracking Pinus Radiata branch cutpoints in a forestry environment using a UAV platform. Our method applies the YOLOv11 pose estimation network to detect 2D cutpoint positions. Depth estimates around each cutpoint are improved through DBSCAN clustering, frame differencing, and filtering to remove fine-structure noise. A paired Kanade-Lucas-Tomasi tracker was used for frame-to-frame tracking and re-acquisition of cutpoints after occlusion. The method achieved a Mean Average Precision of 0.88 on video sequences captured in a Pinus Radiata plantation, with a depth error of less than 50 mm at a distance of up to 1200 mm. An average pipeline processing time of 5.4ms was achieved on an Nvidia Jetson Orin AGX 32GB. These results and frame processing times indicate that the proposed method would be suitable for enabling autonomous, real-time pruning of Pinus Radiata. |
| 11:28 | US-Loss: Integrating Uncertainty Estimation in the Loss Function of Image Segmentation ABSTRACT. While deep learning models achieve excellent re- sults in image segmentation, their ability to express uncertainty in predictions remains a key limitation. This is especially crucial in safety-critical domains like medical imaging and autonomous driving where prediction errors can lead to life-threatening situations. In this research, we propose an Uncertainty-aware Segmentation Loss (US-Loss) function that focuses on optimiz- ing both: higher segmentation accuracy and reduced epistemic uncertainty. Unlike existing methods that require multiple forward passes or complex architectural modifications, US- Loss provides uncertainty estimates through a single forward pass. Moreover, this loss function can easily be integrated into any segmentation architecture. We evaluated our approach against two popular uncertainty measurement approaches, Deep Ensemble and Monte Carlo Dropout, on two challenging datasets: skin lesion images (ISIC-2018) and breast ultrasound images (BUS). Both quantitative and qualitative results show that our proposed approach performs best across both datasets. |
| 11:30 | (Online) Joint Super-Resolution and Segmentation for Low-Resolution Brain MRI Analysis ABSTRACT. Low-resolution brain MRI acquisitions compromise anatomical segmentation accuracy, limiting clinical utility in time-constrained imaging protocols. We present a unified framework integrating generative super-resolution with volumetric segmentation to enable precise tissue delineation from degraded inputs. Our architecture employs advanced generative modeling to reconstruct perceptually faithful high-resolution MRI while preserving anatomical fidelity. Segmentation labels are appropriately scaled through parallel processing, and a deep neural network with robust architectural backbone generates corresponding high-resolution segmentation maps. The framework coordinates both components through joint optimization, the super-resolution module prioritizes features critical for tissue boundary detection, while the segmentation network adapts to characteristics of synthetically enhanced images. |
| 11:32 | (Online) Empathic Risk Companion: Multimodal Vision-Language Fusion with Emotion Prediction Error for Decision Support ABSTRACT. We present the Empathic Risk Companion (ERC), a multimodal vision–language system that fuses facial, speech, and text affect estimates via a reliability-aware late-to-hybrid strategy, and operationalises emotion prediction error (EPE) as a first-class signal for adaptive decision support. ERC couples a dedicated emotion detection module with a domain risk module and a vision-language policy to generate transparent, just-in-time guidance. We evaluate ERC in a controlled trading task with four between-subjects conditions (dashboard baseline, vision-only, ERC-noEPE, ERC-Full). Although there is no real financial loss, participants reported above-midpoint perceived stress in early rounds, indicating non-trivial affective engagement. ERC achieved high usability (SUS = 84.37) and reduced absolute EPE over six rounds, indicating improved affect calibration and decision consistency. Exploratory ablations benchmark ERC against (i) a dashboard baseline, (ii) a unimodal vision-only pipeline, and (iii) ERC without EPE (same interface). ERC-Full consistently yielded the largest drop in |EPE| and higher decision consistency, suggesting that EPE contributes beyond interface and unimodal effects. |
| 11:34 | (Online) An Improved ORB-SLAM2 Algorithm Based on Extended Kalman Filtering and Particle Swarm Optimization ABSTRACT. To address the challenges of pose estimation jitter and cumulative errors in visual SLAM systems operating in complex dynamic environments, this paper proposes an enhanced ORB-SLAM2 algorithm that integrates an Extended Kalman Filter (EKF)-based frontend pose smoother with a Particle Swarm Optimization (PSO)-driven backend optimizer. At the frontend, the EKF pose smoother fuses Inertial Measurement Unit (IMU) data with visual odometry to correct camera trajectories in real-time, effectively suppressing short-term pose drift. At the backend, the PSO algorithm dynamically optimizes node constraints in the pose graph and adaptively adjusts loop-closure thresholds, refining the weights of reprojection errors to enhance consistency between mapping outputs and the original environment. Experimental results demonstrate that compared to the baseline ORB-SLAM2 and PV-LIO algorithms, the proposed method significantly improves efficiency across three map scales—reducing mapping time and scanning duration while maintaining mapping quality—and achieves notable error suppression. |
| 11:36 | (Online) Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure ABSTRACT. This paper presents the SIFT-SNN framework, based on a low-latency neuromorphic signal processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, supports transparent decision-making, and operates efficiently on embedded hardware. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide. |
| 11:38 | (Online) Text-to-Floorplan Synthesis via Graph-Conditioned Diffusion Processes ABSTRACT. This paper introduces a novel AI-based framework for automated 2D architectural floor plan generation from natural language descriptions, employing graph-conditioned denoising diffusion probabilistic models (DDPM) integrated with a U-Net architecture. The proposed system transforms user-provided textual specifications of room types, counts, and spatial relationships into structured room-adjacency graphs. These semantic graphs are encoded into compact embeddings and used to condition the diffusion model during the generative process. The framework produces segmented floor plans with clearly labelled architectural components, which are further refined through post-processing into editable 2D layouts. We evaluate our method on the RPLAN dataset and incorporate qualitative assessments from domain experts. Results demonstrate superior performance in generating diverse, semantically valid, and spatially coherent floor plans compared to existing GAN- and transformer-based approaches. This work highlights the effectiveness of combining semantic graph representations with diffusion-based generative models in advancing AI-driven architectural design. |
| 11:40 | (Online) RobotFlags: AI-Powered Semaphore Interactions Between Chatbot and Humanoid Robot ABSTRACT. This paper introduces RobotFlags, an intelligent system that integrates flag language with deep learning, large language models, and humanoid robots to support learning and interactive communication. The core of the system is the improved YOLO-AKEMA model, which incorporates attention mechanisms and adaptive convolutions to achieve high-accuracy recognition across 27 flag categories, forming a reliable foundation for gesture analysis. The user interface is implemented on the Dify AI platform, with a retrieval-augmented generation (RAG) framework constructed from curated semaphore documents and the BGE-M3 embedding model, enabling context-aware responses. A humanoid robot is seamlessly integrated as both a demonstrator and evaluator: it performs flag gestures, assesses learners’ performance, and provides detailed feedback. To ensure real-time interaction, optimization strategies such as half-precision computation, streaming inference, and caching are employed, maintaining average response times under three seconds. Altogether, RobotFlags delivers a robust, multimodal learning environment that advances flag language education and creates new opportunities for gesture-based human–robot interaction. |
| 11:42 | (Online) Rotation-Invariant Game State Evaluation via Canonical Board Tensors ABSTRACT. Evaluating game states benefits from exploiting rotational symmetries in board configurations. We introduce a parameter-free canonicalisation layer that maps any board tensor to a unique representative under the rotation group, enforcing rotation invariance without data augmentation or group-equivariant architectures. We formalise the induced equivalence relation over game states and define a total order to select canonical representatives. Inserted between the input and a standard MLP, the layer reduces the effective state space while leaving the learnable parameter count unchanged. On $m$–$n$–$k$ games, the resulting model consistently achieves lower training and test MSE than an otherwise identical baseline that consumes raw board tensors, and it converges faster across architectures and minimum-depth filters. These results indicate that explicit symmetry reduction via canonicalisation is a simple, general mechanism for improving game state evaluation. |
| 11:44 | (Online) Visual Question Answering using Multimodal Data Augmentation for Hausa ABSTRACT. This paper presents a classification-based Visual Question Answering (VQA) framework that integrates Large Language Models (LLMs) and vision transformers for Hausa, a low-resource African language. By fine-tuning LLMs on monolingual Hausa text and fusing their embeddings with state-of-the-art vision encoders, our system predicts answers from a fixed vocabulary. Evaluations on no-augmentation, inline, and offline text-image enhancement strategies conducted on the HaVQA dataset demonstrate that the offline augmentation strategy yields the best performance, achieving 35.85\% accuracy, 35.89\% WUPS, and 15.32\% F1-score. These results surpass the baseline by more than 5\%. Our findings underscore the importance of language-specific pretraining and comprehensive data enrichment for robust classification-based VQA in under-resourced settings. |
| 11:46 | (Online) Comparative Study of DINOv2, I-JEPA, and ViT Embeddings for Unsupervised Anomaly Detection ABSTRACT. This paper presents a unified, unsupervised frame- work for Visual Anomaly Detection (VAD) in dynamic, fixed-view scenes, leveraging modern Vision Transformers (ViTs) without additional fine-tuning. We investigate whether embeddings from backbones like a generic ViT, DINOv2, and I-JEPA, combined with a simple clustering approach, are sufficient for identifying anomalies. Our methodology extracts both global (CLS token) and local (patch-level) embeddings, applies clustering (k-Means, HDBSCAN) to model the distribution of normal scenes, and uses a scalable vector database (ChromaDB) for efficient similarity search. The three backbones delivered comparable performance, with the generic ViT showing a small but consistent advantage in balanced accuracy. Although local embeddings provided a 3% gain in balanced accuracy, their sequential processing time is considerably higher. This limitation could be mitigated through parallelization, potentially bringing their efficiency closer to that of global embeddings. In contrast, global embeddings deliver comparable performance while being approximately 256 times faster, enabling near real-time batch processing of 15 s video segments. The k-Means clustering and the proposed retrieval strategy proved most effective, achieving a practical operational trade-off with a balanced accuracy of 82%. These results sug- gest that ViT embeddings are effective for separating normal and anomalous patterns. Local embeddings appear to capture complementary and pertinent information, and we expect that with modest adjustments in the detection strategy, they could be further leveraged to improve anomaly detection performance. |
| 11:48 | (Online) M-AIDE: Mechanistic Agentic Interpretability for Decoding Empathy in Language Models ABSTRACT. Large language models (LLMs) have transformed conversational agents, powering applications from everyday assistants to domain-specific systems. Yet, their internal mechanisms remain opaque, limiting our understanding of how complex behvaviours are represented. Therapeutic conversational agents provide a compelling setting to study this problem, as they require models to encode empathic behaviours. For a better understanding of these behaviours, we present M-AIDE, an agentic framework designed to systematically interpret empathy-related features in LLMs. We apply this technique to therapeutic dialogue data, specifically to understand how LLMs may encode perceived empathy. Our approach leverages mechanistic interpretability to uncover artificial empathy features aligned with psychological categories of empathy. M-AIDE integrates automated interpretability into its pipeline, enabling large-scale classification and explanation of discovered features without exhaustive manual inspection. Our experiments reveal a gradient of representation: low-level features predominate at early layers, while distinct empathy features emerge as layers become deeper. We will release the code upon acceptance. |
| 11:50 | (Online) Leveraging Pupil Facial Fusion for Enhanced Micro-Expression Recognition ABSTRACT. Micro-expressions—fleeting, involuntary facial movements that reveal genuine emotions—are difficult to recognize because they span only a few frames and occur in highly class-imbalanced data. Prior work concentrates on facial appearance while overlooking pupillary dynamics that provide complementary arousal cues. We propose the Enhanced Merged Divided Space–Time Transformer (EM-DSTA), a lightweight yet accurate framework that unifies facial, peri-ocular, and pupil information within a single early–mid fusion backbone. A two-stage detector localizes the iris and pupil in high-resolution frames, encoding the pupil-to-eye area ratio as a scalar arousal token. Full-face and eye crops are channel-concatenated, and the scalar token is injected via a Pupil Gate positioned between spatial and temporal attention blocks. Variable-length videos are handled by a sliding-window schedule with attention masks, while depth-wise convolutions enhance locality without incurring quadratic cost. On CAS(ME)3, EM-DSTA attains 62.50% top-1 accuracy, 51.52% UAR, and UF1 of 51.83%. Ablation studies confirm the dominant role of the pupil signal and show that early fusion outperforms late or dual-stream alternatives. |
| 11:52 | (Online) Machine Learning Models for Predicting Post-Wildfire Methane Emissions in Australia Using Multivariate Data ABSTRACT. Wildfires are significant geophysical disasters that destroy ecosystems and accelerate climate change by emitting greenhouse gases like methane (CH4) and carbon dioxide (CO2). CH4, in particular, is a potent greenhouse gas that has a signif- icant impact on global warming and atmospheric composition. Wildfires emit CH4 during biomass combustion, making proper quantification of these emissions an essential research objective. This work aims to estimate CH4 emissions from wildfires in Australia using satellite-based remote sensing data, specifically fire radiative power (FRP) measurements from MODIS (NASA EOS-Terra and Aqua), as incorporated in the CAMS Global Fire Assimilation System (GFAS). Our predictive system combines advanced model optimisation approaches with SHAP (Shapley Additive Explanations) values to uncover significant features affecting CH4 prediction accuracy. Using this framework, we analyse the performance of various machine learning models and recommend the most effective ways to anticipate wildfire-related CH4 emissions in Australia. |
| 11:54 | (Online) SAIMNet:Object Detection Based on Semantic Alignment of Infrared Image and Microwave Non-image Information Fusion ABSTRACT. Due to the inability of unimodal data to provide comprehensive information and its poor environmental adaptability, which fails to meet the detection accuracy requirements, this paper proposes a object detection algorithm that integrates infrared images and microwave non-image information based on semantic alignment, leveraging their complementary characteristics. We improve radar image quality by converting radar point cloud data into radar RD (Range-Doppler) images and introducing hierarchical wavelet threshold denoising techniques. Subsequently, to address the noise issue in infrared images, we design an adaptive median filtering algorithm to effectively remove salt-and-pepper noise while preserving edge details. Furthermore, we propose the Adaptive Convolutional Context Attention (ACCA) module, which dynamically adjusts the size of the convolutional kernel, enhancing the model's ability to perceive features across different image regions. To further improve feature representation capability, we design the Adaptive Feature Enhancement (AFE) module, which strengthens the model’s ability to capture and integrate semantic information across channels. Experimental results demonstrate that SAIMNet improves object detection accuracy. |
| 11:56 | (Online) Deep Spectral Analytics based Soil Nutrient Prediction using Spatial-Semantic Feature Embedding with Prototype-Guided Perturbation ABSTRACT. Soil nutrient prediction from hyperspectral imagery is not an easy task, owing to the high-dimensional data, spatial redundancy, and non-adaptability of feature-learning techniques. In most cases, conventional deep learning methods are inefficient when it comes to capturing spatial-spectral semantics and are vulnerable to noise interference or irrelevant band information. A hybrid deep learning approach integrating spatial-semantic embedding from convolutional autoencoder, multi-granular CNN (MG-CNN) for richer spectral feature extraction, prototype-guided attention (PGA) for task-aware interpretability, and prototype-guided perturbation (PGP) to promote robustness has been proposed to obviate these problems. Each module has been tailored to address a certain aspect of hyperspectral complexity, from spatial encodings to adversarial resistance. Experimental results obtained on land use/cover area frame statistical survey (LUCAS) soil nutrient data ensure the effectiveness of the proposed model with an R2 score of 0.9944, 0.9958, and 0.9960 for N, P, and K, respectively, and low error rate in prediction. This framework is a promising solution and a reliably effective method in the field of precise soil nutrient estimation in precision agriculture. |
| 11:58 | (Online) Synthetic Chest X-ray Augmentation via Generative Variational Autoencoding for Pneumonia Detection ABSTRACT. Pneumonia remains a major global health concern, particularly in low-resource environments or settings where access to radiological expertise is limited. Developing AI methods, especially deep learning models, for pneumonia detection using chest X-ray (CXR) is often hindered by the scarcity of well-annotated data. This study proposes a data augmentation approach using a Variational Autoencoder with the Generative Adversarial Network (VAE-GAN) to generate pneumonia-specific synthetic CXR images. By combining variational encoding with adversarial training, the method produces anatomically consistent images that supplement limited datasets. These synthetic samples are used to train a Vision Transformer (ViT) for pneumonia classification. Compared to conventional augmentation and GAN-based alternatives such as DCGAN and Autoencoder, the use of VAE-GAN improves the model's performance when evaluated on real test data. Grad-CAM visualizations further show that the ViT model trained with VAE-GAN data attends more consistently to relevant lung regions. These results suggest that incorporating structured synthetic images can help improve classification outcomes and interpretability in environments or set-tings with limited access to expert-labeled medical data. |
| 12:00 | (Online) DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation ABSTRACT. One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation. |
| 12:02 | (Online) GRASP-former: A Lightweight Global-Random Sparse Attention for Domain-Aware Multi-Class Obscenity Detection PRESENTER: Shukla Mondal ABSTRACT. The availability of obscene content across diverse domains poses a threat to user well-being, especially children, as well as adults at the workplace. Detecting obscene content across image domains is vital, whereas a single classifier approach lacks the ability to represent the domain differences semantically. To tackle this issue, we propose a domain-aware obscenity classification framework in this paper. It trains the domain-aware classifier with shared head that aligns the semantic boundary between obscene and non-obscene content across domains. We present GRASP-former, a novel feature representation architecture for obscene image classification, where the random–global sparse attention builds a lightweight global context by attending to a small set of learnable global tokens and randomly sampled tokens. We fuse it with depthwise local convolution, which further refines the obscenity features to distinguish visual ambiguities present in obscene and non-obscene classes. We evaluate the proposed architecture with standard performance metrics using samples from the NPDI and NSFW datasets. |
| 12:04 | (Online) Utilizing Keypoint R-CNN for Automated Root Angulation Detection in OPGs PRESENTER: Hira Qayyum ABSTRACT. Accurate measurement of root angulation is crucial in orthodontics for treatment planning and diagnosis. Traditional methods rely on manual estimation, which is time-consuming and prone to errors. Existing automated approaches focus mainly on tooth segmentation and lack precise detection of key points at both the crown and root. There is a need for a deep learning based solution that can accurately identify these keypoints and compute root angulation from panoramic dental X-rays (OPGs). This study uses keypoint R-CNN to detect tooth crown and root points in OPGs, enabling an automated calculation of angulation. The model was trained on annotated dental X-rays and tested on unseen images for validation. The results show that our approach achieves high accuracy in keypoint detection and provides consistent angulation measurements, reducing the risk of human error. The automated system can assist orthodontists in making precise assessments, improving efficiency in dental diagnostics. Our work contributes to a novel application of Keypoint R-CNN for root angulation analysis, enhancing automated orthodontic evaluation. |
| 12:06 | (Online) Cybersickness in VR: State-of-the-Art and Future Research Agenda ABSTRACT. Cybersickness (CS) remains a critical barrier to the widespread adoption of virtual reality (VR), manifesting as nausea, disorientation, and visual fatigue despite significant technological advancements. This paper presents a systematic umbrella review of 25 survey papers (January 2020–May 2025) to quantitatively synthesize evidence on CS influencing factors, assessment methods, and mitigation strategies. Our analysis reveals that latency is the most consistently cited hardware factor (72\% of reviews), while demographic factors like age and gender show inconclusive results due to methodological confounds. Content design choices, particularly continuous locomotion, are major contributors to CS, whereas discrete movement methods like teleportation significantly reduce symptoms (76\% of reviews). A critical finding is the field’s overreliance on subjective questionnaires like the Simulator Sickness Questionnaire (SSQ), despite widespread recognition of their limitations, and a persistent gap between the discussion of objective physiological measures and their implementation. Based on these insights, we propose a research agenda advocating for: 1) multi-factorial studies to disentangle variable interactions, 2) standardized multi-modal assessment protocols, and 3) rigorous testing of closed-loop adaptive systems. This review provides a comprehensive evidence map and actionable roadmap to advance CS research from fragmented findings toward engineered solutions for improved VR accessibility and comfort. |
| 12:08 | (Online) Enhanced Emphysema Classification in CT Images Using RIU4-LQP and Spatial Texture Features ABSTRACT. Emphysema is a major subtype of chronic obstructive pulmonary disease (COPD) which presents significant challenges in early detection and subtype classification. Computed tomography (CT) imaging provides valuable structural detail, and texture-based analysis has become a key strategy for identifying emphysema patterns. This study introduces an improved texture descriptor, Rotation-Invariant Uniform Local Quinary Pattern (RIU4-LQP), designed to reduce feature dimensionality while preserving fine-grained local details. Additionally, the Average Nearest Neighbor Index (ANNI) is integrated to capture spatial distribution features not adequately represented by traditional descriptors. A feature selection step based on recursive feature elimination (RFE) further optimizes the feature set for classification. Using a leave-one-subject-out cross-validation strategy with support vector machines (SVMs), experiments demonstrate that combining RIU4-LQP with ANNI yields superior classification performance compared to standard LQP. The findings highlight the importance of integrating local and spatial texture descriptors, alongside feature selection, for robust emphysema subtype classification in CT imaging. |
| 12:10 | (Online) Depth-Aware YOLO Segmentation: Enhancing Small Object Detection via MiDaS-Based Spatial Reasoning ABSTRACT. In this paper, we propose a depth-enhanced object detection system that couples MiDaS monocular depth estimation with YOLO-based segmentation to improve the robustness and accuracy of small object detection. Conventional object detectors, e.g., YOLO family members, are 2D-spatial-cue-dependent and depth-unaware and thus misdetect or lose detection for objects in visually complex or depth-ambiguous scenes. Such a limitation is especially acute for the detection of small objects with varied distances. By using a MiDaS-calculated depth map as an input during post-processing, we propose a spatial reasoning mechanism with depth-guided filtering and prioritization that improves the accuracy of YOLO segmentation. The novel method was tested on a small fruit data set, with YOLO segmentation fine-tuned for 20 epochs. Experimental results demonstrate fewer false positives and improved localization of small objects, especially objects closer to the camera. Our findings suggest that the use of depth cues leads to dramatically improved object detection performance with interesting areas to be explored in depth-guided object recognition. |
Paper Session 1
Paper Session 3