View: session overviewtalk overview
| 13:30 | Echocardiogram to CMR Image Synthesis using Generative Models ABSTRACT. Echocardiograms provide noninvasive real-time data for assessing the structure and function of the heart and can assist in diagnosing several conditions. However, they are highly operator-dependent, often yield incomplete or suboptimal views, and can be challenging to interpret. In contrast, Cardiac Magnetic Resonance (CMR) delivers comprehensive and detailed evaluations but remains time-consuming and costly. To address these limitations, this study investigates cross-modal generative modeling for synthesizing CMR sequences directly from 2D Transthoracic Echocardiography (TTE) with temporal information. We propose a novel model that combines an autoencoder (AE) backbone for feature extraction with a vision transformer (ViT) to capture global temporal and spatial dependencies, thereby enabling the prediction of CMR sequences with preserved dynamics. The performance of this architecture is compared with alternative generative models to assess quantitative accuracy. Experimental results show that the proposed ViT-AE model with 12 layers achieved the best performance, with an MAE of 0.08, an SSIM of 0.67, and a PSNR of 18.45. |
| 13:45 | (Online talk) Progressive Distillation Attention for Robust Left Ventricular Ejection Fraction Estimation ABSTRACT. Near accurate estimation of the left ventricular ejection fraction (LVEF) from echocardiography is essential for the assessment of cardiovascular risks, but remains highly dependent on expert interpretation. We propose a lightweight encoder–decoder architecture with attention-guided skip refinements for robust left ventricle segmentation and automated LVEF computation from echocardiogram video and image samples. Unlike conventional U-Net models, our approach integrates dense connections, feedback loops, and Progressive Distillation Attention Blocks (PDABs). The PDAB modules selectively refine shallow encoder features before fusion with decoder representations, enabling progressive distillation of fine spatial details and deeper semantic cues. This refinement strategy ensures the precise delineation of ventricular boundaries even in low-quality echocardiographic frames, directly improving the reliability of the estimation of ejection fraction. With only 2.4M parameters, our model outperforms prior state-of-the-art methods. In the EchoNet-Dynamic dataset, it achieves a 15.85% improvement in R2 with 69.4% and 4.6% reductions in MAE and RMSE, respectively. In the CAMUS dataset, it reduces MAE by 79.06% while improving Pearson’s correlation coefficient and R2 by 19.51% and 1.02%, respectively. These results demonstrate that the PDAB-enhanced method provides a compact and accurate solution for automated LVEF estimation from echocardiography. |
| 14:00 | Nissl-stained Histological Slice Image Completion Based on Generated Masks ABSTRACT. In neuroanatomy, studying histological brain sections cut from non-human primate brains provides valuable insights into how the human brain operates, as they share structural and functional similarities. However, acquiring histological brain sections of a primate animal can be challenging; inevitable human errors may lead to the damage of the histological brain sections, making the samples harder to interpret. Therefore, histological image completion plays a critical role, which paves the road to repairing the damaged slice in the digitalized form. In this study, we used the common marmoset monkey (Callithrix jacchus), which is a New World monkey, sharing similar brain functionality with humans. Two state-of-the-art models were tested and compared, namely Deep Fusion Network (DFNet) and Pluralistic Image Completion with Reduced Information Loss (PUT). These models were first introduced for scenery and face image completion, and have not been applied to histological images of the brain. DFNet achieved a mean squared error of 0.0061±0.0025 with the squared mask, while the irregularly shaped mask achieved 0.0361±0.0095. For the PUT model, the mean squared error reached 0.0105±0.0035. These results demonstrate that both models can successfully reconstruct missing regions in Nissl-stained brain slice images, contributing to the development of future digital histological repair pipelines. |
| 14:15 | Seeing Beyond the Airways: Asthma Prediction via Cross-Attention on Dual Retinal Modalities ABSTRACT. Asthma’s systemic effects and chronic underdiagnosis trigger avoidable exacerbations and a heavy healthcare burden. Effort-dependent spirometry and specialist-only tests block population-scale screening. We propose a cross-modal attention framework for non-invasive asthma classification from dual retinal modalities, colour fundus photographs (CFP; Type 1: posterior pole view; Type 2: optic nerve head view) and optical coherence tomography (OCT) measurements, where systemic changes manifest in retinal structure and vasculature. Fundus images are encoded by a CNN backbone followed by multi-head self-attention (MHSA); OCT metrics are embedded by a lightweight feed-forward encoder; cross-modal attention (CMA) then fuses the two streams to capture intermodal dependencies. Trained and evaluated on a novel dual-modality dataset curated at UNSW Centre for Eye Health (CFEH), the model achieves an AUC of 0.97 and offers improved interpretability via attention-weight visualisations. These results support retinal biomarkers as a scalable pathway for early asthma detection and open a window to population-level oculomic screening for other systemic diseases (e.g., neurodegenerative and cardiovascular diseases), highlighting the promise of CMA for ocular imaging. |
| 14:30 | (Online talk) Enhanced Graph Convolutional Network with Chebyshev Spectral Graph and Graph Attention for Autism Spectrum Disorder Classification PRESENTER: Adnan Ferdous Ashrafi ABSTRACT. Autism Spectrum Disorder (ASD) is a complicated neurodevelopmental disorder marked by variation in symptom presentation and neurological underpinnings, making early and objective diagnosis extremely problematic. This paper presents a Graph Convolutional Network (GCN) model, incorporating Chebyshev Spectral Graph Convolution and Graph Attention Networks (GAT), to increase the classification accuracy of ASD utilizing multimodal neuroimaging and phenotypic data. Leveraging the ABIDE I dataset, which contains resting-state functional MRI (rs-fMRI), structural MRI (sMRI), and phenotypic variables from 870 patients, the model leverages a multi-branch architecture that processes each modality individually before merging them via concatenation. Graph structure is encoded using site-based similarity to generate a population graph, which helps in understanding relationship connections across individuals. Chebyshev polynomial filters provide localized spectral learning with lower computational complexity, whereas GAT layers increase node representations by attention-weighted aggregation of surrounding information. The proposed model is trained using stratified five-fold cross-validation with a total input dimension of 5,206 features per individual. Extensive trials demonstrate the enhanced model’s superiority, achieving a test accuracy of 74.82% and an AUC of 0.82 on the entire dataset, surpassing multiple state-of-the-art baselines, including conventional GCNs, autoencoder-based deep neural networks, and multimodal CNNs. |
| 14:45 | 3D Gaussian Splatting Reconstruction from Simulated CT Projections with Geometric Initialization ABSTRACT. We develop a customized initialization method for 3D Gaussian Splatting methods, aimed at extending its application to Computed Tomography (CT) reconstruction. Initialization in 3D Gaussian Splatting is a crucial step and can be accomplished using several techniques. The official pipeline of 3D Gaussian Splatting uses the Structure-from-Motion (SfM) technique in its initialization step. While SfM works well for natural scene photographs, it is not directly applicable in the medical domain, more specifically in the CT environment. To address this limitation, we propose a customized, geometric-aware initialization method that is compatible with parallel beam CT geometry. We investigated 16 simulated CT datasets along with the Shepp-Logan phantom. These simulated models were acquired from TomoPhantom toolbox that provided 2D projection images as the ground truth. These ground truth images and the 3D models were used in our customized 3D Gaussian placement strategy, ensuring accurate camera orientation and 3D point sampling for parallel-beam CT reconstruction. We obtained the rendered images corresponding to their ground truth projections that mostly preserved the true geometric structures. For the Shepp-Logan phantom, we achieved a test PSNR of 29.987 and an L1 loss of 0.015 after 30,000 iterations. Further work may extend this approach to real-time CT data with different scanner acquisition, such as cone beam or helical. |
| 15:00 | Multimodal Cross-Attention for Range of Motion Assessment ABSTRACT. Accurate and automated Range of Motion (ROM) assessment is essential for rehabilitation, physical therapy, and post-surgical recovery. Traditional manual goniometer-based evaluations suffer from subjectivity, inter-rater variability, and reliance on trained professionals. Although RGB-based computer vision enables markerless ROM estimation, it remains susceptible to occlusions, pose estimation errors, and difficulties detecting subtle joint movements. Furthermore, vision alone cannot capture neuromuscular activation, which is crucial for understanding joint dynamics. To address these challenges, we propose a multi-modal deep learning framework that integrates RGB-based motion tracking with electromyography (EMG) signals. EMG provides neuromuscular activation data, enhancing the robustness against visual occlusions and improving sensitivity to subtle joint displacements. Our method employs an Hourglass-based convolutional neural network (CNN) for spatial feature extraction and a gated recurrent unit (GRU)-based model for temporal EMG processing. To further enhance performance, we introduce feature-level and modality-level attention modules, dynamically emphasizing the most informative features and modality contributions. Experimental results demonstrate that our proposed model achieves an overall RMSE of 2.55, and improvements via the feature and modality attention mechanisms, respectively. Moreover, the fully fused RGB-EMG model outperforms RGB-only approaches, particularly in accurately predicting subtle ROM movements. |
| 15:15 | Risk-Controlled Multimodal Emotion Coaching for Autism Support Using Self-Supervised Vision and Speech Encoders ABSTRACT. Challenges in social communication and emotion recognition characterize autism spectrum disorder (ASD). However, many existing digital interventions often fail, lacking the multimodal, adaptive, and safety-conscious frameworks necessary for adequate real-world support. This paper introduces a risk-controlled multimodal emotion coaching (RC-MEC) system designed to provide personalized and safe affective learning for individuals with autism. Our framework combines vision and speech using self-supervised learning (SSL) to capture rich, contextualized representations. We validated RC-MEC on the FER-2013 facial emotion and RAVDESS speech datasets. At the core of RC-MEC is a risk-control module, driven by conformal prediction, which dynamically regulates the agent’s action. Interventions are delivered only within a predefined low-risk confidence limit, achieving 90.1% coverage at the target risk level or error tolerance (α) = 0.1 and 92.2% singleton accuracy when confident. The results confirm that the proposed multimodal and risk-aware system offers a safer, more effective, and reliable tool for emotion coaching in autism, thereby paving the way for more responsible and user-centered assistive technologies. |
| 16:00 | Mobile Robot Navigation Method based on Multiple External Cameras in Crowded Environment ABSTRACT. Existing navigation approaches for mobile robots in crowded environments predominantly rely on on-board sensors like LiDAR and monocular cameras, suffering from limited sensing coverage and occlusion issues that hinder comprehensive perception of dynamic surroundings. This paper presents a novel navigation framework leveraging a multi-camera system deployed in the environment to enable holistic environmental perception and robust robot navigation. The framework introduces a Generalized Multi-View Detection (GMVD) algorithm with learnable adaptive projection and dynamic view fusion, which uses markers to assist in robot localization. The navigation layer integrates an improved A* algorithm with a hierarchical strategy combining speed barriers and dynamic window approaches to achieve collision-free path planning. Real-world experiments comparing the proposed method with previous crowd navigation algorithms demonstrate that it significantly enhances the robot’s navigation performance, generating obstacle-free paths for safe and efficient navigation in crowded scenarios. |
| 16:15 | SAM-Based Leaf Segmentation with Morphological Quality Assessment for Enhanced Plant Disease Detection ABSTRACT. Plant diseases threaten global food security through substantial agricultural losses, yet traditional visual inspection often fails to detect early-stage symptoms critical for timely intervention. While deep learning models have shown promise for automated disease detection, their performance often degrades in realistic field conditions. This study investigates whether data-centric preprocessing can improve computer vision performance for apple leaf disease detection. We present the first systematic evaluation of the Segment Anything Model (SAM) combined with morphological quality assessment for leaf segmentation, compared against whole-image classification using the PlantPathology FGVC7 dataset (3,642 apple orchard images). To ensure segmentation reliability, we introduce a five-metric morphological framework (area ratio, aspect ratio, spatial coverage, centroid proximity, border penalty). Experiments with ResNet-18 under 3-fold cross-validation reveal class-specific effects: SAM improves F1 by 3.0% for the minority multiple diseases class, but decreases by 1.4% for healthy leaves where contextual cues aid detection. Rust and scab remain stable above 95% F1, reflecting their distinctive visual signatures. GradCAM++ confirms that preprocessing redirects attention toward disease-relevant regions, particularly in complex multiple-disease cases. Overall, these findings show that adaptive preprocessing, rather than universal background removal, offers practical benefits for precision agriculture. |
| 16:30 | Attention‑Guided Band Pruning for Efficient Hyperspectral Early Grape Leaf Disease Detection ABSTRACT. Early detection of grapevine diseases, particularly grapevine leafroll-associated virus (GLRaV) and grapevine red blotch virus (GRBV), is impeded by presymptomatic presentation and on-device computing limitations. This work introduces attention-guided band pruning (AGBP), an embedded selector that learns per-band importance within a 3D ResNet-18 hyperspectral pipeline under an entropy penalty, then retrains compact k-band models whose cost scales with k. Two aggregation rules, EMA-stream and trimmed mean, convert per-sample weights into stable global rankings. Under a unified protocol with fixed plant-level splits and standardized preprocessing on a 40-band grapevine dataset, AGBP improves baseline accuracy and yields compact models that retain about 99% of baseline AUROC at k=4–6 while cutting FLOPs and latency by up to an order of magnitude. Compared with Pearson, ReliefF, and CARS, AGBP performs best at very small spectral budgets and provides predictable accuracy–efficiency curves suitable for edge deployment. |
| 16:45 | Automated Activity Monitoring of Cryptic Species in a Zoo Environment ABSTRACT. Automated monitoring of cryptic species in a zoo environment provides a valuable opportunity to study behaviours and ecological patterns to inform conservation efforts that would otherwise remain poorly documented. In this study, we explore the problem of monitoring a population of endangered Whitaker's Skink (\textit{Oligosoma whitakeri}), a cryptic species that hides under leaf litter for most of the day, held in captivity at Te Nukuao Wellington Zoo, New Zealand. We propose a lightweight pipeline based on the Structural Similarity Index (SSIM) between frames sampled from videos and identify time points where the skinks are active in their habitats. We benchmark our pipeline against a deep-learning approach based on the MegaDetector model, explore the influence of both the sampling rate and choice of difference metric on detection performance, and discuss the challenges of implementing our pipeline. The proposed method provides scalable, interpretable, and low-cost monitoring of skink activity and can be integrated with existing zoo camera systems to reduce the daily workload of keepers without requiring intensive computing resources. |
| 17:00 | YOLO and SGBM Integration for Autonomous Tree Branch Detection and Depth Estimation in Radiata Pine Pruning Applications ABSTRACT. Manual pruning of radiata pine trees poses signifi- cant safety risks due to extreme working heights and challenging terrain. This paper presents a computer vision framework that integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for autonomous drone-based pruning operations. Our system achieves precise branch detection and depth estimation using only stereo camera input, eliminating the need for expensive LiDAR sensors. Experimental evaluation demonstrates YOLO’s superior performance over Mask R-CNN, achieving 82.0% mAPmask50–95 for branch segmentation. The integrated system accurately localizes branches within a 2-meter operational range with processing times under one second per frame. These results establish the feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in commercial forestry. |
| 17:15 | (Online talk) Y-LIChess: Live and Interactive Over-the-Board Chess Recognition and Play with YOLO ABSTRACT. Chess is widely played on computers, yet over-the-board (OTB) chess remains the official and preferred format for many players due to its tactile and immersive nature. Bridging digital and physical play requires accurate recognition of OTB positions. Prior research has explored modular pipelines for board localization, square occupancy, and piece classification, as well as one-shot detectors. While these approaches demonstrate strong accuracy in controlled conditions, they often accumulate errors across stages, face latency and robustness issues, and rarely support interactive play. Thus, in this work, we present Y-LIChess, a YOLO-based system for live, interactive OTB play with engines and online platforms. Y-LIChess employs semi-automatic calibration, event-triggered recognition, and legality-aware validation to ensure seamless, low-latency interaction. On our wood180 dataset, built with an active learning process to reduce manual annotation, Y-LIChess achieves 99.36 AP50 with only 0.21% per-square error, reconstructs 100% of boards within one mistake, and performs FEN reconstruction in ~7 ms GPU, more than an order of magnitude faster than prior pipelines. |
| 17:30 | Evaluating Human Perception of Automatically Created Synthetic Road Networks that Integrate Real-world Cost Factors and Terrain Features ABSTRACT. Creating synthetic road networks that are both realistic and convincing within virtual environments can be time-consuming and difficult. Several researchers have investigated the use of procedural algorithms for automatically creating road networks. While the results are often visually pleasing, they may not be plausible since they do not consider real-world factors such as the construction costs of different designs. We designed, implemented and evaluated an automated system that can generate virtual road networks based upon real-world information, including terrain features, land usage details, and economic cost estimates, to procedurally create realistic road layouts. A user study (n=32) showed that our system produced more realistic roads than previous work and the resulting road network was perceived to be as realistic and cost-efficient as the equivalent real road network of the given land area. Changing parameters can significantly alter outcomes and choosing a small population size resulted in the most plausible results. |
| 17:45 | Fake Money, Real Threat: Fooling Wavelet-Based Banknote Authentication with AdvGAN PRESENTER: Julian Knaup ABSTRACT. As machine learning models are increasingly deployed, their vulnerability to adversarial examples poses a significant threat to security-relevant applications. Financial transactions, in particular, rely on the assumption that payments are legitimate, which makes banknote authentication an essential use case. Banknotes incorporate several security features, and the applied printing technique itself can be leveraged for authentication. Specifically, Intaglio printing results in fine line work and microstructures that can be analyzed and distinguished by spatial frequency analysis, e.g. the wavelet packet transform. By evaluating statistical moments of wavelet coefficient histograms, fast and reliable authentication is achieved. This paper adapts the AdvGAN framework for the context of wavelet-based banknote authentication. By proposing a customized loss function that constrains the feature space, highly effective yet subtle adversarial examples are generated. These perturbations deceive the authentication system, causing it to classify forgeries as genuine banknotes and genuine ones as forgeries. Under attack a maximum drop from 100% classification accuracy to 0% is accomplished. Furthermore, limitations and countermeasures are outlined highlighting potential challenges of deploying such attacks in practical scenarios. |