SOICT 2025: THE 14TH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY
PROGRAM FOR SATURDAY, DECEMBER 13TH
Days:
previous day
all days

View: session overviewtalk overview

08:30-09:10 Session 7: Keynote III: Josiah Poon (University of Sydney, Australia)

Like water that adapts to any container, documents need Natural Language Processing (NLP) systems that adapt and move across modalities pages and scales to find verifiable evidence. In this talk, I will share a practical agenda to build NLP systems that ingest text, images, layout, tables and figures and produce traceable answers. We emphasise three pillars: integration, learning and retrieval. Integration: fuse multimodal features and layout aware encodings so text and visual content are interpreted together. Learning: train specialist teachers across modalities and distil their feature knowledge into compact deployable students for NLP tasks. Retrieval: adopt a retrieval first approach, using multipage and multimodal retrieval to find candidate passages, tables and figures, then chain those candidates into a clear evidence trail. I demonstrate how graph-based encodings and multiscale reasoning work together, and how multiteacher distillation compacts expert knowledge into deployable students. Then, with concise multimodal case studies and retrieval centric metrics, I show measurable gains in evidence grounding, generalisation and operational readiness. I conclude with practical measures to control complexity and annotation cost, and present simple experiments and evaluation criteria for different domains.

Location: Ballrooom
09:10-09:50 Session 8: Keynote IV: Tung Kum Hoe Anthony (National University of Singapore, Singapore)

In an era dominated by massive foundation models, smaller players risk being left behind—unable to afford the scale, data, or manpower that large AI systems demand. This talk introduces the concept of Prudent AI—an approach that emphasizes right-sized, lightweight, and explainable intelligence delivered through just-in-time, Plug-and-Play AI Boxes. Focusing on applications like early anomaly detection in multivariate time series, we demonstrate how our AI Boxes use sparse data, minimal compute, and human-guided refinement to detect rare but critical events. The architecture integrates symbolic reasoning, data-driven refinement, and secure edge deployment, showing how being small can actually be a strength in resource-constrained settings. Through this, we reimagine how organizations can adopt AI that is transparent, agile, and sustainable.

Location: Ballrooom
10:20-12:00 Session 9A: SOICT Technical Session XIII: Lifelog Event Retrieval
Location: Ballrooom A
10:20
Toward Abstraction-Level Event Retrieval in Large Video Collections: Leveraging Human Knowledge and LLM-Based Reasoning in the Ho Chi Minh City AI Challenge 2025

ABSTRACT. The Ho Chi Minh City AI Challenge 2025 focused on advancing abstraction-level event retrieval from large, diverse video collections. Key tasks included Known-Item Search (KIS), Question Answering (Q&A), and the demanding Temporal Retrieval and Alignment of Key Events (TRAKE), requiring precise temporal coherence. Solutions integrated robust multimodal architectures (VLM, OCR, ASR) with Large Language Models (LLMs) for semantic reasoning and query refinement. Critical advancements in temporal modeling, often using Dynamic Programming, ensured event sequence coherence. This collection provides 48 papers from participating teams, demonstrating significant progress in scalable and context-aware retrieval systems.

10:40
Real-Time Hybrid Multimodal Retrieval System for AI Challenge HCMC 2025

ABSTRACT. This paper presents our team’s system for the AI Challenge 2025 (AIC 2025), addressing three video retrieval tasks: Known-Item Search (KIS), Question Answering (Q&A), and Tracking-Based Known Event (TRAKE). Building on our previous work [1], we redesigned the real-time framework to improve semantic precision, temporal consistency, and multi-user scalability. The system integrates a hybrid CLIP-SigLIP2 embedding engine, context-aware reranking, and temporal fusion, deployed on an asynchronous FastAPI-Redis-Milvus stack for low-latency processing. Tested on the official AIC 2025 dataset, it achieved 9% higher Recall@10 over single-model baselines with < 180ms average latency, ranking first in the High-School Division. These results demonstrate that combining hybrid multimodal embeddings with contextual post-processing yields a robust, real-time solution for large-scale video retrieval.

11:00
Towards Conversational Video Retrieval with an Intelligent Search Agent

ABSTRACT. Advances in technology have led to an explosive growth of multimedia content, especially videos. This creates an increasing demand for reliable video event retrieval frameworks capable of efficiently extracting meaningful events from large-scale video databases. Although recent advances have enabled multimodal and temporal matching, most video retrieval systems still rely on rigid query interfaces and lack the ability to engage in conversational refinement. This limitation arises from the absence of an intelligent agent capable of understanding natural language, maintaining context, and dynamically linking user intent to the retrieval modules. To overcome this limitation, we propose a video event retrieval system that incorporates an intelligent search agent, enabling seamless, conversation-guided video retrieval. The agent engages users in dialogue to interpret intent, answer content-related questions, and iteratively refine search results. It leverages existing multimodal and temporal retrieval modules to ground responses in both textual and visual evidence, while a lightweight plug-and-play feedback fusion allows composed retrieval without retraining. Through this interactive loop, users can explore large-scale video collections in a natural and effective manner. Evaluation in the 2025 Ho Chi Minh AI City Challenge demonstrates the effectiveness of our approach, achieving 85 out of 88 successful retrievals and highlighting the potential of conversational video retrieval.

11:20
Applying Large Language Model (LLM) Agents for Automated Lifelog Retrieval

ABSTRACT. Recent advances in LLMs have enabled new levels of autonomy in complex multimedia retrieval systems. This work introduces an LLM-based Planning Agent for fully automated lifelog retrieval, capable of interpreting user queries and orchestrating multimodal retrieval components without human intervention. Given a query and a registry of available retrieval modules (e.g., by text embedding, by OCR, by activity labels, and by object tags), the agent generates and executes complete retrieval plans, selecting appropriate modalities, configuring parameters, and fusing results through normalization and ranking strategies. The system exhibits agentic behavior by reasoning about task structure, adapting execution flow dynamically, and self-evaluating retrieval outcomes to refine its strategy. This fully autonomous mode demonstrates how LLM-driven agents can transform lifelog retrieval from an expert-guided, interactive process into an intelligent, self-directed search paradigm, which bridges the gap between human expertise and machine understanding.

11:40
Leveraging Composed Image Retrieval Principles for Efficient Textual Feedback in Multimodal Retrieval

ABSTRACT. People watch countless videos every day for learning, entertainment, or work, and as video content continues to grow, finding a specific clip or moment among the vast collections has become increasingly difficult. This growing challenge highlights the need for more effective video retrieval systems that can help users quickly locate the videos they are looking for. However, most existing systems produce static results that do not always align with user intent, compelling users to reformulate their queries multiple times while the absence of interactive refinement mechanisms makes it difficult to achieve smooth, conversational search experiences. To address this challenge, we propose a feedback-driven textual search system that enables users to provide direct feedback on retrieved images, iteratively refining retrieval results until the desired outcomes are achieved. Through this approach, our system achieved a score of 84.6 on a maximum scale of 88 in the preliminary round of the HCM AI Challenge 2025, demonstrating strong real-world performance.

12:00
Estimating size of lesions in Endoscopic Images using depth model-based approaches

ABSTRACT. In clinical environments, especially during minimally invasive procedures such as endoscopy, accurate measurement of lesions such as polyps or tumors plays a vital role in determining disease severity and choosing appropriate treatment methods. However, endoscopic cameras typically provide only two-dimensional RGB images that lack depth information, making the size estimation of internal organs or lesions highly challenging. To address this issue, this study utilizes the ZoeDepth model, a deep learning model capable of estimating depth from a single RGB image. However a depth estimation model such as ZoeDepth has not trained for endoscopic images, this study handles this obstacle to solve the problem of estimating the size of the lesion from endoscopic cameras. The proposed method is based on a set-up with chessboards in controlled condition. It combines a series of steps that include geometric correction, depth value optimization, and measurement error assessment. The evaluation of the system was performed on multiple datasets, including synthetic and real endoscopic images, showing that the proposed method could predict object sizes with an error estimation smaller than +\-2.5 mm. These results are stable and relatively accurate measurements with small deviations using 100 real polyps images with different objects' sizes. It opens up a wide range of applications in the medical field, especially diagnostic and surgical endoscopy, with AI-driven clinical support systems

10:20-12:00 Session 9B: SOICT Technical Session XIV: AI Applications
Location: Ballrooom B
10:20
FA-Net: A Dual-Branch Attention Architecture for Extracting Fine-Grained Anatomical Features of Wood

ABSTRACT. Accurate identification of wood species is a challenging \textit{Fine-Grained Visual Classification (FGVC)}, playing a crucial role in supply chain management and in combating illegal logging. Conventional Convolutional Neural Networks (CNNs) often fail to capture subtle morphological details due to feature compression (global pooling), even though macro-images inherently contain both global structural context and fine-grained cues. To overcome this limitation, we propose \textbf{FA-Net (Fine-Anatomical Network)}, a novel dual-branch architecture that employs a \textit{global branch} to capture global structural context (e.g. porosity types and vessel distribution) and a \textit{local branch} to preserve local morphological details (e.g. parenchyma patterns and vessel/ray sizes) from macro-scale images. Both branches are enhanced with channel–spatial attention mechanisms and are adaptively fused through a pyramid self-attention module, yielding a highly discriminative representation. Comprehensive experiments across five benchmark datasets demonstrate that FA-Net achieves state-of-the-art accuracy, reaching up to 99.32\%—outperforming the DenseNet121 baseline by 4.0\%—while maintaining near-real-time inference speed. Interpretability analysis via EigenCAM further confirms that FA-Net successfully attends to critical anatomical traits (such as porosity types and parenchyma patterns). FA-Net provides an efficient, transparent and deployment-ready solution for practical applications in forestry and customs inspection.

10:40
Adaptive Rainfall Forecasting from Multiple Geographical Models Using Matrix Profile and Ensemble Learning

ABSTRACT. Rainfall forecasting in Vietnam is highly challenging due to its diverse climatic conditions and strong geographical variability across river basins, yet accurate and reliable forecasts are vital for flood management, hydropower operation, and disaster preparedness. In this work, we propose a Matrix Profile-based Weighted Ensemble (MPWE), a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts while incorporating redundancy-aware weighting to balance contributions across models. We evaluate MPWE using rainfall forecasts from eight major basins in Vietnam, spanning five forecasting horizons (1-hour and accumulated rainfall over 12, 24, 48, 72, and 84 hours). Experimental results show that MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines, demonstrating both improved accuracy and stability across basins and horizons.

11:00
Toward a Culture‑Aware Vietnamese Mental Health Support Chatbot with Large Language Models

ABSTRACT. In recent years, mental health has become a pressing concern in Vietnam, driven by increasing academic, professional, and social pressures. Nonetheless, access to mental health support remains constrained by a shortage of professionals and cultural stigmas that discourage help‑seeking. This paper presents an AI‑based mental health chatbot powered by Large Language Models (LLMs) to deliver accessible, empathetic, and culturally tailored interactions. The model is pretrained on a 100~MB Vietnamese mental health corpus, comprising 200 documents, forum threads, and books (approximately 1.5~million tokens), capturing local linguistic patterns and psychological themes such as academic stress and familial expectations. It is fine‑tuned on an adapted CACTUS dataset, consisting of 36\,577 Vietnamese counseling dialogues (31\,577 translated from English and 5\,000 synthetically generated), incorporating Cognitive Behavioral Therapy (CBT) techniques like decatastrophizing and alternative perspective. The chatbot serves as a virtual companion, providing accurate information, emotional support, and practical strategies for everyday psychological challenges. Its effectiveness is evaluated using 100 client profiles with detailed intake forms and predefined attitudes, assessed via the Cognitive Therapy Rating Scale (CTRS) for general counseling skills (understanding, interpersonal effectiveness, collaboration) and CBT‑specific skills (guided discovery, focus, strategy), alongside a custom Change in Attitude Towards Guidance metric to measure shifts in client attitudes, both rated on a 0–6 scale. Results show the model’s performance closely mirrors professional counseling standards, highlighting its potential for real‑world deployment. With strengths in linguistic and cultural localization, the model offers a scalable solution to bridge mental health care gaps in Vietnam. Future work will focus on dataset expansion, real‑world testing, and integration with digital health platforms to enhance scalability and reliability.

11:20
MiRAGE: Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion

ABSTRACT. Detecting student misconceptions in open-ended responses is a longstanding challenge, demanding semantic precision and logical reasoning. We propose MiRAGE - Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion, a novel framework for automated misconception detection in mathematics. MiRAGE operates in three stages: (1) a Retrieval module narrows a large candidate pool to a semantically relevant subset; (2) a Reasoning module employs chain-of-thought generation to expose logical inconsistencies in student solutions; and (3) a Reranking module refines predictions by aligning them with the reasoning. These components are unified through an ensemble-fusion strategy that enhances robustness and interpretability. On mathematics datasets, MiRAGE achieves Mean Average Precision scores of 0.82/0.92/0.93 at levels 1/3/5, consistently outperforming individual modules. By coupling retrieval guidance with multi-stage reasoning, MiRAGE reduces dependence on large-scale language models while delivering a scalable and effective solution for educational assessment.

10:20-12:00 Session 9C: SOICT Technical Session XV: Human Computer Interaction and Intelligent Interactive Systems
Location: Yersin A
10:20
A Co-Simulation Approach for UAV-Network-AI Interaction in Digital Twin Visual Context

ABSTRACT. The growing use of Unmanned Aerial Vehicles (UAVs) in both civilian and military applications demands robust communication and smart real-time decision-making. Existing simulation platforms often lack integration between physical flight dynamics, wireless communication, and AI, resulting in a "reality gap" during real-world deployment. With an emphasis on Wi-Fi networks in contexts with obstacle modeling, this study builds on previous efforts such as CAVIAR by introducing an integrated co-simulation framework as a digital twin for UAVs. ns-3 for realistic network modeling, Microsoft AirSim with Unreal Engine for high-fidelity physical and visual simulations, and a Python-based AI component for object detection through YOLOv11 [6], allowing for synchronized bidirectional interactions under network constraints. Using a PUB/SUB model for scalable and decoupled synchronization, ZeroMQ enables soft real-time data sharing. Key contributions include: (1) a quantitative analysis of how communication constraints impact UAV behavior and AI decision-making; (2) a scalable platform that can model indoor Wi-Fi in high fidelity with obstacle effects; and (3) a reliable testbed for real-time, AI-in-the-loop UAV applications in intricate and safety-critical scenarios. In simulations, experimental results show system stability with an end-to-end latency of less than 80 ms, and real-world tests validate the high fidelity. This adaptable platform promotes research in UAV networks and edge AI while advancing autonomous UAV systems.

10:40
Fairy360VR: Immersive 360° Storytelling with Large Language Models and Generative Diffusion

ABSTRACT. In the growing digital storytelling landscape, most current systems mainly focus on text or static images, which do not effectively exploit the potential of panoramic environments to create immersive experiences. In addition, automatically converting short stories into multimodal storytelling experiences remains a challenge due to the requirements of deep semantic understanding and the ability to synthesize appropriate images according to the story timeline. In this study, we propose a system that automatically generates multimodal storytelling experiences from short stories, combining both text and 360° images. First, we build a dataset of pairs of short stories and scene descriptions, which are preprocessed and mapped into semantically rich text segments to serve as a basis for training and evaluation. Next, we design a system architecture consisting of two main components: (1) a text processing module that uses a large language model to extract and organize events in a timeline, and (2) a 360° biome module based on the biome model, allowing users to observe the story in panoramic space. The system is deployed on the web, allowing users to directly experience it by dragging and dropping the mouse to rotate and explore the surrounding space. To evaluate, we conducted a user study comparing two storytelling formats: text-based stories generated by Gemini Story Book and a 360° visual experience suggested by the system. Participants were surveyed on criteria such as attractiveness, comprehensibility, immersion, and personal preference. Preliminary results show that the 360° method provides a higher level of immersion and is rated by users as more appealing than the pure text-based stories. This study contributes to opening up a new approach to the field of digital storytelling, where the combination of language models and generative models can bring a more interactive and immersive experience to users.

11:00
Enhancing VR Drink Taste Believability using Olfactory Stimulation

ABSTRACT. Taste perception is inherently multisensory, with olfaction playing a key role alongside vision and other senses. Although virtual reality (VR) has been increasingly applied to simulate multisensory experiences, drinking interactions remain largely underexplored. This paper presents the design and evaluation of a VR system that integrates visual and olfactory cues to enrich drinking experiences and enhance immersion. The system associates predefined liquid colors with specific odors, which are automatically released when a drinking gesture is detected. Guided by a user-centered design approach, a user study was conducted to examine whether combining color and scent could modulate taste perception in VR. Results show that while liquid color alone had limited impact, olfactory cues—particularly strawberry and lemon scents—significantly shaped perceptions of sweetness and sourness. These findings demonstrate the potential of incorporating olfaction into VR to advance Human-Food Interaction, paving the way for future multisensory applications in entertainment, education, and food commerce.

11:20
An eye-tracking system for extracting and visualizing visual features of dyscalculia in children

ABSTRACT. Dyscalculia is a learning disability characterised by poor performance on tasks involving spatial and numerical processing. The use of eye tracking to study developmental disorders has been widely used, but its application to dyscalculia has been relatively neglected. Most previous studies lack specific technical methods to identify the visual features characteristic of children with dyscalculia. Our study seeks to bridge this gap by introducing an eye-tracking system designed to elucidate the underlying perceptual problems of Vietnamese dyscalculia children as they perform task demands. It not only explores visual attention strategies but also uncovers fundamental issues such as auditory or spatial order perception when performing calculations. In addition, the visual problems regarding space and time are clarified. The system collects multimodal signal information from the child's eye and mouse-based interactions during the task. Preliminary findings from the comparative analysis between dyscalculic and typically developing children suggest that the system may provide a solid foundation for future research on detection, intervention, and development of assistive applications tailored to the visual processing abilities of these children.

11:40
MO-PO RM: A Collaborative Mixed Reality Board Game for Engaging Players and Audience in Learning through Playing

ABSTRACT. Collaborative educational activities often rely on physical artifacts such as board games, which encourage manipulation, peer interaction, and active engagement. Yet, traditional board games suffer from several limitations: they require fragile physical resources, manual orchestration by a facilitator, and remain difficult to adapt for formal evaluation. To address these issues, we designed and evaluated MO-PO MR, a Mixed Reality (MR) version of Mono-Poly, a chemistry card game where players create polymers by combining monomers. MO-PO MR serves as an exemplary collaborative MR board games which can eliminate the need for fragile physical components, automate rule enforcement and scorekeeping, and reduce opportunities for error or cheating. The system supports both players and audiences: co-located participants use tablets to interact with an augmented game board, while spectators view a large third-person MR display where virtual objects are spatially fused with real-world participants. This hybrid design ensures accessibility even for users without MR headsets. We conducted a within-subject study with groups of three players and one observer, comparing the MR version against the physical game. Results indicate that MO-PO MR improves clarity, coordination, and immersion, while making the game easier to follow for both players and observers. These findings highlight the potential of MR-based collaborative games to enhance STEM education by offering scalable, traceable, and engaging learning experiences.

10:20-12:00 Session 9D: SOICT Technical Session XVI: Multimedia Processing
Chair:
Location: Yersin B
10:20
LOGOS: Language-guided Oriented Object Detection in Aerial Scenes

ABSTRACT. Object detection in geospatial scenes, such as satellite and aerial imagery, presents significant challenges due to the varying orientations and densities of objects, as well as the complex backgrounds inherent in remote sensing images. Traditional methods for oriented object detection have struggled to address issues such as angular discontinuity, fixed query sizes, and inefficiencies in handling sparse or cluttered scenes. In this paper, we propose LOGOS, a novel transformer-based approach that leverages textual prompts to guide the detection of oriented objects in aerial scenes. In particular, our proposed approach incorporates prompt-modulated content queries to dynamically adjust the model’s focus based on the given text, ensuring more accurate object detection in complex environments. Empirically, extensive experiments on the DOTA dataset demonstrate that LOGOS outperforms existing state-of-the-art methods, particularly in densely packed and rotated object scenarios. Our approach offers a significant step forward in improving the robustness and scalability of oriented object detection in remote sensing applications.

10:40
From Text to Thumbnail: A Unified Framework for Automated News Image Generation and Evaluation for Daily Activities

ABSTRACT. Images play a crucial role in online news consumption, attracting attention and driving user engagement. While recent Text-to-Image (T2I) models can generate high-quality images from text, selecting images that accurately reflect news content remains subjective and influenced by user preferences. In this paper, we propose a unified framework for news image generation that systematically addresses this challenge targeting daily activities. Our approach introduces a novel criteria-learning phase to extract salient visual attributes from 6,000 thumbnail images across various domains and a prompt enrichment pipeline to create Grounded Summaries from article text. Finally, each generated image is assessed using a evaluation protocol that quantifies semantic and visual quality via established human preference models. To validate our approach and facilitate future research, we apply our full pipeline to generate and evaluate over 10,000 images for 1,500 news articles, releasing this collection as the first benchmark dataset for grounded news image generation. Our experiments on this dataset demonstrate the framework's effectiveness and highlight how our prompt enrichment method successfully balances semantic fidelity with aesthetic appeal.

11:00
Self-Supervised ViT for Endoscopy: I-JEPA Pretraining with Label-Free Diffusion Assessment

ABSTRACT. The scarcity of expert-labeled data remains a major barrier for gastrointestinal endoscopy image analysis, as annotation is costly and time-consuming while vast amounts of clinical images remain unlabeled. To address this challenge, we propose a label-efficient pipeline that combines self-supervised pretraining and label-free evaluation. Specifically, we pretrain a Vision Transformer (ViT) encoder on 66,820 unlabeled endoscopy frames using I-JEPA, a joint-embedding predictive approach. To assess the quality of learned representations without relying on annotated data, we introduce a diffusion-based probing mechanism—Reconstruction-Conditioned Diffusion Modeling (RCDM)—which reconstructs images from latent features to provide a qualitative, label-free evaluation. Finally, we transfer the pretrained encoder to downstream tasks including classification, lesion/polyp segmentation, and multitask learning. Experiments on public benchmarks (e.g., Kvasir, CVC series, ETIS) and private datasets demonstrate that I-JEPA pretraining, particularly when combined with sequential MAE to I-JEPA adaptation, yields superior segmentation performance compared to strong baselines such as RaBiT, with larger gains in low-label regimes. Multitask analysis further highlights the role of decoder architecture, where ViT encoders paired with RaBiT-style decoders surpass EndoUnet in most tasks. These results show that our pretrain–probe–transfer framework enables domain-aware, label-efficient representation learning for endoscopic image analysis, providing both practical benefits under label scarcity and actionable insights for multitask model design.

11:20
Generalizability Evaluation and Anchor-Guided Approach for Category-Agnostic Pose Estimation

ABSTRACT. Category-agnostic pose estimation models may overfit to the limited set of predefined landmarks within the training dataset, resulting in large errors for query points far from these landmarks. We demonstrate this effect by analyzing the error distribution of query points on human faces. We find that query points distant from landmarks exhibit high errors, suggesting existing models struggle to generalize beyond the training data. To handle this problem, we introduce a training-free, anchor-guided geometric mapping approach to improve keypoint prediction. Our method leverages reliably predicted anchor points to construct a pose-consistent geometric basis via Delaunay triangulation. It then uses barycentric coordinate interpolation to map any query point from a support image to a target image, preserving geometric structure across different poses. Quantitative evaluation on human faces and qualitative analysis across diverse categories confirm the overfitting issues and show that our approach significantly improves keypoint accuracy without requiring additional training.

11:40
RIOT: Robust Incremental Few-Shot Instance Segmentation via Synthetic Feature Generation with Optimal Transport

ABSTRACT. Few-shot instance segmentation (FSIS) extends few-shot detection by requiring both object localization and accurate mask prediction, but remains underexplored in incremental settings where new categories arrive over time with limited annotations. Existing incremental FSIS methods mainly focus on classifier adaptation and knowledge preservation, while neglecting data augmentation or feature generation, which are crucial to mitigate overfitting and distribution mismatch under extreme data scarcity. In this work, we present RIOT, Robust Incremental Few-Shot Instance Segmentation via Synthetic Feature Generation with Optimal Transport. RIOT follows a two-stage pipeline: (1) base training on abundant categories to learn strong segmentation features, and (2) generator training with both Optimal Transport and KL-divergence losses to produce class-conditional synthetic features aligned with real distributions. Unlike prior FSIS approaches, RIOT supports incremental learning without additional fine-tuning stages. Extensive experiments on standard benchmarks demonstrate that RIOT significantly improves recognition of novel classes while maintaining base-class knowledge, establishing a strong baseline for incremental FSIS with synthetic feature generation.

10:20-12:00 Session 9E: Poster Exhibition
Integrated Semantic and Temporal Alignment for Interactive Video Retrieval

ABSTRACT. The growing volume of video data and the introduction of complex retrieval challenges, such as the Temporal Retrieval and Alignment of Key Events (TRAKE) task, expose critical limitations in existing systems. Many methodologies lack scalable, holistic architectures and rely on "frozen" embedding models that fail on out-of-knowledge (OOK) or real-world queries. This paper introduces a comprehensive, modular video retrieval framework designed to address these gaps. Our system features a scalable architecture integrating TransNetV2 for scene segmentation, BEiT-3 for visual embeddings in Milvus, and Gemini OCR for metadata in Elasticsearch. We propose two novel components: (1) QUEST (Query Understanding and External Search for Out-of-Knowledge Tasks), a two-branch framework that leverages a Large Language Model (LLM) for query rewriting and an external image search pathway to resolve OOK queries; and (2) DANTE (Dynamic Alignment of Narrative Temporal Events), a novel dynamic programming algorithm that efficiently solves the temporally-incoherent TRAKE task, which has an efficient O(NT) time complexity. These contributions form a robust, scalable, and intelligent system that significantly advances the state-of-the-art in handling complex, real-world video search queries.

HelioSearch: A Multimodal Video Retrieval Framework with LLM-Driven Query Expansion and Hybrid Filtering

ABSTRACT. Video event retrieval in large-scale multimedia databases remains a critical challenge due to the inherent complexity of multimodal understanding and semantic alignment across heterogeneous data sources. This paper presents the design and development of a unified multimodal video retrieval system that addresses key limitations of existing approaches in cross-modal representation learning, temporal reasoning, and semantic consistency. The proposed framework leverages BEiT-3 and CLIP as unified transformer encoders to learn shared semantic representations. To overcome single-modality constraints, the system integrates Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), and YOLOv12-based object detection for fine-grained entity filtering. These modalities are organized within a specialized database architecture that combines vector, document, and search indexes to enable efficient multimodal fusion. The system supports diverse query types, further enhanced through Large Language Model (LLM)-assisted query expansion. Comprehensive experiments conducted on the Ho Chi Minh City AI Challenge 2025 (AIC 2025) dataset demonstrate substantial improvements in retrieval precision and ranking stability, validating the system’s effectiveness and generalization capability. Overall, the proposed framework offers a scalable, interpretable, and extensible foundation for real-world multimodal video event retrieval applications.

Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval

ABSTRACT. Multimedia information retrieval from videos remains a challenging problem. While recent systems have advanced multimodal search through semantic, object, and OCR queries - and can retrieve temporally consecutive scenes - they often rely on a single query modality for an entire sequence, limiting robustness in complex temporal contexts. To overcome this, we propose a cross-modal temporal event retrieval framework that enables different query modalities to describe distinct scenes within a sequence. Another key contribution is the Kernel Density Gaussian Mixture Thresholding (KDE-GMM) algorithm, which adaptively determines decision thresholds for scene transition and slide change detection, ensuring optimal keyframe selection. These extracted keyframes act as compact, high-quality visual exemplars that retain each segment's semantic essence, improving retrieval precision and efficiency. Additionally, the system incorporates a large language model (LLM) to refine and expand user queries, enhancing overall retrieval performance. The proposed system's effectiveness and robustness were demonstrated through its strong results in the Ho Chi Minh AI Challenge 2025.

VidAlign: Integrating Multi-Event Alignment and LLM Co-Searching for Video Retrieval

ABSTRACT. The exponential growth of video content demands efficient retrieval systems capable of understanding complex, multi-event scenarios. In this work, we present VidAlign, a fine-grained video retrieval framework that effectively aligns multi-event textual queries with temporally distributed visual content. The framework introduces a novel TDP-Fuse (Temporal Dynamic Programming Fusion) algorithm to dynamically align and fuse partial retrieval results over time. Furthermore, an LLM-guided Co-Searching mechanism is incorporated to assist users in query formulation and refinement, leveraging large language models to enhance semantic understanding and interactive retrieval. VidAlign’s architecture combines fast approximate search via FAISS with intelligent reranking, temporal fusion, and adaptive co-searching, ensuring both scalability and retrieval accuracy. The system has been rigorously evaluated in real-world settings and demonstrated outstanding performance at the Ho Chi Minh AI Challenge 2025, ranking among the top solutions. Experiments show that VidAlign effectively enhances semantic alignment and temporal coherence, making it a strong competitor in large-scale event-based video retrieval.

PerceptionBrowswer: Enhancing information retrieval system with spatial-temporal knowledge

ABSTRACT. The ubiquity of internet and media service in the modern world has provided the enormous amounts of data, which is considered the new gold in the informatics era. This emergence has nurtured the development of media retrieval systems, in which the users have the ability to extract the most relevant piece of information from the enormous dataset. Therefore, we propose an effective retrieval system that could run on a portable machine like laptops without the need of internet access, which enhance the system's robustness and usage in privacy-sensitive situations. First, we process the enormous dataset to reduce its size significantly, which enables operation on resource-restrained machines. Second, to facilitate nformation retrieval by text queries, we use CLIP-based features obtained from visual foundation models. This allows us to integrate both spatial (image) and temporal (video) features in our system. Furthermore, we also introduce a temporal combination algorithm to enhance the temporal understanding and retrieval performance. Benchmarking our system on the set of queries provided in the elimination round of Ho Chi Minh AI Challenge 2025 (AIC25), we achieved an impressive score of 84.4/88, equivalent to 95.91\% accuracy, with an average query response time of under 15 seconds. These results underscore our system’s robustness in managing diverse and complex queries, demonstrating it as an efficient tool for life-log and media retrieval purposes by significantly enhancing the user experience for both common and advanced usage while maintaining minimal resource requirement. Our code is publicly available at \url{https://github.com/trnKhanh/past-beggars}.

TARS: Temporal Alignment Retrieval System for Efficient Multi-Segment Video Event Retrieval

ABSTRACT. Temporal video event retrieval requires returning video segments whose frames follow the action order stated by a natural-language query. Existing systems built on global or scene-level similarity often surface visually plausible yet order-inconsistent matches; learning temporal encoders improves ordering but adds training cost and degrades robustness under domain shift. We present TARS, a training-free, order-aware framework that performs temporal reasoning entirely at inference time. A query is decomposed into sub-events, then being embedded by complementary vision–language encoders; and a monotonic dynamic-programming alignment searches the best ordered path on the frame–subevent similarity matrix. A prefix-maximum recurrence yields O(nm) time and O(m) memory per shot and integrates cleanly with candidate retrieval and lightweight re-ranking. On the AI Challenge HCMC 2025 benchmark, TARS attains 93.15% Top-1 accuracy, demonstrating that explicit inference-time temporal alignment over frozen embeddings is a simple, robust, and deployable solution for order-sensitive video retrieval.

KPT: Enhancing Temporal Event Retrieval in Vietnamese News Videos

ABSTRACT. This paper presents KPT, an enhanced Vietnamese news event retrieval system developed for the 2025 Ho Chi Minh City AI Challenge. The system represents an evolution of our previous solution, focusing on improving temporal event search accuracy and overall system efficiency. First, a key upgrade is the integration of the Milvus vector database, which replaces the prior in-memory implementation and enables faster retrieval with scalable handling of large video datasets. Second, the web-based user interface and core functionalities have been completely redesigned to provide a more intuitive and efficient user experience, supporting complex multimodal querying and result submission. Third, the temporal search algorithm has been upgraded from a frame-level to a scene-level indexing approach, incorporating a new scoring function that jointly models semantic similarity and temporal order, significantly improving both computational efficiency and the precision of event boundary detection. The enhanced KPT system was successfully deployed in the preliminary round, achieving substantially better performance than the previous version and demonstrating the effectiveness of the proposed architectural and algorithmic enhancements.

TEMPO: A Multimodal Video Retrieval System with Sequential Query Support

ABSTRACT. The Ho Chi Minh AI Challenge 2025, sets the ambitious goal of building a powerful video retrieval system. The competition requires teams to handle a medium-size dataset under a tight timeline, which demands solutions that balance both speed and accuracy. To stay competitive, we design and implement TEMPO, a system that integrates multiple search strategies, including Textual Search, Visual Search, and most importantly, Temporal Search. Starting from the official dataset, we separates and processes both audio and visual streams. These pipelines enable us to build strong features that drive the system: Semantic Search, OCR Search, ASR Search, and Temporal Search. Among them, Temporal Search stands out by supporting sequence-based queries, also known as Sequential Query Video Retrieval. This feature is still rare in commercial systems and represents a novel contribution to the competition. Our system achieves promising results and is currently ranked among the top teams in the preliminary round, based on 50\% of the ground-truth data provided by the organizers. These outcomes highlight the practical potential of TEMPO and its effectiveness in addressing real-world video retrieval tasks.

FRED: Unified Multimodal Fusion and Dynamic Temporal Reasoning with Semantic Query Expansion and Exclusionary Search

ABSTRACT. Our proposed system introduces an innovative approach to interactive multimodal video retrieval, developed for the AI Challenge Ho Chi Minh City 2025. The system enhances both retrieval accuracy and user interaction through the integration of Large Language Models (LLMs) for semantic reasoning and query expansion, effectively addressing query ambiguities and improving contextual relevance. The retrieval framework is built upon Vision-Language Models (VLMs) to support text-to-video and image-based search, while incorporating auxiliary components such as Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), and Object Detection to enrich multimodal understanding. These complementary signals enable the system to capture textual, auditory, and visual cues from videos, creating a more comprehensive search foundation. Furthermore, a dynamic temporal search mechanism evaluates frame-level relevance and temporal dependencies, providing adaptive and context-aware retrieval. Overall, our system demonstrates the effectiveness of combining multimodal perception with LLM-driven intelligence to advance the precision, adaptability, and interactivity of modern video retrieval systems.

A Video Retrieval System with Advanced Temporal Algorithm and Vision Language Models Integration

ABSTRACT. Video retrieval is the demanding task of locating the exact moment within an unbounded video collection that corresponds to a user query. This capability is critical as the volume of video data explodes, becoming a "digital haystack" where proficiently finding a specific information is practically impossible. Recent retrieval systems have successfully taken advantages of technology innovations to trivialize many challenges of the task, especially in the era of Large Language Models (LLMs) and Vision Language Models (VLMs). However, limitations remain: (i) most of the current systems only utilize LLMs and VLMs for retrieval operations (query detailing, assisted fusion, etc), undermining their potential; and (ii) - the lack of temporal-capable retrieval methods. Addressing these challenges, this paper presents a pipeline that utilized the strength of VLMs and LLMs in both the pre-processing and retrieval operations, integrated into a system with an advanced dynamic-anchoring temporal algorithm and a refined semantic extraction workflow. Using AIC 2025 - a video retrieval competition - as a benchmarking ground, the proposed systems achieved 97 percent accuracy during preliminary rounds, demonstrating its practicality in terms of performance, interactivity, and scalability.

FrameSeeker: Shot-Level Captioning with Multimodal Hints for Efficient Video Retrieval

ABSTRACT. The exponential growth of multimedia data has created an urgent need for video retrieval systems that can deliver fine-grained semantic understanding without excessive computational cost. Existing methods often rely on detailed captioning for every frame, which fails to capture temporal context and results in redundant inference. To address these challenges, a multimodal retrieval framework is proposed that integrates hybrid semantic keyword search, query enhancement, and temporal reasoning. A key feature of the system is the use of multimodal cues and few-shot prompt engineering within a Vision Language Model to generate a single, temporally coherent caption for all frames in a shot. By jointly using information from visual content, object detection, and text recognition, this approach produces rich and semantically grounded embeddings that improve retrieval precision while reducing inference cost. The effectiveness of the proposed method is validated through its strong performance in the 2025 Ho Chi Minh City AI Challenge.

Adaptive Agent-Guided Dynamic Programming for Temporal Optimization in Multi-Event Video Retrieval

ABSTRACT. As multimedia content continues to expand rapidly, achieving accurate and efficient video retrieval has become increasingly critical. Among various retrieval tasks, multi-event retrieval poses greater challenges, as it requires understanding both semantic content and temporal dependencies across events. However, existing approaches often rely on locally greedy matching strategies that enforce only pairwise temporal consistency, leading to suboptimal alignments and disrupted event order. To overcome these limitations, we propose a Dynamic Programming (DP)-based video retrieval framework that formulates the task as a global temporal optimization problem. Our method jointly optimizes the entire event sequence to identify the globally optimal keyframe path that best preserves both chronological flow and semantic coherence. This DP formulation effectively reduces the search space while maintaining temporal consistency and semantic integrity. To enhance automation and adaptivity, the framework is supported by an agent-guided coordination layer powered by a Large Language Model (LLM). This layer interprets user intent, decomposes multi-event queries into structured representations, and autonomously triggers the DP-based retrieval pipeline. Experimental results demonstrate strong performance in the 2025 Ho Chi Minh AI Challenge, highlighting the potential of this agent-guided, DP-centered framework for large-scale, intelligent multimedia retrieval.

GalaxyAssistant: An Intelligent Assistant for Multimedia Event Retrieval

ABSTRACT. The exponential growth of large-scale multimedia data necessitates efficient event retrieval systems, a challenge addressed by competitions like LSC, VBS, and the Ho Chi Minh City AI Challenge. To address this, we propose GalaxyAssistant, an intelligent assistant framework designed for in-depth analysis and information retrieval from complex video data. Our system's intelligence is rooted in its tri-modal indexing, which concurrently processes video via three parallel pipelines following shot detection (TransNetV2). First, a visual-language model (SigLIP) generates dense visual embeddings. Second, an image captioning model (InternVL) creates textual descriptions. Third, an ASR model (ChunkFormer) transcribes the audio. Both caption and audio transcripts are encoded using a hybrid dense-and-sparse model (BGE-M3 + BM25) to create robust textual indices. During retrieval, the assistant dually encodes a user’s query using both SigLIP and BGE-M3+BM25 text encoders. A concurrent search is then executed via the Hierarchical Navigable Small Worlds (HNSW) algorithm against all three vector databases (visual, caption, and audio). This tri-modal fusion allows our assistant to perform high-precision analysis and retrieval for complex, event-based queries.

Text-Guided Filtering to Enhance Open-Vocabulary Object Detection for Sport Event Retrieval
PRESENTER: Duc-Thang Nguyen

ABSTRACT. Accurately interpreting sports broadcasts requires not only visual perception but also understanding textual cues such as player names, jersey numbers, and scoreboard information. Traditional object detectors, constrained by closed-set vocabularies, struggle to generalize across unseen entities and lack the ability to reason about text embedded in the scene. This paper presents a text-guided filtering framework that integrates YOLO-World, an open-vocabulary object detector, with an Optical Character Recognition module to enhance fine-grained football scene understanding. Experiments on FIFA World Cup 2022 broadcast frames demonstrate that incorporating textual cues improves recognition accuracy and contextual coherence, particularly in event-specific cases such as goals and substitutions. The results highlight the effectiveness of text-guided filtering for multimodal reasoning, offering a scalable direction for open-vocabulary object detection in structured visual domains.

Fusurge: An Accelerated Query-Driven System for Multimodal Information Retrieval

ABSTRACT. With the rapid expansion of multimedia archives, video retrieval systems must strike a balance between scalability, accuracy, and responsiveness. Fusurge addresses this challenge by integrating a compact yet powerful pipeline. Key components include data processing for rapid keyframe sampling, PaddleOCR for multilingual OCR, and faster-whisper for efficient ASR. On the semantic side, multiple CLIP-based encoders are fused to widen coverage, while compact vision–language models support interactive question answering. Retrieval quality is further refined through reranking with user-guided clarification. An intuitive interface lowers entry barriers and ensures ease of use. Empirically, Fusurge demonstrates strong retrieval quality and robust system performance, thereby achieving consistent top-rank effectiveness up to 86% with responsive, stable runtime — making it suitable for real-world, large-scale use.

ATLAS: Adaptive Temporal Low-rank Alignment System for AI Challenge 2025

ABSTRACT. Text-Video Retrieval (TVR) on large-scale event data requires scalable and semantically rich solutions, particularly within the AI Challenge 2025 - a national competition fostering research in multimodal retrieval. This paper presents ATLAS, the system submitted to the competition, designed to address computational inefficiency and contextual misalignment in cross-modal search. Built upon Milvus and Elasticsearch, and leveraging foundation models such as CLIP and BLIP-2, ATLAS adopts a novel Adaptive Fusion architecture. The system introduces three key innovations: (1) Low-rank Modulation (LoRM), adapted from the RAP architecture, to mitigate temporal redundancy and generate highly representative LoRM-enhanced Activity Vectors; (2) Selective Lite-SGG, which encodes structural context only on salient keyframes, balancing efficiency and expressiveness; and (3) Weighted Reciprocal Rank Fusion (WRRF), a dynamic ranking mechanism that adjusts weights based on query complexity to integrate multi-modal retrieval results effectively. Experimental evaluations demonstrate that ATLAS achieves robust, accurate, and adaptive retrieval performance on large-scale news datasets, setting a strong benchmark for future AI Challenge systems.

MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

ABSTRACT. The rapid growth of online video platforms has created an increasing need for effective and semantically grounded event retrieval systems. To address this, we propose MERVIN, a unified multimodal framework for Vietnamese news video retrieval that integrates visual and textual representations through keyframes, transcripts, and video sum- marizations. The framework enhances textual quality using the Gemini 1.5 Flash model for transcript cleaning and summarization, effectively re- ducing noise caused by accents, background interference, and recognition errors. For visual understanding, features are extracted using the Percep- tion Encoder model, while a Vietnamese-specific language model gener- ates textual embeddings to ensure linguistic relevance. Both visual and textual embeddings are indexed in a Milvus vector database, enabling ef- ficient similarity-based retrieval. On top of this, a web-based interactive interface built with React allows users to iteratively refine queries across modalities, leading to more accurate and semantically aligned search re- sults. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase.

Aligning Time and Semantics (ATS): A System for Temporal Retrieval and Alignment of Key Events

ABSTRACT. Video retrieval is the process of locating the exact moment in a vast video collection to match the textual description provided by the users. This problem presents a considerable challenge in the era of the digital world, where thousands of media content like videos, images, and audio are everywhere on the Internet. For this reason, as part of the Ho Chi Minh AI Challenge 2025, we developed an innovative framework called Aligning Time and Semantics (ATS), capable of specifying the video segment that corresponds to the question from the users. This framework is integrated with multimodal models, with the ability not only to combine various types of models but also to effectively manage temporal searches through multi-stage processes.

Unlocking Arbitrary-Length Querying for Video Retrieval via Advanced Vision-Language Models and Hybrid Temporal Search

ABSTRACT. Video retrieval, a critical task in multimedia retrieval, faces challenges in aligning complex textual queries with video content, especially for long or multimodal queries. This paper introduces a novel multimodal retrieval system designed for the AI Challenge 2025, emphasizing efficient video retrieval. Our approach utilizes LongCLIP as the primary Vision-Language Model (VLM) to address token limitations of previous models, enabling robust processing of extended queries. We integrate OCR, ASR, and object detection, fused via a weighted sum strategy to enhance retrieval accuracy. Additionally, we propose a temporal search algorithm to precisely identify frames corresponding to specific actions described in the query, improving temporal alignment. Experimental results from the AI Challenge 2025 demonstrate the system's superior precision and efficiency in real-world video search scenarios.

Tournament-Inspired Elimination Reranking for Multi-Modal Video Retrieval

ABSTRACT. This paper presents a novel multi-modal video retrieval framework designed to enable efficient and accurate content discovery across large-scale video datasets. The system integrates multiple vision–language models to enhance semantic alignment between textual queries and video content. A key contribution is a double-elimination reranking mechanism inspired by tournament structures, which improves recall and ranking stability through multi-stage candidate evaluation. The framework employs a three-tier storage architecture comprising Milvus for vector similarity search, Elasticsearch for hybrid lexical–semantic retrieval, and MongoDB for metadata management. Reciprocal Rank Fusion is used to combine results from multiple encoders, while post-retrieval reasoning with Gemini is applied to answer user queries and align frames with corresponding events. Demonstrated in the competition setting, the system shows strong potential for large-scale, intelligent multimedia retrieval.

Multi-modal and Temporally-aware Video Retrieval

ABSTRACT. In this work, we present a multi-modal and temporally-aware video retrieval framework designed for multi-event video search. The proposed method captures diverse semantic information for each keyframe, including image embeddings, optical character recognition text, object detection features, and audio transcripts. Moreover, event dependencies are modelled with a decay-based weighting mechanism to improve long-sequence matching across multiple modalities. In addition, an enhanced user interface with an integrated chatbot assistant is developed to support faster and more streamlined query formulation. Comparative experiments with another video retrieval system demonstrate that the proposed approach enhances contextual understanding and retrieval performance for complex, time-dependent video queries in the HCMC AI Challenge 2025.

Cross Segment Coherence Scorer: A Training Free Temporal Framework for Multimodal Video Retrieval

ABSTRACT. Video event retrieval involves identifying and aligning segments that correspond to the events or actions described in a user’s query. As large scale video archives continue to grow, managing their multimodal and temporal complexity becomes a critical challenge, requiring accurate cross modal retrieval over visuals, text, and audio while preserving the correct order of events. Many existing systems incorporate temporal checks by verifying frame order or expanding fixed time windows; however, these rigid approaches often fail in real world editing scenarios where interleaved scenes are rejected as out of order and events separated by longer gaps are missed. To overcome these issues, we propose a training free video event retrieval framework that combines late fusion with explicit inference time temporal scoring. Specifically, multimodal fusion based on Reciprocal Rank Fusion (RRF) unifies visual, OCR, and ASR evidence without retraining, and a Cross Segment Coherence Scorer (CSCS) applies soft penalties for order reversals, long temporal gaps, and jumps across shots to handle interleaved and repeated scenes while preserving temporal coherence. This design provides temporal flexibility with minimal inference overhead. The framework was evaluated on the AI Challenge HCMC 2025 benchmark, achieving 95% accuracy and demonstrating competitive performance in event level video retrieval.

Poly-Temporal Seach: Bridging Composed and Temporal Queries for Multimodal Video Retrieval

ABSTRACT. Retrieving relevant video content from large-scale datasets requires understanding both what appears in a scene and how events unfold over time. However, existing approaches typically focus on a single aspect of this problem: compositional models capture detailed object–text relationships but fail to model temporal evolution, whereas temporal models track event sequences but overlook fine-grained scene semantics. This separation limits the ability to reason over complex, narrative-style queries that combine spatial composition with chronological order. Moreover, current systems lack robust keyframe selection and unified multimodal representations, frequently relying on static frame-level alignment or local similarity fusion that fails to maintain global temporal coherence. To address this gap, we propose Poly-Temporal Search, a unified framework that integrates compositional and temporal reasoning within a single retrieval process. Our method introduces Adaptive Sampling Keyframe Selection for stable and representative frame selection, Spherical Linear Interpolation for text-grounded compositional retrieval, and a beam-based temporal search to ensure event-level coherence. Together, these components enable joint modeling of intra-scene semantics and inter-event dependencies. Evaluated on the HCMAI 2025 benchmark, Poly-Temporal Search achieved finalist-level performance, demonstrating the effectiveness of unified multimodal reasoning for complex video retrieval tasks and highlighting the importance of bridging compositional and temporal paradigms in multimodal understanding.

LGCA: Enhancing Semantic Representation via Progressive Expansion

ABSTRACT. Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the original and expanded images, enabling the model to capture both local and global features while minimizing misinformation. Additionally, we provide a theoretical analysis demonstrating that the time complexity of LGCA remains the same as that of the original model prior to the repeated expansion process, highlighting its efficiency and scalability. Extensive experiments demonstrate that our method substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

ABSTRACT. Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.

FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

ABSTRACT. Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present FLUID-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. FLUID contributes three core elements: (1) Q-transforms, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a Q-bottleneck that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time that enables efficient specialization to diverse semantic patterns. Extensive experiments demonstrate that FLUID attains 91% accuracy on the GLAMI-1M benchmark, significantly outperforming prior baselines and exhibiting strong resilience to label noise, long-tail class imbalance, and semantic heterogeneity. Targeted ablation studies corroborate both the individual and synergistic benefits of the proposed components, positioning FLUID as a scalable, noise-resilient solution for multimodal product classification.

Research Paper Quality Recognition Through Textual Feature Analysis

ABSTRACT. Knowledge and innovations are shaped by using the quality and credibility of the scientific research. Yet, distinguishing between impactful, high-quality work and flawed studies remains a challenge. This paper introduces a benchmark for classifying research papers into two categories: good (highly cited) and non-good (retracted), using only textual features from titles and abstracts. We evaluate multiple embedding techniques, including SBERT, Word2Vec, FastText, USE, and TF-IDF, combined with classifiers such as Support Vector Machines (SVM), Random Forests, and Neural Networks. Our contributions include: (1) hyperparameter transparency, (2) feature space visualizations using t-SNE, (3) model interpretability analysis with SHAP, and (4) detailed examination of error cases. Experimental results show that a neural network with SBERT embeddings achieves 87.22% accuracy, while FastText combined with SVM reaches 91.12%. These findings highlight the value of textual information in assessing research quality, with ethical considerations for deployment. This work contributes toward the development of academic integrity tools that promote trustworthy scholarship

Efficient Probabilistic Cross-Modal Retrieval via Top-k Selection and Fast Embedding Learning

ABSTRACT. Image-Text Matching (ITM) is a core task in vision-language research, enabling cross-modal retrieval and zero-shot classification. While deterministic embedding methods map inputs to fixed vectors in a shared space, they often struggle to capture the semantic diversity inherent in multimodal data. Probabilistic embeddings offer a more expressive alternative by modeling inputs as distributions, but existing approaches face challenges in feature selection and training efficiency. In this work, we propose FAST-PCME, a novel probabilistic ITM framework that enhances the PCME architecture with two key innovations: (i) a top-k token selection strategy that filters out less informative features before pooling, and (ii) a fast probabilistic embedding learning mechanism that reformulates the matching objective for accelerated convergence. Extensive experiments on ECCV Caption, CxC, and MS COCO benchmarks demonstrate that FAST-PCME achieves state-of-the-art performance while reducing training time and improving semantic expressiveness.

Text-Based Person Search in Low-Resource Scenarios

ABSTRACT. Text-based person search (TBPS) that aims to identify in- dividuals from natural language descriptions has many potential appli- caitons in surveillance, security, and forensics. While impressive results have been achieved with large-scale annotated training data, performance in low-resource scenarios where only a few text–image pairs are available, or only images exist, remains a significant challenge. This paper aims at addressing text-based person search (TBPS) in low-resource scenarios where the number of image-text pair is limited. To address the scarcity of high-quality image–text pairs in TBPS, we integrate a new module: Description generation module into an existing framework TBPS-CLIP [1]. Three scenarios have been proposed for description generation mod- ule: Scenario 1: Direct Multimodal Generation, Scenario 2: Captioning via Fine-tuning, and Scenario 3: VQA-Based Generation. Scenarios 2 and 3 require a small number of image–text pairs to fine-tune the mod- els, whereas Scenario 1 operates without image-text pairs. Furthermore, to better guide the description generation module, we design a new set of questions. Extensive experiments on the CUHK-PEDES dataset demon- strate that our method achieves strong retrieval performance, reaching 71.60% in R@1, 90.66% in R@5, and 95.25% in R@10, even with only a limited number of image–text pairs for fine-tuning the description gener- ation module. Moreover, in the best case, the proposed question set yields a 16.1% improvement in the R@1 metric over the baseline question list.

GigaCount: Enhancing Crowd Counting by Integrating a Multi-Scale Feature Fusion Model into CLIP-EBC

ABSTRACT. Crowd counting has emerged as a vital task in computer vision, driving applications ranging from urban planning to public safety. Despite advances, challenges remain in handling diverse crowd scenarios, such as low-light scenes, distorted human figures, and extremely dense crowds. To handle these problems, we propose GigaCount, a multi-scale vision-language model that leverages Contrastive Language-Image Pretraining with Enhanced Blockwise Classification to enhance crowd counting performance. Our approach integrates ConvNeXt and its multi-scale feature fusion capabilities into CLIP, further addressing key challenges in crowd analysis. We evaluate the model’s effectiveness by analyzing density maps and conducting ablation studies, which reveal patterns in prediction errors and their underlying causes. These findings guide targeted enhancements, including data augmentation to boost robustness across diverse lighting conditions, loss function adjustments to enhance accuracy in dense scenes, and layer removal to minimize model size and computational cost. Achieving a competitive MAE of 103.3, our model introduces a novel, lightweight architecture that integrates multi-scale feature fusion into CLIP’s image encoder. Although it does not outperform state-of-the-art methods, this approach surpasses almost all conventional CNN-based techniques, underscoring the potential of multiscale visionlanguage models in crowd analysis and laying a foundation for further advancements. The implementation is available at https://github.com/AdamHermes/GigaCount

Integrating Motion-based Technique and Deep Learning for Expression Analysis in Vietnamese Traditional Chèo

ABSTRACT. Due to the lack of specific datasets and studies on expression in Vietnam’s traditional Chèo theatre, this paper presents a framework that integrates motion-based preprocessing with deep learning to analyze those expressions within this art form. We utilize Eulerian Video Magnification (EVM) and dense optical flow techniques to automatically detect and segment subtle facial movements, resulting in a dataset of 7,166 expression segments. Of these, 3,353 apex frames are manually annotated by experts at the Hanoi Academy of Theatre and Cinema. Using this dataset, we evaluate several popular convolutional and transformer-based architectures under transfer-learning settings. Among them, VGGFace achieves the highest accuracy (81.15%) and Cohen’s kappa score (0.703), closely followed by ResNet18 (80.77% accuracy and 0.6993 kappa). These results demonstrate the effectiveness of motion-based extraction in the challenging performing-arts context of Ch`eo and lay a foundation for future cultural-heritage preservation and the development of educational tools.

VisionGuard: Synergistic Framework for Helmet Violation Detection

ABSTRACT. Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.

Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting

ABSTRACT. We present Edit3DGS, a unified framework for dynamic 3D head editing that integrates 2D instruction-guided diffusion with 3D Gaussian splatting. Unlike prior approaches that separately address frame-based edits or static 3D reconstruction, our method couples semantic controllability in the image domain with photorealistic, temporally consistent 3D representations. Given an input video, editable facial regions are masked and modified using a text-conditioned diffusion model to support fine-grained operations such as expression transformation, attribute modification, and appearance refinement. The edited frames are then aggregated through 3D Gaussian splatting to produce a coherent, high-fidelity avatar that preserves both identity and motion dynamics. To enforce consistency, Edit3DGS incorporates multi-view batch editing and lightweight inpainting strategies that recover lost expressions across timesteps. Experimental results demonstrate that our framework enables controllable, artifact-free head editing with smooth temporal transitions, offering practical applications in virtual avatars, immersive communication, film production, and interactive media.

VNProductKIE: A Dataset and Three-Stage Pipeline for Key Product Information Recognition on Vietnamese Packaging Labels

ABSTRACT. Food waste poses serious environmental and economic concerns, often worsened by the lack of accessible product information. Automated extraction from packaging labels offers a promising solution, yet existing datasets fall short in representing the linguistic and visual diversity found in the Vietnamese markets. This paper introduces VNProductKIE, a new dataset of high-resolution images capturing Vietnamese food and beverage packaging. It features both English and Vietnamese text, diacritic-rich scripts, local date formats, and real-world distortions such as blur, curvature, and clutter. To extract structured information, this paper proposes a three-stage pipeline including: (1) a YOLO11x-based detector for locating key regions (e.g., product name, weight, brand, expiration date), (2) a word-level detector for segmenting individual words, and (3) a VietOCR-based recognizer for transcription. The final output is structured into complete product metadata. In experiments on VNProductKIE, the pipeline achieved a word-level recognition accuracy of 98.85%, highlighting the effectiveness of the proposed approach.

MMCS: Multimodal Mamba Channel Switching for Object Detection via RGB-IR Fusion

ABSTRACT. Object detection in low-light conditions presents significant challenges due to the presence of noise and reduced contrast in conventional RGB images. Furthermore, optimizing the number of model parameters and computational efficiency remains problematic. The paper proposes MMCS, an efficient object detection framework that harnesses multimodal data derived from paired RGB and Infrared (IR) images. Environmental features are processed via two distinct streams, extracted using the Mamba backbone, and propagated through successive layers. To effectively integrate and prioritize salient information, channel switching spatial attention module blocks are incorporated, employing pooling and attention mechanisms. The refined features are subsequently forwarded to the decoder for final processing. The proposed model was evaluated on two benchmark datasets, LLVIP and FLIR. Experimental results demonstrate that MMCS outperforms existing approaches, achieving substantially higher accuracy, exhibiting a 1.3-fold increase in mean Average Precision (mAP) compared to YOLO-based models and other methods utilizing both RGB and IR modalities. Ultimately, the combination of the robust state-space modeling capabilities of Mamba with an intelligent multimodal information exploitation strategy enhances object recognition performance under varying environmental conditions.

Balancing Quality, Speed, and Compactness of 3D Gaussian Splatting

ABSTRACT. 3D Gaussian Splatting has transformed the field of novel view synthesis by enabling real-time rendering of high-quality, photorealistic scenes. However, its practical application is often hindered by long training times and the large memory footprint of the resulting models. While methods like DashGaussian accelerate training and GaussianSpa creates compact models, they operate independently; GaussianSpa’s sparsification comes at a significant training time cost, and accelerated methods like DashGaussian still produce large final models. The potential for a combined framework that integrates these approaches to efficiently tackle these bottlenecks remains unexplored. To address this, we first introduce a novel three-stage training schedule that integrates the coarse-to-fine acceleration of DashGaussian with the sparsity-enforcing framework of GaussianSpa. Our experiments demonstrate that this method achieves state-of-the-art model compactness, reducing model sizes by up to an order of magnitude with an acceptable trade-off in visual fidelity. Crucially, this compression is achieved efficiently, with training times significantly shorter than standalone sparsification methods. Furthermore, we introduce an experimental variant that replaces gradient-based densification with a direct sampling strategy to explore the limits of training acceleration. Our results show that this second approach achieves the fastest training times by a significant margin. Overall, our work presents two distinct solutions to key 3DGS bottlenecks: one that yields exceptionally compact models, and another that generates models in a fraction of the standard training time.

OTGen-FSIS: Optimal Transport–Driven Feature Generation for Few-Shot Instance Segmentation

ABSTRACT. Few-shot instance segmentation (FSIS) extends the chal- lenges of few-shot object detection (FSOD) by requiring not only object localization but also precise pixel-level mask prediction for novel cate- gories with only a few labeled samples. This task is particularly difficult because limited supervision makes it hard to capture intra-class vari- ations, and existing generative approaches often produce synthetic fea- tures that misalign with real distributions, resulting in degraded segmen- tation quality. To overcome these limitations, we propose OTGen-FSIS, an Optimal Transport–Driven Feature Generator for FSIS. Our approach introduces a conditional generator trained with an OT-based loss and clustering, enabling the synthesis of diverse and representative features for novel classes by leveraging variations from base classes. Unlike prior work, our method relies solely on an unsupervised OT loss to optimize the generator. This design not only stabilizes generator learning but also offers flexibility for future extensions to tasks without labeled samples. By capturing global geometric relationships between distributions, op- timal transport (OT) loss provides stronger alignment than point-wise losses, reducing distributional mismatch and improving generalization to novel categories. Extensive experiments on standard FSIS benchmarks demonstrate that OTGen-FSIS significantly enhances novel class seg- mentation while maintaining strong performance on base classes, vali- dating the effectiveness of OT-based distribution matching in few-shot instance segmentation

DAKTA: Directional Kolmogorov-Arnold Classifier for Task Arithmetic in Continual Learning

ABSTRACT. Continual learning requires models to acquire new knowledge while preserving previously learned information-a fundamental challenge known as the stability-plasticity tradeoff. While recent advances in model compositionality through task arithmetic show promise, existing approaches primarily rely on linear classifiers that struggle to maintain stable classification spaces during incremental learning. In this paper, we propose Directional Arithmetic Kolmogorov Task Architecture(DAKTA), a strategy that addresses both compositionality and classification stability in continual learning. DAKTA leverages second-order Taylor approximation theory to ensure task vectors remain within the pre-training basin, enabling effective model composition for enhanced stability. Our key innovation lies in replacing conventional linear classifiers with learnable Gaussian Radial Basis Functions (RBF) that provide selective channel activation, enhanced locality properties, and significantly reduced interference between tasks, thus boosting plasticity capabilities. Furthermore, we proposed a novel directional logit fusion mechanism that combines RBF-based magnitude information with feature directional cues, enabling the classifier to capture both local similarity patterns and global directional relationships for more robust class discrimination. Experiments on ImageNet-R, CUB-200, and CIFAR-100 demonstrate significant improvements of DAKTA compared to state-of-the-art methods.

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

ABSTRACT. Event-enriched image Captioning seeks to generate descriptions that convey not only what is visible in an image but also the broader context surrounding the depicted event. However, most existing models remain limited to pixel-level information and fail to integrate non-visual knowledge such as timing, location, or participants. To address this limitation, we propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches image captions with external contextual narratives. CIAN first employs a Context Retrieval leveraging the SigLIP model to retrieve semantically relevant articles for each query image. The retrieved textual content is then summarized and used to guide a Narrative Generation stage, where a LoRA-fine-tuned Qwen model produces an event-aware caption. Finally, an N-Gram-based Refinement step enhances linguistic fluency and domain-specific coherence. Evaluated on the OpenEvents-V1 benchmark, CIAN achieves strong retrieval performance (mAP of 0.979) and improves captioning quality, boosting the CIDEr score from 0.030 to 0.094 after refinement. These results demonstrate the effectiveness of combining retrieval-augmented reasoning with progressive linguistic refinement, advancing the development of AI systems capable of generating holistic, human-like visual narratives.

Improving Code-Switching Speech Synthesis via Concatenated Tokenizers

ABSTRACT. Code-switching text-to-speech (CS-TTS) remains a challenging task due to phonetic ambiguity, orthographic overlap, and prosodic discontinuity between languages. Existing approaches have not fully addressed the issue of cross-lingual interference, often resulting in degraded pronunciation accuracy and unnatural transitions. In this work, we draw inspiration from the concatenated tokenizer strategy originally introduced for code-switching automatic speech recognition (ASR) and adapt it to the CS-TTS setting. The core idea is to assign each language its own tokenizer and allocate disjoint token identifier spaces by shifting the token IDs of the secondary language by an offset equal to the vocabulary size of the primary language. This design ensures that even visually identical graphemes are mapped to distinct token IDs according to their language origin, thereby encoding both character form and language identity directly at the token level. By eliminating cross-lingual ambiguity in the text encoding stage, the model learns language-specific pronunciation rules while maintaining the ability to generate smooth and natural cross-lingual speech. Experimental evaluations, including subjective mean opinion scores and objective metrics demonstrate that our method significantly improves pronunciation accuracy, naturalness, and the smoothness of cross-lingual transitions. To the best of our knowledge, this is the first application of the concatenated tokenizer paradigm to CS-TTS, providing a simple yet principled solution to cross-lingual interference in multilingual speech synthesis.

Impact of Foggy Weather on Anomaly Detection in Aerial Traffic Surveillance Videos: An In-Depth Analysis

ABSTRACT. Video anomaly detection is vital for intelligent traffic surveillance systems to enhance public safety and security. Despite significant advancements, current methods struggle with rough weather conditions, particularly fog, which degrades image quality and visibility, complicating the identification of abnormal events. This study investigates the impact of fog on anomaly detection performance in aerial traffic surveillance videos. We utilize the DW-GAN algorithm to generate realistic fog scenarios on two benchmark datasets, including UIT-ADrone and Drone-Anomaly. Extensive experiments are conducted with six state-of-the-art anomaly detection methods to evaluate their performance under foggy conditions. Our results reveal significant performance degradation across all methods, highlighting the challenges posed by fog. Additionally, we evaluate the impact of data preprocessing by testing model performance on dehazed datasets by using GridFormer, demonstrating the advantage of dehazing as a preprocessing step for anomaly detection under foggy conditions. Furthermore, we perform in-depth analyses to identify the practical limitations of current approaches and discuss potential directions for future research. Our findings contribute to a better understanding of anomaly detection in adverse weather and provide insights to develop more robust models. The datasets and source code are available online at \url{https://github.com/PhatNC/AnomalyDetection}.

Lightweight digital signature algorithms based on linear public-key

ABSTRACT. Digital signatures represent one of the most widely adopted applications of public-key cryptography. Over the years, numerous schemes have been introduced, underscoring the importance and continued relevance of research in this field. As a cornerstone of modern security infrastructures, digital signatures support a wide range of applications, including secure economic systems, financial transactions, and national security frameworks. Consequently, a comprehensive understanding and effective implementation of these schemes are essential not only for cryptography specialists but also for the broader user community. Classical signature algorithms such as RSA and ElGamal have long served as foundational cryptographic solutions. Over time, advanced models such as blind signatures and ring signatures have emerged, extending their applicability across diverse domains. Moreover, the adoption of digital signatures in cloud-based environments has grown significantly in recent years. Among current approaches, schemes that utilize public-keys of the form k + r × p have attracted considerable attention due to their efficiency and structural simplicity. Building on this trend, the present work introduces a lightweight digital signature algorithm grounded in this principle, aimed at addressing the constraints of resource-limited environments. In addition to detailing the proposed construction, we provide a thorough security analysis and performance evaluation, comparing it with existing solutions to demonstrate its practicality and effectiveness.

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

ABSTRACT. Composed Image Retrieval (CIR) retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Due to resource constraints, experiments were conducted on a representative subset of the dataset. Preliminary findings demonstrate improved compositional reasoning and fine-grained retrieval behavior, indicating the feasibility and potential of the proposed framework for future full-scale evaluation.

A RGB-D Dataset of Isolated Vietnamese Sign Language

ABSTRACT. In this paper, we introduce ViSL120, the first large-scale multimodal dataset for Vietnamese Sign Language (ViSL). In contrast to previous tiny, RGB-only, or multi-view datasets, ViSL120 provides more than 50,000 videos in 120 glosses that were taken with a single Intel RealSense D435 camera in both RGB and depth. This dataset ensures easy data collection and deployment while offering rich indications from hands, bodies, and facial expressions. To demonstrate its utility, we establish benchmark results with state-of-the-art sign language recognition models, revealing both the challenges of ViSL and the potential for robust model development. ViSL120 enhances Vietnamese sign language resources, supports assistive technology for the deaf community, and advances the research community.

13:30-15:30 Session 10A: SOICT Technical Session XVII: Lifelog Event Retrieval
Location: Ballrooom A
13:30
U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

ABSTRACT. Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

13:50
Visionary: Optimized Temporal Video Retrieval via Large Language Model-Enhanced Query Processing

ABSTRACT. The rapid growth of video content necessitates efficient, real-time event retrieval systems. Addressing the Ho Chi Minh City AI Challenge 2025, we present Visionary, a new generation of the NewsInsight systems. Our system introduces four key contributions: (1) a novel adaptive keyframe extraction algorithm; (2) an enhanced pre-processing pipeline using the Qwen3-VL model for metadata generation and integrated optical character recognition; (3) a flexible architecture supporting multiple embedding models; and (4) the use of Reciprocal Rank Fusion to synthesize retrieval results. These enhancements aim to substantially improve retrieval accuracy and overall performance for complex, large-scale video retrieval tasks.

14:10
KPTER: K-Pointer for Temporal Event Retrieval

ABSTRACT. The explosive proliferation of online video content necessi- tates advanced retrieval systems, as promoted by the Ho Chi Minh AI Challenge (AIC) 2025. This competition comprises three tasks: Known- item Search (KIS), Visual Question Answering (VQA), and the newly introduced Temporal Retrieval and Alignment of Key Events (TRAKE). To address these challenges, we propose a comprehensive multimodal retrieval framework. Our system integrates heterogeneous data sources, including semantic embeddings from CLIP and BEIT-3 , Optical Char- acter Recognition (OCR), Automatic Speech Recognition (ASR), and open-vocabulary object detection. The architecture features a dual-layer search mechanism. For discrete queries (KIS, VQA), a Multi-modal Search layer retrieves and ranks results from parallel data streams using the Weighted Reciprocal Rank Fusion (WRRF) algorithm. To address the sequential nature of the TRAKE task, we introduce a novel Temporal Search module built upon an efficient K-pointer sequential re-ranking algorithm. This algorithm effectively validates and ranks video segments containing events in a specified temporal order. Finally, we present a competition-oriented user interface designed for real-time interaction, supporting multi-stage query construction and precise temporal refine- ment tools.

14:30
MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation

ABSTRACT. The rapid expansion of video content across online platforms has intensified the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of com- plex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that ref- erence unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework that unifies tempo- ral search with web-scale visual grounding. Our temporal search mech- anism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi- event queries. Complementarily, a Google Image Search–based fallback module expands query representations with external web imagery, ef- fectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal reasoning and generalization capabil- ities of modern video retrieval systems, paving the way for more seman- tically aware and adaptive retrieval across large-scale video corpora.

14:50
AIthena-Vision: Adaptive Temporal Multimodal Event Retrieval with LLM-generated Multiperspective Fusion

ABSTRACT. Event retrieval in large-scale video collections remains a chal- lenging task due to the complexity of multimodal content and the seman- tic gap between user queries and visual data. In this paper, we present AIthena-Vision, our entry for the AI Challenge HCMC 2025, an adaptive system to retrieve large-scale video events designed to effectively address these challenges. The core of our system combines a state-of-the-art Per- ception Encoder (a CLIP variant) for superior text–visual alignment with an LLM-driven multiperspective query expansion to improve retrieval ac- curacy. Furthermore, the system incorporates adaptive temporal search and integrates multiple sources of multimodal evidence, including OCR, ASR, and object detection. With an interactive interface that supports collaborative searching, AIthena-Vision offers a competitive and effective retrieval solution. In our latest evaluation with the competition ground truth, the system achieved 92% precision, outperforming our previous model - which obtained the highest score among all teams in the third preliminary round last year - with 81% precision, demonstrating signifi- cant improvement and robustness.

15:10
Lucifer-TRACE: Dynamic Programming and LVLM-Aided Verification for Event-Based Video Retrieval

ABSTRACT. Retrieving complex real-world events from large-scale videos recorded has been one of the most fundamental yet underexplored chal- lenges. Most existing systems still treat videos as collections of indepen- dent frames, overlooking the temporal continuity and semantic coherence that define real events. As a result, they often fail to align fragmented ev- idence across time or verify whether the retrieved segments truly match the user’s intent. We propose Lucifer-TRACE, a multi-modal event re- trieval framework that unifies structured temporal reasoning with early semantic validation. At its core, the Soft Temporal Search mechanism employs dynamic programming to link semantically related yet tempo- rally scattered frames into coherent event chains. In order to complement this, we present an Early Semantic Verification (ESV) module, leverag- ing a large vision–language model (LVLM) to assess the semantic cor- rectness of candidate segments, providing interpretable feedback without altering retrieval scores. By integrating visual and textual cues in a uni- fied pipeline, Lucifer-TRACE achieves both precision and transparency. While evaluated in the Ho Chi Minh City AI Challenge 2025, our system can obtain a score of 86/88, ranking among the top-performing teams. These results highlight the potential of combining dynamic program- ming–based temporal alignment with lightweight VLM-aided verification for robust, interactive event retrieval.

13:30-15:30 Session 10B: SOICT Technical Session XVIII: AI Applications
Location: Ballrooom B
13:30
The Privacy–Utility Trade-off in Brain MRI Synthesis: A Comparative Framework for Generative Models

ABSTRACT. Generative Adversarial Networks (GANs) showed significant potential for synthesizing realistic brain MRI scans. However, this capability introduced a critical risk of sensitive patient data leakage due to sample memorization. This study evaluated data leakage mitigation techniques for GANs applied to brain MRI synthesis. We assessed a standard Deep Convolutional GAN (DCGAN) model and its variants incorporating four distinct mitigation strategies: Maximum Entropy GAN (MEGAN), Spectral Normalization (SN), Wasserstein GAN (WGAN) and Differential Privacy (DP). The models’ efficacy was assessed based on their ability to reduce memorization and maintain image quality. We revealed that both MEGAN-SN and WGAN-SN-D provide an optimal balance, significantly reducing privacy risks while maintaining acceptable image quality. Conversely, DPGANs substantially compromised image quality to achieve their strong theoretical privacy guarantees.

13:50
Task-Aware Harmonization of Sentinel-2 for Canopy Height Mapping: A Deep Learning Application in the Ngoc Linh Mountains, Vietnam

ABSTRACT. Accurate mapping of forest canopy height is essential for biomass estimation, carbon accounting, and long-term forest monitoring. However, the heterogeneous spatial resolutions of Sentinel-2 imagery ($10 \mathrm{~m}, 20 \mathrm{~m}$, and 60 m bands) present significant challenges for reliable canopy height estimation. Conventional approaches typically decouple the task into two stages-superresolution followed by regression-which often introduces error propagation and reduces accuracy. In this study, we propose an end-to-end deep learning framework that jointly performs resolution harmonization and canopy height regression. The model incorporates frequency-enhanced residual blocks and channel attention mechanisms to align all Sentinel-2 bands to a uniform 10 m resolution while extracting task-specific spectral-spatial features. Experiments conducted over the Ngoc Linh Mountains in Central Vietnam demonstrate that the proposed method achieves a mean absolute error of $7.061 \pm 0.110 \mathrm{~m}$ and a root mean square error of $9.175 \pm 0.155 \mathrm{~m}$, outperforming existing baselines. Qualitative analyses further confirm robust canopy structure reconstruction with low prediction uncertainty (STD < 15). These results highlight that integrating harmonization and regression into a unified architecture leads to more accurate and stable canopy height predictions in complex tropical forest landscapes.

14:10
Adaptive Multi-Level Attention for Effective Cross-Domain Brain Tumor Detection

ABSTRACT. We present an innovative approach for unsupervised domain adaptation (UDA) in the classification of brain tumors, Adaptive Multi-Level Attention (AMLA) to confront the UDA challenges of domain shift in medical imaging datasets. AMLA solves the brain tumor classification problem across domains without target domain labeled data through the use of Efficient Channel Attention (ECA) and Dual Self-Attention (DSA) mechanisms. While ECA operates at the early stages of the network, DSA which incorporates Spatial Attention Module (SAM) and Channel Attention Module (CAM) applies to the final network stage. \texttt{ECA} applies to the early network stage since it captures low-level features in an efficient manner. On the other hand, DSA applies to the final stage captures long-range spatial and channel interactions and is important for the separation of complicated tumor types. This stage-specific adaptation strikes a balance between cost-effectiveness and feature expressiveness, which is important in alleviating overfitting when performing UDA. Experimental results over three brain tumor datasets demonstrate that the target test accuracy of our AMLA outperforms previous UDA methods. Our backbone-agnostic approach ensures robustness and scalability for medical imaging applications.

14:30
Critical Success Factors for AI Adoption: A Multivocal Literature Review and a Top Management Perspective

ABSTRACT. The strategic adoption of Artificial Intelligence (AI) is a critical determinant of competitive advantage, yet organizations face high failure rates. This paper aims to identify and categorize the Critical Success Factors (CSFs) that influence suc-cessful AI adoption by synthesizing academic and practitioner knowledge from a top management perspective. We conducted a multivocal literature review of 57 academic and 20 practitioner sources, screened per PRISMA. From vote-count synthesis we identified 16 CSFs grouped via the Technology-Organization-Environment (TOE) framework, with organizational factors dominating such as leadership support, AI literacy, and cultural readiness. Practitioner evidence cor-roborates most academic CSFs and adds implementation-centric aspects (e.g., AI scalability, use-case–value alignment, partnerships). Implications for top man-agement/Chief Information Officers include prioritizing data governance, change capability, and portfolio-level value realization. Scholarly, the paper validates the TOE framework's continued relevance for AI while highlighting the amplified importance of its organizational dimension.

14:50
A Computational Framework for the Personalized Remediation of Reading Difficulties Using Dynamic Bayesian Networks

ABSTRACT. Reading disorders in children pose a significant clinical challenge, with current interventions often limited by their reliance on static assessments. These approaches fail to capture the dynamic nature of skill acquisition and the hierarchical dependencies between cognitive domains that underlie reading ability. This study develops an intervention approach based on the Dynamic Bayesian Network (DBN), a computational model that represents skill acquisition over time, as a diagnostic tool. By mapping the relationships between underlying language abilities, DBNs aim to identify core deficits that lead to reading failure. We conducted an experimental study with two children with word decoding difficulties. The DBN model was used not only to assess performance but also to create a diagnostic profile of each child's specific underlying weaknesses. This profile is then used to develop interventions tailored to the nature of the child's intrinsic difficulties. The positive results demonstrate the effectiveness of this targeted intervention, opening up promising new opportunities for individualized remediation of reading disorders.

15:10
Towards Reliable Oriented Surgical Instrument Detection: Benchmark and Evaluation

ABSTRACT. Accurate detection of surgical instruments is crucial for computer- and robotic-assisted minimally invasive surgery (MIS). Segmentation-based methods provide pixel-level localization but are computationally demanding for real-time use, while axis-aligned detection cannot capture the orientation of elongated and articulated tools such as threads and suturing needles. To address these limitations, we adapt the SAR-RARP50 dataset by converting segmentation masks into oriented bounding box annotations, establishing the first benchmark for oriented detection in robotic surgery. Using this benchmark, we evaluate ten state-of-the-art detectors. Experimental results show that YOLO-based single-stage models achieve the highest mean Average Precision, with YOLOv9 reaching 81.2%, while rotation-aware architectures such as Rotated RetinaNet and SASM achieve the most accurate orientation predictions, with 9.9° error and 92.1% accuracy, and 14.0° error and 86.3% accuracy respectively. These findings highlight a trade-off between real-time detection robustness and orientation precision, providing a foundation for hybrid designs, class-balanced strategies, and the development of reliable perception systems for surgical robotics.

13:30-15:30 Session 10C: SOICT Technical Session XIX: Recent Advances in Cyber Security
Location: Yersin A
13:30
Robust Intrusion Detection and Classification in EVSE Using Ensemble Methods

ABSTRACT. This paper presents a novel machine learning-based approach for intrusion detection and classification in Electric Vehicle Supply Equipment (EVSE). Focusing exclusively on network traffic data, we explore how ensemble learning methods can enhance threat detection in both binary and multi-class classification tasks. By leveraging a combination of classifiers such as K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost), our approach demonstrates improved accuracy and robustness over individual models. The study evaluates multiple ensemble strategies, including majority voting and soft voting, to identify the most effective techniques for real-time threat classification in EVSE environments. Evaluated on the CICEVSE2024 dataset, our approach achieves exceptional performance, with 99.93% accuracy in binary classification and 99.57% accuracy in multi-class classification. Our findings contribute to the growing field of intelligent cybersecurity solutions in electric mobility systems, highlighting the critical role of network-level machine learning analysis in protecting EVSE against evolving cyber threats.

13:50
FOAMI: Enhancing ICS Threat Detection via Feature Optimization, Realistic Augmentation, and Mutual Inference

ABSTRACT. Industrial Control Systems (ICS) are now encountering a multitude of advanced cyber dangers, particularly those associated with the Internet. However, the present study continues to delineate several research gaps, including the restricted sample size, the subpar quality of the training data, and the inability to capitalize on the advantages of various AI models. This study introduces a novel ICS threat detection system, named FOAMI, which incorporates feature optimization, realistic data augmentation techniques, and mutual inference. Our FOAMI facilitates superior feature performance, increased layer separation by targeted aggregation, and overall learning enhancement via mutual inference. Extensive experimental findings on the standard industrial dataset IEC 60870-5-104 indicate that our approach significantly enhances detection accuracy, attaining a detection rate of 89.00% and decreasing the false alarm rate to only 0.99%, outperforming state-of-the-art sophisticated approaches.

14:10
A Novel Framework for Android Malware Detection Based on Function Call Graph Pruning and Contrastive Learning

ABSTRACT. The widespread popularity of the Android operating system, along with its open-source nature, has made it a prime target for malware attacks. Given the rapid evolution and increasing sophistication of Android malware, developing effective detection techniques remains a critical challenge. Recently, many studies have explored the potential of function call graphs (FCGs) combined with graph neural networks for Android malware detection. However, existing approaches still face two main limitations: (1) the FCGs of Android applications are often extremely large, leading to high computational costs and reduced effectiveness in representation learning; and (2) the embedding spaces generated by graph representation learning models often lack clear separability between benign and malicious applications, which negatively impacts classification performance. To address these limitations, we propose PruCLDroid, a novel Android malware detection framework with two key contributions. First, we introduce a graph pruning technique based on structural similarity between node pairs to remove less significant edges, thereby simplifying the graph while preserving essential call relationships. Second, we employ a contrastive learning strategy to optimize the embedding space by pulling together representations of samples from the same class and pushing apart those from different classes. Experimental results demonstrate that PruCLDroid consistently outperforms existing methods across all evaluation metrics.

14:30
MPPO-GEM: Reinforcement Learning Approach for Generating Evasive Malware against Static and Dynamic Malware Detectors

ABSTRACT. Machine learning(ML)–based malware detectors are widely adopted in both research and practice. However, they remain vulnerable to adversarial attacks in both static and dynamic settings. In the static case, simple semantic-preserving edits (e.g., byte padding, section manipulation, control-flow redividing) can alter model features without breaking executability. In dynamic settings, attackers may evade sandboxing by delaying payload execution, simulating benign API interactions, or employing anti-debug/user-interaction techniques. To analyze these challenges in depth, we propose MPPO-GEM, a MaskablePPO-based framework that jointly applies problem-space primitives for generating adversarial malware, including call-based redividing, section-level modifications, semantic NOP insertion, user-interaction simulation, and anti-debugging bypasses within a constrained, action-masked Reinforcement Learning(RL) environment. The generated adversarial samples are evaluated across multiple platforms (VirusTotal, Kaspersky static/dynamic, MalConv, LightGBM). Empirical results show our method nearly doubled the evasion rate compared with the original malware while preserving executable integrity.

14:50
Pri-WeDec: A Private Deep Learning Approach for Weapon Detection in Digital Forensics

ABSTRACT. Modern digital forensic investigations increasingly rely on Artificial Intelligence (AI) tools to screen and analyze vast volumes of image-based evidence. However, uploading this sensitive data to cloud systems or third-party servers for analysis poses significant challenges to privacy, data security, and the integrity of the chain of custody. To address this issue, we propose Pri-WeDec, a novel framework that enables the detection of weapon imagery directly on encrypted data, ensuring that the original content of the evidence is never exposed to untrusted environments. Our solution integrates Fully Homomorphic Encryption (FHE) with a specially customized Convolutional Neural Network (CNN). Specifically, we utilize the CKKS encryption scheme, which supports real-number arithmetic, to encrypt the images. Subsequently, a custom-designed "FHE-friendly" CNN—which employs polynomial activation functions in place of ReLU and utilizes Average Pooling layers—performs inference directly on these ciphertexts. Experimental results show that our model achieves high accuracy on encrypted data, demonstrating the feasibility of performing complex forensic image analysis securely and with full privacy. This work opens a new avenue for the development of next-generation digital forensic tools, where the efficiency of AI can be leveraged without sacrificing core security principles.

15:10
Few-Shot Intrusion Detection using Model-Agnostic Meta-Learning with Deep Neural Networks

ABSTRACT. Nowadays, Intrusion Detection Systems (IDS) have to deal with dynamic and evolving cyberattacks. Albeit, traditional methods often fail to generalize to new threats because of the need for a large amount of labeled data. This study applies Model-Agnostic Meta-Learning (MAML) to a few-shot intrusion detection scenario. We developed a MAML-based multilayer perceptron (MAML-MLP) and experimented with it using the NSL-KDD dataset and CICIDS2017 dataset, focusing on fast adaptation to new attack classes and a few labeled samples. In this paper, comparative experiments of MAML-MLP and baseline deep learning models, including supervised MLP, LSTM, GRU, 1D-CNN, and CNN-LSTM, are presented to demonstrate MAML has superior and stable performance in the few-shot setting. We conclude that in a few-shot scenario, MAML is an optimal solution for intrusion detection to quickly respond to new and emerging cyber threats with less supervision. This model archive an average result of 90.62% and 95.89% accuracy on completely unseen attack type in CICIDS2017 and NSL-KDD dataset, respectively.

13:30-15:30 Session 10D: SOICT Technical Session XX: Multimedia Processing
Chair:
Location: Yersin B
13:30
Scene Graph for Vietnamese Video Understanding: An Agentic Approach with Reasoning

ABSTRACT. Vietnamese Video Understanding (VVU) requires not only recognizing objects and actions but also reasoning about their interactions over time. A central approach is Video Scene Graph Generation (VSGG), yet traditional methods are prohibitively costly, demanding dense frame-level annotations of object–relation pairs and language-specific supervision—constraints infeasible for low-resource Vietnamese. Even modern VSGG pipelines and Vision–Language Models (VLMs) often rely on benchmark-specific fine-tuning or adapters, resulting in high computational/memory demands and brittleness under context shifts. These challenges motivate zero-shot, training-free approaches as a natural alternative. A key insight we emphasize is masking/segmentation before feeding frames into VLMs. Isolating objects and relevant background regions helps preserve scene information, reduce noise from irrelevant context, and avoid spurious correlations. While frequently overlooked, scene context is crucial for accurate relational reasoning. We present VISTA (Video Intelligence with Scene-graph Team-based Agents), a zero-shot multi-agent framework that decomposes video reasoning to reduce computational and memory load. VISTA orchestrates pre-trained components—frame selection, segmentation/masking, vision–language grounding, entity linking/refinement, and graph construction—via role-specialized agents connected through a lightweight shared memory. This design produces dynamic temporal scene graphs and answers questions without further training. By activating only query-relevant tools, pruning candidates early, and reusing cached evidence, VISTA achieves cost-efficient compositional inference. To evaluate, we adapt part of the NEXT-QA dataset into Vietnamese and conduct real-world YouTube case studies. Results across graph quality, agent efficiency, and QA performance show that coordinated multi-agent decomposition can rival large-scale training, underscoring modularity and collaboration as practical pathways to advance video understanding in underrepresented languages.

13:50
OpenLifelogQA: An Open-Ended Multimodal Lifelog Question-Answering Dataset

ABSTRACT. We introduce OpenLifelogQA, a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data. Lifelogging is the passive collection and analysis of personal daily activities using wearable devices, producing rich multimodal data such as images, locations, and biometrics. Question answering (QA) over lifelog data enables users to interactively query their own experiences, supporting applications in memory support, lifestyle analysis, and personal assistance. OpenLifelogQA contains 14,187 Q&A pairs spanning multiple question types and difficulty levels, designed to support robust evaluation in realistic settings. Compared with prior resources, OpenLifelogQA offers greater diversity and practicality for real-world applications. To establish baselines, we evaluate the LLaVA-NeXT-Interleave 7B model, achieving 89.7% BERTScore, 25.87% ROUGE-L, and an average LLM Score of 3.97. By releasing OpenLifelogQA, we aim to promote future research on lifelog technologies, paving the way for personal lifelog assistants capable of memory augmentation, healthcare support, and lifestyle coaching.

14:10
EnAug: ENT Endoscopy Images Classification Using Ensemble and Augmentation Methods

ABSTRACT. Ear, nose, and throat (ENT) endoscopy is a key diagnostic tool for detecting a variety of head and neck conditions. However, automated analysis of endoscopic images remains challenging due to inconsistent image quality, limited labeled data, and the inherent variability in human interpretation. In this work, we present a robust classification framework based on an ensemble of deep learning models, designed specifically for ENT endoscopy images. To address class imbalance and improve generalization, we employ a novel data augmentation strategy that combines symmetry-based label flipping with Mixup, Mosaic, and other augmentation techniques. Our approach is evaluated on a curated ENT dataset covering seven anatomical categories, achieving an accuracy of $95.82\%$, which surpasses several competitive baselines. In addition to strong overall performance, our method demonstrates improved robustness on underrepresented classes, showing its potential for real-world deployment in clinical settings. This work highlights the effectiveness of deep learning ensembles and thoughtful augmentation in building scalable AI tools for medical imaging. The full implementation and models are available at our repository at https://github.com/thanhson28/acmmm2025_entrep_challenge_was.

14:30
EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

ABSTRACT. Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery, remains robust to resizing, compression, and cropping, and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

14:50
Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

ABSTRACT. Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

15:10
From Relative to Absolute: Monocular Depth Estimation in Aerial Imagery

ABSTRACT. Monocular depth estimation from single aerial images is fundamental to autonomous navigation, 3D reconstruction, and terrain analysis, yet practical deployment requires absolute (metric) depth instead of the purely relative outputs produced by most deep models. This work benchmarks three state-of-the-art methods - Marigold, NDDepth, and Unidepth - on three aerial datasets (ESPADA, ENRICH-Aerial, Skyscene) with a comprehensive metric suite, then investigates two routes to metric recovery: a least-squares scale-shift alignment and a learned scale-shift predictor based on ResNet50. On ESPADA, Marigold outperforms Unidepth with lower error and higher accuracy (absErrRel 0.1699 vs 0.1972, LinearRmse 5.1449 vs 6.0058, delta1Acc 0.8467 vs 0.8255). On ENRICH-Aerial, the two models are comparable (Marigold absErrRel 0.0673, delta1Acc 0.9733; Unidepth 0.0694, 0.9751). Prompt conditioning analysis on ESPADA shows that adding aerial-scene prompts improves over no-prompt (e.g., absErrRel drops from 0.1699 to 0.1402 with true prompts and to 0.1365 – 0.1373 with partly-correct prompts), while ENRICH-Aerial exhibits near-neutral sensitivity to prompts. The classical least-squares alignment yields only marginal numerical gains across datasets. On ESPADA, the absErrRel improves trivially from 0.1699 to 0.169815, while on ENRICH-Aerial it changes from 0.0673 to 0.06726. These negligible differences indicate that once predictions are stable, the benefit of least-squares calibration is limited. Finally, our Res-Net50-based regressor directly predicts scale and shift to convert relative depth to absolute depth, achieving strong results that match the best calibrated outputs (ESPADA: absErrRel 0.1694, delta1Acc 0.8471; ENRICH-Aerial: 0.0674, 0.9734). Collectively, these findings demonstrate that absolute depth from monocular aerial imagery is attainable with high fidelity via lightweight calibration or learned scale–shift prediction, enabling reliable downstream drone and GIS applications.

13:30-17:20 Session 10E: Poster Exhibition
Anatomy-based Brain Hemorrhage Segmentation and Application in Assessment of Traumatic Brain Injury Severity

ABSTRACT. Traumatic brain injury (TBI) requires an objective and timely severity assessment to support clinical decision making and integration into digital health workflows. In this study, we present a multimodal pipeline based on features extracted from CT images, integrated with structured clinical variables. The proposed method utilizes a dataset named 103_TBI, which comprises 504 records. To extract features from brain CT images, a U-Net-based neural network is used to quantify lesion volumes (e.g., epidural, subdural, and intraparenchymal hematomas), midline shift, and subarachnoid characteristics. In addition, the unified workflow performs clinically informed binning, KNN-based imputation, Z-score normalization, and class rebalancing using SMOTE. Tree-based ensemble models (Random Forest and XGBoost) trained on the 103_TBI dataset achieve accuracies of up to 94.20\%. The results highlight the added value of segmentation features, particularly midline shift and hematoma burden, when combined with key clinical indicators such as the Glasgow Coma Scale. The proposed framework demonstrates a practical approach to integrating CT imaging features with clinical data to assess the severity of TBI. The experimental results confirm the feasibility of deploying this solution to support the evaluation of the severity of TBI in clinical diagnostics.

House Price Prediction via Attribute, Visual, and Economic Features

ABSTRACT. In this paper, we investigate the impact of economic factors such as interest rates inflation, employment, and local market conditions on the house pricing along with the attribute and visual features. To this end, we collect the House Attribute, Visual, and Economic Network Dataset (HAVEN-3000), which includes 3,000 houses from 22 states and 74 cities. Then, we propose a multimodal approach that combines house photos with property attributes and economic factors. Our proposed work achieves a Mean Absolute Error (MAE) of 10.9178 and an R2 of 0.3863. We further benchmark a direct prediction XGBoost model, which delivers improved results with MAE values of 9.245805 (2023), 10.978892 (2024), and 6.896334 (2025), alongside R2 scores of 0.6528, 0.6106, and 0.8219. Results high- light interest rate fluctuations as the strongest driver of housing price dynamics, highlighting the importance of incorporating multimodal data into predictive models for real estate.

AEye: Avian Monitoring from Streaming Videos

ABSTRACT. The conservation of bird species, especially those that are endangered or at risk of extinction such as eagles and hawks, has become an urgent ecological priority. In this paper, we address the bird detection and classification problem, aimed at supporting large-scale avian monitoring and conservation efforts. Our work focuses on recognizing both adult birds and their chicks, with chicks observed from the time of hatching until fledging. We meticulously collect a realistic and diverse AEye dataset using YouTube streaming videos, covering 20 bird species, each classified into two categories, namely, parent and chick. The AEye dataset includes both daytime and night vision footage to evaluate model performance under different lighting conditions. We leverage the YOLO object detection model on the newly collected dataset, which demonstrates strong performance in detecting and classifying birds in different environments. We also evaluate performance in both daytime and nighttime settings.

Optimizing UAV Swarm Routing with Optical Communication Systems

ABSTRACT. Fifth generation (5G) and beyond 5G (5G/B5G) networks deliver high-speed, widespread data transfer with reduced latency and enhanced connectivity compared to earlier systems. Among them, unmanned aerial vehicles (UAVs) and optical communication technologies have been increasingly applied in 5G/B5G networks, addressing the growing need for rapid data transfer and extensive connectivity in dynamic environments. This study explores routing algorithms for multi-hop UAV swarm networks, with a focus on minimizing latency during data transmission. We develop both path-delay optimal and heuristic algorithms to address routing challenges in large-scale UAV swarm networks, ensuring efficient data delivery across complex topologies. We also add must-pass intermediate UAVs that ensure critical information delivery, which are additional constraints to the routing design. Numerical results demonstrate that the proposed path-delay optimal algorithm consistently achieves lower latency compared to the heuristic approach for data transmission compared to the heuristic algorithm, though it requires higher computational complexity, highlighting a trade-off between performance and resource demands.

Practical multivariate algebraic signature scheme with one hidden group

ABSTRACT. In the field of post-quantum two-key cryptography, a significant interest is the development of practical algebraic schemes of electronic digital signature (EDS) with a secret (hidden) group, the security of which is based on the computational complexity of solving systems of power equations with many unknowns. Associative non-commutative finite algebras (ANFA) are used as an algebraic carrier in such crypto schemes. A critical concern in developing these types of EDS schemes lies in guaranteeing a sufficient level of randomness in the fitting element of the digital signature, which is a vector S, repeatedly included in the verification equation as a multiplier. A known solution to this problem is based on calculating S depending on two vectors randomly selected from two commutative secret groups such that the elements of one of them are non-commutative with the elements of the other. This mechanism requires the use of auxiliary fitting signature elements, which are calculated independently of the vector S, which creates potential prerequisites for calculating the secret key in parts. The paper proposes a new method for enhancing signature randomization, which is distinguished by calculating the value of S depending on two vectors randomly selected from one commutative secret group. Based on the proposed method, a practical post-quantum algebraic digital signature algorithm is developed, in which the calculation of auxiliary fitting elements of the signature and the vector S is carried out in conjunction and simultaneously. Due to this, the specified potential vulnerability of algebraic digital signature algorithms with a hidden group is eliminated.

QuantaMind: A Robust and Efficient Framework for Quantum Machine Learning Applications

ABSTRACT. Quantum Neural Networks (QNNs) possess the capability to significantly reduce the complexity of training neural networks. This research examines the potential advantages of QNNs compared to classical neural networks. We present QuantaMind, a novel framework for rapidly building hybrid classical-quantum models. We investigate hybrid classical-quantum neural networks combining quantum circuits with classical layers, specifically examining their performance on classification and regression tasks across many datasets by constructing Hybrid Quantum Feedforward Neural Networks (HQFNNs) and Hybrid Quantum Convolutional Neural Networks (HQCNNs). In most circumstances, these models show competitive performance with their classical counterparts in terms of accuracy and loss. QuantaMind demonstrates its ability to be suitable for many application domains, indicating an important step forward in Quantum Machine Learning.

GENLog: Enhance Generalization to Log-based Anomaly Detection

ABSTRACT. Log-based anomaly detection is crucial for maintaining the reliability and security of modern software systems. While deep learning models, particularly those based on Transformers like NeuralLog, have shown significant promise, they often suffer from limited generalization. This limitation arises because purely discriminative training, driven by cross-entropy loss, can lead to "shallow" representations over-specialized to the training data. To address this, we propose GENLog, a novel architecture designed to enhance generalization by integrating generative and discriminative learning. GENLog augments the standard Transformer encoder with a decoder module, creating a multi-task framework. In addition to the primary classification task, the model is simultaneously trained to reconstruct the original input sequence. This reconstruction objective acts as a powerful regularizer, compelling the encoder to learn a rich, comprehensive latent representation that preserves essential semantic and structural information. We conduct extensive experiments on two benchmark datasets, HDFS and BGL, demonstrating that GENLog significantly outperforms state-of-the-art methods, especially in challenging low-data regimes. Our analysis further includes an ablation study on the trade-off hyperparameter, providing insights into the synergistic relationship between the two learning objectives.

Architecting Trustworthy AI: The Cyber-Resilient AI (CRAI) Framework

ABSTRACT. The inductive biases and optimization objectives integral to modern artificial intelligence, particularly in large-scale generative models, create inherent and exploitable security vulnerabilities. The differentiability and high-dimensional nature of deep neural networks, for instance, give rise to adversarial attack surfaces that are not mere implementation flaws but fundamental properties of the models themselves. Current research often addresses these vulnerabilities in isolation, resulting in a fragmented landscape of point defenses that lack a unifying structure. This paper addresses this methodological gap by conducting a structured critical analysis of the AI-driven threat landscape. A novel, unified framework for proactive defense—the Cyber-Resilient AI (CRAI) architecture—is introduced. The CRAI framework is built on three synergistic pillars: (1) Hardening collaborative learning through cryptographic integrity proofs, (2) Enhancing model auditability via causal and counterfactual explainability (XAI), and (3) Implementing adaptive governance informed by real-time model state analysis. This work's primary contribution is a formal synthesis that connects specific AI model properties to emergent threat vectors and maps them to a coherent, multi-layered defense strategy. It provides a new research roadmap for developing verifiably robust and secure AI systems, moving beyond reactive patching toward a paradigm of security-by-design.

Adaptive Federated Learning for Software Vulnerability Detection

ABSTRACT. Federated learning has recently been adopted in cybersecurity to enable collaborative model training on distributed threat data without disclosing sensitive information. However, existing approaches for software vulnerability detection rely on static aggregation and require shared test data to monitor convergence. To address these challenges, we introduce Adaptive Federated Learning (Adaptive FL), which dynamically allocates computation to clients with harder-to-learn vulnerability profiles and uses client-side validation to orchestrate the learning process without any centralized test data. We evaluate Adaptive FL on multiple vulnerability datasets and demonstrate that it achieves an F1-score of approximately 70%, outperforming VDBFL baselines by 5–10 percentage points, while maintaining stable runtimes of 100–150 s per round—over 20 times faster than competing methods. These results establish Adaptive FL as a practical, efficient, and privacy-preserving solution for federated vulnerability detection in heterogeneous software development environments.

A Method for Building QA Corpora for Low-Resource Languages

ABSTRACT. Building high-quality question–answering (QA) datasets for low-resource languages is challenging due to the lack of annotated corpora. We propose a fully automated and scalable pipeline for constructing large-scale QA corpora from authoritative online sources. The pipeline includes five stages: (i) selecting authoritative QA websites; (ii) automated crawling; (iii) extracting question–context–answer (QCA) triples via site-specific templates that leverage semantic HTML; (iv) applying an AI-assisted fact-checking filter that uses Sentence-BERT retrieval with a high-similarity threshold followed by LLM verification, both against curated references; and (v) final canonicalization and deduplication to remove redundant items and maintain corpus diversity. Unlike conventional QA pairs, the QCA structure preserves contextual grounding, enhancing corpus utility and model robustness. Applied to Vietnamese, our method produced 30,000 QCA triples from four reputable sources. To demonstrate usability, we fine-tuned vit5-base, a Vietnamese sequence-to-sequence model, achieving strong results on a 1,000-triple test set (BLEU 89.1; semantic similarity >=0.8: 91.5%) and in a human evaluation of 500 samples (grammaticality 4.58/5, usefulness 4.29/5). Compared with existing baselines (question-generation-vietnamese-v2, Ollama, GPT-5), our model yields substantially higher performance, underscoring both the effectiveness of the pipeline and the quality of the corpus. The released dataset and tools are publicly available as open-source resources, providing a valuable benchmark for future research on Vietnamese QA and question generation in low-resource settings.

JuanQueue: A Digital Appointment and Queuing System for a Government Organization

ABSTRACT. This paper presents JuanQueue, a web-based digital appointment and queuing system designed to modernize document-request services in local government offices. The study addressed inefficiencies in manual workflows that cause long waiting times and administrative bottlenecks. It followed a quantitative research design and Rapid Application Development (RAD) for system creation and iterative refinement. Evaluation was performed across four objectives: (1) correlation between manual waiting time and citizen satisfaction, (2) expert validation of the developed system under ISO/IEC 25010 software-quality standards, (3) user-experience and adoption analysis using the Unified Theory of Acceptance and Use of Technology (UTAUT), and (4) assessment of data-driven decision support via the Technology-to-Performance Chain (TPC) framework. Results showed very weak negative correlations between waiting time and perceived efficiency, usability, and satisfaction, confirming that citizens tolerate manual processes for familiarity rather than effectiveness. IT experts rated all ISO/IEC 25010 criteria “Very Acceptable,” while users reported mean ratings of ≈5.6 / 6 across UTAUT constructs. Barangay staff rated information quality and decision support highly. Findings indicate that JuanQueue is technically ready, widely accepted, and capable of supporting data-informed local governance. The study demonstrates the feasibility of digitizing barangay services and highlights opportunities for scalable Software-as-a-Service deployment.

Smart Mobility through Hybrid Offline-Online Scheduling for Ridesharing

ABSTRACT. In recent years, ridesharing has emerged as one of the most cost-effective and ef-ficient transportation solutions, allowing more than one people to share a single vehicle. Nonetheless, scheduling remains a critical challenge that must be ad-dressed to enhance user adoption. This paper addresses this issue through a two-fold objective: firstly, by providing frequent riders with reliable, pre-arranged routes; and secondly, by enabling the dynamic insertion of new riders into active shared trips. To this end, we develop an algorithm named aVC, which clusters regular users into ridesharing groups based on the similarity of their frequent travel routes. This approach eliminates the need for users to repeatedly search for rides or experience extended waiting times, as trip details and drivers assignments are communicated to users in advance. Additionally, when a regular ridesharing trip commences and vacant seats remain, drivers can accommodate real-time re-quests from new users without disrupting the planned itinerary. To efficiently handle such scenarios, we introduce a method named biSearchIns designed to quickly process shared trip queries. The proposed method are assessed against existing approaches using simulated datasets. Experimental results demonstrate that our methods surpass current solutions in terms of computational efficiency, the number of riders served, and the overall reduction in the required number of vehicles

16:00-17:20 Session 11A: SOICT Technical Session XXI: Lifelog Event Retrieval
Location: Ballrooom A
16:00
CLIPAR: Multimodal and Temporal-Aware Video Retrieval System

ABSTRACT. The Ho Chi Minh AI Challenge 2025 sets an ambitious goal of building a powerful video retrieval system that can compete with other teams. To address this challenge, CLIPAR is designed and implemented, integrating multiple search strategies, including Semantic Search, OCR Search, ASR Search, Object Detection, and Image Matching. CLIPAR processes audio and visual streams separately, extracting keyframes, transcripts, and embeddings to support diverse retrieval tasks. One of the system’s key strengths is its flexibility: users can search using text, objects, or images, and the system quickly returns the most relevant video segments. CLIPAR's system demonstrates strong performance in the preliminary stage of the competition, achieving top ranking among participating teams. Its ability to handle multiple types of queries allows it to return relevant video segments quickly and accurately. These results highlight CLIPAR’s practical potential for real-world video retrieval tasks and show that a system designed with flexible, multimodal search capabilities can outperform more specialized approaches.

16:20
Vortex: A Multi-Modal Fusion System for Intelligent Video Retrieval

ABSTRACT. This paper presents Vortex, a multimodal video retrieval system developed for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, Vortex achieved a final score of 79.6/88 (90.5\%), demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.

16:40
Efficient Video Retrieval for Less-Resourced Languages via Multi-Modal Semantic Search

ABSTRACT. The surge of multimedia data has increased the demand for intelligent video retrieval systems capable of understanding complex natural-language queries. Current multimodal approaches often underperform in less-resourced languages, where semantic ambiguity, linguistic diversity, and cultural context hinder retrieval accuracy. To address this, we propose a multilingual video retrieval framework that integrates multimodal embeddings and a caption-based module to enhance performance. The caption-based module provides fine-grained, language-adaptive captions for each keyframe, segmented into overlapping chunks for embedding and re-ranking. This design enables precise semantic alignment between video content and user queries, improving both contextual relevance and retrieval precision. We evaluate the framework on Vietnamese, an under-resourced but linguistically rich language, demonstrating its adaptability and effectiveness. Experiments on large-scale Vietnamese video datasets, including the 2025 Ho Chi Minh AI Challenge, show that our approach significantly improves cross-modal understanding and retrieval performance, highlighting its potential as a robust foundation for multilingual multimedia search systems.

16:00-17:20 Session 11B: SOICT Technical Session XXII: AI Applications
Chair:
Location: Ballrooom B
16:00
AuMoM: A Framework for Learning Discriminative Speaker Embeddings using a Mamba-based Mixture of Experts and Contrastive Loss

ABSTRACT. This paper presents Audio MoE-Mamba (AuMoM), an innovative speaker verification system optimized for scalability and efficiency. AuMoM begins by transforming audio waveforms into spectrogram patches, which are then embedded as vectors and processed by a Mamba Encoder enhanced with a Mixture of Experts (MoE) architecture. Unlike conventional attention-based models, the bidirectional Mamba Encoder leverages state-space modeling and convolutional operations to achieve faster processing speeds. The MoE layer dynamically routes inputs to specialized expert sub-models, increasing model capacity and efficiency. A classification token placed within the sequence facilitates learning bidirectional context. Finally, a Siamese Network compares the learned audio features using a contrastive loss function to distinguish between different speakers. This architecture enables AuMoM to produce discriminative speaker embeddings effectively, combining attention-free processing with MoE for enhanced computational efficiency.

16:20
A Survey on Challenges and Emerging Frontiers of Multi-Agent Systems

ABSTRACT. Multi-Agent Systems (MAS) have emerged as a fundamental approach for solving dynamic and distributed problems across domains such as robotics, communication networks, and intelligent decision-making. MAS agents are characterized by autonomy, sociality, and flexibility [1]. Recent advances such as deep reinforcement learning (DRL), large lan- guage model (LLM)-based agents, and context awareness have broadened MAS capabilities, but existing surveys are often limited to specific subdo- mains or focus on outdated platforms, ignoring the convergence between learning-based and language-based systems. This review provides a com- prehensive, technically rigorous view that connects classical MAS theory with emerging paradigms. We analyze cross-cutting system challenges, including scalability, cybersecurity, and privacy, and synthesize recent model families across five functional research areas: (1) MAS for Com- plex Problem Solving and Planning; (2) Embodied MAS for Physical Environments; (3) MAS for Emergent Communication and Decentral- ized Coordination; (4) MAS for Human-AI Teaming and Social Intelli- gence; (5) Toward Generalist and Multi-tasking MAS. This work aims to consolidate the fragmented literature, highlight common challenges, and outline future research opportunities for developing scalable, general, and interoperable MAS systems.

16:40
Improving Plant Species Distribution Models with Hydrologic and Topographic Features

ABSTRACT. Species distribution models (SDMs) for plants typically prioritize climate and broad remote sensing while under-using hydrologic and topographic information. Using European presence–absence surveys from GeoPlant (~94k plots) and an XGBoost baseline built on Location + Climate + Land-cover (LCL), we quantify the added value of hydrologic and terrain context and test sensitivity to elevation source. We derive river and lake descriptors from HydroRIVERS/HydroLAKES (e.g., network position, connectivity, discharge, and lentic morphology), compute the Topographic Position Index (TPI) at multiple radii (150–3000 m), and compare predictors derived from ASTER GDEM versus Copernicus GLO-30 (COP). Models are trained and evaluated over five seeds using standard discrimination and retrieval metrics. Three consistent findings emerge. First, DEM choice matters: across like-for-like configurations, COP provides more informative elevation for SDM predictors than ASTER. Second, fine-scale topography helps when derived from high-quality elevation: among TPI scales, the 150 m radius yields the clearest improvement when paired with COP. Third, hydrologic context is complementary: river features improve the LCL baseline, and combining rivers with lake descriptors yields further gains. The best configuration—COP elevation + TPI (150 m) + river + lake features—achieves the highest overall accuracy, raising the F1 score to 29.97. Feature importances corroborate these trends: latitude dominates among location variables, several BIO-climatic predictors remain near the top, river-network distance appears among the top five features, and TPI adds a measurable (though smaller) contribution alongside land cover and elevation. Together, these results provide a practical recipe for integrating hydro-topography into scalable plant SDMs and highlight the benefits of high-quality elevation data and fine-scale terrain context for continent-scale prediction.

16:00-17:20 Session 11C: SOICT Technical Session XXIII: Recent Advances in Cyber Security
Chair:
Location: Yersin A
16:00
Password Generation Based on GenAI for Evaluating the Security of Password-Based Control Systems

ABSTRACT. The rapid growth of Generative AI (GenAI) brings new challenges for password security. Traditional rules based only on length or character complexity are insufficient to measure real strength. Currently, many AI models for password guessing show severe limits. They often repeat guesses, generate in random order, and fail to follow the real attack patterns. This study presents a new framework for pattern-aware, search-based, ordered password generation. It combines a pattern-conditioned Generative Pretrained Transformer (PagPassGPT) with a tree-search algorithm (SOPG). This design creates password guesses in order, without repetition, and with correct patterns. The model was trained with the RockYou dataset and compared with advanced password-guessing models. Results show substantial improvement: our model reached 20.8% success in 1,000 guesses, while PassGPT reached only 2.2%. It also avoided repetition (0%), unlike others that repeated up to 99.9%. In conclusion, the framework is reliable for testing password policies and simulating AI-based attacks.

16:20
FusionMalNet: A Hybrid Ensemble Architecture for Windows Malware Detection

ABSTRACT. Malware detection remains a critical yet challenging task, as traditional static analysis techniques struggle with increasingly sophisticated obfuscation methods. In this paper, we propose FusionMalNet, an ensemble deep-learning framework leveraging multimodal representations for robust Windows malware classification. FusionMalNet combines visual patterns extracted from Gramian Angular Field images via a RegNetY convolutional architecture and structural information captured from static file features using XGBoost. These complementary representations are integrated through a compact neural fusion module, enabling accurate and confident predictions. To facilitate evaluation, we also introduce a dataset of Portable Executable (PE) files from which aligned visual and tabular representations are derived. Extensive experiments demonstrate that FusionMalNet achieves state-of-the-art performance among the compared methods, attaining 99.67% accuracy and 99.97% ROC-AUC, surpassing multiple existing approaches across diverse evaluation metrics.

16:40
PowerGAN: Enhancing PowerShell Attack Detection through GAN-Driven Data Generation

ABSTRACT. As defensive solutions advance, Living Off the Land (LoTL) attacks have emerged as a powerful evasion technique by exploiting native system tools. Among them, PowerShell—deeply integrated into Windows—has become a prime vector for stealthy LoTL attacks that frequently bypass traditional detection methods. While machine learning (ML) and deep learning (DL) approaches have been widely applied, limitations in data quantity, and class balance for PowerShell scripts hinder their effectiveness. To address this challenge, we propose PowerGAN, a generative deep learning approach for producing additional training data to enhance ML- and DL-based PowerShell detection. Experimental results demonstrate that PowerGAN significantly improves detection performance, and we further compare different GAN variants to identify the most suitable model for this problem.

16:00-17:20 Session 11D: SOICT Technical Session XXIV: Multimedia Processing
Chair:
Location: Yersin B
16:00
SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing

ABSTRACT. Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, which is a unified framework that simultaneously integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.

16:20
Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

ABSTRACT. The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

16:40
HERF: Hybrid Evidence Retrieval Framework for Entity-Centric Question Answering

ABSTRACT. Question answering (QA) systems are typically built on either knowledge bases or the open web, each with distinct advantages and limitations. Structured knowledge bases provide high-precision answers but are often incomplete, while open web data offer broader coverage at the cost of factual reliability. To address this trade-off, this paper presents HERF (Hybrid Evidence Retrieval Framework), a hybrid architecture designed for entity-centric question answering—a task focused on queries about specific entities. HERF’s parallel architecture retrieves complementary evidence: one stream queries a knowledge base, while the other extracts contextual evidence from open web text. An aggregator then fuses evidences from both streams into a unified set of candidate answers, ensuring both precision and coverage. On the EntityQuestions benchmark, HERF achieves superior performance compared to existing comparative models. The results demonstrate that a parallel fusion strategy is a highly effective approach, highlighting the potential of integrating hybrid evidence to build more robust and accurate QA systems.