SOICT 2024: THE 13TH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY
PROGRAM FOR SATURDAY, DECEMBER 14TH, 2024
Days:
previous day
all days

View: session overviewtalk overview

09:50-10:20 Session 12: Poster session III
MAVERICS: Multimodal Advanced Visual Event Retrieval with Integrated CPU-Optimized Search

ABSTRACT. The increasing volume of visual data in news archives and media sources poses significant challenges for efficient event retrieval. This paper presents a multimodal approach to tackle the problem of Event Retrieval from Visual Data. Our system integrates several techniques to process diverse query types, including text, image, and video. For image-text retrieval, the BLIP2 model is used to embed both images and text descriptions. In cases where queries are in Vietnamese, we employ the pre-trained VietAI/envit5-translation model to translate prompts into English before processing them with BLIP2. Object detection is handled by YOLOWorldv2, and text extraction from images utilizes PP-OCRv3 and VGG Transformer. Additionally, WhisperX is employed for audio-to-text conversion. Embeddings from textual data, whether derived from OCR or audio, are generated using sentence-transformers/all-MiniLM-L6-v2. These embeddings are indexed using Usearch, enabling fast and efficient retrieval. Furthermore, we developed a high-speed temporal search mechanism that calculates scores and combinations for consecutive related frames to improve performance in temporal queries. The system is capable of running efficiently on CPUs, with a maximum query processing time of 2 seconds for advanced queries, such as Temporal search, which require multiple models to run consecutively, making it a scalable solution for large-scale video data retrieval. Additionally, we have built a user-friendly interface using Streamlit, enabling users to easily interact with and utilize the system.

Forecasting Traffic Flow under Uncertainty: A Case Study in Da Nang

ABSTRACT. This paper discusses the design and implementation of a modern traffic flow prediction system using data from street surveillance cameras deployed at the website 0511.vn. The core objective of the research was to develop an accurate and efficient prediction model based on direct image analysis and real-time data, providing instant traffic information and forecasting short-term traffic trends. Initially, it is necessary to identify and evaluate existing image processing and machine learning methods to filter out and classify vehicles from the collected video data. Subsequently, the author designed models combining ARIMA and LSTM methods to predict the density and movement of vehicles on the roads. These methods were tested and optimized through a series of experiments on historical data and real-time data collected from 0511.vn, marking a significant advancement in applying video surveillance technology to urban traffic management. The research results not only contribute to the field of data science and image processing but also have practical potential in supporting the decision-making of traffic management agencies and improving the community's commuting experience.

Enhanced Video Retrieval System: Leveraging GPT-4 for Multimodal Query Expansion and Open Image Search

ABSTRACT. The exponential growth of video data on digital media and video-sharing platforms has created an urgent need for efficient content-based video retrieval systems. Traditional methods, such as object recognition, text extraction, and color analysis, have been extensively explored, while current approaches leverage pre-trained multimodal models like CLIP, which give impressive results in text-based video retrieval. However, as the large-scale database gets larger, the level of diversity and complexity increases, and fixed text queries often produce limited results. To address this, we propose an enhanced video retrieval framework that integrates CLIP with GPT-4 through API interaction. By employing advanced prompt engineering techniques, we dynamically expand and refine text queries, enabling broader and more effective exploration of video datasets. Additionally, our framework translates human-generated queries into machine-optimized formats for vision-language models, enhancing retrieval precision. Furthermore, to utilize large image databases such as Google, we introduce an open image-based search functionality that allows users to import reference images similar to their text queries, improving the system’s ability to find relevant content. This dual approach enhances query-model alignment, increasing the likelihood of retrieving accurate and contextually relevant results.

ReViMM: Enhanced Video Retrieval with Reweighting Mechanism for Multi-Modal Queries

ABSTRACT. The increasing volume of video data is posing significant challenges for efficient event retrieval. Applications in fields such as security, multimedia management, and event analysis require systems that are not only robust but also flexible enough to handle various input types, such as text, images, audio, and data from multiple sources. Traditional approaches often struggle to handle the large volume and diversity of data, leading to inefficient retrieval in complex scenarios. Our proposed system integrates FAISS (Facebook AI Similarity Search) for fast similarity search and ElasticSearch to make the search process more efficient. The system can process various input types, including descriptive text and similar images, while using Whisper for speech recognition and transcription. Additionally, Large Language Models (LLMs) are employed to generate detailed and accurate image descriptions. A key highlight of our approach is the use of reweighting to adjust the importance of each word or image during the search process. By reweighting, the system can optimize the prioritization of the most relevant information for each query, significantly enhancing the accuracy of the search results. This approach not only improves the precision of data retrieval but also greatly enhances the user experience. By offering comprehensive and precise results for complex queries, the system helps users easily find relevant information in large and diverse datasets, meeting the growing demands of applications in security and multimedia analysis.

LLM-Powered Video Search: A Comprehensive Multimedia Retrieval System

ABSTRACT. Image and video search has become an important problem amid the rapid growth of image and video-sharing platforms. The explosion of video content on the Internet has created an urgent demand for effective content management and retrieval. Traditional approaches, such as text-based, image, object, audio, and color searches, have achieved some success but often lack consistency and fail to meet user needs comprehensively. To address this issue, we propose an artificial intelligence (AI) system that integrates multiple existing search methods with a Combined Ranking Score (CRS) algorithm. CRS balances the ranking and scores from different approaches, optimizing query results. The system also uses a Retrieval-Augmented Generation (RAG) architecture combined with large language models (LLMs) to enhance contextual understanding and deliver more accurate results. Users can interact directly with our AI system to easily and efficiently search for desired moments in videos. This approach promises to significantly improve the user experience in multimedia content search and retrieval.

Interactive Video Retrieval System for AI Challenge 2024 Using CLIP, RAM++, and LLM-Enhanced Tag Matching

ABSTRACT. In this paper, we present an interactive video retrieval system developed for the AI Challenge 2024. The system offers multi-modal search functionality, allowing users to search using text, images, or tags. At its core, the system leverages CLIP (Contrastive Language-Image Pre-training) to enable efficient video retrieval from both natural language and image-based inputs. For tag-based queries, we incorporate the state-of-the-art image tagging Recognize Anything Plus (RAM++) model. However, due to the large number of tags produced by RAM++, it becomes impractical to manually select the most relevant tag of the input query. To address this challenge, we use the Gemini Large Language Model (LLM) to automatically select the most appropriate tags for the given query. Additionally, we describe our temporal search algorithm, which further enhances retrieval performance. Our experiments show that this combination of models provides a scalable and high-performance solution for video search applications in real-world scenarios.

Transforming Video Search: Leveraging Multimodal Techniques and LLMs for Optimal Retrieval
PRESENTER: Truong Dinh

ABSTRACT. The fast development of online video material has made video searching critical in the digital era. Traditional approaches, such as image-text retrieval, object, audio, color, and text-based searches, have made considerable advances in this field. However, these approaches frequently require refining when dealing with numerous users’ inquiries at the same time, which might result in overlapping searches. Furthermore, present video search algorithms must increase their capacity to respond to complicated questions including data extraction from several frames. Overcoming these limits is critical for creating scalable and userfriendly video search engines. In this research, we provide an improved video search system that includes three significant breakthroughs. We enhance text detection for Vietnamese, employ picture captioning to increase search relevancy, and allow users to modify queries with a large language model (LLM) for more precision. These innovations considerably increase the search process’s efficiency and accuracy. The intuitive interface enables seamless searches by queries, frame IDs, and related images, while offering sophisticated features such as query expansion, result aggregation, and integrated feedback for enhanced search accuracy.

Real-Time Multi-User Multimedia Event Retrieval Application System Using WebSocket Protocol

ABSTRACT. Event query systems for video have become essential tools in various fields such as surveillance, sports analytics, and media management, where accurately retrieving significant moments is crucial. This paper presents an event query system optimized for multi-user access, utilizing the WebSocket protocol to enhance real-time interaction. The system allows multiple users to simultaneously query specific events in video content while supporting cross-validation of results to improve accuracy.

The primary research method focuses on integrating WebSocket technology, enabling continuous communication between users and the system, thus enhancing user experience. Additionally, the system features a ranking mechanism for events based on user votes, which aids in optimizing suggestions and encourages community participation.

Experimental results demonstrate that the system operates effectively in environments with multiple users, providing accurate and rapid queries while enhancing user engagement through real-time interactive features. This system marks a significant advancement in applying modern technologies to video event querying, opening up substantial opportunities for practical applications across various fields.

Application of the SFE Feature Selection Method for Multi-Omic Biomarker Discovery in Brain Cancer Subtyping

ABSTRACT. Glioblastoma (GBM) is an aggressive brain cancer with poor prognosis, making the identification of reliable molecular biomarkers vital for early detection and improving treatment strategies. This study introduces a two-phase framework for discovering and validating GBM biomarkers. In the first phase, we employed the Simple, Fast, and Efficient (SFE) feature selection algorithm to high-dimensional multi-omics data from The Cancer Genome Atlas (TCGA) GBM cohort to identify potential biomarkers. In the second phase, we assessed the explainability of these biomarkers through two approaches. First, by comparing them with reference data from established databases. Second, by evaluating their performance using classical machine learning models. This two-phase framework is versatile and can be adapted to other cancer datasets, offering a promising approach to biomarker discovery for improving cancer treatment.

Enhancing Video Retrieval via Synergized Image Embeddings and RAG

ABSTRACT. This paper introduces an advanced video retrieval system designed to efficiently process and retrieve video content through innovative information extraction and embedding techniques. The system indexes videos by extracting keyframes and removing redundancies using CNN-based embeddings. Keyframes are embedded with the BEiT-3 model and enriched with metadata from object detection. Videos are segmented into overlapping sub-videos, with transcripts aligned and embedded using the Alibaba-NLP model. All data is stored in cloud storage and indexed in a vector database for efficient retrieval. User queries are processed through multiple embedding models, supporting versatile search capabilities across transcripts, frames, and descriptions. The retrieval process includes a re-ranking algorithm that filters and ranks keyframes, providing users with the most relevant results. Additionally, RetrievalAugmented Generation (RAG) is employed to enhance search precision, offering a robust solution for large-scale video content analysis and retrieval.

A Comprehensive Video Event Retrieval System for Vietnamese News: Integrating CLIP ViT, TASK-former, Transcripts, and OCR

ABSTRACT. In response to the growing need for precise and efficient video retrieval, we present a versatile video event retrieval system that supports multiple query modalities. Our system integrates 5 modes for querying in natural languages: quick search, temporal search, hybrid text and sketch search, transcript-based search, and OCR-based search. Powered by the CLIP ViT-L model, the quick search matches user queries to relevant video segments. Temporal search combines CLIP embeddings with mathematical techniques to pinpoint specific timeframes, while the TASK-former model supports hybrid sketch-text search, enabling users to locate scenes using hand-drawn sketches and textual descriptions. Transcript-based search aids users to overcome socio-cultural barriers and OCR-based search utilizes extracted text from video keyframes. The system’s interface allows users to input queries, and browse top-ranked results in a manner that tackles clustered viewing problems seen in common systems. In addition, users can explore visually similar images, and preview short clips, improving both precision and accessibility in video content retrieval.

LameFrames: Optimizing Video Event Retrieval Through Strategic Integration and Individual Strategy Enhancement

ABSTRACT. Video event retrieval task aims at retrieving video events from a large video collection that are semantically relevant to a given textual or visual query. There are several approaches have been introduced for this promising task, but overall, they can be categorized either as embedding-based techniques or proxy techniques. Each of them has its advantages and appropriately used context. The huge size of datasets often seen in this task is also a challenge for retrieval systems to work efficiently. This paper presents a comprehensive solution and application for video retrieval that addresses the challenges of speed and accuracy in large-scale datasets. Our approach integrates two complementary methods: an image semantic search using CLIP visual-textual embeddings together with an advanced FAISS vector retrieve index; and a text-proxy image search using optical character recognition (OCR) and automatic speech recognition (ASR) together with Elastic Text Search. In the first approach, we further implement four strategies that advantage and enhance the power of CLIP embeddings. Through experiments, we demonstrate that our approaches provide high accuracy and efficiency for video retrieval applications for a vast type of query. We also found that the quality of the query noticeably impacted the search results and the ability of end users to customize the search by modifying different factors also contributed to a quick and successful query.

MMMSVR: An Advanced Video Retrieval and Question Answering System

ABSTRACT. Video retrieval, the process of locating specific video content within large da-tasets, presents a significant challenge in the era of digital multimedia. In response to this, as part of the Ho Chi Minh City AI Challenge 2024, this paper presents an advanced multi-modalities and multi-stages video retrieval framework namely MMMSVR, enhanced with the ability to answer supplemental questions, such as automatic counting of objects. The proposed method leverages vision-language models, combining CLIP ViT-H/14, BLIP2 and Beit-3 for feature encoding and implements a re-ranking mechanism based on a weighting system. Furthermore, a wide range of query modalities such as Optical Character Recognition (OCR), Object Detection, and Automatic Speech Recognition are integrated to refine and improve the retrieval process. The system supports both single and multi-text queries for event sequences, facilitating efficient video retrieval based on image, audio, and textual attributes. Furthermore, the framework includes an image-based query feature, enriching the model’s versatility and improving retrieval accuracy. The proposed approach demonstrates significant performance improve-ments and offers a robust, flexible solution for video search and question-answering tasks.

CLIP-Enhanced Lifelog Retrieval System: Robust Multi-Modal Media Search with Real-Time Performance

ABSTRACT. In this paper, we introduce a robust media retrieval system designed to address the challenges posed by large-scale, multi-modal data retrieval tasks, particularly in the context of image retrieval, which plays a crucial role in surfacing key moments from vast media datasets. Efficient and accurate image retrieval is essential for individuals to navigating and retrieving relevant events, making it a vital component of any advanced retrieval system. Our system builds on retrieval models by integrating advanced features, more specific like CLIP-based image search, YOLOv8 for precise object detection, and temporal search to handle long and complex queries. Key optimizations include enhanced visual similarity search and an intuitive, interactive interface that ensures fast and efficient query results. By storing extracted features in the Milvus vector database, the system achieves significant speed improvements in retrieval, enabling real-time performance. Benchmarking our model at the AI Challenge (AIC24) in Ho Chi Minh City, the system demonstrated top-tier results, particularly excelling in KIS-type queries, where it achieved 100\% frame retrieval accuracy (15/15) with an average query response time under 30 seconds. These results highlight the effectiveness of our system in handling diverse and complex queries, making it a valuable tool for lifelog retrieval tasks, especially in improving the user experience for both novice and expert users. Our code is publicly available at \url{https://github.com/trnKhanh/AIC24}.

Enhanced Video Event Retrieval through Adaptive Multi-Model Fusion with Large Language Models

ABSTRACT. The retrieval of events in videos has emerged as a critical area of research due to the complexity of multimedia information and the rapid growth of digital content. Current methods typically rely on models to extract features from various data sources and combine these features for enhanced retrieval. However, this approach often necessitates prioritizing model weighting based on query context, as different models may yield varying relevance depending on the nature of the query. To address these challenges, this study proposes an adaptive multi-model fusion technique within a video event retrieval framework, leveraging large language models (LLMs) to dynamically adjust weights for multimodal data, thereby enhancing context-aware retrieval. Evaluated on the AI Challenge HCMC (AIC) 2024 dataset, our method achieved a success rate of 91.9\%, thereby proving the effectiveness of our solution.

"MAVEN: Video Retrieval System using A Multi-Agent Visual Exploration Network"

ABSTRACT. Effective video retrieval systems are essential as video data grows across various fields. Traditionally, these systems rely on OCR, object detection, color extraction, and audio analysis. Current approaches like CLIP bridge the text-image embeddings gap for search, but they often lack contextual depth for complex, multi-frame searches. We propose a solution that integrates traditional methods and CLIP with advanced language models and prompting techniques for image captioning, extracting rich information from individual frames. Our system includes an Agent that automates searches, classifies queries, generates prompts, and verifies results, improving search accuracy while reducing user effort. Additional features like temporal search, video previews, and frame filtering further enhance the user experience. This comprehensive approach provides a powerful toolkit for achieving more accurate and efficient video search results, addressing the growing complexity of video data retrieval across various domains.

Can Image Generative Models be Considered Experts?

ABSTRACT. This paper addresses foundational challenges in evaluating Generative Artificial Intelligence (GAI), focusing on the transition from expertise evaluation to intelligence evaluation. It critiques both quantitative and qualitative metrics for GAI, highlighting limitations in human algorithm interaction environments. The study examines knowledge representation in neural network architectures and the processes of filtering versus tokenization for image processing, emphasizing inconsistencies and lack of standardization in test design. The paper then proposes a research design to evaluate the expertise of GAI models, with a focus on image generation. The methodology includes training an expert model, evaluating model tasks, developing an evaluation framework, and conducting human evaluations. The paper also acknowledges some potential research design limitations and considerations. The authors hope that this research design will aid in the development of GAI models that can have practical applications in high-technical industries such as industrial manufacturing, electrical engineering, and other high-precision industries.

10:20-12:00 Session 13A: Generative AI
Chairs:
Location: Danang 1
10:20
Improving Vietnamese Legal Document Retrieval using Synthetic Data

ABSTRACT. In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.

10:40
A Diffusion Model for Personalized Text-to-Image Generation

ABSTRACT. Recent text-to-image generative models have demonstrated a remarkable ability to produce high-quality images that accurately match given text prompts. However, generating images with novel concepts, such as incorporating a subject ID provided by a reference image, remains challenging. This task, so-called personalized image generation, aims to enable text-to-image models to adapt to new concepts while maintaining strong text-image alignment. In this work, we experiment with a simple yet effective face ID adapter module called FaceID-IpAdapter. This module transforms facial features obtained from off-the-shelf face embedding models into new token embeddings, which can be used alongside existing text token embeddings as conditions for pre-trained text-to-image diffusion models. Our model, which requires no test-time finetuning, achieves an impressive balance between face ID preservation and text-image alignment using only a single reference face image. During training, we also introduce a novel face ID loss to explicitly teach the models to preserve facial features from the reference image. Experimental results show that our model achieves an impressive 68.87% cosine similarity between facial features of the reference and generated images, which is 43% higher than another finetune-free method with roughly the same number of parameters.

11:00
Enhancing Neural Machine Translation with Direct Preference Optimization Using Human Feedback

ABSTRACT. his paper presents a study on improving the quality of neu- ral machine translation (NMT) for the English-Romanian language pair using Reinforcement Learning from Human Feedback (RLHF) via Di- rect Preference Optimization (DPO). Despite advancements in NMT, challenges remain, particularly for low-resource languages and personal- ized translations. By incorporating human feedback, the proposed ap- proach demonstrates improvements in translation accuracy and natu- ralness. Evaluation metrics, including BLEU and chrF++, along with human assessments, show that the DPO-trained model performs better in aligning translations with human preferences, particularly in everyday conversational contexts.

11:20
A Stable Diffusion Pipeline for Diverse Procedural Painting via Text Prompts

ABSTRACT. In digital art, artificial intelligence (AI) has found ubiquitous applications in (semi-)automatically producing captivating visuals with aesthetic appeal, pushing the boundaries of creativity and productivity. Pixel-based photo generation AI models and stroke-based neural painting methods have been successfully developed for creating photorealistic and artistic images. On the one hand, pixel-based models can directly predict the pixel values of raster images, allowing for detailed and realistic representations. On the other hand, stroke-based techniques provide a more aesthetic approach to art creation, mimicking the way humans draw and create paintings. By combining these two techniques, artists can achieve a harmonious blend of high fidelity and artistic interpretation, bringing exciting possibilities to the realm of digital art. In this work, we propose a pipeline for combining state-of-the-art AI methodologies in order to generate a collection of multiple procedural paintings via a single text prompt. Specifically, we employ an integration of Quality-Diversity (QD) optimization and Stable Diffusion (SD) to generate diverse high-quality images, which then become inputs for the Compositional Neural Painter (CNP) model to render sequences of painting strokes artistically drawing the images.

11:40
Enhancing Image Authenticity in the Age of Generative AI: an Autoencoder-Driven Fourier Transform based Approach

ABSTRACT. In the era of generative artificial intelligence (AI) applications, the challenge of distinguishing real from AI-generated synthetic images is critical for ensuring security and information authenticity. Our study presents a cutting-edge method for detecting synthetic images, combining a pretrained Autoen- coder with Fourier transform techniques to extract unique image fingerprints. The main idea behind the use of the Autoencoder is that it would fail to reconstruct unnatural features in AI-generated im- ages. The occurrence of such failure gives rise to an isolated attribute referred to as residual noise, which serves as an indicator of the image genera- tion process and significantly improves the efficacy of detecting counterfeit images. This advancement results in an impressive improvement, yielding a +22.29% increase in accuracy and a +16.72% rise in AUC compared to state-of-the-art competitors. This advancement not only demonstrates the effi- cacy of our approach but also highlights its poten- tial for widespread application in areas requiring robust security and information verification mea- sures.

10:20-12:00 Session 13B: Lifelog and Multimedia Event Retrieval
Location: Danang 2
10:20
Event Retrieval from Large Video Collection in Ho Chi Minh City AI Challenge 2024

ABSTRACT. The Ho Chi Minh City AI Challenge 2024, now in its fifth edition, focused on event retrieval from large video collections, advancing research and innovation in video analysis. The challenge featured a meticulously curated dataset and diverse query formats to evaluate system performance across realistic scenarios. Participant teams competed in multiple rounds, employing innovative approaches to solve complex queries involving temporal and semantic event understanding. Key strategies included advanced deep learning models, temporal segmentation, and multimodal fusion techniques. This paper outlines the challenge organization, dataset details, query types, team methodologies, and insights into common trends and solutions, fostering future advancements in event retrieval.

10:40
Fustar: Divide and Conquer Query in Video Retrieval System

ABSTRACT. Video retrieval is a crucial task as the volume of multimedia data grows rapidly, requiring advanced systems to handle complex and diverse queries. In this paper, we present Fustar, a video retrieval system developed for the Ho Chi Minh AI Challenge 2024. Fustar leverages features from the CLIP-based model, combined with OCR, ASR, and object-based detection, to support robust text-based search. We also implemented an improved version of the clustering algorithm to optimize the number of keyframes extracted. Besides, We propose two key innovations: fused search, which combines multiple features to improve accuracy and handle long queries, and temporal search, designed to process time-based video queries. Additionally, we integrate human feedback re-ranking and query refinement using large language models (LLMs) to enhance result relevance. Fustar user-friendly interface ensures accessibility for non-technical users, offering a powerful yet simple tool for effective video retrieval.

11:00
NewsInsight2.0: An Enhanced Version Integrating Large Language Model-based Query Optimization with Advanced Temporal Mechanisms

ABSTRACT. We introduce NewsInsight2.0, a cutting-edge evolution of the NewsIn- sight system, specifically designed for the Ho Chi Minh AI Chal- lenge 2024. Building on the strengths of ViewsInsight, NewsIn- sight2.0 addresses previous performance limitations with signifi- cant enhancements. Our optimized architecture is tailored to deliver exceptional search capabilities for AIC 2024. At its core, NewsIn- sight2.0 leverages the CLIP (Contrastive Language-Image Pre-training) model, trained on a vast dataset of 5 billion parameters (DFN-5B). Additionally, we have refined our temporal query mechanism with a more efficient algorithm and an intuitive user interface. Further- more, NewsInsight2.0 features an automatic query generator pow- ered by open-source large language models, streamlining the pro- cess of optimizing user input queries.

11:20
AViSearch: A Multimodal Video Event Retrieval System via Query Enhancement and Optimized Keyframes

ABSTRACT. Developing an efficient video event retrieval system is crucial due to the rapid growth of video-sharing platforms, resulting in a vast increase in the number and volume of video content. Traditional techniques used in previous VBS competitions, such as Optical Character Recognition (OCR) and Object Detection, primarily relied on metadata. However, current state-of-the-art methods utilize pre-trained models like CLIP and BLIP. These models effectively link textual queries with image data, but user queries are sometimes ambiguous, leading to possible mismatches between user expectations and the retrieved content. Moreover, many duplicate frames may not contribute new information when extracting keyframes, increasing search times and storage costs. To overcome these challenges, we present AViSearch, an advanced video event retrieval system developed for AI Challenge HCMC 2024. AViSearch leverages Gemini to optimize queries and generate more diverse queries from the user's original input, facilitating a more detailed need. We also optimize keyframes, reducing storage requirements and improving the efficiency of the search process. Additionally, our system incorporates various retrieval techniques, including Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), object detection, and image similarity, providing a comprehensive toolkit for achieving more accurate and efficient search results.

11:40
An Optimized And Interactive Video Event Retrieval System With An Improved Temporal Algorithm

ABSTRACT. Video event retrieval is the process of identifying and extracting specific events or actions from a video collection based on a given query or description. Effective retrieval systems must adeptly manage the storage, indexing, searching, and delivery of such information. However, current approaches often focus on speed or accuracy, while user interactivity and scalability are not the most focused keywords. Therefore, this paper introduces PumpkinV2, an interactive video retrieval system with high precision and scalability. Capturing a visually balanced user interface, the system deployed a temporal-enabled visual-text association search pipeline with adaptive reranking methods. The retrieval results are enhanced by applying multiplicative and additive approaches in the integrated temporal algorithm. Furthermore, the system is built on top of a highly scalable vector database, combined with vector quantization, format-optimized data and a production-grade gateway web server. PumpkinV2 achieved outstanding results at AI Challenge HCMC 2024, an annual video event retrieval competition, with 94 percent accuracy during qualifying rounds and ranked top 10 amongst finalists, proving its capability of as a robust, scalable and highly interactive system for video event retrieval.

10:20-12:00 Session 13D: Applied Operations Research and Optimization
Location: Son Tra
10:20
Constraint Programming-Based Cutting Plane Algorithm for a Combination of Orienteering and Maximum Capture Problem
PRESENTER: Hoang Giang Pham

ABSTRACT. In this paper, we study a new variant of orienteering problem (OP) where each vertex in the OP tour is a facility within a competitive market context, where customer demand is predicted by a random utility choice model. Unlike prior research, which primarily focuses on simple objective function such as maximizing a linear sum of score of selected vertices, we introduce a complicated non-linear objective function that necessitate the selection of locations to maximize a profit value such as expected customer demand or revenue. In our study, routing constraints included in the form of the OP is handled by Constraint Programming (CP), and the non-linear objective function, resulting from the utilization of random utilities, is tackled by two types of valid cuts, namely, outer-approximation and submodular cuts. These lead to the development of an exact solution methods: Cutting Plane, where these valid cuts are iteratively added to a master problem. Extensive experiments are conducted on problem instances of varying sizes, demonstrating that our approach excels in terms of solution quality and computation time when compared to other baseline approach.

10:40
Cost Optimization in Competitive Facility Location under General Demand Model
PRESENTER: Ba Luat Le

ABSTRACT. This work addresses a cost optimization problem in facility location where customer demand is modeled using the cross-nested logit model, one of the most flexible demand models in the literature. The objective is to maximize a captured demand function by allocating a fixed investment budget across a set of facilities, where the investment directly influences the demand captured by each facility. The resulting optimization problem involves exponential and fractional terms, leading to a highly nonlinear structure. To the best of our knowledge, no existing methods can solve this problem to near-optimality. To address this, we propose a piecewise linear approximation technique and apply variable transformations to approximate the problem (to any desired precision) as a mixed-integer convex program, which can be solved to optimality using an outer-approximation method. Extensive experiments on generated instances of varying sizes demonstrate the effectiveness of our proposed approach compared to standard baselines.

11:00
Influence Maximization with Fairness Allocation Constraint

ABSTRACT. Motivated by practical applications from social influence and viral marketing, this work studies the problem of Influence Maximization with Fairness Allocation Constraint, which aims to find a set of $k$ uses from groups in a social network with maximal influence spread so that the number of selected uses in each group does not exceed the group budget. We propose an efficient and salable approximation that returns an approximation ratio of $1/2-\epsilon$ and takes $O((m+\log(\frac{k}{\epsilon}))\frac{n}{\epsilon^2}(k \log n+\log(\frac{1}{\delta})))$ time complexity, where $\epsilon$ is a constant, $n$ is the number of uses and $m$ is the number of links. Besides theoretical results, extensive experiments conducted on real social networks show that our algorithm provides better solutions than cutting-edge methods.

11:20
A Reputation Scoring Framework for Lending Protocols using the PageRank Algorithm

ABSTRACT. Blockchain technology has revolutionized the financial sector by introducing decentralized finance (DeFi) as a powerful alternative to traditional banking systems. Among DeFi sectors, Lending has become a key area, facilitating cryptocurrency borrowing and lending without intermediaries. As of May 2024, Lending ranks second in Total Value Locked (TVL) within DeFi, reflecting its widespread adoption. Key entities in the Lending ecosystem, including Personal Wallets, Lending Smart Contracts, Protocol Supported Tokens, and Centralized Exchange, play a crucial role in shaping governance, decision-making, and resource allocation. However, current evaluation methods, which primarily rely on token holdings for governance voting, are vulnerable to manipulation and fail to accurately reflect contributions. To address this, we propose a novel scoring system that evaluates entities based on both token holdings and lending interactions over time. Our system uses the PageRank algorithm, scaled to the FICO score range, to offer a more stable and transparent assessment. This approach promotes healthy competition, encourages user activity, and supports the long-term growth and stability of the Lending DeFi ecosystem.

11:40
A method combining the reference information of the adaptive adjustment method and the decision maker of multi-objective evolutionary algorithms

ABSTRACT. In practice, when using multi-objective optimization algorithms, people use reference information to search for desired solutions. However, including decision maker reference information can cause the evolutionary process to lose balance between exploration and exploitation capabilities, thereby leading to missing good solutions or being locally optimized. Recently, there have been many effective proposals to analyze trends and maintain the balance automatically. To simultaneously address decision maker desires and self-regulation capabilities, this paper proposes a method that combines decision maker information and adaptive control information applied on the DMEA-II, MOEA/D using reference points. The experimental results created a good balance in using the two reference information types in the evolutionary process.

14:10-14:40 Session 15: Poster session IV
Flow Velocity Analysis of Rivers Using Farneback Optical Flow and STIV Techniques with Drone Data

ABSTRACT. This study focuses on the application of the Farneback optical flow method and Space-Time Image Velocimetry (STIV) for analyzing river flow velocities in upstream regions using drone-captured data. By integrating drone imagery with satellite data, we developed a high-precision model for predicting river flow, aiming to improve hydropower stability and flood control measures. The Farneback method enabled accurate flow velocity predictions by analyzing image sequences, while STIV was applied to obtain temporal and spatial flow measurements. Field experiments conducted in Japan, Sri Lanka, and Vietnam confirmed the effectiveness of these techniques, surpassing traditional flow measurement methods. The results showcase the potential of combining AI-driven image analysis and advanced drone technology for real-time river management, contributing to disaster risk mitigation and sustainable water resource management.

Faster, Larger, Stronger: Optimally Solving Employee Scheduling Problems with Graph Neural Networks

ABSTRACT. Employee planning is a complex task that affects the opera- tions and profitability of many companies. It requires optimizing total operating costs and profitability while satisfying constraints such as em- ployee availability, skills and preferences, which oftentimes leads to the use of combinatorial optimization techniques, a powerful tool to solve this problem. In many real-world applications, traditional combinatorial opti- mization techniques rely on branch and bound, cutting plane, and local search techniques to find the optimal solution. However, these traditional techniques can be computationally expensive for and are oftentimes not scalable for large-scale problem instances. To overcome these issues, we propose a new method which utilises the power of deep neural networks. In particular, we first convert the employee scheduling problem into a graph and build a novel graph neural network (GNN) to learn the optimal solution on the graphical representation of the problem. We evaluate the performance of our enhanced techniques on a large number of in- stances of employee scheduling problems, which show that our approach can significantly improve the performance of traditional combinatorial optimization techniques (approximately 20.86% to 46.79% compared to the state-of-the-art solver, CPLEX).

Advancing Geopolitical Map Analysis: An Intelligent System for Territorial Integrity Verification

ABSTRACT. Accurate cartographic representation of territorial sovereignty is crucial for geopolitical integrity, especially in historically complex re- gions. This paper presents an intelligent system for detecting island omis- sions, particularly the Hoang Sa and Truong Sa archipelagos, on geo- graphic maps. Addressing data-scarcity and map-diversity challenges, we propose a multi-faceted approach integrating web-crawling, map- detection, and island-identification techniques. Our novel deep-learning architecture incorporates a crawl-service, map-detection model, Vietnam- map classifier, and parallel archipelago and nine-dash-line detection pro- cesses. We introduce the IslandMapVN dataset, comprising 3000 anno- tated map-images, including 1200 Vietnam maps featuring the disputed archipelagos and 600 rare nine-dash-line images. This unique dataset sig- nificantly contributes to the geopolitical map-analysis field. Experimen- tal results exhibit the superiority of the proposed method in comparison with competitive baselines, in terms of both quantitative and qualitative manners.

Improving Human Action Recognition Using Quaternion Discrete Fourier Transform in Transfer Learning

ABSTRACT. Human action recognition is a critical field in artificial intelligence and computer vision, with wide-ranging applications such as healthcare, surveillance, and virtual reality. This study enhances action recognition performance by integrating transfer learning with the Quaternion Discrete Fourier Transform (QDFT), a novel approach that leverages the mathematical properties of quaternions for advanced signal processing. Using a subset of the UCF50 dataset, the study evaluates the effectiveness of this method across various actions, including BaseballPitch, Basketball, and Biking. The approach involves extracting features from pre-trained convolutional neural networks (CNNs) and applying the Fourier transform on quaternions to these features, which are then combined with those processed through fully connected layers. The experimental results demonstrate that incorporating QDFTs significantly improves the accuracy, precision, recall, and F1-scores of transfer learning models compared to conventional methods. Furthermore, comparative analysis with baseline models, such as MobileNet and EfficientNet, highlights the superiority of this hybrid approach. This research contributes to the field of action video recognition by providing a robust framework that can be adapted to various applications, with potential for further optimization and exploration across different datasets and recognition tasks.

Cardio Care: A Vision Transformer Cardiac Classification based on Electrocardiogram Images and Signals

ABSTRACT. Electrocardiogram is essential for evaluating cardiac function and many artificial intelligence models for computerised interpretation have been developed. However, these are unsuitable for under-resourced communities that only have access to the paper-based ECG report. To overcome this disadvantage, we propose Cardio Care, a mobile platform that processes electrocardiogram images and signals input for abnor- malities detection utilising Vision Transformer for improving imaging recognition, making it more suitable for wider applications. We use three datasets (including public and local datasets) with varying sample sizes and input types to reflect the data in real-world settings. The results demonstrate consistent performance across all datasets and highlight the promising application of Cardio Care in assisting cardiologists in remote and resource-limited healthcare facilities with AVG Macro F1 scores of 65, 99, and 82 across three datasets CPSC, Mendeley and Tam Duc datasets, respectively. This research proposed an alternative approach toward a preprocessing pipeline for ECG signals and images fed into a Vision Transformer-based deep learning network, aiming to improve healthcare access for under-resourced and under-served communities.

A Tool for Preventing Consanguineous Marriages Using Vietnam's National Residents Database

ABSTRACT. Consanguineous marriage is defined as a marriage between a male and female within the same family lineage, not exceeding three generations. Such marriages facilitate the development and manifestation of a range of hereditary diseases caused by recessive genes on chromosomes, which can become apparent in subsequent generations. In Vietnam, consanguineous marriages remain common, particularly among ethnic minorities. Fortunately, Vietnam has collected resident data on 104 million citizens, which has been utilized for economic and social development purposes. This study aims to propose tools for preventing consanguineous marriages by checking whether two individuals planning to marry are related within three generations before registering their marriage, using the national resident database. The study proposes three methods based on three different data structures (simple graph, balanced tree, and hash table). Theoretical and experimental analyses demonstrate that the method based on the hash table provides the fastest verification of whether two individuals planning to marry are related within three generations, outperforming the current methods.

Optimizing Smart Grids with Reinforcement Learning for Enhanced Energy Efficiency

ABSTRACT. The transition from traditional, centralized electric grids to smart grids offers numerous opportunities for optimizing energy distri- bution and consumption. This paper presents a reinforcement learning- based approach for load scheduling in smart grids, aiming to reduce energy loss and enhance grid reliability. By leveraging consumer prefer- ences,theproposedsystemschedulesloadsefficiently,therebyminimizing energy loss in transmission lines and reducing peak loads. Our results, tested on simulated grid environments of varying scales, demonstrate sig- nificant improvements in energy efficiency, suggesting that reinforcement learning can play a crucial role in the future of smart grid management.

Benchmarking Real-Time Object Detection: Evaluating YOLO and RT-DETR on Speed, Accuracy, and Efficiency
PRESENTER: Cao Vu Bui

ABSTRACT. Object detection in real-time is indispensable in applications like autonomous vehicles, robotics, and surveillance, where both high accuracy and efficiency become necessary. The paper conducts an investigation and benchmarking on state-of-the-art models, including YOLO and RT-DETR, using the Pascal VOC dataset on the most important performance metrics: accuracy-mAP, size of the model, and inference speed-FPS. The results show YOLOv8x with the highest accuracy with an mAP 50:95 of 0.480 and 133 by FPS, while RT-DETRv1 (r50vd), although a little lower in terms of accuracy with an mAP 50:95 of 0.475, achieves 100 FPS by offering a pretty competitive trade-off in terms of model complexity. On the contrary, RT-DETRv2 (r34vd) has the smallest model with an mAP 50:95 of 0.468 and an FPS of 118, making it more adequate for resource-constrained environments. Our results immediately point to the trade-offs between accuracy, model complexity, and speeds providing practical observations that could be used when deploying such models on real-time systems. This research provides a reference that will be useful for model selection, taking into account specific performance requirements while conjugating highly accurate, fast, and resource-efficient models.

Progressive Retention Sampling for Sequence Generation-based Scene Text Spotting

ABSTRACT. Sequence generation models have demonstrated promising results in scene text spotting. However, these models face a discrepancy between training and inference phases: during training, ground-truth se quences are provided as input, whereas during inference, this input is replaced by the model’s own predictions. While current sampling strate gies have effectively mitigated this discrepancy in various sequence gen eration tasks such as image captioning and machine translation, they are not directly applicable to scene text spotting due to its unique se quence structure. This paper introduces Progressive Retention Sampling, a novel sampling strategy tailored specifically for sequence generation based scene text spotting. We evaluate our approach using two scene text spotting models, UNITS and SPTS, conducting experiments on the ICDAR 2015, Total-Text, and VinText datasets. Our results demon strate that the proposed method outperforms both baselines and con ventional sampling strategies. The implementation is publicly available at this GitHub repository.

Development of an Edge-Computing-Based Intelligent Service Framework for Smart Camera Applications

ABSTRACT. Advancements in the Internet of Things (IoT), Artificial Intelligence (AI), and wireless networks have fueled the growth of Artificial Intelligence of Things (AIoT) applications at the edge, addressing challenges such as power consumption, bandwidth constraints, and response latency. Our previous studies enhanced the management of edge AIoT applications, particularly for smart cameras. However, developing these systems remains challenging due to the lack of a unified platform, resulting in high complexity, inconsistency, and increased costs driven by limited code reuse and fragmented development processes. To address these challenges, this paper introduces an edge-computing-based intelligent service framework for smart camera applications. The framework provides a unified platform that enhances consistency, facilitates code reuse, and reduces costs and deployment time. It employs a microservices architecture deployed on a virtualization platform with container orchestration at the edge, allowing customization to meet specific application needs. The system also supports extended services, including cluster management, system monitoring, and DevOps practices for secure continuous integration and continuous deployment (CI/CD). This approach improves flexibility and efficiency in the development, deployment, management, and monitoring of smart camera applications at the edge. Experimental use cases, including face mask recognition and helmet detection applications, validate the proposed system's effectiveness.

MedCapNet: A Novel Approach to Medical Image Captioning

ABSTRACT. Medical image captioning is crucial for automating the generation of accurate textual descriptions for medical images. This paper introduces MedCapNet, a novel encoder-decoder architecture designed to address the challenges associated with this task. The model incorporates a Swin Transformer and Enhancement Encoder, allowing for the efficient extraction and refinement of both patch-level and global-level features from medical images. A Transformer block with a Fusion Module is utilized by the decoder to seamlessly integrate visual and linguistic information. A key innovation is Dual-Scale Masked Multi-Head Self-Attention, which enhances the model's ability to effectively capture long-range dependencies and fine-grained details. Our model was evaluated on ROCO v2, achieving state-of-the-art performance with scores of 0.647, 0.239, and 0.094 for BERTScore, CIDEr, and METEOR, respectively. Our work contributes to the field by introducing a robust architecture that effectively bridges the gap between visual and textual modalities in medical imaging.

Contrastive Perturbation Enhancement for LLM-Based Machine Translation

ABSTRACT. Large language models have been increasingly effective in various NLP tasks, especially for machine translation task. However, these models require a lot of computational resources and need to be further fine-tuned on a specific training data to achieve better performance. Medium-sized language models often show significantly poor performance compared to the large language models. Therefore, it is necessary to study methods to solve this problem. In this paper, we propose a method to fine-tune a language model with a size of several billion parameters based on the instructions from a large language model such as GPT-4 through the contrastive learning technique, called CoPE - Contrastive Perturbation Enhancement for LLM-Based Machine Translation. Our proposal consists of three stages: fine-tuning the language model on a parallel dataset, generating entailment as positive and contradiction as negative examples from the training dataset based on a high-performance large language model such as GPT-4, and then using these examples to improve the model through the contrastive learning technique. These examples will be evaluated and ranked to increase the influence of quality examples. Experimental results show that our proposal with a base model of LLaMA-3.1 with 8B parameters achieves 35.99 bleu score, 85.28 COMET-22 score, and 88.90 XCOMET score on the WMT’21 and WMT’22 datasets. This result is competitive with models such as ALMA-13B-R trained based on the contrastive preference optimization technique and is higher than the GPT-3.5 model.

Traffic Anomaly Detection under Extreme Weather from Aerial Images

ABSTRACT. Video anomaly detection is a crucial task in surveillance systems, significantly enhance the safety and security of city dwellers. It has attracted considerable interest from researchers in computer vision, machine learning, cyber security, remote sensing, and data mining. However, current state-of-the-art methods for video anomaly detection still encounter significant challenges in extreme weather conditions such as rain, fog, snow, flood, and thunderstorms. These conditions introduce considerable noise and complicate the detection of abnormal events in real-world scenarios, particularly in traffic surveillance videos. Therefore, this study investigates the impact of varying weather conditions on the performance of prominent methods for the detection of traffic anomalies in aerial videos. We perform extensive experiments utilizing six state-of-the-art methods on the standard benchmark dataset in aerial video surveillance, UIT-ADrone. In addition, we provide an in-depth analysis of the practical challenges posed by adverse conditions, including rain and snow. This analysis aims to elucidate the complex scene contexts that may hinder the efficacy of current methods in high-altitude drone videos for traffic surveillance.

URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots - A Case Study at HCMUT

ABSTRACT. With the rapid advancement of Artificial Intelligence, particularly in Natural Language Processing, Large Language Models (LLMs) have become pivotal in educational question-answering systems, especially university admission chatbots. Concepts such as Retrieval-Augmented Generation (RAG) and other advanced techniques have been developed to enhance these systems by integrating specific university data, enabling LLMs to provide informed responses on admissions and academic counseling. However, these enhanced RAG techniques often involve high operational costs and require the training of complex, specialized modules, which poses challenges for practical deployment. Additionally, in the educational context, it is crucial to provide accurate answers to prevent misinformation, a task that LLM-based systems find challenging without appropriate strategies and methods. In this paper, we introduce the Unified RAG (URAG) Framework, a hybrid approach that significantly improves the accuracy of responses, particularly for critical queries. Experimental results demonstrate that URAG enhances our in-house, lightweight model to perform comparably to state-of-the-art commercial models. Moreover, to validate its practical applicability, we conducted a case study at our educational institution, which received positive feedback and acclaim. This study not only proves the effectiveness of URAG but also highlights its feasibility for real-world implementation in educational settings.

EPC-YOLOv7: The Proposed One-stage Detector for Aerial Scenario Detection

ABSTRACT. Aerial scenario detection has always been an interesting challenge due to its various object scales and complex background, making the model hard to cover all the shapes, patterns, and recognize labels for detected objects. In this study, we contribute an extended network based on Fasternet block that can be implemented to the ELAN backbone network of the YOLOv7 architecture without harming the final performance or any adjustment in model width and depth scale while still being able to obtain a lower number in parameter, computation, and faster inference speed. Moreover, the proposed architecture is also combined with a Bi-direction Feature Pyramid Network to enhance model detection capability, allowing the architecture to detect more small objects. As a result, our proposed method reached 37.8% mAP50 on Visdrone2019 test-dev. For the DIOR test set, the proposed model achieved better than the YOLOv7 baseline model in both performance and complexity.

A Low-Cost EEG-Based System for Measuring and Forecasting Levels of Alertness with Long Short-Term Memory

ABSTRACT. In this paper, we propose a practical and cost-effective system for real-time tracking and predicting alertness levels. Instead of relying on multi-electrode sensors that require complex setups and may cause discomfort, our system uses a compact, single electrode sensor to capture EEG data. This data is then analyzed by various machine learning models to calculate an Awake Score for users. The Awake Score is also used as input for a forecasting model, which predicts the users’ alertness trends in advance. The forecasting model leverages an advanced deep learning model, Long Short-Term Memory (LSTM), to handle the EEG data and detect intricate temporal patterns in brain activity. Furthermore, we optimize the system to function with minimal electrodes while maintaining high predictive accuracy, providing a feasible solution for real-time detection of fatigue and cognitive load.

Real-Time Multi-Face Emotion Recognition for Enhancing Student Engagement in Classroom Environments Using Low-Power IoT Devices

ABSTRACT. This paper explores the role of emotion recognition in improving student learning by monitoring psychological states in real-time. We introduce a multi-face automated emotion recognition system that analyzes the emotions of multiple students simultaneously, evaluated on the Emotion PTIT dataset, which consists of students in a classroom, manually labeled with 1,500 images and 71 videos for face recognition and classroom detection tasks, respectively. To enhance classroom dynamics, we propose a new formula for evaluating engagement based on emotional data. The system, designed for low-power IoT devices, addresses challenges like load balancing and latency, processing live video feeds to assess classroom interactions. Our prototype can detect up to 50 faces at 25 FPS with an accuracy of 88.68%.

MEPC: Multi-level Product Category Recognition Image Dataset

ABSTRACT. Multi-level product category prediction is a problem for businesses providing online retail sector systems. Accurate Multi-level prediction supports the system in avoiding the need for sellers to fill in product category information, saving time and reducing the cost of listing products online. This is an open research problem, which always attracts researchers. Deep learning techniques have shown promising results for category recognition problems. A neat and clean dataset is an elementary requirement for building accurate and robust deep-learning models for category prediction. In this article, we introduce a new image dataset of the multi-level product, called MEPC. MEPC dataset has +164.000 images in the processed format available in the dataset. We evaluate MEPC dataset with popular deep learning models, benchmark results a top-1 accuracy score of 92.055% with 10 classes and a top-5 accuracy score 57.36% with 1000 classes. The proposed dataset is good for training, validation, and testing for hierarchical image classification to improve predict multi-level categories in the online retail sector systems. Data and code will be released at https://huggingface.co/datasets/sherlockvn/MEPC .

14:40-16:00 Session 16A: Secured and Intelligent Multimedia Systems
Location: Danang 1
14:40
EPEdit: Redefining Image Editing with Generative AI and User-Centric Design

ABSTRACT. The demand for image manipulation has seen a significant increase recently. Traditional tools like Photoshop and Capture One, while powerful, require considerable expertise to use effectively. Generative AI has introduced alternative platforms, such as Luminar Neo, Pixlr X, and Canva. However, many of these solutions, including resource-heavy models like Stable Diffusion, often require substantial retraining and fine-tuning, leading to high costs for users. To address these challenges, we introduce Efficient Photo Editor (EPEdit), an application that integrates a robust backend framework with a user-friendly front-end interface. EPEdit supports a wide range of creative image editing tasks, including image generation, object replacement, object removal, background modification, changes in object pose or perspective, region-specific editing, and thematic collection design, all guided by masks and prompts. Users can interact with the system through simple text commands or by marking areas for precise adjustments, making it accessible even to those without technical expertise. At its core, EPEdit leverages zero-shot image editing algorithms based on Stable Diffusion model, removing the need for additional fine-tuning. This approach enables efficient image manipulation and thematic collection creation. User evaluations for tasks of image editing, thematic design, and overall system performance demonstrate that EPEdit outperforms existing solutions, offering a user-friendly, cost-effective solution for comprehensive image editing.

15:00
A Simple Approach towards Frame Filtering for Efficient Gaussian Splatting

ABSTRACT. Neural rendering has established itself as the state-of-the-art approach for scene reconstruction and novel view synthesis (NVS) tasks. However, its heavy reliance on precise camera poses presents a signif- icant limitation. Since 2023, Gaussian Splatting (GS) has emerged as a promising approach for volumetric rendering, gaining traction in the 3D computer vision and graphics community due to its efficiency and real-time rendering capabilities. While COLMAP-free GS methods have been proposed to address camera pose dependency, they often struggle with "useless frames" - frames that do not introduce information gain about the rendered surface and/or have low resolution - leading to slower reconstruction and inefficient use of computational resources, potentially causing out-of-memory issues on mid-tier machines which does not have extraordinary computational power. To address these challenges, we pro- pose a frame filtering method for efficient NVS based on COLMAP-free GS. Our approach enables scene reconstruction under computational re- source constraints while maintaining high rendering quality. Experimen- tal results demonstrate that our method achieves a 25% reduction in GPU VRAM usage and a 20% decrease in training time for scene recon- struction, offering a more efficient solution for NVS tasks.

15:20
CESE: A Clip-based Event Search Engine for AI Challenge HCMC 2024

ABSTRACT. In the context of video event retrieval, identifying specific events in vast visual data collections is a critical challenge. This paper presents CESE, our solution for the AI Challenge HCMC 2024, which focuses on retrieving video clips based on multiple queries rather than individual frames based on a single query. CESE employs a dual-module architecture: FrameClipping and TextClipping, designed to address visual and textual data, respectively. FrameClipping uses a Visual Semantic Clipping algorithm based on keyframe extraction and visual embeddings, while TextClipping introduces both Textual Semantic Clipping for sentence-based queries and Keywords Clipping for keyword-only queries. The system integrates a lightweight keyframe extraction pipeline, optimizing the processing of large video datasets. We also leverage Gemini 1.5 Flash for context-aware caption generation, ensuring detailed and coherent event descriptions. Additionally, CESE offers a keyboard-driven interface for faster, more efficient retrieval by minimizing mouse interactions. Our approach significantly improves retrieval accuracy and user experience by returning meaningful video segments that encapsulate the described events. CESE represents a scalable and efficient solution for multimodal event search in competitive settings.

15:40
GeoSI: An Interesting Interactive System for Retrieving and Mapping News from Multiple Online Sources

ABSTRACT. In the era of Industry 4.0, the ability to process and extract valuable insights from vast amounts of online information has become crucial for users to stay informed with timely and relevant news. This paper introduces GeoSI, an interactive intelligent system designed to assist users in exploring and summarizing real-time information from online sources. GeoSI enables users to ask questions, delve into specific URLs, and visualize data geographically on a world map. The system not only provides in-depth analysis of individual countries but also evaluates the sentiment of the information (positive or negative), offering users a comprehensive and accurate overview of global events. GeoSI is built using OpenAI’s API v4.0 mini for intelligent question answering, combined with Selenium and Google Search to crawl and retrieve relevant articles and related content. By leveraging these tech- nologies, GeoSI offers an interactive experience, providing updated in- formation while organizing it by geographic location, making it easier to interpret trends, strengths, and challenges at the national level. GeoSI addresses three main challenges: real-time interaction with fresh online content, deep exploration of specific websites, and summarization of information by country. The effectiveness of the system will be evalu- ated through user experience in real-world scenarios.

14:40-16:00 Session 16B: Lifelog and Multimedia Event Retrieval
Location: Danang 2
14:40
Addressing Ambiguous Queries in Video Retrieval with Advanced Temporal Search

ABSTRACT. The increasing volume of multimedia content has intensified the demand for video retrieval systems that can efficiently and accurately extract relevant information from large-scale archives. However, existing methods frequently encounter challenges when dealing with ambiguous queries, particularly those involving complex temporal relationships, often leading to incomplete or suboptimal retrieval results. To address these limitations, we propose a novel multimodal video retrieval system designed to handle a wide range of query types by integrating outputs from multiple search models. A central feature of the system is its advanced temporal search mechanism, which improves ambiguity resolution by conducting additional searches within adjacent video shots, rather than relying solely on chronological order. The effectiveness of the proposed system is demonstrated through its performance in the 2024 Ho Chi Minh AI Challenge.

15:00
SnapSeek: A Multimodal Video Retrieval System with Context Awareness for AI Challenge 2024

ABSTRACT. Retrieving information on news is a complex task that demands considerable effort. The AI Challenge stands as one of the pioneering competitions in the nation, initiating exploration in this domain. This study presents the SnapSeek system, a notable entry in this year’s competition. Utilizing advanced tools such as Milvus and Elasticsearch, SnapSeek facilitates searches across various data types, particularly emphasizing vector embeddings and metadata in textual or encoded forms. Building upon the initial SnapSeek version introduced at LSC 2024, this iteration incorporates significant enhancements, including dataset expansion, metadata enrichment, the magic brush feature, temporal search capabilities, human feedback and contextual news extraction. These advancements collectively enhance the efficiency of information retrieval. Furthermore, SnapSeek offers a minimalist and user-friendly interface, rendering it an effective and appropriate system for retrieving news information.

15:20
ArtemisSearch: A Multimodal Search Engine for Efficient Video Log-Life Event Retrieval Using Time-Segmented Queries and Vision Transformer-based Feature Extraction

ABSTRACT. In this century, search engines have emerged as a crucial component of the technological landscape. Enterprises require a search engine to retrieve specific information within a particular field. However, they face various challenges due to the rapidly increasing volume of data and the need for effective database management to handle diverse data types. Additionally, the search for data is hindered by difficulties in matching queries with key frames or the limitations in understanding query context. In this paper, we introduce ArtemisSearch, a text-based multimodal search engine designed for temporal event retrieval in videos. In the proposed system, an efficient algorithm for Content-Based Image Retrieval (CBIR) using ViT-H/14 and BEiT3 for feature extraction and an open-source vector database, Milvus, our system efficiently retrieves events by leveraging temporal segmentation of queries and matching embeddings for Artificial Intelligence (AI) applications. Additionally, we developed a web application that allows end users to easily create temporally-aware descriptive queries, efficiently explore top results, and view precise video previews at relevant timestamps. The ArtemisSearch method represents a significant advancement in temporal video retrieval, with potential applications across diverse fields, leading to a smoother and more accurate video search experience.

15:40
KPI: Knowledge-based Processing for Interactive Video Retrieval

ABSTRACT. The expansion of internet technologies has led to challenges in managing and retrieving vast amounts of video content, causing information overload. Traditional search systems struggle to rank results based on user intent, prompting the research community to improve retrieval methods, notably through challenges like the Ho Chi Minh AI City Challenge. In this paper, we introduce KPI: Knowledge-based Processing for Interactive Video Retrieval, a novel system that enhances multimedia retrieval efficiency and accuracy. The system integrates text, automatic speech recognition (ASR), optical character recognition (OCR), and temporal retrieval, enabling time-based segment searches. Besides, we develop an elevated keyframe selection and incorporate a dominant-color search strategy, all supported by a user feedback mechanism and a user-centric interface. Our system competes in the 2024 Ho Chi Minh AI City Challenge and achieves a competitive result.

14:40-16:00 Session 16C: Human Computer Interaction
Chair:
Location: Danang 3
14:40
MRClassroom: A Mixed-Reality Interface for Improving Remote Students' Presence in Hybrid Classrooms

ABSTRACT. Hybrid classrooms, where remote and on-site students attend lectures together, often rely on video-conferencing platforms as the primary communication tool. However, these platforms typically limit remote students to a fixed view, preventing them from gaining a holistic perception of the physical classroom. Additionally, remote students' presence is often overlooked in interactions between the teacher and on-site students, potentially leading to disengagement. To address these challenges, we propose MRClassroom, a mixed-reality (MR) system designed to enhance remote students' presence by spatially integrating their representation into the physical classroom. MRClassroom features a dynamically constructed 3D replica of the classroom, providing remote students with a sense of co-location with their on-site peers. Using an MR headset, teachers can interact with remote students as naturally as with those on-site, while maintaining the instructional flow. Preliminary studies suggest the system improves student engagement and classroom dynamics.

15:00
Multi-Agent Chatbot for Efficient Interaction with Blockchain APIs

ABSTRACT. This paper addresses the challenge of enabling end users to interact directly with various APIs that provide real-time data and perform actions, eliminating the need for programming or developing intermediate applications. Direct interaction with APIs offers significant benefits, including increased accessibility, reduced development time, and enhanced flexibility in meeting diverse user needs. We propose a chatbot solution that leverages Large Language Models (LLMs) to facilitate these interactions. The advanced language understanding and reasoning capabilities of LLMs underpin our approach, addressing challenges such as precise query refinement, planning the selection of one or more APIs, and extracting parameters from user queries for API input. Our novel architecture integrates question refinement, entity recognition, and API filtering modules, supported by a multi-agent chatbot system that plans and evaluates API usage. This multi-agent system operates as a collaborative team of specialized experts, iteratively handling complex queries. Experimental results demonstrate that our system achieves a 91.95\% accuracy rate with minimal response time. This approach simplifies the development of chatbots across various domains by leveraging available APIs, making it easier to build sophisticated, context-aware systems.

15:20
Evaluation of AI-Based Assistant Representations on User Interaction in Virtual Explorations

ABSTRACT. Exploration activities, such as tourism, cultural heritage, and science, enhance knowledge and understanding. The rise of 360-degree videos allows users to explore cultural landmarks and destinations remotely. While multi-user VR environments encourage collaboration, single-user experiences often lack social interaction. Generative AI, particularly Large Language Models (LLMs), offer a way to improve single-user VR exploration through AI-driven virtual assistants, acting as tour guides or storytellers. However, it's unclear whether these assistants need visual representation, and if so, in what form. We developed an AI-based assistant in three forms: voice-only, 3D human-sized avatar, and mini-hologram avatar, and conducted a user study to assess their impact on user experience.

15:40
A Novel Simulation-Driven Data Enrichment Approach to Improve Machine Learning Algorithm Performance

ABSTRACT. This study presents a novel framework for developing machine learning algorithms by integrating the Robot Operating System (ROS) with Unity for UAV research. The framework leverages ROS’s robust algorithm libraries and multi-language support, combined with Unity’s capacity to create realistic simulations allows for comprehensive model testing without relying on physical experiments. It provides a high-fidelity simulation environment replicating real-world scenarios, enabling validation of UAVs in complex conditions that are difficult to replicate physically. Additionally, the framework offers an innovative approach to generating enriched datasets by capturing object data from various angles and incorporating contextual information, enhancing models’ object detection in diverse scenarios. To validate its capabilities, we conducted two case studies: the first targets victim detection in rescue operations, showing that our generated dataset poses a higher challenge than COCO and PASCAL SOC when applied to multiple models. The second study assesses UAV performance in obstacle detection, collision avoidance, and navigation after training on our dataset. The findings demonstrate that this framework accelerates AI model development and serves as a reliable platform for validating UAV operations, making it a valuable asset for advancing UAV research.

14:40-16:00 Session 16D: Applied Operations Research and Optimization
Location: Son Tra
14:40
Exemplar-Embed Complex Matrix Factorization with Elastic Net Penalty: An Advanced Approach for Data Representation
PRESENTER: Manh Quan Bui

ABSTRACT. This paper presents an advanced method for complex matrix factorization, termed exemplar-embed complex matrix factorization with elastic net penalty (ENEE-CMF). The proposed ENEE-CMF integrates both L1 and L2 regularizations on the encoding matrix to enhance the sparsity and effectiveness of the projection matrix. Utilizing Wirtinger's calculus for differentiating real-valued complex func-tions, ENEE-CMF efficiently addresses complex optimization challenges through gradient descent, enabling more precise adjustments during factorization. Experi-mental evaluations on facial expression recognition task demonstrate that ENEE-CMF significantly outperforms traditional non-negative matrix factorization (NMF) and similar complex matrix factorization (CMF) models, achieving supe-rior recognition accuracy. These findings highlight the benefits of incorporating elastic net regularization into complex matrix factorization for handling challeng-ing recognition tasks.

15:00
Modeling Information Diffusion in Bibliographic Networks using Pretopology

ABSTRACT. In this research, we propose a novel approach to model information diffusion on bibliographic networks using pretopology theory. We propose pretopological independent cascade model that is a variation of the independent cascade model (IC), namely Preto_IC. We apply pretopology to model the structure of heterogeneous bibliographic networks since it is a powerful mathematical tool for complex network analysis. The highlights of Preto_IC are that the propagation process is simulated on multiple relations, and the concept of elementary closed subset is applied to capture the seed set. In the first step, we construct a pretopological space to illustrate a heterogeneous bibliographic network. In this space, we define a strong pseudo-closure function to capture the neighborhood set of a set A. Next, we propose a new method to choose seed set based on the elementary closed subsets. Finally, we simulate Preto_IC with the seed set from step (2) and for each step t of propagation, determine the neighborhood set for infection based on pseudo-closure function defined from step (1). We experiment on three real datasets and demonstrate the effectiveness of Preto_IC compared with the IC model with existing methods of seed set selection.

15:20
Optimizing Credit Scoring Models for Decentralized Financial Applications

ABSTRACT. Decentralized Finance (DeFi), a rapidly evolving ecosystem of blockchain-based financial applications, has attracted substantial capital in recent years. Lending protocols, which provide deposit and loan services similar to traditional banking, are central to DeFi. However, the lack of credit scoring in these protocols creates several challenges. Without accurate risk assessment, lending protocols impose higher interest rates to offset potential losses, negatively affecting both borrowers and lenders. Furthermore, the absence of credit scoring reduces transparency and fairness, treating all borrowers equally regardless of their credit history, discouraging responsible financial behavior and hindering sustainable growth. This paper introduces credit scoring models for crypto wallets in DeFi. Our contributions include: (1) developing a comprehensive dataset with 14 features from over \numprint{250000} crypto wallets; and (2) constructing four credit scoring models based on Stochastic Gradient Descent, Adam, Genetic, and Multilayer Perceptron algorithms. These findings offer valuable insights for improving DeFi lending protocols and mitigating risks in decentralized financial ecosystems.

15:40
A Historical GPS Trajectory-Based Framework for Predicting Bus Travel Time

ABSTRACT. Accurate bus travel time information helps passengers plan their trips more effectively and can potentially increase ridership. However, cyclical factors (e.g. time of day, weather conditions, and holidays), unpredictable factors (e.g. incidents and abnormal weather), and other complex factors (e.g. dynamic traffic conditions, dwell times, and variations in travel demand) make accurate bus travel time prediction challenging. This study aims to achieve accurate travel time prediction. To accomplish this, we have developed a bus travel time prediction framework based on similar historical Global Positioning System (GPS) trajectory data and an information decay technique.

The framework first divides the predicted route into segments, integrating GPS trajectory data with road map processing techniques to accurately map the bus’s position and estimate its arrival time at bus stops. Then, instead of relying on a single historical trajectory that best matches the predicted bus journey, the framework samples a set of similar trajectories as the basis for travel time estimation. Finally, the information decay technique is applied to construct a bus travel time prediction interval. We conduct comprehensive experiments using real bus trajectory data collected from Kandy, Sri Lanka, and Ho Chi Minh City, Vietnam, to validate our ideas and evaluate the proposed framework. The experimental results show that the proposed prediction framework significantly improves accuracy compared to baseline approaches by considering factors such as bus stops, time of day and day of week.

16:00-17:20 Session 17A: Secured and Intelligent Multimedia Systems
Location: Danang 1
16:00
Media Certificate Authority: A System to Ensure Media Content Originality for Daily Lifelog Media Collection

ABSTRACT. In today’s digital landscape, where media content is rapidly shared and easily replicated, ensuring the authenticity and ownership of user-generated content is crucial. This paper presents a Certificate Au- thority (CA) system designed to validate and authenticate media con- tent with the potential for seamless integration into social media platforms. The system ensures media authenticity by certifying user ownership through public key certification and implementing robust plagiarism detection techniques to prevent unauthorized content reuse. The public key certification process leverages a zero-knowledge proof challenge, se- curing user identity while verifying the legitimacy of their media con- tent. To detect replicated or plagiarized content, the system employs advanced hashing techniques such as difference hashing, average hash- ing, perceptual hashing and vector embeddings. Additionally, the CA system incorporates human intervention to enhance the accuracy of the plagiarism detection process. Once media content passes these checks, it is signed using cryptographic signatures, which provide verifiable proof of authenticity and ownership that can be recognized across platforms.

16:20
Mouse Paw Inflammation Evaluation with Segment Anything and Lightness Classification

ABSTRACT. Inflammation is a common occurrence and a significant contributor to poor health in today's modern way of life. Finding a drug that is safe and effective in controlling inflammation is a challenge. This is why numerous animal models have been created to assess drugs with anti-inflammatory properties. However, it is crucial to carefully choose suitable animal models during the initial stages of drug development. Traditional methods for evaluating inflammation in animal models rely on time-consuming manual data analysis. In this paper, the authors propose a machine-learning approach that combines segmentation and classification models to evaluate the inflammation level of mouse paws. The system leverages the Segment Anything Model to segment the paw and then applies a lightness classifier to determine the degree of infection. Due to the limitation of the dataset, although direct training or fine-tuning an instance segmentation model can yield results, the performance is suboptimal, limiting the model's accuracy and reliability. We use a prompt generator component which creates a bounding box to guide the segment process of Segment Anything. Our system emphasizes flexibility since it is built for various models integrated into the pipeline. Our proposed approach demonstrates high segmentation accuracy, achieving a Mean Dice Score of 0.97997, along with high classification accuracy in our benchmark, even with minimal data. This emphasizes its practical utility in biomedical analysis.

16:40
Knowledge Distillation for Lumbar Spine X-ray Classification

ABSTRACT. Lumbar spondylosis is a prevalent chronic illness that results in deformation of the lumbar spine and limits human movement. Over time, spinal deformities can compress or exert tension on the nerve roots, resulting in lower back discomfort and disc herniation. The incidence of spondylosis is escalating, attributed to a growing population of younger individuals. This tendency results from alterations like contemporary jobs and education. X-ray imaging of the lumbar spine is widely utilized and endorsed by several physicians for its rapidity, precision, and accessibility across diverse patient populations. This article introduces a technique for detecting and classifying both abnormal and healthy lumbar spine X-ray pictures. After image filtration, we implement Knowledge Distillation, wherein a trained teacher model instructs smaller student models. We employ EfficientNet-B4 as the Teacher model, a high-accuracy and efficient Convolutional Neural Network (CNN) architecture for medical image analysis, and MobileNetV2 as the Student model, which also utilizes the knowledge distillation approach. To assess the model's performance, 2,000 lumbar spine X-ray pictures were obtained from Kien Giang General Hospital and Trung Cang General Clinic, with 872 samples designated for training and testing. The outcomes attained an accuracy of 91.0\%, a precision of 90.0\%, a recall of 91.8\%, and an F1-score of 90.9\%. The findings were achieved after 500 training epochs with a learning rate 0.001. This indicates that our suggested model has strong performance with excellent dependability.

17:00
Exploring Prompt Injection: Methodologies and Risks with an Interactive Chatbot Demonstration

ABSTRACT. Large Language Models (LLMs) offer significant advancements in computation and generative capabilities, enabling wide-ranging applications. However, integrating LLMs into services introduces risks, particularly through prompt injection attacks, where user inputs can manipulate model behavior. This paper explores common strategies for prompt injection and highlights the associated risks in LLM-integrated applications. To demonstrate this vulnerability, we present Injextion, a chatbot where users attempt to exploit the Llama 3 model to obtain a hidden key. Additionally, we implement a minimal TLS handshake with a digital signature to securely transfer chat messages.

17:20
Motorcycle Helmet Detection Benchmarking

ABSTRACT. In this paper, we focus on evaluating the robustness of helmet detection in the context of traffic surveillance, achieved through the state-of-the-art deep learning models. This aims to contribute significantly to motorcycle safety by implementing intelligent systems adept at accurately identifying helmets. An integral component of this inquiry entails a meticulous benchmark of cutting edge object detection models and the integration of advanced techniques, aiming not only to bolster accuracy but also to improve the overall practicality and effectiveness of helmet detection systems. The experimental results highlight the effectiveness of the state-of-the-art object detection methods in detecting helmets and the potential of transferring from the traffic domain to the construction site domain.

16:00-17:20 Session 17B: Lifelog and Multimedia Event Retrieval
Location: Danang 2
16:00
RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

ABSTRACT. Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.

16:20
A Hybrid Video Retrieval System Using CLIP and BEiT-3 for Enhanced Object and Contextual Understanding

ABSTRACT. Video retrieval from large datasets has gained significant attention due to its wide range of applications. Existing methods typically focus on either global semantic alignment, which emphasizes detecting prominent objects in queries, or fine-grained semantic alignment, which captures contextual information. However, these approaches often struggle with queries involving complex relationships between objects and their contexts. To address these challenges, we propose a hybrid retrieval approach that integrates Contrastive Language-Image Pretraining (CLIP) and BERT Pre-Training of Image Transformers (BEiT3). CLIP enhances object detection and recognition, while BEiT-3 excels at understanding detailed contextual relationships. By leveraging the complementary strengths of these models, our approach provides both global semantic understanding and fine-grained contextual analysis across multiple modalities. The proposed system was evaluated in the 2024 Ho Chi Minh AI Challenge, demonstrating significant improvements in retrieval performance for complex queries.

16:40
VizQuest: Enhanced Video Event Retrieval Using Fusion and Temporal Modeling

ABSTRACT. Retrieving events from large video datasets is challenging, especially when different data types like images, audio, video, and temporal sequences are involved. Although there have been improvements in cross-modal retrieval, current methods still struggle with precise temporal alignment in event searches. This paper proposes a new video event retrieval system that integrates models for processing text, audio, and visual data. Retrieval accuracy is improved by combining the ranked outputs from these individual models. Each model contributes uniquely to the performance of system, and when combined, they enhance both the diversity and robustness of the search. Additionally, temporal modeling ensures the accurate retrieval of event sequences, making the system suitable for time-sensitive tasks. The system was tested in the AI Challenge HCMC 2024, where it demonstrated remarkable speed and precision and secured the top 10, confirming its potential for real-world use in complex event retrieval scenarios.

17:00
Unveiling Peripheral Information: A Context-Aware Video Retrieval Approach

ABSTRACT. As video content continues to expand, effective retrieval methods must account for the complexity of diverse data types and contexts. Traditional video retrieval systems often prioritize prominent features, overlooking valuable details in less noticeable regions, such as the periphery of video frames. This can result in incomplete retrieval, especially when subtle yet important information is present in these areas. In this paper, we introduce a novel context-aware video retrieval technique that emphasizes peripheral information, detecting and extracting features from these minor-impact regions. By focusing on such overlooked details, our approach enhances both the accuracy and contextual relevance of retrieval. Extensive testing validates the method’s effectiveness, demonstrating its potential for real-world applications where detailed and precise video retrieval is crucial.

16:00-17:20 Session 17C: Human Computer Interaction
Location: Danang 3
16:00
Now I Know What I am Eating: Real-time Tracking and Nutritional Insights Using VietFood67 to Enhance User Experience

ABSTRACT. Maintaining a balanced diet is vital for preventing diseases like diabetes and cancer, but busy lifestyles make tracking food intake dif- ficult. Traditional dietary assessment methods, such as questionnaires, are often labor-intensive and inaccurate. The need for real-time, user- friendly nutrition tools has grown, especially after the COVID-19 pan- demic. This paper introduces VietFood67, an expanded Vietnamese food dataset featuring 33,003 images across 68 classes, including hu- man face detection, built on VietFood57. A fine-tuned YOLOv10 model achieved a mAP50 score of 0.92, showing strong performance on the larger dataset. Additionally, the enhanced FoodDetector website now offers real-time nutritional information for detected dishes. A user study with 35 participants showed high satisfaction and increased nutritional awareness, highlighting the system’s potential to encourage healthier eat- ing habits through efficient, online dietary tracking.

16:20
Towards Enabling Tangible Interaction with Physical Objects in Virtual Reality Desktop Workplaces

ABSTRACT. Advancements of immersive technologies like virtual reality (VR) have created a possibility to virtualize desktop workspaces, allowing users to work on their desktop computers with more flexibility rather than being constrained to certain physical spaces. However, current virtual screen interaction methods, which rely on controllers or mid-air gestures, often result in user fatigue. . Aiming to solve this issue, we propose a new system that leverages only consumer-grade devices to let users perceive and interact with tangible objects like mouses, keyboards, and users' hands in VR. We implemented and evaluated two ways of representing physical objects on physical desktop workspaces and users' interactions with them in VR. Furthermore, we developed an exemplary application which integrates our system into a VR control tower, which can be used for training air-traffic controllers as well as can serve as an experimental environment for air-traffic control and management research.

16:40
Meal Plan App: Personalized meal plans based on personal unique needs.

ABSTRACT. As society evolves, individuals often make poor dietary choices, leading to negative health impacts. With the large number of recipes available on various platforms, selecting and planning meals that meet specific dietary needs becomes increasingly challenging. Furthermore, current meal planning applications are often limited by their focus on single dietary items or personal preferences, limiting users with multiple dietary requirements or preferences. This paper introduces the integration of Edamam API and MealDB API, providing access to over 1,000 diverse recipes, complete with nutritional information, detailed cooking instructions, and dietary labels. By combining these APIs with a user preference collection mechanism, the system can provide personalized recipes that are tailored to each individual's health goals and dietary needs. Additionally, incorporating AI enhanced the system's accuracy in recommending appropriate meals, improving the match rate of diet-friendly recipes by 23% compared to non-AI-based methods.

17:00
Budget-Aware Keyboardless Interaction

ABSTRACT. Interacting with computers typically relies on traditional input devices such as keyboards, mice, and monitors, which can be cumbersome for users seeking greater mobility. Virtual keyboards have been explored to address these limitations, but they often involve complex setups or expensive equipment. This paper proposes a novel virtual keyboard system that leverages only a standard camera and a paper with a printed keyboard layout. Unlike previous methods requiring complex calibration or special lighting conditions, our approach can work on standard environment using modern computer vision technologies. Combining modern segmentation and detection models with traditional image processing algorithms, we efficiently identify the keyboard region. Touch detection is performed using an algorithm analyzing the color of the user’s fingernail. Experiments demonstrated a promising results our proposed solution of keyboard and keystroke detection for practical applications. Participants attended our user study also found the proposed system interesting.