View: session overviewtalk overview
MAVERICS: Multimodal Advanced Visual Event Retrieval with Integrated CPU-Optimized Search ABSTRACT. The increasing volume of visual data in news archives and media sources poses significant challenges for efficient event retrieval. This paper presents a multimodal approach to tackle the problem of Event Retrieval from Visual Data. Our system integrates several techniques to process diverse query types, including text, image, and video. For image-text retrieval, the BLIP2 model is used to embed both images and text descriptions. In cases where queries are in Vietnamese, we employ the pre-trained VietAI/envit5-translation model to translate prompts into English before processing them with BLIP2. Object detection is handled by YOLOWorldv2, and text extraction from images utilizes PP-OCRv3 and VGG Transformer. Additionally, WhisperX is employed for audio-to-text conversion. Embeddings from textual data, whether derived from OCR or audio, are generated using sentence-transformers/all-MiniLM-L6-v2. These embeddings are indexed using Usearch, enabling fast and efficient retrieval. Furthermore, we developed a high-speed temporal search mechanism that calculates scores and combinations for consecutive related frames to improve performance in temporal queries. The system is capable of running efficiently on CPUs, with a maximum query processing time of 2 seconds for advanced queries, such as Temporal search, which require multiple models to run consecutively, making it a scalable solution for large-scale video data retrieval. Additionally, we have built a user-friendly interface using Streamlit, enabling users to easily interact with and utilize the system. |
Forecasting Traffic Flow under Uncertainty: A Case Study in Da Nang ABSTRACT. This paper discusses the design and implementation of a modern traffic flow prediction system using data from street surveillance cameras deployed at the website 0511.vn. The core objective of the research was to develop an accurate and efficient prediction model based on direct image analysis and real-time data, providing instant traffic information and forecasting short-term traffic trends. Initially, it is necessary to identify and evaluate existing image processing and machine learning methods to filter out and classify vehicles from the collected video data. Subsequently, the author designed models combining ARIMA and LSTM methods to predict the density and movement of vehicles on the roads. These methods were tested and optimized through a series of experiments on historical data and real-time data collected from 0511.vn, marking a significant advancement in applying video surveillance technology to urban traffic management. The research results not only contribute to the field of data science and image processing but also have practical potential in supporting the decision-making of traffic management agencies and improving the community's commuting experience. |
Enhanced Video Retrieval System: Leveraging GPT-4 for Multimodal Query Expansion and Open Image Search ABSTRACT. The exponential growth of video data on digital media and video-sharing platforms has created an urgent need for efficient content-based video retrieval systems. Traditional methods, such as object recognition, text extraction, and color analysis, have been extensively explored, while current approaches leverage pre-trained multimodal models like CLIP, which give impressive results in text-based video retrieval. However, as the large-scale database gets larger, the level of diversity and complexity increases, and fixed text queries often produce limited results. To address this, we propose an enhanced video retrieval framework that integrates CLIP with GPT-4 through API interaction. By employing advanced prompt engineering techniques, we dynamically expand and refine text queries, enabling broader and more effective exploration of video datasets. Additionally, our framework translates human-generated queries into machine-optimized formats for vision-language models, enhancing retrieval precision. Furthermore, to utilize large image databases such as Google, we introduce an open image-based search functionality that allows users to import reference images similar to their text queries, improving the system’s ability to find relevant content. This dual approach enhances query-model alignment, increasing the likelihood of retrieving accurate and contextually relevant results. |
ReViMM: Enhanced Video Retrieval with Reweighting Mechanism for Multi-Modal Queries ABSTRACT. The increasing volume of video data is posing significant challenges for efficient event retrieval. Applications in fields such as security, multimedia management, and event analysis require systems that are not only robust but also flexible enough to handle various input types, such as text, images, audio, and data from multiple sources. Traditional approaches often struggle to handle the large volume and diversity of data, leading to inefficient retrieval in complex scenarios. Our proposed system integrates FAISS (Facebook AI Similarity Search) for fast similarity search and ElasticSearch to make the search process more efficient. The system can process various input types, including descriptive text and similar images, while using Whisper for speech recognition and transcription. Additionally, Large Language Models (LLMs) are employed to generate detailed and accurate image descriptions. A key highlight of our approach is the use of reweighting to adjust the importance of each word or image during the search process. By reweighting, the system can optimize the prioritization of the most relevant information for each query, significantly enhancing the accuracy of the search results. This approach not only improves the precision of data retrieval but also greatly enhances the user experience. By offering comprehensive and precise results for complex queries, the system helps users easily find relevant information in large and diverse datasets, meeting the growing demands of applications in security and multimedia analysis. |
LLM-Powered Video Search: A Comprehensive Multimedia Retrieval System ABSTRACT. Image and video search has become an important problem amid the rapid growth of image and video-sharing platforms. The explosion of video content on the Internet has created an urgent demand for effective content management and retrieval. Traditional approaches, such as text-based, image, object, audio, and color searches, have achieved some success but often lack consistency and fail to meet user needs comprehensively. To address this issue, we propose an artificial intelligence (AI) system that integrates multiple existing search methods with a Combined Ranking Score (CRS) algorithm. CRS balances the ranking and scores from different approaches, optimizing query results. The system also uses a Retrieval-Augmented Generation (RAG) architecture combined with large language models (LLMs) to enhance contextual understanding and deliver more accurate results. Users can interact directly with our AI system to easily and efficiently search for desired moments in videos. This approach promises to significantly improve the user experience in multimedia content search and retrieval. |
Interactive Video Retrieval System for AI Challenge 2024 Using CLIP, RAM++, and LLM-Enhanced Tag Matching ABSTRACT. In this paper, we present an interactive video retrieval system developed for the AI Challenge 2024. The system offers multi-modal search functionality, allowing users to search using text, images, or tags. At its core, the system leverages CLIP (Contrastive Language-Image Pre-training) to enable efficient video retrieval from both natural language and image-based inputs. For tag-based queries, we incorporate the state-of-the-art image tagging Recognize Anything Plus (RAM++) model. However, due to the large number of tags produced by RAM++, it becomes impractical to manually select the most relevant tag of the input query. To address this challenge, we use the Gemini Large Language Model (LLM) to automatically select the most appropriate tags for the given query. Additionally, we describe our temporal search algorithm, which further enhances retrieval performance. Our experiments show that this combination of models provides a scalable and high-performance solution for video search applications in real-world scenarios. |
Transforming Video Search: Leveraging Multimodal Techniques and LLMs for Optimal Retrieval PRESENTER: Truong Dinh ABSTRACT. The fast development of online video material has made video searching critical in the digital era. Traditional approaches, such as image-text retrieval, object, audio, color, and text-based searches, have made considerable advances in this field. However, these approaches frequently require refining when dealing with numerous users’ inquiries at the same time, which might result in overlapping searches. Furthermore, present video search algorithms must increase their capacity to respond to complicated questions including data extraction from several frames. Overcoming these limits is critical for creating scalable and userfriendly video search engines. In this research, we provide an improved video search system that includes three significant breakthroughs. We enhance text detection for Vietnamese, employ picture captioning to increase search relevancy, and allow users to modify queries with a large language model (LLM) for more precision. These innovations considerably increase the search process’s efficiency and accuracy. The intuitive interface enables seamless searches by queries, frame IDs, and related images, while offering sophisticated features such as query expansion, result aggregation, and integrated feedback for enhanced search accuracy. |
Real-Time Multi-User Multimedia Event Retrieval Application System Using WebSocket Protocol ABSTRACT. Event query systems for video have become essential tools in various fields such as surveillance, sports analytics, and media management, where accurately retrieving significant moments is crucial. This paper presents an event query system optimized for multi-user access, utilizing the WebSocket protocol to enhance real-time interaction. The system allows multiple users to simultaneously query specific events in video content while supporting cross-validation of results to improve accuracy. The primary research method focuses on integrating WebSocket technology, enabling continuous communication between users and the system, thus enhancing user experience. Additionally, the system features a ranking mechanism for events based on user votes, which aids in optimizing suggestions and encourages community participation. Experimental results demonstrate that the system operates effectively in environments with multiple users, providing accurate and rapid queries while enhancing user engagement through real-time interactive features. This system marks a significant advancement in applying modern technologies to video event querying, opening up substantial opportunities for practical applications across various fields. |
Application of the SFE Feature Selection Method for Multi-Omic Biomarker Discovery in Brain Cancer Subtyping ABSTRACT. Glioblastoma (GBM) is an aggressive brain cancer with poor prognosis, making the identification of reliable molecular biomarkers vital for early detection and improving treatment strategies. This study introduces a two-phase framework for discovering and validating GBM biomarkers. In the first phase, we employed the Simple, Fast, and Efficient (SFE) feature selection algorithm to high-dimensional multi-omics data from The Cancer Genome Atlas (TCGA) GBM cohort to identify potential biomarkers. In the second phase, we assessed the explainability of these biomarkers through two approaches. First, by comparing them with reference data from established databases. Second, by evaluating their performance using classical machine learning models. This two-phase framework is versatile and can be adapted to other cancer datasets, offering a promising approach to biomarker discovery for improving cancer treatment. |
Enhancing Video Retrieval via Synergized Image Embeddings and RAG ABSTRACT. This paper introduces an advanced video retrieval system designed to efficiently process and retrieve video content through innovative information extraction and embedding techniques. The system indexes videos by extracting keyframes and removing redundancies using CNN-based embeddings. Keyframes are embedded with the BEiT-3 model and enriched with metadata from object detection. Videos are segmented into overlapping sub-videos, with transcripts aligned and embedded using the Alibaba-NLP model. All data is stored in cloud storage and indexed in a vector database for efficient retrieval. User queries are processed through multiple embedding models, supporting versatile search capabilities across transcripts, frames, and descriptions. The retrieval process includes a re-ranking algorithm that filters and ranks keyframes, providing users with the most relevant results. Additionally, RetrievalAugmented Generation (RAG) is employed to enhance search precision, offering a robust solution for large-scale video content analysis and retrieval. |
A Comprehensive Video Event Retrieval System for Vietnamese News: Integrating CLIP ViT, TASK-former, Transcripts, and OCR ABSTRACT. In response to the growing need for precise and efficient video retrieval, we present a versatile video event retrieval system that supports multiple query modalities. Our system integrates 5 modes for querying in natural languages: quick search, temporal search, hybrid text and sketch search, transcript-based search, and OCR-based search. Powered by the CLIP ViT-L model, the quick search matches user queries to relevant video segments. Temporal search combines CLIP embeddings with mathematical techniques to pinpoint specific timeframes, while the TASK-former model supports hybrid sketch-text search, enabling users to locate scenes using hand-drawn sketches and textual descriptions. Transcript-based search aids users to overcome socio-cultural barriers and OCR-based search utilizes extracted text from video keyframes. The system’s interface allows users to input queries, and browse top-ranked results in a manner that tackles clustered viewing problems seen in common systems. In addition, users can explore visually similar images, and preview short clips, improving both precision and accessibility in video content retrieval. |
LameFrames: Optimizing Video Event Retrieval Through Strategic Integration and Individual Strategy Enhancement ABSTRACT. Video event retrieval task aims at retrieving video events from a large video collection that are semantically relevant to a given textual or visual query. There are several approaches have been introduced for this promising task, but overall, they can be categorized either as embedding-based techniques or proxy techniques. Each of them has its advantages and appropriately used context. The huge size of datasets often seen in this task is also a challenge for retrieval systems to work efficiently. This paper presents a comprehensive solution and application for video retrieval that addresses the challenges of speed and accuracy in large-scale datasets. Our approach integrates two complementary methods: an image semantic search using CLIP visual-textual embeddings together with an advanced FAISS vector retrieve index; and a text-proxy image search using optical character recognition (OCR) and automatic speech recognition (ASR) together with Elastic Text Search. In the first approach, we further implement four strategies that advantage and enhance the power of CLIP embeddings. Through experiments, we demonstrate that our approaches provide high accuracy and efficiency for video retrieval applications for a vast type of query. We also found that the quality of the query noticeably impacted the search results and the ability of end users to customize the search by modifying different factors also contributed to a quick and successful query. |
MMMSVR: An Advanced Video Retrieval and Question Answering System ABSTRACT. Video retrieval, the process of locating specific video content within large da-tasets, presents a significant challenge in the era of digital multimedia. In response to this, as part of the Ho Chi Minh City AI Challenge 2024, this paper presents an advanced multi-modalities and multi-stages video retrieval framework namely MMMSVR, enhanced with the ability to answer supplemental questions, such as automatic counting of objects. The proposed method leverages vision-language models, combining CLIP ViT-H/14, BLIP2 and Beit-3 for feature encoding and implements a re-ranking mechanism based on a weighting system. Furthermore, a wide range of query modalities such as Optical Character Recognition (OCR), Object Detection, and Automatic Speech Recognition are integrated to refine and improve the retrieval process. The system supports both single and multi-text queries for event sequences, facilitating efficient video retrieval based on image, audio, and textual attributes. Furthermore, the framework includes an image-based query feature, enriching the model’s versatility and improving retrieval accuracy. The proposed approach demonstrates significant performance improve-ments and offers a robust, flexible solution for video search and question-answering tasks. |
CLIP-Enhanced Lifelog Retrieval System: Robust Multi-Modal Media Search with Real-Time Performance ABSTRACT. In this paper, we introduce a robust media retrieval system designed to address the challenges posed by large-scale, multi-modal data retrieval tasks, particularly in the context of image retrieval, which plays a crucial role in surfacing key moments from vast media datasets. Efficient and accurate image retrieval is essential for individuals to navigating and retrieving relevant events, making it a vital component of any advanced retrieval system. Our system builds on retrieval models by integrating advanced features, more specific like CLIP-based image search, YOLOv8 for precise object detection, and temporal search to handle long and complex queries. Key optimizations include enhanced visual similarity search and an intuitive, interactive interface that ensures fast and efficient query results. By storing extracted features in the Milvus vector database, the system achieves significant speed improvements in retrieval, enabling real-time performance. Benchmarking our model at the AI Challenge (AIC24) in Ho Chi Minh City, the system demonstrated top-tier results, particularly excelling in KIS-type queries, where it achieved 100\% frame retrieval accuracy (15/15) with an average query response time under 30 seconds. These results highlight the effectiveness of our system in handling diverse and complex queries, making it a valuable tool for lifelog retrieval tasks, especially in improving the user experience for both novice and expert users. Our code is publicly available at \url{https://github.com/trnKhanh/AIC24}. |
Enhanced Video Event Retrieval through Adaptive Multi-Model Fusion with Large Language Models ABSTRACT. The retrieval of events in videos has emerged as a critical area of research due to the complexity of multimedia information and the rapid growth of digital content. Current methods typically rely on models to extract features from various data sources and combine these features for enhanced retrieval. However, this approach often necessitates prioritizing model weighting based on query context, as different models may yield varying relevance depending on the nature of the query. To address these challenges, this study proposes an adaptive multi-model fusion technique within a video event retrieval framework, leveraging large language models (LLMs) to dynamically adjust weights for multimodal data, thereby enhancing context-aware retrieval. Evaluated on the AI Challenge HCMC (AIC) 2024 dataset, our method achieved a success rate of 91.9\%, thereby proving the effectiveness of our solution. |
"MAVEN: Video Retrieval System using A Multi-Agent Visual Exploration Network" ABSTRACT. Effective video retrieval systems are essential as video data grows across various fields. Traditionally, these systems rely on OCR, object detection, color extraction, and audio analysis. Current approaches like CLIP bridge the text-image embeddings gap for search, but they often lack contextual depth for complex, multi-frame searches. We propose a solution that integrates traditional methods and CLIP with advanced language models and prompting techniques for image captioning, extracting rich information from individual frames. Our system includes an Agent that automates searches, classifies queries, generates prompts, and verifies results, improving search accuracy while reducing user effort. Additional features like temporal search, video previews, and frame filtering further enhance the user experience. This comprehensive approach provides a powerful toolkit for achieving more accurate and efficient video search results, addressing the growing complexity of video data retrieval across various domains. |
Can Image Generative Models be Considered Experts? ABSTRACT. This paper addresses foundational challenges in evaluating Generative Artificial Intelligence (GAI), focusing on the transition from expertise evaluation to intelligence evaluation. It critiques both quantitative and qualitative metrics for GAI, highlighting limitations in human algorithm interaction environments. The study examines knowledge representation in neural network architectures and the processes of filtering versus tokenization for image processing, emphasizing inconsistencies and lack of standardization in test design. The paper then proposes a research design to evaluate the expertise of GAI models, with a focus on image generation. The methodology includes training an expert model, evaluating model tasks, developing an evaluation framework, and conducting human evaluations. The paper also acknowledges some potential research design limitations and considerations. The authors hope that this research design will aid in the development of GAI models that can have practical applications in high-technical industries such as industrial manufacturing, electrical engineering, and other high-precision industries. |
10:20 | Constraint Programming-Based Cutting Plane Algorithm for a Combination of Orienteering and Maximum Capture Problem PRESENTER: Hoang Giang Pham ABSTRACT. In this paper, we study a new variant of orienteering problem (OP) where each vertex in the OP tour is a facility within a competitive market context, where customer demand is predicted by a random utility choice model. Unlike prior research, which primarily focuses on simple objective function such as maximizing a linear sum of score of selected vertices, we introduce a complicated non-linear objective function that necessitate the selection of locations to maximize a profit value such as expected customer demand or revenue. In our study, routing constraints included in the form of the OP is handled by Constraint Programming (CP), and the non-linear objective function, resulting from the utilization of random utilities, is tackled by two types of valid cuts, namely, outer-approximation and submodular cuts. These lead to the development of an exact solution methods: Cutting Plane, where these valid cuts are iteratively added to a master problem. Extensive experiments are conducted on problem instances of varying sizes, demonstrating that our approach excels in terms of solution quality and computation time when compared to other baseline approach. |
10:40 | Cost Optimization in Competitive Facility Location under General Demand Model PRESENTER: Ba Luat Le ABSTRACT. This work addresses a cost optimization problem in facility location where customer demand is modeled using the cross-nested logit model, one of the most flexible demand models in the literature. The objective is to maximize a captured demand function by allocating a fixed investment budget across a set of facilities, where the investment directly influences the demand captured by each facility. The resulting optimization problem involves exponential and fractional terms, leading to a highly nonlinear structure. To the best of our knowledge, no existing methods can solve this problem to near-optimality. To address this, we propose a piecewise linear approximation technique and apply variable transformations to approximate the problem (to any desired precision) as a mixed-integer convex program, which can be solved to optimality using an outer-approximation method. Extensive experiments on generated instances of varying sizes demonstrate the effectiveness of our proposed approach compared to standard baselines. |
11:00 | Influence Maximization with Fairness Allocation Constraint ABSTRACT. Motivated by practical applications from social influence and viral marketing, this work studies the problem of Influence Maximization with Fairness Allocation Constraint, which aims to find a set of $k$ uses from groups in a social network with maximal influence spread so that the number of selected uses in each group does not exceed the group budget. We propose an efficient and salable approximation that returns an approximation ratio of $1/2-\epsilon$ and takes $O((m+\log(\frac{k}{\epsilon}))\frac{n}{\epsilon^2}(k \log n+\log(\frac{1}{\delta})))$ time complexity, where $\epsilon$ is a constant, $n$ is the number of uses and $m$ is the number of links. Besides theoretical results, extensive experiments conducted on real social networks show that our algorithm provides better solutions than cutting-edge methods. |
11:20 | A Reputation Scoring Framework for Lending Protocols using the PageRank Algorithm ABSTRACT. Blockchain technology has revolutionized the financial sector by introducing decentralized finance (DeFi) as a powerful alternative to traditional banking systems. Among DeFi sectors, Lending has become a key area, facilitating cryptocurrency borrowing and lending without intermediaries. As of May 2024, Lending ranks second in Total Value Locked (TVL) within DeFi, reflecting its widespread adoption. Key entities in the Lending ecosystem, including Personal Wallets, Lending Smart Contracts, Protocol Supported Tokens, and Centralized Exchange, play a crucial role in shaping governance, decision-making, and resource allocation. However, current evaluation methods, which primarily rely on token holdings for governance voting, are vulnerable to manipulation and fail to accurately reflect contributions. To address this, we propose a novel scoring system that evaluates entities based on both token holdings and lending interactions over time. Our system uses the PageRank algorithm, scaled to the FICO score range, to offer a more stable and transparent assessment. This approach promotes healthy competition, encourages user activity, and supports the long-term growth and stability of the Lending DeFi ecosystem. |
11:40 | A method combining the reference information of the adaptive adjustment method and the decision maker of multi-objective evolutionary algorithms ABSTRACT. In practice, when using multi-objective optimization algorithms, people use reference information to search for desired solutions. However, including decision maker reference information can cause the evolutionary process to lose balance between exploration and exploitation capabilities, thereby leading to missing good solutions or being locally optimized. Recently, there have been many effective proposals to analyze trends and maintain the balance automatically. To simultaneously address decision maker desires and self-regulation capabilities, this paper proposes a method that combines decision maker information and adaptive control information applied on the DMEA-II, MOEA/D using reference points. The experimental results created a good balance in using the two reference information types in the evolutionary process. |
Flow Velocity Analysis of Rivers Using Farneback Optical Flow and STIV Techniques with Drone Data ABSTRACT. This study focuses on the application of the Farneback optical flow method and Space-Time Image Velocimetry (STIV) for analyzing river flow velocities in upstream regions using drone-captured data. By integrating drone imagery with satellite data, we developed a high-precision model for predicting river flow, aiming to improve hydropower stability and flood control measures. The Farneback method enabled accurate flow velocity predictions by analyzing image sequences, while STIV was applied to obtain temporal and spatial flow measurements. Field experiments conducted in Japan, Sri Lanka, and Vietnam confirmed the effectiveness of these techniques, surpassing traditional flow measurement methods. The results showcase the potential of combining AI-driven image analysis and advanced drone technology for real-time river management, contributing to disaster risk mitigation and sustainable water resource management. |
Faster, Larger, Stronger: Optimally Solving Employee Scheduling Problems with Graph Neural Networks ABSTRACT. Employee planning is a complex task that affects the opera- tions and profitability of many companies. It requires optimizing total operating costs and profitability while satisfying constraints such as em- ployee availability, skills and preferences, which oftentimes leads to the use of combinatorial optimization techniques, a powerful tool to solve this problem. In many real-world applications, traditional combinatorial opti- mization techniques rely on branch and bound, cutting plane, and local search techniques to find the optimal solution. However, these traditional techniques can be computationally expensive for and are oftentimes not scalable for large-scale problem instances. To overcome these issues, we propose a new method which utilises the power of deep neural networks. In particular, we first convert the employee scheduling problem into a graph and build a novel graph neural network (GNN) to learn the optimal solution on the graphical representation of the problem. We evaluate the performance of our enhanced techniques on a large number of in- stances of employee scheduling problems, which show that our approach can significantly improve the performance of traditional combinatorial optimization techniques (approximately 20.86% to 46.79% compared to the state-of-the-art solver, CPLEX). |
Advancing Geopolitical Map Analysis: An Intelligent System for Territorial Integrity Verification ABSTRACT. Accurate cartographic representation of territorial sovereignty is crucial for geopolitical integrity, especially in historically complex re- gions. This paper presents an intelligent system for detecting island omis- sions, particularly the Hoang Sa and Truong Sa archipelagos, on geo- graphic maps. Addressing data-scarcity and map-diversity challenges, we propose a multi-faceted approach integrating web-crawling, map- detection, and island-identification techniques. Our novel deep-learning architecture incorporates a crawl-service, map-detection model, Vietnam- map classifier, and parallel archipelago and nine-dash-line detection pro- cesses. We introduce the IslandMapVN dataset, comprising 3000 anno- tated map-images, including 1200 Vietnam maps featuring the disputed archipelagos and 600 rare nine-dash-line images. This unique dataset sig- nificantly contributes to the geopolitical map-analysis field. Experimen- tal results exhibit the superiority of the proposed method in comparison with competitive baselines, in terms of both quantitative and qualitative manners. |
Improving Human Action Recognition Using Quaternion Discrete Fourier Transform in Transfer Learning ABSTRACT. Human action recognition is a critical field in artificial intelligence and computer vision, with wide-ranging applications such as healthcare, surveillance, and virtual reality. This study enhances action recognition performance by integrating transfer learning with the Quaternion Discrete Fourier Transform (QDFT), a novel approach that leverages the mathematical properties of quaternions for advanced signal processing. Using a subset of the UCF50 dataset, the study evaluates the effectiveness of this method across various actions, including BaseballPitch, Basketball, and Biking. The approach involves extracting features from pre-trained convolutional neural networks (CNNs) and applying the Fourier transform on quaternions to these features, which are then combined with those processed through fully connected layers. The experimental results demonstrate that incorporating QDFTs significantly improves the accuracy, precision, recall, and F1-scores of transfer learning models compared to conventional methods. Furthermore, comparative analysis with baseline models, such as MobileNet and EfficientNet, highlights the superiority of this hybrid approach. This research contributes to the field of action video recognition by providing a robust framework that can be adapted to various applications, with potential for further optimization and exploration across different datasets and recognition tasks. |
Cardio Care: A Vision Transformer Cardiac Classification based on Electrocardiogram Images and Signals ABSTRACT. Electrocardiogram is essential for evaluating cardiac function and many artificial intelligence models for computerised interpretation have been developed. However, these are unsuitable for under-resourced communities that only have access to the paper-based ECG report. To overcome this disadvantage, we propose Cardio Care, a mobile platform that processes electrocardiogram images and signals input for abnor- malities detection utilising Vision Transformer for improving imaging recognition, making it more suitable for wider applications. We use three datasets (including public and local datasets) with varying sample sizes and input types to reflect the data in real-world settings. The results demonstrate consistent performance across all datasets and highlight the promising application of Cardio Care in assisting cardiologists in remote and resource-limited healthcare facilities with AVG Macro F1 scores of 65, 99, and 82 across three datasets CPSC, Mendeley and Tam Duc datasets, respectively. This research proposed an alternative approach toward a preprocessing pipeline for ECG signals and images fed into a Vision Transformer-based deep learning network, aiming to improve healthcare access for under-resourced and under-served communities. |
A Tool for Preventing Consanguineous Marriages Using Vietnam's National Residents Database ABSTRACT. Consanguineous marriage is defined as a marriage between a male and female within the same family lineage, not exceeding three generations. Such marriages facilitate the development and manifestation of a range of hereditary diseases caused by recessive genes on chromosomes, which can become apparent in subsequent generations. In Vietnam, consanguineous marriages remain common, particularly among ethnic minorities. Fortunately, Vietnam has collected resident data on 104 million citizens, which has been utilized for economic and social development purposes. This study aims to propose tools for preventing consanguineous marriages by checking whether two individuals planning to marry are related within three generations before registering their marriage, using the national resident database. The study proposes three methods based on three different data structures (simple graph, balanced tree, and hash table). Theoretical and experimental analyses demonstrate that the method based on the hash table provides the fastest verification of whether two individuals planning to marry are related within three generations, outperforming the current methods. |
Optimizing Smart Grids with Reinforcement Learning for Enhanced Energy Efficiency ABSTRACT. The transition from traditional, centralized electric grids to smart grids offers numerous opportunities for optimizing energy distri- bution and consumption. This paper presents a reinforcement learning- based approach for load scheduling in smart grids, aiming to reduce energy loss and enhance grid reliability. By leveraging consumer prefer- ences,theproposedsystemschedulesloadsefficiently,therebyminimizing energy loss in transmission lines and reducing peak loads. Our results, tested on simulated grid environments of varying scales, demonstrate sig- nificant improvements in energy efficiency, suggesting that reinforcement learning can play a crucial role in the future of smart grid management. |
Benchmarking Real-Time Object Detection: Evaluating YOLO and RT-DETR on Speed, Accuracy, and Efficiency PRESENTER: Cao Vu Bui ABSTRACT. Object detection in real-time is indispensable in applications like autonomous vehicles, robotics, and surveillance, where both high accuracy and efficiency become necessary. The paper conducts an investigation and benchmarking on state-of-the-art models, including YOLO and RT-DETR, using the Pascal VOC dataset on the most important performance metrics: accuracy-mAP, size of the model, and inference speed-FPS. The results show YOLOv8x with the highest accuracy with an mAP 50:95 of 0.480 and 133 by FPS, while RT-DETRv1 (r50vd), although a little lower in terms of accuracy with an mAP 50:95 of 0.475, achieves 100 FPS by offering a pretty competitive trade-off in terms of model complexity. On the contrary, RT-DETRv2 (r34vd) has the smallest model with an mAP 50:95 of 0.468 and an FPS of 118, making it more adequate for resource-constrained environments. Our results immediately point to the trade-offs between accuracy, model complexity, and speeds providing practical observations that could be used when deploying such models on real-time systems. This research provides a reference that will be useful for model selection, taking into account specific performance requirements while conjugating highly accurate, fast, and resource-efficient models. |
Progressive Retention Sampling for Sequence Generation-based Scene Text Spotting ABSTRACT. Sequence generation models have demonstrated promising results in scene text spotting. However, these models face a discrepancy between training and inference phases: during training, ground-truth se quences are provided as input, whereas during inference, this input is replaced by the model’s own predictions. While current sampling strate gies have effectively mitigated this discrepancy in various sequence gen eration tasks such as image captioning and machine translation, they are not directly applicable to scene text spotting due to its unique se quence structure. This paper introduces Progressive Retention Sampling, a novel sampling strategy tailored specifically for sequence generation based scene text spotting. We evaluate our approach using two scene text spotting models, UNITS and SPTS, conducting experiments on the ICDAR 2015, Total-Text, and VinText datasets. Our results demon strate that the proposed method outperforms both baselines and con ventional sampling strategies. The implementation is publicly available at this GitHub repository. |
Development of an Edge-Computing-Based Intelligent Service Framework for Smart Camera Applications ABSTRACT. Advancements in the Internet of Things (IoT), Artificial Intelligence (AI), and wireless networks have fueled the growth of Artificial Intelligence of Things (AIoT) applications at the edge, addressing challenges such as power consumption, bandwidth constraints, and response latency. Our previous studies enhanced the management of edge AIoT applications, particularly for smart cameras. However, developing these systems remains challenging due to the lack of a unified platform, resulting in high complexity, inconsistency, and increased costs driven by limited code reuse and fragmented development processes. To address these challenges, this paper introduces an edge-computing-based intelligent service framework for smart camera applications. The framework provides a unified platform that enhances consistency, facilitates code reuse, and reduces costs and deployment time. It employs a microservices architecture deployed on a virtualization platform with container orchestration at the edge, allowing customization to meet specific application needs. The system also supports extended services, including cluster management, system monitoring, and DevOps practices for secure continuous integration and continuous deployment (CI/CD). This approach improves flexibility and efficiency in the development, deployment, management, and monitoring of smart camera applications at the edge. Experimental use cases, including face mask recognition and helmet detection applications, validate the proposed system's effectiveness. |
MedCapNet: A Novel Approach to Medical Image Captioning ABSTRACT. Medical image captioning is crucial for automating the generation of accurate textual descriptions for medical images. This paper introduces MedCapNet, a novel encoder-decoder architecture designed to address the challenges associated with this task. The model incorporates a Swin Transformer and Enhancement Encoder, allowing for the efficient extraction and refinement of both patch-level and global-level features from medical images. A Transformer block with a Fusion Module is utilized by the decoder to seamlessly integrate visual and linguistic information. A key innovation is Dual-Scale Masked Multi-Head Self-Attention, which enhances the model's ability to effectively capture long-range dependencies and fine-grained details. Our model was evaluated on ROCO v2, achieving state-of-the-art performance with scores of 0.647, 0.239, and 0.094 for BERTScore, CIDEr, and METEOR, respectively. Our work contributes to the field by introducing a robust architecture that effectively bridges the gap between visual and textual modalities in medical imaging. |
Contrastive Perturbation Enhancement for LLM-Based Machine Translation ABSTRACT. Large language models have been increasingly effective in various NLP tasks, especially for machine translation task. However, these models require a lot of computational resources and need to be further fine-tuned on a specific training data to achieve better performance. Medium-sized language models often show significantly poor performance compared to the large language models. Therefore, it is necessary to study methods to solve this problem. In this paper, we propose a method to fine-tune a language model with a size of several billion parameters based on the instructions from a large language model such as GPT-4 through the contrastive learning technique, called CoPE - Contrastive Perturbation Enhancement for LLM-Based Machine Translation. Our proposal consists of three stages: fine-tuning the language model on a parallel dataset, generating entailment as positive and contradiction as negative examples from the training dataset based on a high-performance large language model such as GPT-4, and then using these examples to improve the model through the contrastive learning technique. These examples will be evaluated and ranked to increase the influence of quality examples. Experimental results show that our proposal with a base model of LLaMA-3.1 with 8B parameters achieves 35.99 bleu score, 85.28 COMET-22 score, and 88.90 XCOMET score on the WMT’21 and WMT’22 datasets. This result is competitive with models such as ALMA-13B-R trained based on the contrastive preference optimization technique and is higher than the GPT-3.5 model. |
Traffic Anomaly Detection under Extreme Weather from Aerial Images ABSTRACT. Video anomaly detection is a crucial task in surveillance systems, significantly enhance the safety and security of city dwellers. It has attracted considerable interest from researchers in computer vision, machine learning, cyber security, remote sensing, and data mining. However, current state-of-the-art methods for video anomaly detection still encounter significant challenges in extreme weather conditions such as rain, fog, snow, flood, and thunderstorms. These conditions introduce considerable noise and complicate the detection of abnormal events in real-world scenarios, particularly in traffic surveillance videos. Therefore, this study investigates the impact of varying weather conditions on the performance of prominent methods for the detection of traffic anomalies in aerial videos. We perform extensive experiments utilizing six state-of-the-art methods on the standard benchmark dataset in aerial video surveillance, UIT-ADrone. In addition, we provide an in-depth analysis of the practical challenges posed by adverse conditions, including rain and snow. This analysis aims to elucidate the complex scene contexts that may hinder the efficacy of current methods in high-altitude drone videos for traffic surveillance. |
URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots - A Case Study at HCMUT ABSTRACT. With the rapid advancement of Artificial Intelligence, particularly in Natural Language Processing, Large Language Models (LLMs) have become pivotal in educational question-answering systems, especially university admission chatbots. Concepts such as Retrieval-Augmented Generation (RAG) and other advanced techniques have been developed to enhance these systems by integrating specific university data, enabling LLMs to provide informed responses on admissions and academic counseling. However, these enhanced RAG techniques often involve high operational costs and require the training of complex, specialized modules, which poses challenges for practical deployment. Additionally, in the educational context, it is crucial to provide accurate answers to prevent misinformation, a task that LLM-based systems find challenging without appropriate strategies and methods. In this paper, we introduce the Unified RAG (URAG) Framework, a hybrid approach that significantly improves the accuracy of responses, particularly for critical queries. Experimental results demonstrate that URAG enhances our in-house, lightweight model to perform comparably to state-of-the-art commercial models. Moreover, to validate its practical applicability, we conducted a case study at our educational institution, which received positive feedback and acclaim. This study not only proves the effectiveness of URAG but also highlights its feasibility for real-world implementation in educational settings. |
EPC-YOLOv7: The Proposed One-stage Detector for Aerial Scenario Detection ABSTRACT. Aerial scenario detection has always been an interesting challenge due to its various object scales and complex background, making the model hard to cover all the shapes, patterns, and recognize labels for detected objects. In this study, we contribute an extended network based on Fasternet block that can be implemented to the ELAN backbone network of the YOLOv7 architecture without harming the final performance or any adjustment in model width and depth scale while still being able to obtain a lower number in parameter, computation, and faster inference speed. Moreover, the proposed architecture is also combined with a Bi-direction Feature Pyramid Network to enhance model detection capability, allowing the architecture to detect more small objects. As a result, our proposed method reached 37.8% mAP50 on Visdrone2019 test-dev. For the DIOR test set, the proposed model achieved better than the YOLOv7 baseline model in both performance and complexity. |
A Low-Cost EEG-Based System for Measuring and Forecasting Levels of Alertness with Long Short-Term Memory ABSTRACT. In this paper, we propose a practical and cost-effective system for real-time tracking and predicting alertness levels. Instead of relying on multi-electrode sensors that require complex setups and may cause discomfort, our system uses a compact, single electrode sensor to capture EEG data. This data is then analyzed by various machine learning models to calculate an Awake Score for users. The Awake Score is also used as input for a forecasting model, which predicts the users’ alertness trends in advance. The forecasting model leverages an advanced deep learning model, Long Short-Term Memory (LSTM), to handle the EEG data and detect intricate temporal patterns in brain activity. Furthermore, we optimize the system to function with minimal electrodes while maintaining high predictive accuracy, providing a feasible solution for real-time detection of fatigue and cognitive load. |
Real-Time Multi-Face Emotion Recognition for Enhancing Student Engagement in Classroom Environments Using Low-Power IoT Devices ABSTRACT. This paper explores the role of emotion recognition in improving student learning by monitoring psychological states in real-time. We introduce a multi-face automated emotion recognition system that analyzes the emotions of multiple students simultaneously, evaluated on the Emotion PTIT dataset, which consists of students in a classroom, manually labeled with 1,500 images and 71 videos for face recognition and classroom detection tasks, respectively. To enhance classroom dynamics, we propose a new formula for evaluating engagement based on emotional data. The system, designed for low-power IoT devices, addresses challenges like load balancing and latency, processing live video feeds to assess classroom interactions. Our prototype can detect up to 50 faces at 25 FPS with an accuracy of 88.68%. |
MEPC: Multi-level Product Category Recognition Image Dataset ABSTRACT. Multi-level product category prediction is a problem for businesses providing online retail sector systems. Accurate Multi-level prediction supports the system in avoiding the need for sellers to fill in product category information, saving time and reducing the cost of listing products online. This is an open research problem, which always attracts researchers. Deep learning techniques have shown promising results for category recognition problems. A neat and clean dataset is an elementary requirement for building accurate and robust deep-learning models for category prediction. In this article, we introduce a new image dataset of the multi-level product, called MEPC. MEPC dataset has +164.000 images in the processed format available in the dataset. We evaluate MEPC dataset with popular deep learning models, benchmark results a top-1 accuracy score of 92.055% with 10 classes and a top-5 accuracy score 57.36% with 1000 classes. The proposed dataset is good for training, validation, and testing for hierarchical image classification to improve predict multi-level categories in the online retail sector systems. Data and code will be released at https://huggingface.co/datasets/sherlockvn/MEPC . |
14:40 | Addressing Ambiguous Queries in Video Retrieval with Advanced Temporal Search ABSTRACT. The increasing volume of multimedia content has intensified the demand for video retrieval systems that can efficiently and accurately extract relevant information from large-scale archives. However, existing methods frequently encounter challenges when dealing with ambiguous queries, particularly those involving complex temporal relationships, often leading to incomplete or suboptimal retrieval results. To address these limitations, we propose a novel multimodal video retrieval system designed to handle a wide range of query types by integrating outputs from multiple search models. A central feature of the system is its advanced temporal search mechanism, which improves ambiguity resolution by conducting additional searches within adjacent video shots, rather than relying solely on chronological order. The effectiveness of the proposed system is demonstrated through its performance in the 2024 Ho Chi Minh AI Challenge. |
15:00 | SnapSeek: A Multimodal Video Retrieval System with Context Awareness for AI Challenge 2024 ABSTRACT. Retrieving information on news is a complex task that demands considerable effort. The AI Challenge stands as one of the pioneering competitions in the nation, initiating exploration in this domain. This study presents the SnapSeek system, a notable entry in this year’s competition. Utilizing advanced tools such as Milvus and Elasticsearch, SnapSeek facilitates searches across various data types, particularly emphasizing vector embeddings and metadata in textual or encoded forms. Building upon the initial SnapSeek version introduced at LSC 2024, this iteration incorporates significant enhancements, including dataset expansion, metadata enrichment, the magic brush feature, temporal search capabilities, human feedback and contextual news extraction. These advancements collectively enhance the efficiency of information retrieval. Furthermore, SnapSeek offers a minimalist and user-friendly interface, rendering it an effective and appropriate system for retrieving news information. |
15:20 | ArtemisSearch: A Multimodal Search Engine for Efficient Video Log-Life Event Retrieval Using Time-Segmented Queries and Vision Transformer-based Feature Extraction PRESENTER: Hoang-Phuc Nguyen ABSTRACT. In this century, search engines have emerged as a crucial component of the technological landscape. Enterprises require a search engine to retrieve specific information within a particular field. However, they face various challenges due to the rapidly increasing volume of data and the need for effective database management to handle diverse data types. Additionally, the search for data is hindered by difficulties in matching queries with key frames or the limitations in understanding query context. In this paper, we introduce ArtemisSearch, a text-based multimodal search engine designed for temporal event retrieval in videos. In the proposed system, an efficient algorithm for Content-Based Image Retrieval (CBIR) using ViT-H/14 and BEiT3 for feature extraction and an open-source vector database, Milvus, our system efficiently retrieves events by leveraging temporal segmentation of queries and matching embeddings for Artificial Intelligence (AI) applications. Additionally, we developed a web application that allows end users to easily create temporally-aware descriptive queries, efficiently explore top results, and view precise video previews at relevant timestamps. The ArtemisSearch method represents a significant advancement in temporal video retrieval, with potential applications across diverse fields, leading to a smoother and more accurate video search experience. |
15:40 | KPI: Knowledge-based Processing for Interactive Video Retrieval ABSTRACT. The expansion of internet technologies has led to challenges in managing and retrieving vast amounts of video content, causing information overload. Traditional search systems struggle to rank results based on user intent, prompting the research community to improve retrieval methods, notably through challenges like the Ho Chi Minh AI City Challenge. In this paper, we introduce KPI: Knowledge-based Processing for Interactive Video Retrieval, a novel system that enhances multimedia retrieval efficiency and accuracy. The system integrates text, automatic speech recognition (ASR), optical character recognition (OCR), and temporal retrieval, enabling time-based segment searches. Besides, we develop an elevated keyframe selection and incorporate a dominant-color search strategy, all supported by a user feedback mechanism and a user-centric interface. Our system competes in the 2024 Ho Chi Minh AI City Challenge and achieves a competitive result. |
14:40 | Exemplar-Embed Complex Matrix Factorization with Elastic Net Penalty: An Advanced Approach for Data Representation PRESENTER: Manh Quan Bui ABSTRACT. This paper presents an advanced method for complex matrix factorization, termed exemplar-embed complex matrix factorization with elastic net penalty (ENEE-CMF). The proposed ENEE-CMF integrates both L1 and L2 regularizations on the encoding matrix to enhance the sparsity and effectiveness of the projection matrix. Utilizing Wirtinger's calculus for differentiating real-valued complex func-tions, ENEE-CMF efficiently addresses complex optimization challenges through gradient descent, enabling more precise adjustments during factorization. Experi-mental evaluations on facial expression recognition task demonstrate that ENEE-CMF significantly outperforms traditional non-negative matrix factorization (NMF) and similar complex matrix factorization (CMF) models, achieving supe-rior recognition accuracy. These findings highlight the benefits of incorporating elastic net regularization into complex matrix factorization for handling challeng-ing recognition tasks. |
15:00 | Modeling Information Diffusion in Bibliographic Networks using Pretopology ABSTRACT. In this research, we propose a novel approach to model information diffusion on bibliographic networks using pretopology theory. We propose pretopological independent cascade model that is a variation of the independent cascade model (IC), namely Preto_IC. We apply pretopology to model the structure of heterogeneous bibliographic networks since it is a powerful mathematical tool for complex network analysis. The highlights of Preto_IC are that the propagation process is simulated on multiple relations, and the concept of elementary closed subset is applied to capture the seed set. In the first step, we construct a pretopological space to illustrate a heterogeneous bibliographic network. In this space, we define a strong pseudo-closure function to capture the neighborhood set of a set A. Next, we propose a new method to choose seed set based on the elementary closed subsets. Finally, we simulate Preto_IC with the seed set from step (2) and for each step t of propagation, determine the neighborhood set for infection based on pseudo-closure function defined from step (1). We experiment on three real datasets and demonstrate the effectiveness of Preto_IC compared with the IC model with existing methods of seed set selection. |
15:20 | Optimizing Credit Scoring Models for Decentralized Financial Applications ABSTRACT. Decentralized Finance (DeFi), a rapidly evolving ecosystem of blockchain-based financial applications, has attracted substantial capital in recent years. Lending protocols, which provide deposit and loan services similar to traditional banking, are central to DeFi. However, the lack of credit scoring in these protocols creates several challenges. Without accurate risk assessment, lending protocols impose higher interest rates to offset potential losses, negatively affecting both borrowers and lenders. Furthermore, the absence of credit scoring reduces transparency and fairness, treating all borrowers equally regardless of their credit history, discouraging responsible financial behavior and hindering sustainable growth. This paper introduces credit scoring models for crypto wallets in DeFi. Our contributions include: (1) developing a comprehensive dataset with 14 features from over \numprint{250000} crypto wallets; and (2) constructing four credit scoring models based on Stochastic Gradient Descent, Adam, Genetic, and Multilayer Perceptron algorithms. These findings offer valuable insights for improving DeFi lending protocols and mitigating risks in decentralized financial ecosystems. |
15:40 | A Historical GPS Trajectory-Based Framework for Predicting Bus Travel Time ABSTRACT. Accurate bus travel time information helps passengers plan their trips more effectively and can potentially increase ridership. However, cyclical factors (e.g. time of day, weather conditions, and holidays), unpredictable factors (e.g. incidents and abnormal weather), and other complex factors (e.g. dynamic traffic conditions, dwell times, and variations in travel demand) make accurate bus travel time prediction challenging. This study aims to achieve accurate travel time prediction. To accomplish this, we have developed a bus travel time prediction framework based on similar historical Global Positioning System (GPS) trajectory data and an information decay technique. The framework first divides the predicted route into segments, integrating GPS trajectory data with road map processing techniques to accurately map the bus’s position and estimate its arrival time at bus stops. Then, instead of relying on a single historical trajectory that best matches the predicted bus journey, the framework samples a set of similar trajectories as the basis for travel time estimation. Finally, the information decay technique is applied to construct a bus travel time prediction interval. We conduct comprehensive experiments using real bus trajectory data collected from Kandy, Sri Lanka, and Ho Chi Minh City, Vietnam, to validate our ideas and evaluate the proposed framework. The experimental results show that the proposed prediction framework significantly improves accuracy compared to baseline approaches by considering factors such as bus stops, time of day and day of week. |