Language Guided Graph Transformer for Skeleton Action Recognition
ABSTRACT. The Transformer model is a novel neural network architecture based on a self-attention mechanism, primarily used in the field of natural language processing and is currently being introduced to the computer vision domain. However, the Transformer model has not been widely applied in the task of human action recognition. Action recognition is typically described as a single classification task, and the existing recognition algorithms do not fully leverage the semantic relationships within actions. In this paper, we propose a new method called LGGT that combines textual information and Graph Transformer to incorporate semantic guidance in skeleton-based action recognition. The LGGT method employs Graph Transformer as the encoder for skeleton data to extract feature representations and effectively captures long-distance dependencies between joints. Additionally, LGGT utilizes a large-scale language model as a knowledge engine to generate textual descriptions specific to different actions, capturing the semantic relationships between actions and improving the model's understanding and accurate recognition and classification of different actions. We extensively evaluate the performance of using the proposed method for action recognition on the Smoking dataset, Kinetics-Skeleton dataset, and NTU RGB$+$D action dataset. The experimental results demonstrate significant performance improvements of our method on these datasets, and the ablation study shows that the introduction of semantic guidance can further enhance the model's performance.
Gradient Coupled Flow: Performance Boosting on Network Pruning by Utilizing Implicit Loss Decrease
ABSTRACT. Network pruning prior to training makes generalization more challenging than ever, while recent studies mainly focus on the trainability of the pruned networks in isolation. This paper explores a new perspective on loss implicit decrease of the data to be trained caused by one-batch training during each round, whose first-order approximation we term gradient coupled flow. We thus present a criterion sensitive to gradient coupled flow (GCS), which is hypothesized to capture those weights most sensitive to performance boosting at initialization. Evaluations are empirically investigated on multiple datasets such as CIFAR10/100, Tiny-ImageNet and MNIST deployed on VGG and ResNet architectures. It is shown that GCS achieves decent performance in both single pruning and iterative pruning before training. In addition, a variant of GCS called GCS-Group performs better at low compression ratios, further confirming the role of implicit loss. Interestingly, our explorations show there exists a linear correlation between generalization and implicit loss decrease based measurements on previous works as well as GCS, which ideally describes causes of accuracy fluctuation in a fine-grained manner.
Learnable Color Image Zero-Watermarking Based on Feature Comparison
ABSTRACT. Zero-watermarking is one of the solutions to protect the copyright of color images without tampering with them. Existing zero-watermarking algorithms either rely on static classical techniques or employ pre-trained models of deep learning, which limit the adaptability of zero-watermarking to complex and dynamic environments. These algorithms are prone to fail when encountering novel or complex noise. To address this issue, we propose a self-supervised anti-noise learning color image zero-watermarking method that leverages feature matching to achieve lossless protection of images. In our method, we use a learnable feature extractor and a baseline feature extractor to compare the features extracted by both. Moreover, we introduce a combined weighted noise layer to enhance the robustness against combined noise attacks. Extensive experiments show that our method outperforms other methods in terms of effectiveness and efficiency.
Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method
ABSTRACT. Large language models(LLMs) demonstrate significant capabilities in traditional natural language processing(NLP) tasks and many examinations. However, there are few evaluations in regard to specific subjects in the Chinese educational context. This study, focusing on secondary physics and history, explores the potential and limitations of LLMs in Chinese education. Our contributions are as follows: a PH dataset is established, which concludes secondary school physics and history in Chinese, comprising thousands of multiple-choice questions; an evaluation on three prevalent LLMs: ChatGPT, GPT-3, ChatGLM on our PH dataset is made; a new prompt method called One-More-Check(OMC) is proposed to enhance the logical reasoning capacity of LLMs; finally, three LLMs are set to attend an actual secondary history exam. Our findings suggest that our OMC method improves the performance of LLMs on logical reasoning and LLMs underperform average level of age-appropriate students on the exam of history. All datasets, code and evaluation results are available at https://github.com/hcffffff/PH-dataset-OMC.
Sub-Instruction and Local Map Relationship Enhanced Model for Vision and Language Navigation
ABSTRACT. In this paper, different from most methods in vision and language navigation, which primarily rely on vision-language cross-modal attention modeling and the agent’s egocentric observations. We establish connections between sub-instructions and local maps to elaborately encode environment information and learn a path responsible for the whole instructions rather than the ultimate goal. We first obtain a local semantic map by ground projecting the RGB semantic segmentation map and the depth map. The segmented sub-instruction is passed through the sub-instruction attention module and then taken as input, together with the local map, to the cross-modal attention module. Finally, a set of waypoints is predicted by the navigation module until all sub-instructions of the long instruction are executed, which is the completion of an episode. Comparison experiments and ablation studies on the VLN-CE dataset show that our method outperforms most methods and has a good whole path predicting ability.
STFormer: Cross-Level Feature Fusion in Object Detection
ABSTRACT. Object detection algorithms can benefit from multi-level features, which encompass both high-level semantic information and low-level location details. However, existing detection methods face numerous challenges in effectively utilizing these multi-level features. Most existing detection techniques utilize simplistic operations such as feature addition or concatenation to fuse multi-level features, thereby failing to effectively suppress redundant information. Consequently, the performance of these algorithms is significantly constrained in complex scenarios. To address these limitations, this paper presents a novel feature extraction network that incorporates joint modeling and multi-dimensional feature fusion. Specifically, the network partitions the features of each level into tiles and employs hybrid self-attention mechanisms to extract these grouped features more comprehensively. Additionally, a hybrid cross-attention-based approach is utilized to regulate the transmission proportion of each grouped feature, facilitating the seamless integration of high-level semantic features obtained from deep encoders and the low-level position details retained by the pipeline. Consequently, the network effectively suppresses noise and enhances performance. Experimental evaluation on the MS COCO dataset demonstrates the effectiveness of the proposed approach, achieving an impressive accuracy of 54.3\%. Notably, the algorithm showcases exceptional performance in detecting small-scale targets, surpassing the capabilities of other state-of-the-art technologies.
Social-CVAE: Pedestrian Trajectory Prediction using Conditional Variational Auto-Encoder
ABSTRACT. Pedestrian trajectory prediction is a fundamental task in applications such as autonomous driving, robot navigation, and advanced video surveillance. Since human motion behavior is inherently unpredictable, resembling a process of decision-making and intrinsic motivation, it naturally exhibits multimodality and uncertainty. Therefore, predicting multi-modal future trajectories in a reasonable manner poses challenges. The goal of multi-modal pedestrian trajectory prediction is to forecast multiple socially plausible future motion paths based on the historical motion paths of agents. In this paper, we propose a multi-modal pedestrian trajectory prediction method based on conditional variational auto-encoder. Specifically, the core of the proposed model is a conditional variational auto-encoder architecture that learns the distribution of future trajectories of agents by leveraging random latent variables conditioned on observed past trajectories. The encoder models the channel and temporal dimensions of historical agent trajectories sequentially, incorporating channel attention and self-attention to dynamically extract spatio-temporal features of observed past trajectories. The decoder is bidirectional, first estimating the future trajectory endpoints of the agents and then using the estimated trajectory endpoints as the starting position for the backward decoder to predict future trajectories from both directions, reducing cumulative errors over longer prediction ranges. The proposed model is evaluated on the widely used ETH/UCY pedestrian trajectory prediction benchmark and achieves state-of-the-art performance.
Distributed Training of Deep Neural Networks: Convergence and Case Study
ABSTRACT. Deep neural network training on a single machine has become increasingly difficult due to a lack of computational power. Fortunately, distributed training of neural networks can be performed with model and data parallelism and sub-network training. This paper introduces a mathematical framework to study the convergence of distributed asynchronous training of deep neural networks with a focus on sub-network training. This article also studies the convergence conditions in synchronous and asynchronous modes. An asynchronous and lock-free training version of the sub-network training is proposed to validate the theoretical study. Experiments were conducted on two well-known public datasets, namely Google Speech and MaFaulDa, using the Jean Zay supercomputer of GENCI. The results indicate that the proposed asynchronous sub-network training approach, with 64 GPUs, achieves faster convergence time and better generalization than the synchronous approach.
Improving Handwritten Mathematical Expression Recognition via an Attention Refinement Network
ABSTRACT. Handwritten mathematical expression recognition (HMER), typically regarding as a sequence-to-sequence problem, has made great progress in recent years, where RNN based models have been widely adopted. Although Transformer based model has demonstrated success in many areas, its performance is not satisfied due to the issue of standard attention mechanism in HMER. Therefore, we propose to improve the performance via an attention refinement network in the Transformer framework for HMER. We firstly adopt a shift window attention (SWA) from Swin Transformer to capture spatial contexts of the whole image for HMER. Moreover, we propose a refined coverage attention (RCA) to overcome the issue of lack of converge in the standard attention mechanism, where we utilize a convolutional kernel with a gating function to obtain coverage features. With the proposed RCA, we refine coverage attentions to attenuate the repeating issue of focused areas in the long-sequence. In addition, we utilize a pyramid data augmentation method to generate mathematical expression images with multiple resolutions to enhance the model generalization. We evaluate the proposed attention refinement network on the HMER benchmark datasets of CROHME2014/2016/2019, and extensive experiments demonstrate its effectiveness.
ABSTRACT. JPEG compression brings artifacts into the compressed image, which not only degrade visual quality, but also affect the performance of other image processing tasks. To address this issue, many learning-based compression artifacts removal methods have been developed in recent years, with remarkable success. However, existing learning-based methods generally only exploit spatial information and lack exploration of frequency domain information. Exploring frequency domain information is critical because JPEG compression is actually performed in the frequency domain using the Discrete Cosine Transform (DCT). To effectively leverage information from both the spatial and frequency domains, we propose a novel Dual-Domain Learning Network for JPEG artifacts removal (D2LNet). Our approach first transforms the spatial domain image to the frequency domain by the fast Fourier transform (FFT). We then introduce two core modules, Amplitude Correction Module (ACM) and Phase Correction Module (PCM), which facilitate interactive learning of spatial and frequency domain information. Extensive experimental results performed on color and grayscale images have clearly demonstrated that our method achieves better results than the previous state-of-the-art methods. Code will be available at https://github.com/YeunkSuzy/JPEG.
Dual Channel Graph Neural Network Enhanced by External Affective Knowledge for Aspect Level Sentiment Analysis
ABSTRACT. Aspect-level sentiment analysis is a prominent technology in natural language processing (NLP) that analyzes the sentiment polarity of aspect words in a text. Despite its long history of development, current methods still have some shortcomings. Mainly, they lack the integration of external affective knowledge, which is crucial for allocating attention to aspect-related words in syntactic and semantic information processing. Additionally, the synergy between syntactic and semantic information is often neglected, with most approaches focusing on only one dimension. To address these issues, we propose a knowledge enhanced dual channel graph neural network. Our model incorporates external affective knowledge into both the semantic and syntactic channels in different ways, then utilizes a dy namic attention mechanism to fuse information from these channels. We conducted experiments on Semeval2014, 2015, and 2016 datasets, and the results showed significant improvements compared to existing methods. Our approach bridges the gaps in current techniques and enhances perfor mance in aspect level sentiment analysis.
Reversible data hiding based on adaptive embedding with local complexity
ABSTRACT. In recent years, most reversible data hiding (RDH) algorithms have considered the impact of texture information on embedding performance. The distortion caused by embedding secret data in the image's smooth region is much less than in the non-smooth region. It is because embedding secret data in the smooth region corresponds to fewer invalid shifting pixels (ISPs) in histogram shifting. However, though effective, the local complexity is not calculated precisely enough, which results in inaccurate texture division and does not considerably reduce distortion. Therefore, a new RDH scheme based on adaptive embedding with local complexity (AELC) is proposed to improve the embedding performance effectively. Specifically, first, the cover image is divided into two subsets by the checkerboard pattern. Then the local complexity of each pixel is computed by the correlation between adjacent pixels (CBAP) to improve calculation accuracy. Finally, secret data are adaptive and preferentially embedded into the regions with lower local complexity in each subset. Experimental results show that the proposed algorithm performs best regarding invalid shifted pixels, maximum embedding capacity (EC), and peak signal-to-noise ratio (PSNR) compared to some state-of-the-art RDH methods.
Graph-based Vehicle Keypoint Attention Model for Vehicle Re-identification
ABSTRACT. Vehicle re-identification is the task of locating a particular
vehicle image among a set of images of vehicles captured from differ-
ent cameras. In recent years, many methods focus on learning distinc-
tive global features by incorporating keypoint details to improve re-
identification accuracy. However, these methods do not take into account
the relation between different keypoints and the relation between key-
points and the overall vehicle. To address this limitation, we propose
the Graph-based Vehicle Keypoint Attention (GVKA) model that in-
tegrates keypoint features and two relation components to yield robust
and discriminative representations of vehicle images. The model extracts
keypoint features using a pre-trained model, models the relation among
keypoint features using a Graph Convolutional Network, and employs
cross-attention to highlight important areas of the vehicle and estab-
lish the relation between keypoint features and the overall vehicle. Our
experimental results on three large-scale datasets demonstrate the effec-
tiveness of our proposed method.
Multi-Mobile Object Motion Coordination with Reinforcement Learning
ABSTRACT. Multi-mobile Object Motion Coordination (MOMC) refers to the task of controlling multiple moving objects to travel from their respective starting station to the terminal station. In this task, the objects are expected to complete their travel in a shorter amount of time with no collision. Multi-mobile object motion coordination plays an important role in various application scenarios such as production, processing, warehousing and logistics. The problem can be modeled as a Markov decision process and solved by reinforcement learning. The current main research methods are suffering from long training time and poor policy stability, both of which can decrease the reliability of practical applications. To address these problems, we introduce a State with Time (ST) model and a Dynamically Update Reward (DUR) model. The results of experiments show that the ST model can enhance the stability of learned policies, and the DUR model can improve the training efficiency and stability of the strategy, and the motion coordination solutions obtained through this algorithm are superior to those of similar algorithms.
Multi-Feature Integration Neural Network with Two-Stage Training for Short-Term Load Forecasting
ABSTRACT. Accurate short-term load forecasting (STLF) helps the power sector conduct generation and transmission efficiently, maintain stable grid operation while reducing energy waste, and thus achieve sustainable development. However, short-term load forecasting suffers from complex temporal dynamics and many environment variables, which causes considerable difficulties for the power sector. Therefore, short-term load forecasting is an essential yet challenging task. In this paper, we propose a short-term load forecasting model that integrates historical load, environment variables and temporal information, named TCN-GRU-TEmb. Our method utilizes temporal convolutional network (TCN) to capture the regularity of historical loads and gated recurrent unit (GRU) to extract useful features from environmental variables. As to temporal information, we propose a temporal embedding (TEmb) self-learning module, which can automatically capture the power consumption patterns of different time periods. We further propose a two-stage training algorithm to facilitate model convergence. Comparison experiments show that our model outperforms all the baselines, exhibiting an average reduction in MAE, MAPE, and RMSE of 8.24%, 9.23%, and 7.48%, respectively. Another experiment proves the effectiveness of the proposed temporal embedding method and two-stage training algorithm.
New predefined-time stability theorem and applications to the fuzzy stochastic memristive neural networks with impulsive effects
ABSTRACT. The paper mainly investigates the issue of achieving predefined-time synchronization for fuzzy memristive neural networks with both impulsive effects and stochastic disturbances. Firstly, due to the fact that the existed predefined-time stability theorems can hardly be applied to systems with impulsive effects, a new predefined-time stability theorem is proposed to solve the stability problem of the systems with impulsive effects. The theorem is flexible and can guide impulsive stochastic fuzzy memristive neural network models to achieve predefined-time synchronization. Secondly, due to the limitation problems for sign function that it can easily lead to cause the chattering phenomenon, resulting In undesirable results such as decreased synchronization performance. A novel and effective feedback controller without the sign function is designed to eliminate this chattering phenomenon in the paper. In addition, The paper overcomes the comprehensive influence of fuzzy logic, memristive state dependence and stochastic disturbance, and gives the effective conditions to ensure that two stochastic systems can achieve the predefined-time
synchronization. Finally, the effectiveness of the proposed theoretical results is demonstrated in detail through a numerical simulation.
I-RAFT: Optical Flow Estimation Model Based on Multi-scale Initialization Strategy
ABSTRACT. Optical flow estimation is a fundamental task in the field of computer vision, and recent advancements in deep learning networks have led to significant performance improvements. However, existing models that employ recurrent neural networks to update optical flow from an initial value of 0 suffer from issues of instability and slow training. To address this, we propose a simple yet effective optical flow initialization module as part of the optical flow initialization stage, leading to the development of a optical flow estimation model named I-RAFT. Our approach draws inspiration from other successful algorithms in computer vision to tackle the multi-scale problem. By extracting initial optical flow values from the 4D cost volume and employing a voting module, we achieve initialization. Importantly, the initialization module can be seamlessly integrated into other optical flow estimation models. Additionally, we introduce a novel multi-scale extraction module for capturing context features. Extensive experiments demonstrate the simplicity and effectiveness of our proposed model, with I-RAFT achieving state-of-the-art performance on the Sintel dataset and the second-best performance on the KITTI dataset. Remarkably, our model achieves these results with a 24.48% reduction in parameters compared to the previous state-of-the-art MatchFlow model. We have made our code publicly available to facilitate further research and development.
Educational Pattern Guided Self-Knowledge Distillation for Siamese Visual Tracking
ABSTRACT. Existing Siamese-based trackers divide visual
tracking into two stages, i.e., feature
extraction (backbone subnetwork), and prediction (head subnetwork).
However, they mainly implement task-level supervision (classification and regression), barely considering the feature-level supervision in the knowledge learning process, which could result in deficient knowledge interaction among the features of the tracker's targets and background interference during the online tracking process.
To solve the issues, this paper proposes an educational pattern-guided self-knowledge distillation methodology by guiding Siamese-based trackers to learn feature knowledge by themselves, which can serve as a generic training protocol to improve any Simamese-based tracker. Our key insight is to utilize two educational self-distillation patterns, i.e., focal self-distillation and discriminative self-distillation, to educate the tracker to possess self-learning ability.
The focal self-distillation pattern educates the tracking network to focus on valuable pixels and channels by decoupling the spatial learning and channel learning of target features.
The discriminative self-distillation pattern aims at maximizing the discrimination between foreground and background features, ensuring that the trackers are unaffected by background pixels.
As one of the first attempts to introduce self-knowledge distillation into the visual tracking field, our method is effective and efficient and has a strong generalization ability, which might be instructive for other research. Codes and data are publicly available.
Removing Double Descent with Data-dependent Regularization under Non-Asymptotic View
ABSTRACT. The surprising double-descent phenomenon has drawn public attention in recent years, making the prediction error rise and drop as we increase either sample or model size. The current paper shows that it is possible to alleviate these phenomena by using optimal dropout in the linear regression model, both theoretically and empirically.
${y}=X{\beta}^0+{\epsilon}$ with $X\in\mathbb{R}^{n\times p}$. We obtain the optimal dropout hyperparameter by estimating the ground truth ${\beta}^0$ with generalized ridge typed estimator $\hat{{\beta}}=(X^TX+\alpha\cdot\mathrm{diag}(X^TX))^{-1}X^T{y}$.
Moreover, we empirically show that optimal dropout can achieve a monotonic test error curve in nonlinear neural networks using CIFAR-10. Our results suggest considering dropout for risk curve scaling when meeting the peak phenomenon. In addition, we figure out why previous deep learning models do not encounter double-descent scenarios---because we already apply a usual regularization approach like the dropout in our models. To our best knowledge, this paper is the first to analyze the relationship between data-dependent regularization and double descent.
FE-YOLOv5:Improved YOLOv5 Network for Multi-scale Drone-captured Scene Detection
ABSTRACT. Due to the different angles and heights of UAV shooting, the shooting environment is complex, and the shooting targets are mostly small, so the target detection task in the drone-captured scene is still challenging. In this study, we present a highly precise technique for identifying objects in scenes captured by drones, which we refer to as FE-YOLOv5. First, to optimize cross-scale feature fusion and maximize the utilization of shallow feature information, we propose a novel feature pyramid model called MSF-BiFPN as our primary approach. Furthermore, to improve the fusion of features at different scales and boost their representational power, our innovative approach proposes an adaptive attention module. Moreover, we propose a novel feature enhancement module that effectively strengthens high-level features before feature fusion. This module effectively minimized feature loss during the fusion process, ultimately resulting in enhanced detection accuracy. Finally, the utilization of the normalized Wasserstein distance serves as a novel metric for enhancing the model's sensitivity and accuracy in detecting small targets. The experimental results of FE-YOLOv5 on the VisDrone data set show that mAP 0.5 has increased by 7.8%, and mAP 0.5:0.95 increased by 5.7%. At the same time, the training results of the model at 960×960 image resolution are better than the current YOLO series models, among which mAP 0.5 can reach 56.3%. Based on the experiments conducted, it has been demonstrated that the FE-YOLOv5 model effectively enhances the accuracy of object detection in UAV capture scenes.
Attention-Based Deep Convolutional Network for Speech Recognition under Multi-scene Noise Environment
ABSTRACT. One goal of Automatic Speech Recognition~(ASR) is to convert the command of human speech into computer-readable input, but noise interference is an important yet challenging problem. Capturing speech context, deep neural networks have demonstrated to be superior in learning networks for identifying specific command words. Existing deep neural networks generally rely on the two-layer structure, different layer is used to identify the noisy environment and speech respectively, which makes the model large and complex. In addition, their performance generally drops dramatically in unknown noisy environments, which restricts the generalization of the method. In this paper, we propose a novel deep framework, named Adaptive-Attention and Joint Supervision~(AJS) to circumvent the above challenge. Specifically, we use the spectrogram as the input. Adaptive attention is employed to refine the features from the noise environment and get rid of the limitation of the noisy scene. Furthermore, a combination of coarse-to-fine losses are adopted to process difficult words step by step. Extensive experiments on four public datasets demonstrate the robustness of our method to various noise environments and its superiority for ASR in terms of accuracy. Codes are available at: https://github.com/zhishulin/bajs.
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
ABSTRACT. This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.
ABSTRACT. Lip reading is a fine-grained video understanding task that
endeavors to recognize speech content by analyzing the movement of the
speaker’s mouth. In recent times, 3D-ResNet-18 has become the favored
front-end network for most of the lip reading methods. However, a single
3D CNN layer within the 3D-ResNet-18-based front-end network might
not have enough representation power to extract temporal features. To
address this issue, we propose the incorporation of Temporal Adaptive
Module (TAM) into the front-end network of lip reading methods. TAM
is an uncomplicated temporal module that consists of two branches: a
local branch that provides location-sensitive information, and a global
branch that focuses on capturing long-term temporal dependencies. This
combination of branches helps capture complex temporal structures and
facilitates robust temporal modeling. Taking global and local relation-
ships into consideration explicitly improves the feature representation. It
can be easily used in classical building blocks of networks. We conducted
ablation studies to determine the optimal TAM structure and compared
our results with various related approaches on the LRW dataset. Our
experimental outcomes prove the superiority of our approach.
CInvISP: Conditional Invertible Image Signal Processing Pipeline
ABSTRACT. Standard RGB (sRGB) images processed by the image signal processing (ISP) pipeline of digital cameras have a nonlinear relationship with the scene irradiance. Therefore, the low-level vision tasks which work best in a linear color space are not suitable to be carried out in the sRGB color space. To address this issue, this paper proposes an approach called CInvISP to provide a bidirectional mapping between the nonlinear sRGB and linear CIE XYZ color spaces. To ensure a fully invertible ISP, the basic building blocks in our framework adopt the structure of invertible neural network. As camera-style information is embedded in sRGB images, it is necessary to completely remove it during backward mapping, and properly incorporate it during forward mapping. To this end, a conditional vector is extracted from the sRGB input and inserted into each invertible building block. Experiments show that compared to other mapping approaches, CInvISP achieves a more accurate bidirectional mapping between the two color spaces. Moreover, it is also verified that such a precise bidirectional mapping facilitates low-level vision tasks including image denoising and retouching well.
Reimagining China-US Relations Prediction: A Multi-Modal, Knowledge-Driven Approach with KDSCINet
ABSTRACT. Statistical models and data driven models have achieved remarkable results in international relation forecasting. However, most of these models have several common drawbacks, including (i) rely on large amounts of expert knowledge, limiting the objectivity, applicability, usability, interpretability and sustainability of models, (ii) can only use structured unimodal data or cannot make full use of multimodal data. To address these two problems, we proposed a Knowledge-Driven neural network architecture that conducts Sample Convolution and Interaction, named KDSCINet, for China-US relation forecasting. Firstly, we filter events pertaining to China-US relations from the GDELT database. Then, we extract text descriptions and images from news articles and utilize the fine-tuned pre-trained model MKGformer to obtain embeddings. Finally we connect textual and image embeddings of the event with the structured event value in GDELT database through multi-head attention mechanism to generate time series data, which is then feed into KDSCINet for China-US relation forecasting. Our approach enhances prediction accuracy by establishing a knowledge-driven temporal forecasting model that combines structured data, textual data and image data. Experiments demonstrate that KDSCINet can (i) outperform state-of-the-art methods on time series forecasting problem in the area of international relation forecasting, (ii) improving forecasting performance through the use of multimodal knowledge.
A Graph Convolution Neural Network for User-group Aided Personalized Session-based Recommendation
ABSTRACT. Session-based recommendation systems aim to predict the next user interaction based on the items with which the user interacts in the current session. Currently, graph neural network-based models have been widely used and proven more effective than others. However, these session-based models mainly focus on the user-item and item-item relations in historical sessions while ignoring information shared by similar users. To address the above issues, a new graph-based representation, User-item Group Graph, which considers not only user-item and item-item but also user-user relations, is developed to take advantage of natural sequential relations shared by similar users. A new personalized session-based recommendation model is developed based on this representation. It first generates groups according to user-related historical item sequences and then uses a user group preference recognition module to capture and balance between group-item preferences and user-item preferences. Comparison experiments show that the proposed model outperforms other state-of-art models when similar users are effectively grouped. This indicates that grouping similar users can help find deep preferences shared by users from the same group and is instructive in finding the most appropriate next item for the current user.
RPUC: Semi-supervised 3D Biomedical Image Segmentation through Rectified Pyramid Unsupervised Consistency
ABSTRACT. Deep learning models have demonstrated remarkable performance in various biomedical image segmentation tasks. However, their reliance on a large amount of labeled data for training poses challenges as acquiring well-annotated data is expensive and time-consuming. To address this issue, semi-supervised learning (SSL) has emerged as a potential solution to leverage abundant unlabeled data. In this paper, we propose a simple yet effective consistency regularization scheme called Rectified Pyramid Unsupervised Consistency (RPUC) for semi-supervised 3D biomedical image segmentation. Our RPUC adopts a pyramid-like structure by incorporating three segmentation networks. To fully exploit the available unlabeled data, we introduce a novel pyramid unsupervised consistency (PUC) loss, which enforces consistency among the outputs of the three segmentation models and facilitates the transfer of cyclic knowledge. Additionally, we perturb the inputs of the three networks with varying ratios of Gaussian noise to enhance the consistency of unlabeled data outputs. Furthermore, three pseudo labels are generated from the outputs of the three segmentation networks, providing additional supervision during training. Experimental results demonstrate that our proposed RPUC achieves state-of-the-art performance in semi-supervised segmentation on two publicly available 3D biomedical image datasets.
A Developer Recommendation Method Based on Disentangled Graph Convolutional Network
ABSTRACT. Crowdsourcing Software Development (CSD) solves software development tasks by integrating resources from global developers. With more and more companies and developers moving onto CSD platforms, the information overload problem of the platform makes it difficult to recommend suitable developers for the software development task. The interaction behavior between developers and tasks is often the result of complex latent factors. Existing developer recommendation methods are mostly based on deep learning, where the feature representations ignores the influence of latent factors on interactive behavior, leading to learned feature representations that lack robustness and interpretability. To solve the above problems, we present a Developer Recommendation Method Based on Disentangled Graph Convolutional (DRDGC). Specifically, we use a disentangled graph convolutional network to separate the latent factors within the original features. Each latent factor contains specific information and is independent from each other, which makes the features constructed by the latent factors exhibit stronger robustness and interpretability. Extensive experiments results show that DRDGC can effectively recommend the right developer for the task and outperforms the baseline methods.
Disentangling Node Metric Factors For Temporal Link Prediction
ABSTRACT. Temporal Link Prediction (TLP), as one of the highly concerned tasks in graph mining, requires predicting the future link probability based on historical interactions. On the one hand, traditional methods based on node metrics, such as Common Neighbor, achieve satisfactory performance in the TLP task. On the other hand, node metrics overly focus on the global impact of nodes while neglecting the personalization of different node pairs, which can sometimes mislead link prediction results. However, mainstream TLP methods follow the standard paradigm of learning node embedding, entangling favorable and harmful node metric factors in the representation, reducing the model's robustness. In this paper, we propose a plug-and-play plugin called Node Metric Disentanglement, which can apply to most TLP methods and boost their performance. It explicitly accounts for node metrics and disentangles them from the embedding representations generated by TLP methods. We adopt the attention mechanism to reasonably select information conducive to the TLP task and integrate it into the node embedding. Experiments on various state-of-the-art methods and dynamic graphs verify the effectiveness and universality of our NMD plugin. Our code is publicly available at https://github.com/tianlizhang/NMD.
Multi-scale Context Aggregation for Video-based Person Re-identification
ABSTRACT. For video-based person re-identification (Re-ID), how to effectively aggregate video features is the key of dealing with various complicated situations. Different from previous methods that first extracted spatial features and later aggregated temporal features, we propose a Multi-scale Context Aggregation (MSCA) method in this paper to simultaneously learn spatial-temporal features from videos. Specifically, we design an Attention-aided Feature Pyramid Network (AFPN), which can recurrently aggregate detail and semantic information of multi-scale feature maps from CNN backbone. To enable the aggregation to focus on more salient regions in video, we embed a special Spatial-Channel Attention module (SCA) into each layer of the pyramid. To further enhance the feature representations with temporal information while extracting the spatial features, we design a Temporal Enhancement module (TEM), which can be plugged into each layer of the backbone network in a plug-and-play manner. Comprehensive experiments on three standard video-based person Re-ID benchmarks demonstrate that our method is competitive with most state-of-the-art methods.
Heterogeneous Graph Prototypical Networks for Few-shot Node Classification
ABSTRACT. The node classification task is one of the most significant applications in the analysis of heterogeneous graphs, which have been widely used for modeling multi-typed interactions. Meanwhile, Graph Neural Networks (GNNs) have aroused wide interest due to their remarkable effects on node classification. However, there are some challenges when applying GNNs to heterogeneous graph node classification: the cumbersome node labeling cost, and the heterogeneity of graphs. Existing semi-supervised GNNs still require sufficient annotation, while learning classifiers independently with node embeddings cannot exploit rich information effectively. Recently, few-shot learning has achieved competitive results in homogeneous graphs to address the performance degradation in the label sparsity case. While few-shot learning in heterogeneous graphs is limited by the difficulties of extracting multiple semantic proximities. To this end, we propose a novel Heterogeneous graph Prototypical Network (HPN) in this paper. The proposed HPN consists of two modules: Graph structural module extracts node embeddings and semantic knowledge for meta-training by capturing heterogeneous structures. Meta-learning module produces prototypes with heterogeneous induced subgraphs for important meta-training classes, which improves the utilization of information compared with the traditional meta-learning. Experimental results on three real-world heterogeneous graphs demonstrate that HPN achieves outstanding performance and better stability.
Enhancing LSTM and fusing articles of law for legal text summarization
ABSTRACT. The growing number of public legal documents has led to an increased demand for automatic summarization. Considering the well-organized structure of legal documents, extractive methods can be an efficient method for text summarization. Topic information is an important factor in summary extraction. The LSTM model fails to capture global topic information and suffers from long-distance information loss when dealing with legal texts that belong to long texts. In this paper, we propose TS-LSTM, a multi-layer network structure enhancing LSTM based on topic vector and slot memory unit fusing position information for extractive summarization of long texts in legal domain. Topic information is used to interact with sentences. The slot memory unit is used to model the long-range relationship between sentences. We conduct experiments on the Chinese legal text summarization dataset, the experimental results demonstrate that
our proposed method outperforms the baseline methods.
Link Prediction with Simple Path-Aware Graph Neural Networks
ABSTRACT. Graph Neural Networks (GNNs) are expert in node classifi-
cation and graph classification, but are relatively weak on link prediction
due to their limited expressiveness. Recently, two popular GNN variants,
namely higher-order GNNs and labeling trick are proposed to address the
limitations of GNNs. Compared with plain GNNs, these variants prov-
ably capture inter-node patterns such as common neighbors which facil-
itates link prediction. However, we notice that these methods actually
suffer from two critical problems. First, their algorithm complexities are
impractical for large graphs. Second, we prove that although these meth-
ods can identify paths between target nodes, they cannot identify simple
paths, which are very fundamental in the field of graph theory. To over-
come these deficiencies, we systematically study the common advantages
of previous link prediction GNNs and propose a novel GNN framework
that summarizes these advantages while remaining simple and efficient.
Various experiments show the effectiveness of our method.
AudioFormer: Channel Audio Encoder Based on Multi-Granularity Features
ABSTRACT. To solve the problem of poor standardized feature extraction methods for speech emotion recognition tasks and insufficient depth representation capability for extracting acoustic samples, we first propose a Multi-granularity feature extraction method that takes into account the integrity of data features and overcomes the redundancy of existing feature extraction methods; secondly, we propose a Channel Audio Encoder Model that uses different Feature Encoders to extract High-order features. Experiments show that the proposed Multi-granularity feature-based Channel Audio Encoder achieves state-of-the-art performance in the IEMOCAP dataset. The method also experiments on a real-scene dataset to demonstrate its usability and provide a reference for aiding the diagnosis of mental illness.
A Context Aware Lung Cancer Survival Prediction Network by Using Whole Slide Images
ABSTRACT. Lung cancer has caused enormous harm to human life and traditional whole slide image (WSI) based lung cancer survival prediction methods suffer from information loss and can not maintain the spatial context of the images, which may play the important roles into survival analysis. Meanwhile, the impact of the heterogeneity between the medical images and the natural images has been noticed for some pre-trained models on medical image representation learning. In this paper, we proposed a Context Aware Lung Cancer Survival Prediction Network (CA-SurvNet) by using the whole slide images, in which the survival prediction is decided by every patch of a WSI and its associated spatial context as well. Specifically, the representation of every WSI patch is first learned via a self-supervised learning based feature extractor, and then are sequentially concatenated followed by a channel-wisely dimensional reduction to preserve the significant information and maintain the spatial structure of the WSI simultaneously. Extensive experiments on a large benchmark dataset validate the superiority of the proposed method to its state-of-the-art competitors, and also its effectiveness of the WSI spatial context preserving into the lung cancer survival prediction.
Restore Translation Using Equivariant Neural Networks
ABSTRACT. Invariance to spatial transformations such as translations and rotations is a desirable property and a basic design principle for classification neural networks. However, the commonly used convolutional neural networks (CNNs) are actually very sensitive to even small translations. There exist vast works to achieve exact or approximate transformation invariance by designing transformation-invariant models or assessing the transformations. These works usually make changes to the standard CNNs and harm the performance on standard datasets. In this paper, rather than modifying the classifier, we propose a pre-classifier restorer to recover translated (or even rotated) inputs to the original ones which will be fed into any classifier for the same dataset. The restorer is based on a theoretical result which gives a sufficient and necessary condition for an affine operator to be translational equivariant on a tensor space.
Text Spotting of Electrical Diagram Based on Improved PP-OCRv3
ABSTRACT. The text detection and recognition plays an important role in automatic management of electrical diagrams. However, the images of electrical diagrams often have high resolution, and the format of the text in them is also unique and densely distributed. These factors make the general-purpose text spotting models unable to detect and recognize the text effectively. In this paper, we propose a text spotting model based on improved PP-OCRv3 to achieve better performance on text spotting of electrical diagrams. Firstly, a region re-segmentation module based on pixel line clustering is designed to correct detection errors on irregularly shaped text containing vertical and horizontal characters. Secondly, an improved BiFPN module with channel attention and depthwise separable convolution is introduced during text feature extracting to improve the robustness of input images with different scales. Finally, a character re-identification module based on region extension and cutting is added during the text recognition to reduce the adverse effects of simple and dense character on the model. The experimental results show that our model has better performance than the state-of-the-art (SOTA) methods on the electrical diagrams data sets.
Action Prediction for Cooperative Exploration in Multi-agent Reinforcement Learning
ABSTRACT. Multi-agent reinforcement learning approaches have shown significant progress with the employment of exploration-enhanced methods. However, when dealing with challenging tasks that necessitate complex cooperation among agents, such methods exhibit low exploration efficiency and poor performance. This paper proposes a PQmix method based on action prediction rewards in conjunction with Qmix. The PQmix employs the joint local observation of agents and the next joint local observation after executing actions to predict the real joint action of agents. The prediction error with the real joint action is introduced as the intrinsic reward to measure the novelty of the joint state, so as to encourage multi-agents to actively explore the action-state space in the environment. We validate PQmix's performance with strong baselines on various MARL benchmarks. The experimental results demonstrate that PQmix is able to outperform the state-of-the-art algorithms on the StarCraft Multi-Agent Challenge (SMAC).
ABSTRACT. Radar target detection, as one of the pivotal techniques in radar systems, aims to extract valuable information such as target distance and velocity from the received energy echo signals. However, with the advancements in aviation and electronic information technology, there have been profound transformations in the radar detection targets, scenarios, and environments. The majority of conventional radar target detection methods are primarily based on Constant False Alarm Rate (CFAR) techniques, which rely on certain distribution assumptions. However, when the detection scenarios become intricate or dynamic, the performance of these detectors is significantly influenced. Therefore, ensuring the robust performance of radar target detection models in complex task scenarios has emerged as a crucial concern. In this paper, we propose a radar target detection method based on a hybrid architecture of convolutional neural networks and autoencoder networks. This approach comprises clutter suppression and target detection modules. We conducted ablation experiments and comparative experiments using publicly available radar echo datasets and simulated radar echo datasets. The ablation experiments validated the effectiveness of the clutter suppression module, while the comparative experiments demonstrated the superior performance of our proposed method compared to the comparison methods in complex background scenarios.
SLAM: A Lightweight Spatial Location Attention Module for Object Detection
ABSTRACT. Aiming to address the shortcomings of current object detection models, including a large number of parameters, the lack of accurate localization of target bounding boxes, and ineffective detection, this paper proposes a lightweight spatial location attention module (SLAM) that achieves adaptive adjustment of the attention weights of the location information in the feature map while greatly improving the feature representation capability of the network by learning the spatial location information in the input feature map. First, the SLAM module obtains the spatial distribution of the input feature map in the horizontal, vertical, and channel directions through the average pooling and maximum pooling operations, then generates the corresponding location attention weights by computing convolution and activation functions, and finally achieves the weighted feature map by aggregating the features along the three spatial directions respectively. Extensive experiments show that the SLAM module improves the detection performance of the model on the MS COCO dataset and the PASCAL VOC 2012 dataset with almost no additional computational overhead.
A Malicious Code Family Classification Method Based on RGB Images and Lightweight Model
ABSTRACT. In recent years, malware attacks have been a constant threat to network security, and the problem of how to classify malicious families quickly and accurately urgently needs to be addressed. Traditional malicious family classification methods are affected by the proliferation of variants to lead to failure and are no longer adequate for the current stage of research. The visualization method can maximize the malicious code core performance on the image, and the grayscale image has the problem of few and single features. In this paper, we propose a new malicious code visualization method. Specifically, we first convert the original malicious file into a byte file and an asm file using the IDA Pro tool. Secondly, we extract the opcode sequences in the asm file and the byte sequences in the byte file and convert them into a three-channel RGB image by using visualization techniques, which allows for a more comprehensive representation of the features of the malicious sample. Finally, we propose a new neural network architecture, the MobileNetV2 lightweight model combined with CBAM (MVCBAM) approach for training and prediction. In addition, we conduct various contrast experiments on the BIG2015 dataset and the Malimg dataset. The Experiments show that the accuracy of our proposed model on the two datasets is 99.90% and 99.95%, and the performance of our proposed model was maintained with fewer network parameters than the original MobileNetV2 model and has higher accuracy and faster speed than other advanced methods.
A Novel Approach for Improved Pedestrian Walking Speed Prediction: Exploiting Proximity Correlation
ABSTRACT. Accurately predicting pedestrian speed is crucial for analyzing pedestrian behavior and optimizing intelligent transportation systems. This paper investigates the feasibility of modeling pedestrian walking speed as a time series. Building upon previous research highlighting the spatio-temporal nearest neighbor correlation in pedestrian walking speed, we propose a deep learning method that leverages this correlation. Experimental results demonstrate the superiority of our approach over traditional methods in accurately predicting pedestrian walking speed and capturing temporal characteristics and trends. The findings of this study have significant implications for enhancing pedestrian traffic flow management, improving the pedestrian travel experience, and enhancing overall traffic safety. Future research can focus on exploring advanced time series methods and deep learning models to further enhance the accuracy and practicality of pedestrian walking speed prediction.
Research on Relation Extraction Based on BERT with Multifaceted Semantics
ABSTRACT. Relation Extraction is one of the important tasks in natu-
ral language processing, aiming to determine the class of relations to
which the entities in a sentence belong. Unlike today’s research, where
researchers tend to use large-scale corpus to retrain language models
for relation extraction, which requires too much relevant resources, this
paper proposes a model based on BERT with multifaceted seman-
tics (BERT-LR) for relation extraction, which learns semantics and per-
forms relation extraction from multiple aspects with entities as the cen-
ter. First, we make full use of the already pre-trained and completed
BERT model to obtain rich initialization parameters for our BERT-
LR model. Second, to achieve entity-centric relationship extraction, we
propose a multifaceted semantic relationship extraction model based on
BERT consisting of left semantics, right semantics and global se-
mantics, and use a suitable method to fuse the multifaceted semantics.
Third, we found that fixing the embedding layer of the model during
fine-tuning can achieve better results. Our approach achieves excellent
results on the SemEval-2010 Task8 dataset
End-to-End Urban Autonomous Navigation with Decision Hindsight
ABSTRACT. Urban autonomous navigation has broad application prospects. Reinforcement Learning (RL) based navigation models can be continuously optimized through self-exploration, eliminating the need for human heuristics. However, training effective navigation models faces challenges due to the dynamic nature of urban traffic conditions and the exploration-exploitation dilemma in RL. Moreover, the limited vehicle perception and traffic uncertainty introduce potential safety hazards, hampering the real-world application of RL-based navigation models. In this paper, we proposed a novel end-to-end urban navigation framework with decision hindsight. By formulating the problem of Partially Observable Markov Decision Process (POMDP), we employ a causal Transformer-based autoregressive modeling approach to process the historical navigation information as supplementary observations. We then combine these historical observations with current perceptions to construct a history-feedforward state representation that enhances global awareness, improving data availability and decision predictability. Furthermore, by integrating the historical-feedforward state encoding upstream, we develop an end-to-end learning framework based on RL to obtain a navigation model with decision hindsight, enabling more reliable navigation. To validate the effectiveness of our proposed method, we conduct experiments on challenging urban navigation tasks using the CARLA simulator. The results demonstrate that our method achieves higher learning efficiency and improved driving performance, taking priority over prior methods on urban navigation benchmarks.
MView-DTI: A multi-view feature fusion-based approach for drug-target protein interaction prediction
ABSTRACT. Drug-Target protein Interaction (DTI) prediction is a critical task in the field of drug discovery. Deep learning-based prediction methods have been shown to significantly improve the accuracy of DTI prediction while reducing costs. Most existing methods consider drug molecules and proteins as graphs or sequences, extract features from them, and then utilize networks such as convolutional neural network(CNN), Graph neural network(GNN), and Transformer for learning and prediction. However, drug molecular images clearly display features such as atoms, structures, and chemical bonds that are difficult to capture in sequences or graphs. Therefore, this paper proposes a deep learning method based on multi-view feature fusion that utilizes Transformer to fuse the graph structure and image features of drug molecules, to learn more comprehensive drug features. This enables the learning of more complex interaction features between amino acids and atoms in the process of simulating DTI. The model was evaluated on three benchmark datasets and significant improvements were achieved compared to the latest baselines. Additionally, to validate the effectiveness of capturing drug image feature information, ablation experiments were conducted, and the results showed a significant increase in accuracy after incorporating image data.
ABSTRACT. Causal Emotion Entailment~(CEE) aims to extract which utterances are responsible for the non-neutral emotion in a conversational utterance. Prior research on this topic has primarily focused on using sequential encoding to model conversational contexts, but has neglected to fully consider the impact of interactions between utterances and structure information. In this paper, we explore the significance of discourse parsing in addressing these interactions and structure information, and propose a new model called the discourse-aware model (DAM) to tackle the CEE task. Concretely, we jointly model CEE with discourse parsing using a multi-task learning~(MTL) framework to integrate rich utterance discourse information into our model. In addition, we use a graph neural network~(GNN) to further enhance our CEE model with explicitly encoding discourse and other discourse-related structure features. Results on the benchmark corpus show that DAM outperform the state-of-the-art~(SOTA) systems in the literature. This suggests that the discourse structure may contain a potential link between emotional utterances and their corresponding cause expressions.
Self-Adaptive Inverse Soft-Q Learning for Imitation
ABSTRACT. As a powerful method for solving sequential decision problems, imitation learning (IL) aims to generate policy similar to expert behavior by imitating demonstrations. However, the quality of demonstrations directly limits the performance of the agent imitation policy. To solve this problem, self-adaptive inverse soft-Q learning for imitation (SAIQL) is proposed. SAIQL proposes a novel three-level buffer system by introducing an online excellent buffer on the basis of the expert buffer and the normal buffer. Trajectories from interactions with superior performance are stored in the online excellent buffer. When the amount of data in the online excellent buffer and the expert buffer is equal, the former data will be cleaned and transferred to the latter, ensuring that demonstrations in the expert buffer are continuously optimized. Finally, we compare SAIQL with up-to-date IL methods in both the continuous control and the Atari tasks. The experimental results show the superiority of SAIQL. It improves the quality of expert demonstrations and the utilization of trajectories.
BFMOT: A One-Shot Baseline Model with Fusion Similarity Algorithm Towards Real-Time Multi-Object Tracking
ABSTRACT. A key challenge of multi-object tracking is to realize the trade-off between high accuracy and real-time performance. Recently, the one-shot tracker, which integrates multi-tasks into a unified network, achieves a good balance between the tracking accuracy and speed. Different from previous trackers'practice of exchanging more extra calculation cost for tracking accuracy, we propose a new one-shot baseline model that is faster and lighter. We continue to discuss the conflict under the tracking paradigm of joint detection and re-identification tasks and devote to alleviate the feature conflict in the one-shot model. Furthermore, in order to improve the one-shot base model’s ability to deal with complex scenarios, we have made innovations in data association and proposed a fusion similarity association algorithm. On the MOT17 testing set, the proposed association algorithm reduces the number of ID switches by 22.9% compared with the state-of-the-art association algorithm. On the MOT20 testing set, the proposed BFMOT tracker improves the tracking accuracy (i.e. MOTA) by 6.7% compared with the most popular one-shot tracker. BFMOT is very simple and runs at 30 FPS on a single GPU, which is more oriented towards real-time multi-object tracking.
Debiasing Medication Recommendation with Counterfactual Analysis
ABSTRACT. The AI-driven medication recommendation has emerged as a crucial undertaking in the field of healthcare research. Recent literature has focused on leveraging patients' diagnoses, procedures, and historical visit information for medication recommendation. However, this approach can lead to recommendation biases due to spurious correlations among the historical visit information. Previous studies have either failed to address this bias issue or attempted to mitigate recommendation biases through dataset manipulation, albeit at the expense of increased computational costs. In this study, we propose CAMeR (Counterfactual Analysis based Medication Recommendation ), which is a novel debiasing model based on counterfactual analysis. The model preserves medications information while emphasizing the primary influence of diagnoses and procedures. Unlike traditional factual reasoning approaches that address biases before or during training, counterfactual reasoning mitigates the impact of post-training spurious correlations. Additionally, we incorporate contrastive loss computation in the embedding module of our model to calibrate the feature construction for patients with multiple visit information. We validate the CAMeR on widely adopted datasets, MIMIC-III and MIMIC-IV, and experimental results unequivocally demonstrate its superiority over state-of-the-art methods.
Active Learning for Open-set Annotation Using Contrastive Query Strategy
ABSTRACT. Active learning has achieved significant success in classification tasks with all data samples drawn from known classes. However, in real scenarios, most active learning methods fail when encountering open-set annotation (OSA) problem, i.e., numerous samples from unknown classes. The main reason for such failure comes from existing query strategies that are unavoidable to select unknown class samples. To tackle such problem and select the most informative samples, we propose a novel active learning framework named OSA-CQ, which simplifies the detection work of samples from known classes and enhances the classification performance with an effective contrastive query strategy. Specifically, OSA-CQ firstly adopts an auxiliary network to distinguish samples using confidence scores, which can dynamically select samples with highest probability from known classes in the unlabeled set. Secondly, by comparing the predictions between auxiliary network, classification, and feature similarity, OSA-CQ designs a contrastive query strategy to select these most informative samples from unlabeled and known classes set. Experimental results on CIFAR10 and CIFAR100 shows the proposed OSA-CQ can select samples from known classes with high information, and achieve higher classification performance with lower annotation cost than state-of-the-art active learning algorithms.
Cross-Domain Bearing Fault Diagnosis Method Using Hierarchical Pseudo Labels
ABSTRACT. Data-driven bearing fault diagnosis methods have become increasingly crucial for the health management of rotating machinery equipment. However, in actual industrial scenarios, the scarcity of labeled data presents a challenge. To alleviate this problem, many transfer learning methods have been proposed. Some domain adaptation methods use models trained on source domain to generate pseudo labels for target domain data, which are further employed to refine models. Domain shift issues may cause noise in the pseudo labels, thereby compromising the stability of the model. To address this issue, we propose a Hierarchical Pseudo Label Domain Adversarial Network. In this method, we divide pseudo labels into three levels and use different training approach for diverse levels of samples. Compared with the traditional threshold filtering methods that focus on high-confidence samples, our method can effectively exploit the positive information of a great quantity of medium-confidence samples and mitigate the negative impact of mislabeling. Our proposed method achieves higher prediction accuracy compared with the-state-of-the-art domain adaptation methods in harsh environments.
Efficient Collaboration via Interaction Information in Multi-Agent System
ABSTRACT. Cooperative multi-agent reinforcement learning (CMARL) has shown promise in solving real-world scenarios. The interaction information between agents contains rich global information, which is easily neglected after perceiving other agents' behavior.
To tackle this problem, we propose Collaboration Interaction Information Modelling via Hypergraph (CIIMH), which first perceives the behavior of other agents by mutual information optimization and constructs the dynamic interaction information via hypergraph. Perceived behavioral features of other agents are further aggregated in the hypergraph convolutional network to obtain interaction information.
We compare our method with three existing baselines on StarCraft II micromanagement tasks (SMAC), Level-based Foraging (LBF), and Hallway. Empirical results show that our method outperforms baseline methods on all maps.
Violence-MFAS: Audio-Visual Violence Detection Using Multimodal Fusion Architecture Search
ABSTRACT. Audio-visual fusion methods are widely employed to tackle violence detection tasks, since they can effectively integrate the complementary information from both modalities to significantly improve accuracy. However, the design of high-quality multimodal fusion networks is highly dependent on expert experience and substantial efforts. To alleviate this formidable challenge, we propose a novel method named Violence-MFAS, which can automatically design promising multimodal fusion architectures for violence detection tasks using multimodal fusion architecture search (MFAS). To further enable the model to focus on important information, we elaborately design a new search space. Specifically, multilayer neural networks based on attention mechanisms are meticulously constructed to grasp intricate spatio-temporal relationships and extract comprehensive multimodal representation. Finally, extensive experiments are conducted on the commonly used large-scale and multi-scene audio-visual XD-Violence dataset. The promising results demonstrate that our method outperforms the state-of-the-art methods under the guarantee of a lightweight architecture.
DeepLink: Triplet Embedding and Spatio-Temporal Dynamics Learning of Link Representations for Travel Time Estimation
ABSTRACT. Estimating the time of arrival is a crucial task in intelligent transportation systems. The task poses challenges due to the dynamic nature and complex spatio-temporal dependencies of traffic networks. Existing studies have primarily focused on learning the dependencies between adjacent links on a route, often overlooking a deeper understanding of the links within the traffic network. To address this limitation, we propose DeepLink, a novel approach for travel time estimation that leverages a comprehensive understanding of the spatio-temporal dynamics of road segments from different perspectives. DeepLink introduces triplet embedding, enabling the learning of both the topology and potential semantics of the traffic network, leading to an improved understanding of links' static information. Then, a spatio-temporal dynamic representation learning module integrates the triplet embedding and real-time information, which effectively models the dynamic traffic conditions. Additionally, a local-global attention mechanism captures both the local dependencies of adjacent road segments and the global information of the entire route. Extensive experiments conducted on a large-scale real-world dataset demonstrate the superior performance of DeepLink compared to state-of-the-art methods.
Reversible Data Hiding in Encrypted Images based on Image Reprocessing and Polymorphic Compression
ABSTRACT. With the rapid development of cloud computing and privacy protection, Reversible Data Hiding in Encrypted Images (RDHEI) has attracted increasing attention, since it can achieve covert data transmission and lossless image recovery. To realize reversible data hiding with high embedding capacity, a new RDHEI method is proposed in this paper. First, we introduce the Image Reprocessing and Polymorphic Compression (IRPC) scheme, which can classify the images and then vacate enough room for embedding. After that, an improved RDHEI method combined with the IRPC scheme and a chaotic encryption algorithm is presented. In this method, the content owner uses the IRPC scheme to reserve embeddable rooms in the original image and then utilizes a six-dimensional chaotic encryption system to encrypt the reserved image into an encrypted image. After receiving the encrypted image, the data hider can embed additional data into it to obtain the marked encrypted image. According to the different keys the receiver has, the embedded data or the original image can be extracted or recovered from the marked encrypted image without error. Extensive experimental results show that the average Embedding Rate (ER) of our proposed method on the datasets BOSSbase, BOWS-2, and UCID is higher than that of the baseline method by 0.1 bpp. At the same time, the security performance of the image is also improved.
GRF-GMM: A Trajectory Optimization Framework for Obstacle Avoidance in Learning from Demonstration
ABSTRACT. Learning from demonstrations (LfD) provides a convenient pattern to teach robot to gain skills without mechanically programming. As an LfD approach, Gaussian mixture model/Gaussian mixture regression (GMM/GMR) has been widely used for its robustness and effectiveness. However, there still exist many problems of GMM when an obstacle, which is not presented in original demonstrations, appears in the workspace of robots. To address these problems, this paper presents a novel method based on Gaussian repulsive field-Gaussian mixture model (GRF-GMM) for obstacle avoidance by optimizing the model parameters. A Gaussian repulsive force is calculated through Gaussian functions and employed to work on Gaussian components to optimize the mixture distribution which is learnt from original demonstrations. Our approach allows the reproduced trajectory to keep a safe distance away from the obstacle. Finally, the feasibility and effectiveness of the proposed method are revealed through simulations and experiments.
IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer
ABSTRACT. Automated medical report generation has become increasingly important in medical analysis. It can produce computer-aided diagnosis descriptions and thus significantly alleviate the doctors' work.
Inspired by the huge success in neural machine translation and image captioning, various deep learning methods have been proposed for medical report generation. However, the existing methods suffer from the intrinsic challenges raised by data imbalance and bias within medical data, and thus the generated reports may exhibit linguistic fluency but lack clinical accuracy.
To tackle these challenges, we in this paper propose an image-to-indicator hierarchical transformer (IIHT) framework for medical report generation. It consists of three modules, i.e., a classifier module, an indicator expansion module and a generator module. These modules can effectively address the challenges caused by for example data imbalance and bias.
Furthermore, the proposed IIHT method is feasible for radiologists to modify disease indicators in real-world scenarios and integrate the operations into the indicator expansion module for fluent and accurate medical report generation. Extensive experiments and comparison with state-of-the-art methods under various evaluation metrics demonstrate the great performance of the proposed metho
OD-Enhanced Dynamic Spatial-Temporal Graph Convolutional Network for Metro Passenger Flow Prediction
ABSTRACT. Metro passenger flow prediction is crucial for efficient urban transportation planning and resource allocation. However, it faces two challenges. The first challenge is extracting the diverse passenger flow patterns at different stations, e.g., stations near residential areas and stations near commercial areas, while the second one is to model the complex dynamic spatial-temporal correlations caused by Origin-Destination (OD) flows. Existing studies often overlook the above two aspects, especially the impact of OD flows. In conclusion, we propose an OD-enhanced dynamic spatial-temporal graph convolutional network (DSTGCN) for metro passenger flow prediction. First, we propose a static spatial module to extract the flow patterns of different stations. Second, we utilize a dynamic spatial module to capture the dynamic spatial correlations between stations with OD matrices. Finally, we employ a multi-resolution temporal dependency module to learn the delayed temporal features. We also conduct experiments based on two real-world datasets in Shanghai and Hangzhou. The results show the superiority of our model compared to the state-of-the-art baselines.
Membership Inference Attack against Medical Databases
ABSTRACT. Membership inference is a powerful attack to privacy databases especially for medical data. Existing attack model utilizes the shadow model to inference the privacy members in the privacy data-sets and information, which can damage the benefits of the data owners and may cause serious data leakage. However, existing defence are concentrated on the encryption methods, which ignore the inference can also cause the unacceptable loss in the real applications. In this work, we propose a novel inference attack model, which utilizes a shadow model to simulate the division system in the medical database and subsequently infer the members in the medical databases. Moreover, the established shadow inference model can classify the labels of medical data and obtain the privacy members in the medical databases. In contrast with traditional inference attacks, we apply the attack in the medical databases rather than recommendation system or machine learning classifiers. From our extensive simulation and comparison with traditional inference attacks, we can observe the proposed model can achieve the attacks in the medical data with reasonable attack accuracy and acceptable computation costs.
Enhancing Heterogeneous Graph Contrastive Learning with Strongly Correlated Subgraphs
ABSTRACT. Graph contrastive learning maximizes the mutual information between the embedding representations of the same data instances in different augmented views of a graph, obtaining feature representations for graph data in an unsupervised manner without the need for manual labeling. Most existing node-level graph contrastive learning models only consider embeddings of the same node in different views as positive sample pairs, ignoring rich inherent neighboring relation and resulting in certain contrastive information loss. To address this issue, we propose a heterogeneous graph contrastive learning model that incorporates strongly correlated subgraph features. We design a contrastive learning framework suitable for heterogeneous graphs and introduce high-level neighborhood information during the contrasting process. Specifically, our model selects a strongly correlated subgraph for each target node in the heterogeneous graph based on both topological structure information and node attribute feature information. In the calculation of contrastive loss, we perform feature shifting operations on positive and negative samples based on subgraph encoding to enhance the model’s ability to discriminate between approximate samples. We conduct node classification and ablation experiments on multiple public heterogeneous datasets and the results verify the effectiveness of the research contributions of our model.
TPTGAN: Two-Path-Transformer-Based Generative Adversarial Network Using Joint Magnitude Masking and Complex Spectral Mapping For Speech Enhancement
ABSTRACT. In recent studies, conformer is widely used in speech enhancement, but it still has the problem of excessive suppression especially in human-to-machine communication such as automatic speech recognition (ASR) because of the loss of the target speech when filtering the noise. Therefore, while these methods may yield higher PESQ scores, they often exhibit limited effectiveness in improving the signal-to-noise ratio of speech which is proved vital in ASR. In this paper, we propose a two-path-transformer-based metric generative adversarial network (TPTGAN) for speech enhancement in the time-frequency domain. The generator consists of an encoder, a two-stage transformer module, a magnitude mask decoder and a complex spectrum decoder. Encoder and two-path transformers characterize the magnitude and complex spectra of the inputs and model both sub-band and full-band information of the time-frequency spectrogram. The estimation of magnitude and complex spectrum is decoupled in the decoder, and then the enhanced speech is reconstructed in conjunction with the phase information. Through the implementation of intelligent training strategies and structural adjustments, we have successfully showcased the remarkable efficacy of the transformer model in speech enhancement tasks. The experimental results on the Voice Bank+DEMAND dataset illustrate that TPTGAN shows superior performance compared to state-of-the-art methods, with SSNR of 11.63 and PESQ of 3.35, which alleviates the problem of excessive suppression, while the complexity of the model (1.03M) is significantly reduced.
Sample Selection based on Uncertainty for Combating Label Noise
ABSTRACT. Automatic segmentation of medical images plays a crucial role in scientific research and healthcare. Obtaining large-scale training datasets with high-quality manual annotations poses challenges in many clinical applications. Utilizing noisy datasets has become increasingly important, but label noise significantly affects the performance of deep learning models. Sample selection is an effective method for handling label noise. In this study, we propose a medical image segmentation framework based on entropy estimation uncertainty for sample selection to address datasets with noisy labels. Specifically, after sample selection, parallel training of two networks and cross-model information exchange are employed for collaborative optimization learning. Based on the exchanged information, sample selection is performed using entropy estimation uncertainty, following a carefully designed schedule for gradual label filtering and correction of noisy labels. The framework is flexible in terms of the precise deep neural network (DNN) models used. Method analysis and empirical evaluation demonstrate that our approach exhibits superior performance on open datasets with noisy annotations. The sample selection method outperforms small loss criterion approaches, and the segmentation results surpass those of traditional fully supervised models. Our framework provides a valuable solution for effectively handling noisy label datasets in medical image segmentation tasks.
MFSFFuse: Multi-Receptive Field Feature Extraction for Infrared and Visible Image Fusion using Self-Supervised Learning
ABSTRACT. The infrared and visible image fusion aims to fuse complementary information in different modalities to improve image quality and resolution, and facilitate subsequent visual tasks. Most of the current fusion methods suffer from incomplete feature extraction or redundancy, resulting in indistinctive targets or lost texture details. Moreover, the infrared and visible image fusion lacks ground truth, and the fusion results obtained by using unsupervised network training models may also cause the loss of important features. To solve these problems, we propose an infrared and visible image fusion method using self-supervised learning, called MFSFFuse. To overcome these challenges, we introduce a Multi-Receptive Field dilated convolution block that extracts multi-scale features using dilated convolutions. Additionally, different attention modules are employed to enhance information extraction in different branches. Furthermore, a specific loss function is devised to guide the optimization of the model to obtain an ideal fusion result. Extensive experiments show that, compared to the state-of-the-art methods, our method has achieved competitive results in both quantitative and qualitative experiments.
Design of a Multimodal Short Video Classification Model
ABSTRACT. With the development of mobile Internet, a large amount of short video data is generated on the Internet. The urgent problem of short video classifica-tion is how to better fuse the information of different multimodal infor-mation. This paper proposes a short video multimodal fusion (SV-MF) scheme based on deep learning combined with pre-trained models to com-plete the classification task of short video. The main innovations of the SV-MF scheme are as follows: (1) We find that text modalities contain higher-order information and tend to perform better than audio and visual modali-ties, and with the use of pre-trained language models, text modalities have been further improved in multimodal video classification. (2) Due to the strong semantic representation ability of text. The SVMF scheme proposes a local fusion method based on Transformer for low-order visual and audio modal information to alleviate the information deviation caused by multi-mode fusion. (3) The SV-MF scheme proposes a post processing strategy based on keywords to further improve the classification accuracy of the model. Experimental results based on a multimodal short video classifica-tion dataset derived from social networks show that the performance of the SV-MF scheme is better than the previous video fusion scheme.
Multi-vehicle Platoon Overtaking Using NoisyNet Multi-Agent Deep Q-Learning Network
ABSTRACT. With the recent advancements in Vehicle-to-Vehicle communication technology, autonomous vehicles are able to connect and collaborate in platoon, minimizing accident risks, costs, and energy consumption. The significant benefits of vehicle platooning have gained increasing attention from the automation and artificial intelligence areas. However, few studies have focused on platoon with overtaking. To address this problem, the NoisyNet multi-agent deep Q-learning algorithm is developed in this paper, which the NoisyNet is employed to improve the exploration of the environment. By considering the factors of overtake, speed, collision, time headway and following vehicles, a domain-tailored reward function is proposed to accomplish safe platoon overtaking with high speed. Finally, simulation results show that the proposed method achieves successfully overtake in various traffic density situations.
Learning Adaptable Risk-Sensitive Policies to Coordinate in Multi-Agent General-Sum Games
ABSTRACT. In general-sum games, the interaction of self-interested learning agents commonly leads to socially worse outcomes, such as defect-defect in the iterated stag hunt (ISH). Previous works address this challenge by sharing rewards or shaping their opponents’ learning process, which require too strong assumptions. In this paper, we observe that agents trained to optimize expected returns are more likely to choose a safe action that leads to guaranteed but lower rewards. To overcome this, we present Adaptable Risk-Sensitive Policy (ARSP). ARSP learns the distributions over agent's return and estimates a dynamic risk-seeking bonus to discover risky coordination strategies. Furthermore, to avoid overfitting training opponents, ARSP learns an auxiliary opponent modeling task to infer opponents' types and dynamically alter corresponding strategies during execution. Extensive experiments show that ARSP agents can achieve stable coordination during training and adapt to non-cooperative opponents during execution, outperforming a set of baselines by a large margin.
Multi-intent Description of Keyword Expansion for Code Search
ABSTRACT. To address the issue of discrepancies between online query data and offline training data in code search research, we propose a novel code search model called multi intent description keyword extension-based code search (MDKE-CS). Our model utilizes offline training data to expand query data, thereby mitigating the impact of insufficient query data and intention differ-ences between training and query data on search results. Furthermore, we construct a multi-intention description keyword vocabulary library based on developers, searchers, and discussants from the StackOverflow Q&A library to further expand the query. To evaluate the effectiveness of MDKE-CS in code search tasks, we conducted comparative experimental analyses using two baseline models, DeepCS and UNIF, as well as WordNet and BM25 ex-tension methods. Our experimental results demonstrate that MDKE-CS out-performs the baseline models in terms of R@1, R@5, R@10, and MRR val-ues.
Knowledge Prompting with Contrastive Learning for Unsupervised Commonsense Question Answering
ABSTRACT. Unsupervised commonsense question answering is an emerging task in natural language processing domain. In this task, knowledge is of vital importance. Most existing researches focus on stacking large-scale models or extracting knowledge from external sources. However, these methods suffer from either the unstable quality of knowledge or the deficiency in the model's flexibility. In this paper, we propose a \textbf{K}nowledge \textbf{P}rompting with \textbf{C}ontrastive \textbf{L}earning (KPCL) model to address these problems. Specifically, we first consider dropout noise as augmentation for commonsense questions. Then we apply unsupervised contrastive learning in further pre-training to capture the nuances among questions, and thus help with the subsequent knowledge generation. After that, we utilize generic prompts to generate question-related knowledge descriptions in a zero-shot manner, facilitating easier transfer to new domains. Furthermore, we concatenate knowledge descriptions with the commonsense question, forming integrated question statements. Finally, we reason over them to score the confidence and make predictions. Extensive experimental results on three benchmark datasets demonstrate the effectiveness and robustness of our proposed KPCL, which outperforms baseline methods consistently.
Application of ALMM Technology to Intelligent Control System for a Fleet of Unmanned Aerial Vehicles
ABSTRACT. The article is related to an intelligent information system for managing a fleet of Unmanned Aerial Vehicles (UAVs) while taking into account various dynamically changing constraints. Variability of the time intervals that are available for flights over certain areas located close to the airports is one of the essential constraints. The system must be developed very flexibly to easily accommodate changing both flight destinations and performance conditions. The authors propose application of Algebraic Logic Meta Modelling (ALMM) technology to design and implement the models and algorithms used in this system. The article presents part of the research carried out by the authors during the design of the aforementioned system. An Algebraic Logic (AL) model of UAV flights optimization scheduling problem is given. The execution of overflights in the circumpolar zone with the criterion of minimizing the total completion time of all tasks Cmax is described. Then a hybrid algorithm for solving this problem and the results of the experiments carried out are presented. The component nature of the proposed approach allows easy transposition of the models and algorithms in case of more complex, additional assumption and restrictions referring to manage flights in real conditions.
Effective skill learning on vascular robotic systems: Combining offline and online reinforcement learning
ABSTRACT. Vascular robotic systems, which have gained popularity in clinic, provide a platform for potentially semi-automated surgery. Reinforcement learning (RL) is a appealing skill-learning method to facilitate automatic instrument delivery. However, the notorious sample inefficiency of RL has limited its application in this domain. To address this issue, this paper proposes a novel RL framework, Distributed Reinforcement learning with Adaptive Conservatism (DRAC), that learns manipulation skills with a modest amount of interactions. DRAC pretrains skills from rule-based interactions before online fine-tuning to utilize prior knowledge and improve sample efficiency. Moreover, DRAC uses adaptive conservatism to explore safely during online fine-tuning and a distributed structure to shorten training time. Experiments in a pre-clinical environment demonstrate that DRAC can deliver guidewire to the target with less dangerous exploration and better performance than prior methods (success rate of 96.00% and mean backward steps of 9.54) within 20k interactions. These results indicate that the proposed algorithm is promising to learn skills for vascular robotic systems.
Impulsive Accelerated Reinforcement Learning for H∞ Control
ABSTRACT. This paper revisits reinforcement learning for $H_\infty$ control of affine nonlinear systems with partially unknown dynamics. By incorporating an impulsive momentum-based control into the conventional critic neural network, an impulsive accelerated reinforcement learning algorithm, introducing an accelerated gradient flow with a restart mechanism, is proposed to improve the convergence speed and transient performance compared to traditional gradient descent-based techniques or continuously accelerated gradient methods. Moreover, by utilizing the quasi-periodic Lyapunov function method, an asymptotic stability criterion of the closed-loop system is established. A numerical example with comparisons is provided to illustrate the theoretical results.
ABSTRACT. Target association is an extremely important problem in the field of multi-object tracking, especially for pedestrian scenes with high similarity in appearance and dense distribution. The traditional approach of combining IOU and ReID techniques with the Hungarian algorithm only partially addresses these challenges. To improve the model's association matching ability, this paper proposes a block matching model that extracts local features using a Block Matching Module (BMM) based on the Transformer model. The BMM extracts features by dividing them into blocks and mines effective features of the target to complete target similarity evaluation. Additionally, a Euclidean Distance Module (EDM) based on the Euclidean distance association matching strategy is introduced to further enhance the model's association ability. By integrating BMM and EDM into the same multi-object tracking model, this paper establishes a novel model called BWTrack that achieves excellent performance on MOT16, MOT17, and MOT20 while maintaining high performance at 7 FPS on a single GPU.
PnP: Integrated Prediction and Planning for Interactive Lane Change in Dense Traffic
ABSTRACT. Making human-like behaviors for autonomous driving in interactive scenarios is critical and challenging, which requires the self-driving vehicle to reason about interactive vehicles' reactions to its behavior. We propose an integrated prediction and planning (PnP) decision-making method to address this task. To consider the interactive behaviors, a reactive trajectory prediction model is designed to predict the future states of other actors. Then, n-step temporal-difference search combining the value estimation network with the reactive prediction model is used to make a tactical decision and plan the tracking trajectory for the self-driving vehicle. The proposed PnP method is evaluated in the CARLA simulator and the results verify that PnP achieves better performances than popular model-free and model-based reinforcement learning baselines.
Modeling online adaptive navigation in virtual environments based on PID control
ABSTRACT. It is well known that locomotion-dominated navigation tasks may highly provoke cybersickness effects. Past research has proposed numerous approaches to tackle this issue based on offline considerations. In this work, a novel approach to mitigate cybersickness is presented based on online adaptive navigation. Considering the Proportional-Integral-Derivative (PID) control method, we proposed a mathematical model for online adaptive navigation parameterized with several parameters, taking as input the users' electro-dermal activity (EDA), an efficient indicator to measure the cybersickness level, and providing as output adapted navigation accelerations. Therefore, minimizing the cybersickness level is regarded as an argument optimization problem: find the PID model parameters which can reduce the severity of cybersickness. User studies were organized to collect non-adapted navigation accelerations and the corresponding EDA signals. A deep neural network was then formulated to learn the correlation between EDA and navigation accelerations. The hyperparameters of the network were obtained through the Optuna open-source framework. To validate the performance of the optimized online adaptive navigation developed through the PID control, we performed an analysis in a simulated user study based on the pre-trained deep neural network. Results indicate a significant reduction of cybersickness in terms of EDA signal analysis and motion sickness dose value. This is a pioneering work which presented a systematic strategy for adaptive navigation settings from a theoretical point.
A 3D UWB hybrid localization method based on BSR and L-AOA
ABSTRACT. In this paper, the reliability of the base station(BSR) and the low-cost Angle of arrival(L-AOA) positioning method are proposed to optimize the positioning result of the time difference of arrival(TDOA) positioning method, and then the result is substituted into the Taylor algorithm as the initial value for iterative optimization. The experimental results in non-line-of-sight environment show that the proposed method improves the localization accuracy by about 20% compared with that of TDOA only. In addition, we also apply this algorithm to the mixed algorithms of TDOA and Taylor, under with the same error environment, the positioning accuracy is improved by about 10%.
MEFaceNets: Muti-scale Efficient CNNs for Real-time Face Recognition on Embedded Devices
ABSTRACT. The trend of face recognition being widely used on terminals and embedded devices makes the trade-off between recognition accuracy and actual delay critical. To address this challenge, we propose an efficient bottleneck named MEBottleneck, which utilizes convolution kernels of different sizes on two branches to capture multi-scale features in the bottleneck, followed by a $1 \times 1$ expansion layer to fuse multi-scale features, thereby improving the representation ability. Then, to balance the trade-off between accuracy and latency, we design a family of lightweight models with MEBottleneck specifically for face recognition, named MEFaceNets. Large kernels are used for depthwise convolutions in shallow layers, resulting in improved accuracy. We evaluate the proposed models on several popular face recognition benchmarks. Our primary model achieves 99.80% face verification accuracy on LFW and exhibits excellent performance on the larger and more challenging benchmarks, including MegaFace Challenge 1, IJB-B and IJB-C. Meanwhile, the latency of our primary model is 90 ms on RK3399, which is sufficient to satisfy real-time recognition on the resource-constrained embedded device.
Federated learning using the Particle Swarm Optimization model for the early detection of COVID-19
ABSTRACT. The COVID-19 pandemic has created significant global health and socioeconomic challenges, which urges the need for efficient and effective early detection methods. Several traditional machine learning(ML) and deep learning(DL) approaches have been used in the detection of COVID-19. But ML and DL strategies face challenges like transmission delays, a lack of computing power, communication delays, and privacy concerns. Federated Learning (FL), has emerged as a promising method for training the models on decentralized data while ensuring privacy. In this paper, we present a novel FL framework for early detection of COVID-19 using particle swarm optimization (PSO) model. The proposed framework uses the advantages of both FL and PSO. By employing the PSO technique the model aims to achieve faster convergence and improved performance. In order to validate the effectiveness of the proposed approach, we performed the experiments using the COVID-19 image dataset which was collected from different healthcare institutions. The results indicate that the our approach is more effective as it achieves higher accuracy rate of 94.36\% and which is higher when compared to traditional centralized learning approaches. Furthermore, the FL framework ensures data privacy and security by keeping sensitive patient information decentralized and only sharing aggregated model updates during the training process.
On Searching for Minimal Integer Representation of Undirected Graphs
ABSTRACT. The succinct representation of graphs is relevant to store, communicate, and sample the space of unstructured graphs meeting user-defined criteria. In this paper, we investigate the performance of eight classes of gradient-free optimization heuristics based on Differential Evolution to search for minimal integer representations of undirected graphs. Our computational experiments using graph instances with varying degrees of sparsity have shown the merit of exploration strategies to attain better convergence with few function evaluations. Our results have the potential to elucidate new number-based approaches for graph representation, design and optimization.
Anti-Interference Zeroing Neural Network Model for Time-Varying Tensor Square Root Finding
ABSTRACT. Square root finding plays an important role in many scientific and engineering fields, such as optimization, signal processing and state estimation, but existing research mainly focuses on solving the timeinvariant matrix square root problem. So far, few researchers have studied the the time-varying tensor square root (TVTSR) problem. In this study, a novel anti-interference zeroing neural network (AIZNN) model is proposed to solve TVTSR problem online. With the activation of the advanced power activation function (APAF), the AIZNN model is robust in solving the TVTSR problem in the presence of the vanishing and non-vanishing disturbances. We present detailed theoretical analysis to show that, with the AIZNN model, the trajectory of error will converge to zero within a fixed time, and we also calculate the upper bound of the convergence time. Numerical experiments are presented to further verify the robustness of the proposed AIZNN model. Both the theoretical analysis and numerical experiments show that, the proposed AIZNN model provides a novel and noise-tolerant way to solve the TVTSR problem online.
CACL:Commonsense-Aware Contrastive Learning for Knowledge Graph Completion
ABSTRACT. Most knowledge graphs (KGs) are incomplete in the real world, so knowledge graph completion (KGC) is widely investigated to predict the most credible missing facts from given knowledge. However, existing KGC methods heavily rely on the given facts to predict missing relations between entities, ignoring the value of external knowledge. In addition, previous knowledge representation methods ignore the multi-perspective characteristics of cognate knowledge, which leds to the inability to obtain high-level semantic representation of knowledge. To alleviate the above issues, this paper proposes a Commonsense Aware Contrastive Learning (CACL) framework, which extracts relevant knowledge triples from existing commonsense knowledge base to assist in the KGC. Moreover, our method employs knowledge contrast representation learning method to acquire the higher-order representation from multiple perspectives. Experiments show that our method improves the performance of basic knowledge graph embedding (KGE) models. Our method also could be easily adapted to various KGE models.
Identifying Self-Admitted Technical Debt with Context-based Ladder Network
ABSTRACT. Technical debt is inevitable in software development. The accumulation of technical debt will make the software fixes prohibitively expensive. Self-admitted technical debt (SATD) is a type of technical debt. Identifying SATD in code comments can improve code quality. However, manually discerning whether code comments contain SATD would be expensive and time-consuming. To solve this problem, we propose a method to apply the Ladder Network with the pre-training model to identify SATD based on the labeled data from 10 open source projects and the unlabeled data from another ten projects. By comparing with the original model of Ladder Network, and other semi-supervised learning models, the results show that the proposed method performs better in technical debt identification. In addition, the proposed method also achieves better results compared with supervised learning methods. This shows that our approach can make better use of unlabeled data to improve classification performance.
A Novel Machine Learning Model using CNN-LSTM Parallel Networks for Predicting Ship Fuel Consumption
ABSTRACT. With the continuous increase of carbon emission, precise prediction of ship fuel consumption is gaining significance in reduction of energy consumption and emissions for ships. However, existing approaches for estimation of fuel consumption still have significant room for expansion in terms of high effi-ciency and accuracy. Furthermore, previous studies have not focused on capturing properties of both short and long term and traits for multi-sensor data. Considering the above issues, a novel machine learning model with CNN-LSTM parallel networks is proposed by combining convolutional neu-ral network, long short-term memory and artificial neural network models in this paper. The proposed model integrates the advantages of three single models, such as based on the comprehensive consideration of temporal and non-linear properties of data on fuel consumption through convolutional neural network and long short-term memory, utilizing artificial neural net-works and mechanism of parallel learning to achieve multi-source data fu-sion. Moreover, based on the multi-source data of the liquefied petroleum gas carrier, the proposed model is proven to be effective. Experimental out-comes suggest that CNN-LSTM parallel networks is the best choice with RMSE reaching 0.0243, which decreased by 5.81%, 58.25% and 37.85% than convolutional neural network, long short-term memory and artificial neural network. Therefore, the proposed model can significantly enhance energy efficiency of the ship and reduce operating expenses and emissions.
Deep Learning-Empowered Unsupervised Maritime Anomaly Detection
ABSTRACT. Automatically detecting anomalous vessel behaviour is an extremely crucial problem in intelligent maritime surveillance. In this paper, a deep learning-based unsupervised method is proposed for detecting anomalies in vessel trajectories, operating at both the image and pixel levels. The original trajectory data is converted into a two-dimensional matrix representation to generate a vessel trajectory image. A wasserstein generative adversarial network (WGAN) model is trained on a dataset of normal vessel trajectories, while simultaneously training an encoder to map the trajectory image to a latent space. During anomaly detection, the vessel trajectory image is mapped to a hidden vector by the encoder, which is then used by the generator to reconstruct the input image. The anomaly score is computed based on the residuals between the reconstructed trajectory image and the discriminator's residuals, enabling image-level anomaly detection. Furthermore, pixel-level anomaly detection is achieved by analyzing the residuals of the reconstructed image pixels to localize the anomalous trajectory. The proposed method is compared to autoencoder (AE) and variational autoencoder (VAE) techniques, and experimental results demonstrate its superior performance in anomaly detection and pixel-level localization. This method has substantial potential for detecting anomalies in vessel trajectories, as it can detect anomalies in arbitrary waters without prior knowledge, relying solely on training with normal vessel trajectories. This approach significantly reduces the need for human and material resources. Moreover, it provides valuable insights and references for trajectory anomaly detection in other domains, holding both theoretical and practical importance.
Traffic Data Recovery and Outlier Detection based on Non-Negative Matrix Factorization and Truncated-Quadratic Loss Function
ABSTRACT. Intelligent Transportation System (ITS) plays a critical role in managing traffic flow and ensuring safe transportation. However, the presence of missing and corrupted traffic data may undermine the accuracy and reliability of the system. The problem of recovering traffic data can often be transformed into a low-rank matrix factorization problem by exploiting the intrinsic low-rank characteristics of the traffic matrix. While many existing methods demonstrate excellent recovery performance under the assumption of noiseless or Gaussian noise, they often exhibit suboptimal performance in the presence of outliers. In this paper, we propose a novel method for recovering traffic data using non-negative matrix factorization with a truncated-quadratic loss function. Although the objective function in our model is non-convex and non-smooth, we convert it to a convex formulation using half-quadratic theory. Then, a solver based on block coordinate descent is developed. Our experiments on real-world traffic datasets demonstrate superior performance compared to state-of-the-art methods.
Road Surface Segmentation and Detection Under Extreme Weather Conditions Based on Mask-RCNN
ABSTRACT. The recognition effect of road surface conditions is a crucial factor that impacts traffic safety management and control. However, conventional intelligent models face challenges in accurately detecting road surfaces during extreme weather, which significantly hampers the recognition effect. This paper proposes an intelligent detection model for detecting road surfaces under extreme weather conditions. The model uses a Mask-RCNN architecture with Resnet101 and Feature Pyramid Network layers as its backbone network. In addition, we present a detailed study of the parameter training method for the model. This enables the design of a complete process for intelligent road surface detection in severe weather. The experimental results demonstrate that the proposed model scan accurately detect road surfaces, ultimately improving the recognition accuracy of road surface conditions. The processing method eliminates interference in the image and subsequently utilizes the proposed model to conduct object detection and segmentation on the processed image.
Causal-Inspired Influence Maximization in Hypergraphs Under Temporal Constraints
ABSTRACT. Influence Maximization is a significant problem aimed to find a set of seed nodes to maximize the spread of given events in social networks. Previous studies are contributing to the efficiency and online dynamics of basic IM on classical graph structure. However, they lack an adequate consideration of individual and group behavior on propagation probability. This can be attributed to inadequate attention given to node \textbf{I}ndividual \textbf{T}reatment \textbf{E}ffects (ITE), which significantly divides the sensitive attributes of nodes significantly and impacts the probability of propagation. Additionally, in this process, research on temporal constraints based influence spreading process under higher order interference on hypergraphs is limited. To fill these two gaps, we introduce two sets of basic assumptions about the impact of ITE on the propagation process and develop a new diffusion model: the Latency Aware Contact Process on Causal Independent Cascading (LT-CPCIC) under time constraints on hypergraphs. We then design Causal-Inspired Cost-Effective Balanced Selection algorithm (CICEB) for the proposed models. CICEB first recovers node ITE from observational data and then uses three types of debiasing strategys to weaken the correlation between the propagation effects of different pre- and post-nodes. Finally, we compare CICEB with traditional methods on two real-world datasets and show that it demonstrates better performance on effectiveness and robustness.
Detection of Anomalies and Explanation in Cybersecurity
ABSTRACT. Histogram-based anomaly detectors have gained significant attention and application in the field of intrusion detection because of the high efficiency in identifying anomalous patterns. However, they fail to explain why a given data point is flagged as an anomaly. Outlying aspect mining aims to detect aspects (a.k.a subspaces) where a given anomaly significantly differs from others. In this paper, we have proposed a simple but effective and efficient solution – HMass. In addition to detecting anomalies, HMass provides explanations on why the points are anomalous. The effectiveness and efficiency of HMass are evaluated using comparative analysis on seven cyber security datasets, covering the tasks of anomaly detection and outlying aspect mining.
Probabilistic AutoRegressive Neural Networks for Accurate Long-range Forecasting
ABSTRACT. Forecasting time series data is a critical area of research with applications spanning from stock prices to early epidemic prediction. While numerous statistical and machine learning methods have been proposed, real-life prediction problems often require hybrid solutions that bridge classical forecasting approaches and modern neural network models. In this study, we introduce the Probabilistic AutoRegressive Neural Networks (PARNN), capable of handling complex time series data exhibiting non-stationarity, nonlinearity, non-seasonality, long-range dependence, and chaotic patterns. PARNN is constructed by improving autoregressive neural networks (ARNN) using autoregressive integrated moving average (ARIMA) feedback error, combining the explainability, scalability, and "white-box-like" prediction behavior of both models. Notably, the PARNN model provides uncertainty quantification through prediction intervals, setting it apart from advanced deep learning tools. Through comprehensive computational experiments, we evaluate the performance of PARNN against standard statistical, machine learning, and deep learning models, including Transformers, NBeats, and DeepAR. Diverse real-world datasets from macroeconomics, tourism, epidemiology, and other domains are employed for short-term, medium-term, and long-term forecasting evaluations. Our results demonstrate the superiority of PARNN across various forecast horizons, surpassing the state-of-the-art forecasters. The proposed PARNN model offers a valuable hybrid solution for accurate long-range forecasting. By effectively capturing the complexities present in time series data, it outperforms existing methods in terms of accuracy and reliability. The ability to quantify uncertainty through prediction intervals further enhances the model's usefulness in decision-making processes.
FEGI: A Fusion Extractive-Generative Model for Dialogue Ellipsis and Coreference Integrated Resolution
ABSTRACT. Dialogue systems in open domain have achieved great success due to the easily obtained single-turn corpus and the development of deep learning, but the multi-turn scenario is still a challenge because of the frequent coreference and information omission. In this paper, we aim to quickly retrieve the omitted or coreferred expressions contained in history dialogue and restore them into the incomplete utterance. Jointly inspired by the generative method for text generation and extractive method for span extraction, we propose a fusion extractive-generative dialogue ellipsis and coreference integrated resolution model(FEGI). In detail, we introduce two training tasks OMIT and SPAN to extract missing semantic expressions, then integrate the expressions obtained into the decoding initial and copy stages of the generative model respectively. To support the training tasks, we introduce an algorithm for secondary reconstruction annotation based on existing publicly available corpora via unsupervised technique, which can work in cases of no annotation of the missing semantic expressions.
Moreover, We conduct dozens of joint learning experiments on the CamRest676 and RiSAWOZ datasets. Experimental results show that our proposed model significantly outperforms the state-of-the-art models in terms of quality.
POI Recommendation based on Double-level Spatio-temporal Relationship in Locations and Categories
ABSTRACT. The sparsity of user check-in trajectory data is a great challenge faced by point of interest(POI) recommendation. To alleviate the data sparsity, existing research often utilizes the geographic and time information in check-in trajectory data to discover the hidden spatio-temporal relations. However, existing models only consider the spatio-temporal relationship between locations, ignoring that between POI categories. To further reduce the negative impact of data sparsity, motivated by the method to integrate the spatio-temporal relationship by attention mechanism in LSTPM, this paper proposes a POI recommendation model based on double-level spatio-temporal relationship in locations and categories-POI2TS. POI2TS integrates the spatio-temporal relationship between locations and that between categories through attention mechanism to more accurately capture users' preferences. The test results on the NYC and TKY datasets show that POI2TS is more accurate compared with the state-of-the-art models, which verifies that integrating the spatio-temporal relationship between locations and that between categories can effectively improve POI recommendation models.
Label Selection Algorithm Based on Ant Colony Optimization and Reinforcement Learning for Multi-label Classification
ABSTRACT. Multi-label classification handles scenarios where each instance can be annotated with multiple non-exclusive but semantically related labels simultaneously. Despite significant progress, multi-label classification is still challenging due to the emergence of multiple applications leading to high-dimensional label spaces. Researchers have applied feature dimensionality reduction techniques to label space by using label correlation information, and obtained two techniques: label embedding and label selection. There have been many successful algorithms in the field of label embedding, but less attention has been paid to label selection. In this paper, we propose a label selection algorithm for multi-label classification: LS-AntRL. LS-AntRL is a label selection method that combines ant colony optimization and reinforcement learning. This method helps ant colony algorithms search better in the search space by using temporal difference (TD) reinforcement learning algorithm to learn directly from the experience of ants. For heuristic learning, we need to model the ant colony optimization problem as a reinforcement learning problem, that is, model label selection as a Markov decision process, where the label represents the state, and each ant selects unvisited labels represents a set of actions. The state transition rules of the ant colony optimization algorithm constitute the transition function in the Markov decision process, and the state value function is updated by TD formula to form a heuristic function in ant colony optimization. After performing label selection, we train a binary weighted neural network to recover low-dimensional label space back to the original label space. We apply the above model to five benchmark datasets with more than 100 labels. Experimental results show that our method achieves better classification performance than other advanced methods in terms of two performance evaluation metrics (Precision@n and DCG@n).
Nonlinear Multiple-delay Feedback Based Kernel Least Mean Square Algorithm
ABSTRACT. In this paper, a novel algorithm called nonlinear multiple-delay feedback kernel least mean square (NMDF-KLMS) is proposed by introducing a nonlinear multiple-delay into the framework of multikernel adaptive filtering. The proposed algorithm incorporates the nonlinear multiple-delay to enhance the filtering performance in comparison with the kernel adaptive filtering algorithm using linear feedback. Furthermore, for NMDF-KLMS, the theoretical mean-square convergence analysis is also conducted. Simulation results under chaotic time-series prediction and real-world data applications show that NMDF-KLMS achieves a faster convergence rate and superior filtering accuracy.
Binary Mother Tree Optimization Algorithm for 0/1 Knapsack Problem
ABSTRACT. The knapsack problem is a well-known strongly NP-complete problem where the profits of collection of items in knapsack is maximized under a certain weight capacity constraint. In this paper, a novel Binary Mother Tree Optimization Algorithm (BMTO) and Knapsack Problem Framework (KPF) are proposed to find an efficient solution for 0/1 knapsack problem in a short time. The proposed BMTO method is built on the original MTO and a binary module to solve an optimization problem in a discrete space. The binary module converts a set of real numbers equal to the dimension of the knapsack problem to a binary number using a threshold and the sigmoid function. In fact, the KPF makes the implementation of a metaheuristic algorithm to solve the knapsack problem much simpler. In order to assess the performance of the proposed solutions, extensive experiments are conducted. In this regard, several statistical analyses on the resulting solution are evaluated when solved for two sets of knapsack instances (small and large scale). The results demonstrate that BMTO can produce an efficient solution for knapsack instances of different sizes in a short time, and it outperforms two other algorithms Binary Particle Swarm Optimization (BPSO) and Binary Bacterial Foraging (BBF) algorithms in terms of best solution and time. In addition, the results of BPSO and BBF show the effectiveness of KPF compared to the results in the literature.
A Comprehensive Review of Arabic Question Answering Datasets
ABSTRACT. The research community has shown significant interest in the field of Question Answering (QA) due to the strong relevance of QA applications. In recent years, there has been a significant increase in the availability of publicly accessible datasets aimed at advancing research in Arabic QA systems. This survey aims to identify, summarize, and analyze current Arabic QA datasets, such as Monolingual, Multilingual, and Cross-lingual. Our research surveys the existing datasets and provides a comprehensive and multi-faceted classification. Furthermore, this study aims to guide research in Arabic QA by providing the latest updates about the state-of-the-art in this field and identifying shortcomings in the current datasets to develop more substantial and improved collections. Finally, we discuss the existing challenges in Arabic QA datasets and highlight their potential benefits for future research.
An End-To-End Structure with novel position mechanism and improved EMD for Stock Forecasting
ABSTRACT. As a branch of time series forecasting, stock movement forecasting is one of the challenging problems for investors and researchers. Since Transformer was introduced to analyze financial data, many researchers have dedicated themselves to forecasting stock movement using the Transformer or attention mechanisms. However, existing research mostly focuses on individual stock information but ignores stock market information and high noise in stock data. In this paper, we propose a novel method using the attention mechanism in which both stock market information and individual stock information are considered. Meanwhile, we propose a novel EMD-based algorithm for reducing short-term noise in stock data. Two randomly selected exchange-traded funds (ETFs) spanning over ten years from US stock markets are used to demonstrate the superior performance of the proposed attention-based method. The experimental analysis demonstrates that the proposed attention-based method significantly outperforms other state-of-the-art baselines.
ABSTRACT. In a recent paper, a distributed $k$WTA model has been introduced, in which the recurrent connections is defined as a Laplacian matrix and the neuronal function is defined as a Heaviside function. While the recent model is defined as a continuous-time model, this paper presents a discrete-time version of the model and the conditions for ensuring correct output and finite-time convergent are shown. Thus, we introduce the application of the discrete-time distributed $k$WTA as a decentralized mechanism for auctions. Technical problems regarding its actual implementation are outlined.
A Two-Stage Active Learning Algorithm for NLP Based on Feature Mixing
ABSTRACT. Active learning (AL) aims to improve the model performance with minimal data annotation. While recent AL studies have utilized feature mixing to identify unlabeled instances with novel features, applying it to natural language processing (NLP) tasks has been challenged due to the discrete nature of text tokens and the limited contribution of some novel features. To address these issues, we propose a two-stage acquisition method based on feature mixing for NLP tasks. We first create a mixed feature for both labeled and unlabeled instances to identify the features in the unlabeled instances that the model cannot recognize. Next, we evaluate the contribution of these novel features to the model using the entropy of the nearest labeled neighbors. The proposed method enables the model to select the most informative samples in the unlabeled sample pool. Experiments on sentiment analysis, topic classification, and natural language inference validated that our method not only outperforms other AL approaches but improves the efficiency of batch data acquisition.
Accelerate Support Vector Clustering via Spectral Data Compression
ABSTRACT. This paper proposes a novel framework for accelerating support vector clustering. The proposed method first computes much smaller compressed data sets while preserving the key cluster properties of the original data sets based on a novel spectral data compression approach. Then, the resultant spectrally-compressed data sets are leveraged for the development of fast and high quality algorithm for support vector clustering.
We conducted extensive experiments using real-world data sets and obtained very promising results. The proposed method allows us to achieve 100X and 115X speedups over the state of the art SVC method on the Pendigits and USPS data sets, respectively, while achieving even better clustering quality. To the best of our knowledge, this represents
the first practical method for high-quality and fast SVC on large-scale real-world data sets.
Enhancing Spatial Consistency and Class-level Diversity for Segmenting Fine-grained Objects
ABSTRACT. Semantic segmentation is a fundamental computer vision task attracting a lot of attention. However, limited works focus on semantic segmentation on fine-grained class scenario, which has more classes and greater inter-class similarity. Due to the lack of data available for this task, we establish two segmentation benchmarks, CUB-seg and FGSCR42-seg, based on CUB and FGSCR42 datasets. To solve the two major problems in this task, spatial inconsistency and extremely similar classes confusion, we propose the Spatial Consistency and Class-level Diversity enhancement Network. First, we build the Spatial Consistency Enhancement Module to take advantage of the low-frequency information in the feature, enhancing the spatial consistency. Second, Fine-grained Regions Contrastive Loss is designed to make the features of different classes more discriminative, promoting the class-level diversity. Extensive experiments show that our method can significantly improve the performance compared to baseline models. Visualization study also prove the effectiveness of our method for enhancing spatial consistency and class-level diversity.
Effective Domain Adaptation for Robust Dysarthric Speech Recognition
ABSTRACT. By transferring knowledge from abundant normal speech to limited dysarthric speech, dysarthric speech recognition (DSR) has witnessed significant progress. However, existing adaptation techniques mainly focus on the full leverage of normal speech, discarding the sparse nature of dysarthric speech, which poses a great challenge for DSR training in low-resource scenarios. In this paper, we present an effective domain adaptation framework to build robust DSR systems with scarce target data. Joint data preprocessing strategy is employed to alleviate the sparsity of dysarthric speech and close the gap between source and target domains. To enhance the adaptability of dysarthric speakers across different severity levels, the Domain-adapted Transformer model is devised to learn both domain-invariant and domain-specific features. All experimental results demonstrate that the proposed methods achieve impressive performance on both speaker-dependent and speaker-independent DSR tasks. Particularly, even with half of the target training data, our DSR systems still maintain high accuracy on speakers with severe dysarthria.
An Adaptive Detector for Few Shot Object Detection
ABSTRACT. Few-shot object detection has made progress in recent years. However, most research assumes that base and new classes come from the same domain. In real-world applications, they often come from different domains, resulting in poor adaptability of existing methods. To address this problem, we designed an adaptive few-shot object detection framework. Based on the Meta R-CNN framework, we added an image domain classifier after the backbone’s last layer to reduce domain discrepancy. To avoid class feature confusion caused by image feature distribution alignment, we also added a feature filter module (CAFFM) to filter out features irrelevant to specific classes. We tested our method on three base/new splits and found significant performance improvements compared to the base model Meta R-CNN. In base/new split2, mAP50 increased by ±8%, and in the remaining two splits, mAP50 improved by ±3%. Our method outperforms state-of-the-art methods in most cases for the three different base/new splits, validating the efficacy and generality of our approach
Design of Memristor-based Binarized Multi-layer Neural Network with High Robustness
ABSTRACT. Memristor-based neural networks are promising to alleviate the bottleneck of neuromorphic computing devices based on the von Neumann architecture. Various memristor-based neural networks, which are built with different memristor-based layers, have been proposed in recent years. But the Memristor-based neural networks with full precise weight values are effected by memristor conductance variations which have negative impacts on the performance. However, binarized neural networks only have two kinds of weight states, and the binzaized neural networks built by memristors suffer little from the memristor conductance variations. In this paper, a memristor-based batch normalization layer and a binarized fully connected layer are designed. Then based on the proposed layers, the memristor-based binarized multi-layer neural network is built. The effectiveness of the network is substantiated through simulation experiments on pattern classification tasks. The robustness of the network is also explored and the results show that network has high robustness to the variations.
Multiclass Classification and Defect Detection of Steel tube using modified YOLO
ABSTRACT. Steel tubes are widely used in hazardous high pressure envi- ronments such as petroleum, chemicals, natural gas and shale gas. De- fects in steel tubes have serious negative consequences. Using deep learn- ing object recognition to identify and detect defects can greatly improve inspection efficiency and drive industrial automation. In this work, we use a well-known YOLOv7(You Only Look Once version7) deep learning model and propose to improve it to achieve accurate defects detection of steel tube images. First, the classification of the dataset is checked using a sequential model and AlexNet. A Coordinate Attention (CA) mechanism is then integrated into the YOLOv7 backbone network to improve the expressive power of the feature graph. Additionally, the SIoU (SCYLLA- Intersection over Union) loss function is used to speed up convergence due to class imbalance in the dataset. Experimental results show that the evaluation index of the optimized and modified YOLOv7 algorithm outperforms other models. This study demonstrates the effectiveness of using this method in improving the model’s detection performance and providing a more effective solution to steel tube defects.
Efficient Prompt Tuning for Vision and Language Models
ABSTRACT. Recently, large-scale pre-trained visual language models have demonstrated excellent performance in many downstream tasks. A more efficient adapta-tion method for different downstream tasks is prompt tuning, which fixes the parameters of the visual language model and adjusts only prompt pa-rameters in the process of adapting the downstream tasks, using the knowledge learned by the visual language model during pre-training to solve the problems in the down-stream tasks. However, the loss of the downstream task and the original loss of the visual language model are not exactly same during model training. For example, CLIP uses contrast learn-ing loss to train the model, while the downstream image classification task uses the cross-entropy loss commonly used in classification problems. Dif-ferent loss has different guiding effects on the task. The trend of the accura-cy of the visual language model task during training is also different from that with the downstream task. The choice of an appropriate loss function and a reasonable prompt tuning method have a great impact on the perfor-mance of the model. Therefore, we pro-pose a more efficient method of prompt tuning for CLIP, experiments on 11 datasets demonstrate that our method achieves better performance and faster convergence in the down-stream task.
Category-wise Fine-Tuning for Image Multi-label Classification with Partial Labels
ABSTRACT. Image multi-label classification datasets are often partially labeled (for each sample, only the labels on some categories are known). One popular solution for training convolutional neural networks is treating all unknown labels as negative labels, named Negative mode. But it produces wrong labels unevenly over categories, decreasing the binary classification performance on different categories to varying degrees. On the other hand, although Ignore mode that ignores the contributions of unknown labels may be less effective than Negative mode, it ensures the data have no additional wrong labels, which is what Negative mode lacks. In this paper, we propose Category-wise Fine-Tuning (CFT), a new post-training method that can be applied to a model trained with Negative mode to improve its performance on each category independently. Specifically, CFT uses Ignore mode to one-by-one fine-tune the logistic regressions (LRs) in the classification layer. The use of Ignore mode reduces the performance decreases caused by the wrong labels of Negative mode during training. Particularly, Genetic Algorithm (GA) and binary crossentropy are used in CFT for fine-tuning the LRs. The effectiveness of our methods was evaluated on the CheXpert competition dataset and achieves state-of-the-art results, to our knowledge. A single model submitted to the competition server for the official evaluation achieves mAUC 91.82% on the test set, which is the highest single model score in the leaderboard and literature. Moreover, our ensemble achieves mAUC 93.33% on the test set, superior to the best in the leaderboard and literature (93.05%). Besides, the effectiveness of our methods is also evaluated on the partially labeled versions of the MS-COCO dataset.
Semantic Segmentation of Multispectral Remote Sensing Images with Class Imbalance Using Contrastive Learning
ABSTRACT. Affected by the distribution differences of ground objects, multispectral remote sensing images are characterized by long-tailed distribution, that is, a few classes (head classes) contain many instances, while most classes (tail classes or called rare classes) contain only a few instances. The class imbalanced data brings a great challenge to the semantic segmentation task of multispectral remote sensing images. To conquer this problem, this paper proposes a novel contrastive learning method (CoLM) for semantic segmentation of multispectral remote sensing images with class imbalance. Firstly, we propose a semantic consistency constraint to maximize the similarity of semantic feature embeddings of the same class in the feature space, then a rebalancing sampling strategy is proposed to dynamically select the hard-to-predict samples in each class as anchor samples to impose additional supervision, and use pixel-level supervised contrastive loss to improve the separability of rare classes in the decision space. The experimental results on two long-tailed remote sensing datasets show that our method can be easily integrated into existing segmentation models, effectively improving the segmentation accuracy of rare classes without increasing additional inference costs.
Curve Enhancement: A No-Reference Method for Low-light Image Enhancement
ABSTRACT. In this paper, we introduce an end-to-end method for enhancing low-light images without relying on paired datasets. Our solution is reference-free and unsupervised, addressing the lack of real-world low-light paired datasets effectively. Specifically, we design a Brightness Boost Curve (BB-Curve) that enhances the brightness of image pixels in a finely mapped form. Additionally, we propose a lightweight deep neural network that can estimate the curve parameters and evaluate the quality of the enhanced images using a series of no-reference loss functions. We validate our method through experiments conducted on several datasets and provide both subjective and quantitative evaluations to demonstrate its significant brightness enhancement capabilities, free from smearing and artifacts. Notably, our approach displays a strong ability to generalize while retaining details that are crucial for image interpretation. With the reduced network structure and simple curve mapping, our model yields superior training speed and the best prediction performance among comparative methods.
Text-to-Image Synthesis With Threshold-Equipped Matching-Aware GAN
ABSTRACT. In this paper, we propose a novel Equipped with Threshold Matching-Aware Generative Adversarial Network (ETMA-GAN) for text-to-image synthesis. By filtering inaccurate negative samples, the discriminator can more accurately determine whether the generator has generated the images correctly according to the descriptions. In addition, to enhance the discriminative model's ability to discriminate and capture key semantic information, a word fine-grained supervisor is constructed, which in turn drives the generative model to achieve high-quality image detail synthesis. Numerous experiments and ablation studies on Caltech-UCSD Birds 200 (CUB) and Microsoft Common Objects in Context (MS COCO) datasets demonstrate the effectiveness and superiority of the proposed method over existing methods. In terms of subjective and objective evaluations, the model presented in this study has more advantages than the recently available state-of-the-art methods, especially regarding synthetic images with a higher degree of realism and better conformity to text descriptions.
Effi-Seg: Rethinking EfficientNet Architecture for Real-time Semantic Segmentation
ABSTRACT. A popular strategy for designing a semantic segmentation model is to utilize a well-established pre-trained Deep Convolutional Neural Network (DCNN) as a feature extractor and replace the classification head with a decoder to generate segmented outputs. The advantage of this strategy is the ability to obtain a ready-made backbone with additional knowledge. However, there are several disadvantages, such as a lack of architectural knowledge, a significant semantic gap among the deep feature maps, and a lack of control over architectural changes to reduce memory overhead. To overcome these issues, we first study the complete architecture of EfficientNetV1 and EfficientNetV2, analyzing the architectural and performance gaps. Based on this analysis, we develop an efficient segmentation model called Effi-Seg by implementing several architectural changes to the backbone. This approach leads to better semantic segmentation results with improved efficiency. To enhance contextualization and achieve accurate object localization in the scene, we introduce the feature refinement (FRM) and semantic aggregation module (SAM) at the decoder side. The complete segmentation network comprises only 1.49 million parameters and 8.4 GFLOPs. We evaluate the performance of the proposed model using three popular benchmarks, and it demonstrates highly competitive results on all three datasets while maintaining excellent efficiency.
Spatiotemporal PM2.5 Pollution Prediction Using Cloud-Edge Intelligence
ABSTRACT. This study introduces a novel spatiotemporal method to predict fine dust (or PM2.5) concentration levels in the air, a significant environmental and health challenge, particularly in urban and industrial locales. We capitalize on the power of AI-powered Edge Computing and Federated Learning, applying historical data spanning from 2018 to 2022 collected from four strategic sites in Mumbai: Kurla, Bandra-Kurla, Nerul, and Sector-19a-Nerul. These locations are known for high industrial activity and heavy traffic, contributing to increased pollution exposure. Our spatiotemporal model integrates the strengths of Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, with the goal to predict PM2.5 concentrations 24 hours into the future. Other machine learning algorithms, namely Support Vector Regression (SVR), Gated Recurrent Units (GRU), and Bidirectional LSTM (BiLSTM), were evaluated within the Federated Learning framework. Performance was assessed using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R^2. The preliminary findings suggest that our CNN-LSTM model outperforms the alternatives, with a MAE of 0.466, RMSE of 0.522, and R^2 of 0.9877.
Rapid APT Detection in Resource-Constrained IoT Devices Using Global Vision Federated Learning (GV-FL)
ABSTRACT. The increasing proliferation of the Internet of Things (IoT) devices in industrial applications present unique challenges in cybersecurity, specifically in detecting Advanced Persistent Threats (APTs). The limitations of traditional IoT devices, such as low computational power, further exacerbate these challenges. This paper proposes a novel approach to this problem, Global Vision Federated Learning (GV-FL), which leverages FL for efficient and effective APT detection in resource-constrained IoT devices. We comprehensively analyze APT attacks and their stealthy characteristics and highlight the shortcomings of existing detection methods. The GV-FL method presented in this work offers a unique solution by providing a global perspective of the IoT network, thus enabling rapid detection of APTs even in devices with limited resources. Our experimental evaluation demonstrates that GV-FL not only outperforms existing solutions in terms of detection accuracy and speed but also significantly reduces resource consumption, thus proving to be a promising approach to APT detection in IoT devices. We conclude by exploring potential future work and improvements to the GV-FL algorithm, setting the stage for a new paradigm in IoT cybersecurity.
PSO-enabled Federated Learning for detecting ships in supply chain management
ABSTRACT. Supply chain management plays a vital role in the efficient and reliable movement of goods across various platforms, which involves several entities and processes. Detecting ships and their related activities is of paramount importance in order to ensure successful logistics and security. In order to improve logistics planning, security, and risk management, a strong framework is required that offers an efficient and privacy-preserving solution for identifying ships in supply chain management.
In this paper, we propose a novel approach called PSO-enabled FL (PSO-FL) for ship detection in supply chain management. The proposed PSO-FL framework leverages the advantages of both FL and PSO to address the challenges of ship detection in supply chain management. We can train a ship identification model cooperatively using data from several supply chain stakeholders, including port authorities, shipping firms, and customs agencies, thanks to the distributed nature of federated learning (FL). By improving the choice of appropriate participants for model training, the PSO algorithm improves FL performance. We conduct extensive experiments using real-world ship data that is gathered from various sources in order to evaluate the effectiveness of our PSO-FL approach. The results demonstrate that our framework achieves superior ship detection accuracy of 94.88\% compared to traditional centralized learning approaches and standalone FL methods. Furthermore, the PSO-FL framework demonstrates robustness, scalability, and privacy preservation, making it suitable for large-scale deployment in complex supply chain management scenarios.
LSiF: Log-Gabor Empowered Siamese Federated Learning for Efficient Obscene Image Classification in the Era of Industry 5.0
ABSTRACT. The widespread presence of explicit content on social media platforms has far-reaching consequences for individuals, relationships, and society as a whole. It is essential to address this issue through effective content moderation, user education, and the development of technologies and policies to promote a safer and healthier online environment. To address this issue, this research proposed the Log-Gabor Empowered Siamese Federated Learning (LSiF) framework for the precise and efficient classification of obscene images in the era of industry 5.0. The LSiF framework utilizes a siamese network with two parallel streams, where log-Gabor input and normal raw input are processed simultaneously. This siamese architecture leverages shared weights and parameters, enabling effective learning of distinctive features for class differentiation and pattern recognition. The weight-sharing mechanism enhances the model's ability to generalize, increases its robustness, and improves computational efficiency, making it well-suited for resource-constrained and real-time applications. Additionally, federated learning is employed with a client size of three, allowing local model updates on each device. This approach minimizes the need for extensive data transmission to a central server, reducing communication overhead and improving learning efficiency, particularly in environments with limited bandwidth. The proposed LSiF model demonstrates remarkable performance, achieving an accuracy of 94.30\%, precision of 94.00\%, recall of 94.26\%, and F1-Score of 94.17\% with a client size of three.
Privacy-Preserving Travel Time Prediction for Internet of Vehicles: A Crowdsensing and Federated Learning Approach
ABSTRACT. Travel time prediction (TTP) is an important module task to support various applications of Internet of Vehicles (IoVs). Although TTP has been widely investigated in existing literature, most of them assume that the traffic data for estimating the travel time are comprehensive and public for free. However, accurate TTP needs realtime vehicular data so that the prediction can be adaptive to the traffic changes. Moreover, since realtime data contain vehicles' privacy, TTP requires protection during data processing. In this paper, we propose a novel Privacy-Preserving Travel Time Prediction mechanism for IoVs, PTPrediction, building on crowdsensing and federated learning. In the crowdsensing paradigm, a data curator continually collects traffic data from vehicles for travel time prediction. To protect the vehicle's privacy, we make use of federated learning so that vehicles can help the data curator train the prediction model without revealing their original data. We also design a spatial prefix encoding method to protect vehicles' location information, along with a ciphertext-policy attribute-based encryption (CP-ABE) mechanism to protect the prediction model of the curator. We evaluate PTPrediction in terms of MAE, MSE, RMSE on real-world traffic datasets. The experimental results show that our mechanism has higher prediction accuracy and stronger privacy protection comparing with existing methods.
Research on automatic segmentation algorithm of brain tumor image based on multi-sequence self-supervised fusion in complex scenes
ABSTRACT. Brain tumors play a crucial role in medical diagnosis and treatment planning. Extracting tumor information from MRI images is essential but can be challenging due to the limitations and intricacy of manual delineation. This paper presents a brain tumor image segmentation framework that addresses these challenges by leveraging multiple sequence information. The framework consists of encoder, decoder, and data fusion modules. The encoder incorporates Bi-ConvLSTM and Transformer models, enabling comprehensive utilization of both local and global details in each sequence. The decoder module employs a lightweight MLP architecture. Additionally, we propose a data fusion module that integrates self-supervised multi-sequence segmentation results. This module learns the weights of each sequence prediction result in an end-to-end manner, ensuring robust fusion results. Experimental validation on the BRATS 2018 dataset demonstrates the excellent performance of the proposed automatic segmentation framework for brain tumor images. Comparative analysis with other multi-sequence fusion segmentation models reveals that our framework achieves the highest Dice score in each region. To provide a more comprehensive background, it is important to highlight the significance of brain tumors in medical diagnosis and treatment planning. Brain tumors can have serious implications for patients, affecting their overall health and well-being. Accurate segmentation of brain tumors from MRI images is crucial for assessing tumor size, location, and characteristics, which in turn informs treatment decisions and prognosis. Currently, manual delineation of brain tumors from MRI images is a time-consuming and labor-intensive process prone to inter-observer variability. Automating this segmentation task using advanced image processing techniques can significantly improve efficiency and reliability.
EEG epileptic seizure classification using hybrid time-frequency attention deep network
ABSTRACT. Epileptic seizure is a complex neurological disorder and is difficult to detect. Observing and analyzing the waveform changes of EEG signals is the main way to monitor epilepsy activity. However, due to the complexity and instability of EEG signals, the effectiveness of identifying epileptic region by previous methods using EEG signals is not very satisfactory. On the one hand, these methods use the initial time series directly, which reflect limited epilepsy related features; On the other hand, they do not fully consider the spatiotemporal dependence of EEG signals. This study proposes a novel epileptic seizure classification method using EEG based on a hybrid time-frequency attention deep network, namely, a time-frequency attention CNN-BiLSTM network (TFACBNet). TFACBNet firstly uses a time-frequency representation attention module to decompose the input EEG signals to obtain multiscale time-frequency features which provides seizure relevant information within the EEG signals. Then, a hybrid deep network combining convolutional neural network (CNN) and bidirectional LSTM (BiLSTM) architecture extracts spatiotemporal dependencies of EEG signals. Experimental studies have been performed on the benchmark database of the Bonn EEG dataset, achieving 98.84% accuracy on the three-category classification task and 92.35% accuracy on the five-category classification task. Our experimental results prove that the proposed TFACBNet achieves a state-of-the-art classification effect on epilepsy EEG signals.
Prior-Enhanced Network for Image-based PM2.5 Estimation from Imbalanced Data Distribution
ABSTRACT. The effective monitoring of PM2.5, a major indicator of air pollution, is crucial to human activities. Compared to traditional physiochemical techniques, image-based methods train PM2.5 estimators by using datasets containing pairs of images and PM2.5 levels, which are efficient, economical, and convenient to deploy. However, existing methods either employ handcrafted features, which can be influenced by the image content, or require additional weather data acquired probably by laborious processes. To estimate the PM2.5 concentration from a single image without requiring extra data, we herein proposed a learning-based prior-enhanced (PE) network—comprising a main branch, an auxiliary branch, and a feature fusion attention module—to learn from an input image and its corresponding dark channel (DC) and inverted saturation (IS) maps. In addition, we proposed an histogram smoothing (HS) algorithm to solve the problem of imbalanced data distribution, thereby improving the estimation accuracy in cases of heavy air pollution. To the best of our knowledge, this study is the first to address the phenomenon of a data imbalance in imaged-based PM2.5 estimation. Finally, we constructed a new dataset containing multi-angle images and more than 30 types of air data. Extensive experiments on image-based PM2.5 monitoring datasets verified the superior performance of our proposed neural networks and the HS strategy. The new dataset and codes are available at https://github.com/xxx (open after publication).
PMFNet: A Progressive Multichannel Fusion Network for Multimodal Sentiment Analysis
ABSTRACT. The core of multimodal sentiment analysis is to find effective encoding and fusion methods to make accurate predictions. However, previous works ignore the problems caused by the sampling heterogeneity of modalities, and visual-audio fusion does not filter out noise and redundancy in a progressive manner. On the other hand, current deep learning approaches for multimodal fusion rely on single channel fusion (horizontal position/vertical space channel), and models of the human brain highlight the importance of multichannel fusion.
In this paper, inspired by the perceptual mechanisms of the human brain in neuroscience, to overcome the above problems, we propose a novel framework named Progressive Multichannel Fusion Network (PMFNet) to meet the different processing needs of each modality and provide interaction and integration between modalities at different encoded representation densities, enabling them to be better encoded in a progressive manner and fused over multiple channels. Extensive experiments conducted on public datasets demonstrate that our method gains superior or comparable results to the state-of-the-art models.
Correlation-Distance Graph Learning for Treatment Response Prediction from rs-fMRI
ABSTRACT. Resting-state fMRI (rs-fMRI) functional connectivity (FC) analysis provides valuable insights into the relationships between different brain regions and their potential implications for neurological or psychiatric disorders. However, specific design efforts to predict treatment response from rs-fMRI remain limited due to difficulties in understanding the current brain state and the underlying mechanisms driving the observed patterns, which limited the clinical application of rs-fMRI. To overcome that, we propose a graph learning framework that captures comprehensive features by integrating both correlation and distance-based similarity measures under a contrastive loss. This approach results in a more expressive framework that captures brain dynamic features at different scales and enables more accurate prediction of treatment response. Our experiments on the chronic pain and depersonalization disorder datasets demonstrate that our proposed method outperforms current methods in different scenarios. To the best of our knowledge, we are the first to explore the integration of distance-based and correlation-based neural similarity into graph learning for treatment response prediction.
Asymptotic spatio-temporal averaging of the power of EEG signals for schizophrenia diagnostics
ABSTRACT. Although many sophisticated EEG analysis methods have been developed, they are rarely used in clinical practice. Brain bioelectrical activity is non-stationary and characterized by high daily variations; individual differences are quite significant. Therefore, searching for simple methods that can provide stable results reflecting the basic characteristics of individual neurodynamics is very important. Here, we describe two methods potentially useful in schizophrenia diagnostics. We explore the potential for classification based on features extracted with the asymptotic spatial power distribution method and compare it with the results using microstate parameters and probabilities of transition between microstates. Applied to EEG data with only 16 channels and a low sampling rate, such methods provide quite good discrimination between adolescent schizophrenia patients and a control group of healthy teens.
Non-Contact Respiratory Flow Extraction from Infrared Images Using Balanced Data Classification
ABSTRACT. The COVID-19 pandemic has emphasized the need for non-contact ways of measuring vital signs. However, collecting respiratory signals can be challenging due to the transmission risk and physical discomfort of spirometry devices. This is problematic in places like schools and workplaces where monitoring health is crucial. Infrared fever meters are not accurate enough since fever is not the only symptom of these diseases. The objective of our study was to develop a non-contact method for obtaining Respiratory Flow (RF) from infrared images. We recorded infrared images of three subjects at a distance of 1 meter while they breathed through a spirometry device. We proposed a method called Balanced Data Classification to distribute frames equally into several classes and then used the DenseNet-121 Convolutional Neural Network Model to predict RF signals from the infrared images. Our results showed a high correlation of 97% and a Normalized Mean Absolute Error of 2.3%, which are significant compared to other studies. Our method is fully non-contact and involves standing at a distance of 1 meter from the subjects. In conclusion, our study demonstrates the feasibility of using infrared images to extract RF.
RMPE:Reducing Residual Membrane Potential Error for Enabling High-accuracy and Ultra-low-latency Spiking Neural Networks
ABSTRACT. Spiking neural networks (SNNs) have attracted great attention due to
their distinctive properties of low power consumption and high computing efficiency on neuromorphic hardware. An effective way to obtain deep SNNs with
competitive accuracy on large-scale datasets is ANN-SNN conversion. However,
it requires a long time window to get an optimal mapping between the firing rates
of SNNs and the activation of ANNs due to conversion error. Compared with
the source ANN, the converted SNN usually suffers a huge loss of accuracy at
ultra-low latency. In this paper, we first analyze the residual membrane potential
error caused by the asynchronous transmission property of spikes at ultra-low
latency, and deduce an explicit expression for the residual membrane potential
error (RMPE) and the SNN parameters. Then we propose a layer-by-layer calibration algorithm for these SNN parameters to eliminate RMPE. Finally, a twostage ANN-SNN conversion scheme is proposed to eliminate the quantization
error,the truncation error, and the RMPE separately. We evaluate our method on
CIFAR and ImageNet, and the experimental results show that the proposed ANNSNN conversion method has a significant reduction in accuracy loss at Ultra-lowlatency. For ImageNet, when T is ≤ 64, the delay required by our method is about
1/2 of the other methods.
IEEG-CT: A CNN and Transformer Based Method for Intracranial EEG Signal Classification
ABSTRACT. Intracranial electroencephalography (iEEG) is of great importance for the preoperative evaluation of drug-resistant epilepsy. Automatic classification of iEEG signals can speed up the process of epilepsy diagnosis. Existing deep learning-based approaches for iEEG signal classification usually rely on convolutional neural network (CNN) and long short-term memory network. However, these approaches have limitations in terms of classification accuracy. In this paper, we propose a CNN and Transformer based method, which is named as IEEG-CT, for iEEG signal classification. Firstly, IEEG-CT utilities deep one-dimensional CNN to extract the critical local features from the raw iEEG signals. Secondly, IEEG-CT combines a Transformer encoder, which leverages a multi-head attention mechanism to capture the long-range global information among the extracted features. In particular, we introduce a causal convolution multi-head attention instead of the standard Transformer block to efficiently capture the temporal relations in the input features. Finally, the obtained global features by Transformer encoder are employed for the classification. We evaluate the performance of IEEG-CT using two public multicentre iEEG dataset. The experimental results demonstrate that IEEG-CT outperforms state-of-the-art techniques in terms of various evaluation metrics, i.e., accuracy, AUROC and AUPRC.
SLG-NET: Subgraph Neural Network with Local-Global Braingraph Feature Extraction Modules and a Novel Subgraph Generation Algorithm for Automated Identification of Major Depressive Disorder
ABSTRACT. Major depressive disorder (MDD) is a severe mental illness that poses significant challenges to both society and families. Recently, several graph neural network (GNN)-based methods have been proposed for MDD diagnosis and achieved promising results. However, these methods encode entire braingraph directly, have overlooked the subgraph structure of braingraph, which leads to poor specificity to braingraphs. Additionally, the GNN framework they used is rudimentary, resulting in insufficient feature extraction capabilities. In light of the two shortcomings mentioned above, this paper designed a novel depression diagnosis framework named SLG-NET based on subgraph neural network. To the best of our knowledge, this study is the first attempt to apply subgraph neural network to the field of depression diagnosis. In order to enhance the specificity of our model to braingraphs, we propose a novel subgraph generation algorithm based on sub-structure information of brain. To improve feature extraction capabilities, a local and global braingraph feature extraction modules are proposed to extract braingraph properties at both local and global levels. Comprehensive experiments performed on rest-metamdd dataset show that the performance of proposed SLG-NET significantly surpasses many state-of-the-art methods with an accuracy of 74.15%, the high accuracy shows that the SLG-NET has the potential for auxiliary diagnosis of depression in clinical scenarios, we further analyze high-order FC network and highlight the hyperconnectivity of thalamus as key neurophysiological feature associated with MDD, which may guide the development of biomarkers used for the clinical diagnosis of MDD.
Soybean Genome Clustering using Quantum-Based Fuzzy C-Means Algorithm
ABSTRACT. Bioinformatics is a new area of research in which many computer scientists are working to extract some useful information from
genome sequences in a very less time, whereas traditional methods may
take years to fetch this. One of the studies that belong to the area of
Bioinformatics is protein sequence analysis. In this study, we have considered the soybean protein sequence which does not have class information
therefore clustering of these sequences is required. As these sequences
are very complex and consist of overlapping sequences, therefore Fuzzy
C-Means algorithm may work better than crisp clustering. However, the
clustering of these sequences is a very time-consuming process also the
results are not up to the mark by using existing crisp and fuzzy clustering
algorithms. Therefore we propose here a quantum Fuzzy c-Means algorithm that uses the quantum computing concept to represent the dataset
in the quantum form. The proposed approach also use the quantum superposition concept which fastens the process and also gives better result
than the FCM algorithm.
ONEI: Unveiling Route and Phase of Breathing from Snoring Sounds
ABSTRACT. Obstructive Sleep Apnea-Hypopnea Syndrome (OSAHS) is a chronic respiratory disorder caused by the obstruction of the upper airway. The treatment approach for OSAHS varies based on the individual patient's breathing route and phase during snoring. Extensive research has been conducted to identify various snoring patterns, including the breathing route and the breathing phase during snoring. However, the identification of breathing routes and phases during snoring sounds is still in the early stages due to the limited availability of comprehensive datasets with scientifically annotated nocturnal snoring sounds. To address this challenge, this study presents ONEI, an innovative dataset designed for recognizing and analyzing snoring patterns. ONEI encompasses 5171 snoring recordings and is annotated with four distinct labels, namely nasal-dominant inspiratory snoring, nasal-dominant expiratory snoring, oral inspiratory snoring, and oral expiratory snoring. Experimental evaluations reveal discernible acoustic features in snoring sounds, which can be effectively utilized for accurately identifying various snoring types in real-world scenarios. The dataset will be made publicly available for access at https://github.com/emleeee/ONEI.
ABSTRACT. Meal recommender system, as an application of bundle recommendation, aims to provide courses from specific categories (e.g., appetizer, main dish) that are enjoyed as a meal for a user. Existing bundle recommendation methods work on learning user preferences from user bundle interactions to satisfy users’ information need. However, users in food scenarios may have different preferences for different course categories. It is a challenge to effectively consider course category constraints when predicting meals for users. To this end, we propose a model CateRec: Category-wise meal Recommendation model. Specifically, our model first decomposes interactions and affiliations between users, meals, and courses according to category. Secondly, graph neural networks are utilized to learn category-wise user and meal representations. Then, the likelihood of user-meal interactions is estimated category by category. Finally, our model is trained by a category-wise enhanced Bayesian Personalized Ranking (BPR) loss. Conducted on two public
datasets, our model outperforms state-of-the-art methods in terms of Recall@K and NDCG@K.
Topic Modeling for Short Texts via Adaptive Pólya Urn Dirichlet Multinomial Mixture
ABSTRACT. Inferring coherent and diverse latent topics from short texts is crucial in topic modeling. Existing approaches leverage the Generalized Pólya Urn (GPU) model to incorporate external knowledge and improve topic modeling performance. While the GPU scheme successfully promotes similarity among words within the same topic, it has two major limitations. Firstly, it assumes that similar words contribute equally to a similar topic, disregarding the distinctiveness of different words. Secondly, it assumes that a specific word should have the same promotion across all topics, overlooking the variations in word importance across different topics. To address these limitations, we propose the Adaptive P′olya Urn Dirichlet Multinomial Mixture (APU-DMM) model, which leverages global topic-word correlation to encourage adaptive weights for different words. This is achieved through a novel Adaptive Pólya Urn (APU) scheme. We conduct experiments on three datasets (Tweet, SearchSnippets, and GoogleNews), and the results demonstrate our model’s superiority in terms of topic coherence and topic diversity. This paper contributes to advancing latent topic inference in short texts by introducing the APU-DMM model and showcasing its enhanced performance. Utilizing global topic-word correlation and introducing the APU scheme allows for more adaptive and nuanced modeling, resulting in improved topic coherence and diversity.
Optimal Low-rank QR Decomposition with an Application on RP-TSOD
ABSTRACT. Low-rank matrix approximation has many applications, e.g., denoising, recommender systems and image reconstruction. Recently, a Randomized Pivoted Two-Sided Orthogonal Decomposition (RP-TSOD) was developed to exploit the randomization in approximating a high-dimensional matrix using QR decomposition. Instead of random projection, we propose to optimize the projection matrix for low-rank QR decomposition with the target of minimizing the approximation error. A method based on gradient descent is developed to derive optimal projections. The developed techniques can be used in not only RP-TSOD, but also other decompositions. Experimental results on both synthetic data and real data show that the proposed method could more accurately approximate a high-dimensional matrix than RP-TSOD.
ABSTRACT. Traditional Hidden Markov Models (HMM) allow us to discover the latent structure of the observed data (both discrete and continuous). Recently proposed DenseHMM provides hidden states embedding and uses the co-occurrence-based learning schema. However, it is limited to discrete emissions, which does not meet many real-world problems. We address this shortcoming by discretizing observations and using a region-based co-occurrence matrix in the training procedure. It allows embedding hidden states for continuous emission problems and reducing the training time for large sequences. An application of the proposed approach concerns recommender systems, where we try to explain how the current interest of a given user in a given group of products (current state of the user) influences the saturation of the list of recommended products with the group of products. Computational experiments confirmed that the proposed approach outperformed regular HMMs in several benchmark problems. Although the emissions are estimated roughly, we can accurately infer the states.
EDDVPL: A Web Attribute Extraction Method with Prompt Learning
ABSTRACT. Since labeling web pages requires a lot of human resources and time, web attribute extraction methods based on few-shot learning have gained the at-tention of researchers. However, these methods still rely heavily on suffi-cient labeled data of several seed websites. In order to effectively alleviate the lack of domain information, we design a web attribute extraction model based on dual-view prompt learning named EDDVPL, achieving page-level few-shot learning which uses only a small number of labeled web pages for training. Specifically, we first retrieve semantic prompt information of DOM tree view by a simplified algorithm to stimulate domain-related knowledge of the pre-trained language model. Then, we introduce task prompt information of template view by constructing a template indicating the extraction target, which can help the pre-trained language model quickly understand the task of web attribute extraction. Finally, we integrate the dual-view prompt in-formation by template filling to jointly guide the training of the pre-trained language model at semantic and task levels. Extensive experimental results on the public SWDE dataset show that EDDVPL performs the best results compared to the baselines.
ASRCD: Adaptive Serial Relation-based Model for Cognitive Diagnosis
ABSTRACT. Cognitive diagnosis (CD) is a fundamental task in the education field. The goal of CD is recognizing the actual concept proficiency of learners. Recent studies prove that concept relation (e.g., concept Addition and concept Multiplication in mathematics) plays a key role in CD. Advanced research has made great contributions to concept relation modeling. There is a gap in automatic building and adaptive integration of relation modeling. To address these problems, we propose an Adaptive Serial Relation based model for Cognitive Diagnosis (ASRCD). We first construct a Concept Serial Relation Graph (CSRG) to automatically mine the concept relation from the learner response sequence. Then a refined graph attention network (GAT) is designed to weigh the concept relation for aggregation. Finally, we build a general CD blending concept relation. Leverage the extensibility of CSRG, it can be applied on most existing CD methods. We implement our model on two real-world datasets from education practice. Experimental results demonstrate that the proposed model performs outstandingly in both accuracy and extendibility.
Interactive Selection Recommendation Based on the Multi-Head Attention Graph Neural Network
ABSTRACT. The click-through rate prediction of users is a critical task in the recommendation system, as a powerful machine learning method, graph neural networks have been favored by scholars to solve it in recent years. However, most click-through rate prediction models based on graph neural networks are generally model the relationship between features without considering the effectiveness of feature interaction, although not all feature combinations are meaningful. Therefore, this paper proposes a Multi-head attention Graph Neural Network with Interactive Selection, named MGNN\_IS in short, to capture the complex feature interactions via graph structures. In particular, there are three sub-graphs to be constructed to capture internal information of users and items respectively, and interactive information between users and items, namely the user internal graph, item internal graph, and user-item graph correspondingly. Moreover, the proposed model designs a multi-head attention propagation aggregation module with an interactive selection strategy, which can select the constructed graph and increase diversity with multiple heads to achieve the high-order interaction from the multiple layers. Finally, the proposed model to fuse the features to result in the final prediction. Experiments on three public datasets demonstrates that the proposed model outperformed other advanced models.
An End-to-End Dense Connected Heterogeneous Graph Convolutional Neural Network
ABSTRACT. Graph convolutional networks (GCNs) are powerful models for graph-structured data learning task. However, most existing GCNs may confront with two major challenges when dealing with heterogeneous graph: (1) Predefined meta-paths are required to capture the semantic relations between nodes from different types, which may not exploit all the useful information in the graph; (2) Performance degradation and semantic confusion may happen with the growth of the network depth, which limits their ability to capture long-range dependencies. To meet these challenges, we propose Dense-HGCN, an end-to-end dense connected heterogeneous convolutional neural network to learn node representation. Dense-HGCN computes the attention weights between different nodes and incorporates the information of previous layers into each layer’s aggregation process via a specific fuse function. Moreover, Dense-HGCN leverages multi-scale information for node classification or other downstream tasks. Experimental results on real-world datasets demonstrate the superior performance of Dense-HGCN in enhancing the representational power compared with several state-of-the-art methods.
Graph Convolutional Network based Feature Constraints Learning for Cross-Domain Adaptive Recommendation
ABSTRACT. The problem of data sparsity is a key challenge for recommendation systems. It motivates the research of cross-domain recommendation (CDR), which aims to use more user-item interaction information from source domains to improve the recommendation performance in the target domain. However, finding useful features to transfer is a challenge for CDR. Avoiding negative transfer while achieving domain adaptation further adds to this challenge. Based on the superiority of graph structural feature learning, this paper proposes a graph convolutional network based Cross-Domain Adaptive Recommendation model using Feature Constraints Learning(CDAR-FCL). To begin with, we construct a multi-graph network consisting of single-domain graphs and one cross-domain graph based on overlapping users. Next, we employ specific and common graph convolution on the graphs to learn domain-specific and domain-invariant features, respectively. Additionally, we design feature constraints on the features obtained in different graphs and mine the potential correlation for domain adaptation. To address the issue of shared parameter conflicts within the constraints, we develop a binary mask learning approach based on contrastive learning. CDAR-FCL is a domain adaptive recommendation model that can find useful features to transfer. Experiments on three pairs of real cross-domain datasets demonstrate the effectiveness of CDAR-FCL.
Motif-SocialRec: A Multi-channel Interactive Semantic Extraction Model for Social Recommendation
ABSTRACT. Social recommendation is emerging as a prominent research topic in recommendation systems, which enhances prediction performance by incorporating user-item interaction information with social relationship between users. Most existing social recommendation models always model interaction information relying on pairwise relationships within neighborhood structure during the representation learning, without well capturing and utilizing high-level connection modes in the information network. Therefore, these models always fail to capture complex interaction semantics beyond pairwise relationships, which are crucial to produce valid recommendation results. To address this issue, this paper discusses social recommendation from the perspective of motif and proposes a novel recommendation model, namely Motif-SocialRec, which efficiently models interaction pattern from multi-channel with different motifs. In this model, we extract a series of local structures, depicted by motif, that can describe the high-level interactive semantics in the fused network from three views. By employing hypergraph convolution network conditioned on motif, representations that preserve potential semantic patterns can be learned under the constraint of social relationships. Additionally, we enhance the learned representations by establishing self-supervised learning tasks on different scales to further explore the inherent characteristics of the network. To produce the final recommendation prediction, a joint optimization model is constructed by integrating the primary and auxiliary tasks. Results of extensive experiments on four real-world datasets show that Motif-SocialRec significantly outperforms baselines in terms of three evaluation metrics, on both common and cold-start settings. Finally, further insight about the explainability of Motif-SocialRec is explored by analysing recommendation predictions produced for several randomly sampled users.
Curiosity Enhanced Bayesian Personalized Ranking for Recommender Systems
ABSTRACT. Curiosity affects the users' selections of items, motivating them to explore the items regardless of their preferences. This phenomenon is particularly common in social networks. However, the existing social-based recommendation methods neglect the users' curiosity in the social networks, and it may cause the accuracy decrease in the recommendation. Moreover, only focusing on simulating the users' preferences can lead to users' information cocoons. To tackle the problems above, we propose a Curiosity Enhanced Bayesian Personalized Ranking (CBPR) model for the recommender systems. Our proposed model makes full use of the theories of psychology to model the users' curiosity aroused when facing different opinions. The experimental results on two public datasets demonstrate the advantages of our CBPR model over the existing models.
EWMIGCN: Emotional Weighting based Multimodal Interaction Graph Convolutional Networks for Personalized Prediction
ABSTRACT. To address the challenges of information overload and cold start in personalized prediction systems, researchers have proposed graph neural network-based recommendation methods. However, existing studies have largely overlooked the shared characteristics among different modal features. Moreover, there is a mismatch between the focuses of multimodal feature extraction(MFE) and user preference modeling(UPM). To tackle these issues, this paper establishes an interaction graph by extracting multimodal information and addresses the mismatch between MFE and UPM by constructing an emotion-weighted bisymmetric linear graph convolutional network (EW-BGCN). Specifically, this paper introduces a novel model called EWMIGCN, which combines multimodal information extraction using parallel CNNs to build an interaction graph, propagates the information on EW-BGCN, and predicts user preferences by summing the expressions of users and items through inner product calculations. Notably, this paper incorporates sentiment information from user comments to finely weigh the neighborhood aggregation in EW-BGCN, enhancing the overall quality of items. Experimental results demonstrate that the proposed model achieves superior performance compared to other baseline models on three datasets, as measured by HitsRatio with Normalized Discounted Cumulative Gain.
ASTPSI: Allocating Spare Time and Planning Speed Interval for Intelligent Train Control of Sparse Reward
ABSTRACT. When using deep reinforcement learning (DRL) to solve train operation control in urban railways, encounter complex and dynamic environments with sparse rewards. Therefore, it is crucial to alleviate the negative impact of sparse rewards on finding the optimal trajectory. This paper introduces a novel algorithm called Allocating Spare Time and Planning Speed Intervals (ASTPSI), which can reduce the blindness of exploration dramatically of intelligent train agents under sparse rewards when using DRL and significantly improve their learning efficiency and operation quality. The ASTPSI can generate real-time train trajectories that meet the requirements by combining different DRL algorithms. To evaluate the algorithm's performance, we verified the convergence rate of the ASTPSI-DRL to optimize train trajectories in the face of sparse rewards on a real track. ASTPSI-DRL has better performance and stability than genetic algorithms and original DRL algorithms in reducing train energy consumption, punctuality, and accurate stopping.
Outer Synchronization for Multi-Derivative Coupled Complex Networks with and without External Disturbance
ABSTRACT. This paper investigates the outer synchronization of multi-derivative coupled complex networks (MDCCNs), and further studies the outer $H_{\infty}$ synchronization between two MDCCNs with external disturbance. For the outer synchronization, a synchronization criterion is proposed by using adaptive control strategy, which is proved based on Lyapunov functional and the Barbalat's lemma. For the outer $H_\infty$ synchronization, an adaptive state controller and parameter updating scheme are devised for MDCCNs with external disturbance. Finally, the validity of the presented criteria is demonstrated by providing two simulation examples.
Wasserstein Diversity-Enriched Regularizer for Hierarchical Reinforcement Learning
ABSTRACT. Hierarchical reinforcement learning (HRL) composites subpolicies in different hierarchies to accomplish complex tasks. Automated subpolicies discovery, which does not depend on domain knowledge, is a promising approach to generating subpolicies. However, the degradation problem is a challenge that existing methods can hardly deal with due to the lack of consideration of diversity or the employment of weak regularizers. In this paper, we propose a novel task-agnostic regularizer called the Wasserstein Diversity-Enriched Regularizer (WDER), which enlarges the diversity of subpolicies by maximizing the Wasserstein distances among action distributions. The proposed WDER can be easily incorporated into the loss function of existing methods to boost their performance further. Experimental results demonstrate that the improvement in the diversity of the generated subpolicies is significant when equipped with our WDER, which validated its effectiveness.
Bloomfilter-based Practical Kernelization Algorithm for Minimum Satisfiability
ABSTRACT. Minimum Satisfiability problem (briefly, given a CNF formula,
find an assignment satisfying the minimum number of clauses) has raised much
attention recently. In the theoretical point of view, Minimum Satisfiability problem is fixed-parameterized, by transforming into Vertex Cover.
However, such kind of transformation would be time-consuming, which takes $O(m^2\cdot n)$ times to transform into Vertex Cover. We first present a $O(m^2)$ algorithm to transform MinSAT into Vertex cover, by utilizing Bloom Filter structure. And then, instead of transformationt to Vertex Cover, we present a practical kernelization rule directly on the original formula which take time of $O(L\cdot d(F))$, with a kernel size of $k^2+k$.
ABSTRACT. It is well known that Q-learning (QL) suffers from overestimation bias, which is caused by using the maximum action value to approximate the maximum expected action value. To solve overestimation issue, overestimation property of Q-learning is well studied theoretically and practically. In general, most work on reducing overestimation bias is to find different estimators to replace maximum estimator in order to mitigate the effect of overestimation bias. These works have achieved
some improvement on Q-learning. In this work, we still focus on overestimation bias reduced methods. In these methods, we focus on M samples action values, and one of these samples estimated by remaining samples maximum actions. We select median and max members from these new samples which are estimated by maximum actions. We call these max and median members as Bias Reduced Max Q-learning (BRMQL) and Bias Reduced Median Q-learning (BRMeQL). We first theoretically prove that BRMQL and BRMeQL suffer from underestimation bias and analyze the effect of number of M Q-functions on the performance of our algorithms. Then we evaluate the BRMQL and BRMeQL on benchmark game environments. At last, we show that BRMQL, and BRMeQL less underestimate the Q-value than Double Q-learning (DQL) and perform better than several other algorithms on some benchmark game environments.
Interactive Attention-Based Graph Transformer for Multi-Intersection Traffic Signal Control
ABSTRACT. With the exponential growth in motor vehicle numbers, the issue of urban traffic congestion is becoming increasingly severe. Traffic signal control has become the pivotal technology to alleviate the congestion problem. In modeling multi-intersection, most existing studies focus on communication with regional intersections. They rarely consider the correlation between cross-regional. To address the above limitation, we construct an interactive attention-based graph transformer network for traffic signal control (GTLight). Specifically, the model learns multiple dependency patterns using a relationship-enhanced interactive attention mechanism. It considers the correlations between cross-regional intersections. In addition, the model designs a phase-timing optimization algorithm to solve the problem of overestimation of Q-value in signal timing strategies. The model can provide an optimal signal phasing for each intersection based on different traffic states. We validate the effectiveness of GTLight using the CityFlow traffic simulator on synthetic and real-world traffic datasets. Compared with the recent method, the average travel time is improved by 28.16%, 26.56%, 25.79%, 26.46%, and 19.59%, respectively, achieving excellent performance.
Feature-Fusion-Based Haze Recognition in Endoscopic Images
ABSTRACT. Haze generated during endoscopic surgeries significantly obstructs the surgeon's field of view, leading to inaccurate clinical judgments and elevated surgical risks. Identifying whether endoscopic images contain haze is essential for dehazing. However, existing haze image classification approaches usually concentrate on natural images, showing inferior performance when applied to endoscopic images. To address this issue, an effective haze recognition method specifically designed for endoscopic images is proposed. This paper innovatively employs three kinds of features (i.e., color, edge, and dark channel), which are selected based on the unique characteristics of endoscopic haze images. These features are then fused and inputted into a Support Vector Machine (SVM) classifier. Evaluated on clinical endoscopic images, our method demonstrates superior performance: (Accuracy: 98.67%, Precision: 98.03%, and Recall: 99.33%), outperforming existing methods. The proposed method is expected to enhance the performance of future dehazing algorithms in endoscopic images, potentially improving surgical accuracy and reducing surgical risks.
MemFlowNet: A Network for Detecting Subtle Surface Anomalies with Memory Bank and Normalizing Flow
ABSTRACT. Detection of subtle surface anomalies in the presence of strong noise is a challenging vision task. This paper presents a new neural network called MemFlowNet for detecting subtle surface anomalies by combining the advantages of memory bank and normalizing flow. The proposed method consists of two stages. The first stage achieves pixel-level segmentation of anomalies using noise-insensitive average features in the memory bank and Nearest Neighbor search strategy, and the second stage achieves image-level detection using normalizing flows and multi-scale score fusion. A new dataset called INSCup has been developed to assist this research by acquiring inner surface images of stainless steel insulated cups with ultra-wide lens. The performance of MemFlowNet has been validated on INSCup dataset by surpassing other mainstream methods. In addition, MemFlowNet achieves the best performance with an image level AUROC of 99.57% in anomaly detection of MVTec-AD benchmark. It shows a great potential to apply MemFlowNet to automated visual inspection of surface anomalies.
Spatially-Aware Human-Object Interaction Detection with Cross-Modal Enhancement
ABSTRACT. We propose a novel two-stage HOI detection model that incorporates cross-modal spatial information awareness.
Human-object relative spatial relationships are highly relevant for specific HOI species, but current approaches fail to model such crucial cues explicitly. We observed that relative spatial relationships possess properties that can be described in natural language easily and intuitively.
Building on this observation and inspired by recent advancements in prompt-tuning, we design a Prompt-Enhanced Spatial Modeling (PESM) module that generates linguistic descriptions of spatial relations between humans and objects.
PESM is capable of merging the explicit spatial information obtained by the aforementioned text descriptions with the implicit spatial information of the visual modality. Moreover, we devise a two-stage model architecture that effectively incorporates auxiliary cues to exploit the enhanced cross-modal spatial information.
Extensive experiments conducted on the HICO-DET benchmark demonstrate that the proposed model outperforms state-of-the-art methods, indicating its effectiveness and superiority. The source code is available at https://github.com/liugaowen043/tsce
Oil and Gas Automatic Infrastructure Mapping: Leveraging High-Resolution Satellite Imagery through fine-tuning of object detection models
ABSTRACT. The oil and gas sector is the second largest anthropogenic emitter of methane, which is responsible for at least 25% of current global warming. To curb methane’s contribution to climate change, emissions from oil and gas infrastructure must be monitored. Initiatives such as the Methane Alert and Response System (MARS) launched by the United Nations Environment Program aim to pinpoint significant emission events, alert relevant stakeholders, and monitor and track progress in mitigation efforts. However, an automated solution is needed for consistent monitoring across multiple oil and gas basins. In this extended study, we focus on automated identification of oil and gas infrastructure using advanced supervised object detection algorithms such as YOLO, FASTER RCNN, and DETR with fine-tuning on a specifically segmented Oil and Gas infrastructure database (930 images, 1951 objects). We are specifically investigating automatic detection in the Permian Basin of the U.S. using these algorithms refined with our customized high-resolution image database. The tests performed demonstrate the effectiveness of YOLO v8 model both pre-trained and non-pre-trained.
A Deep Learning Framework with Pruning RoI Proposal for Dental Caries Detection in Panoramic X-ray Images
ABSTRACT. Dental caries is a prevalent noncommunicable disease that affects over half of the global population. It can significantly diminish individuals' quality of life by impairing their eating and socializing abilities. Consistent dental check-ups and professional oral healthcare are crucial in preventing dental caries and other oral diseases. Deep learning based object detection provides an efficient approach to assist dentists in identifying and treating dental caries. In this paper, we present a deep learning framework with a lightweight pruning region of interest (P-RoI) proposal specifically designed for detecting dental caries in panoramic dental radiographic images. Moreover, this framework can be enhanced with an auxiliary head for label assignment during the training process. By utilizing the Cascade Mask R-CNN model with a ResNet-101 backbone as the baseline, our modified framework with the P-RoI proposal and auxiliary head achieves a notable 3.85 increase in Average Precision (AP) for the dental caries class within our dental dataset.
Generating Pseudo-Labels for Car Damage Segmentation using Deep Spectral Method
ABSTRACT. Car damage segmentation, an integral part of vehicle damage assessment, involves identifying and classifying various types of damages from images of vehicles, thereby enhancing the efficiency and accuracy of assessment processes. This paper introduces an efficient approach for car damage assessment by combining pseudo-labeling and deep learning techniques. The method addresses the challenge of limited labeled data in car damage segmentation by leveraging unlabeled data. Pseudo-labels are generated using a deep spectral approach and refined through merge and flip-bit operations. Two models, i.e., Mask R-CNN and SegFormer, are trained using a combination of ground truth labels and pseudo-labels. Experimental evaluation of the CarDD dataset demonstrates the superior accuracy of our method, achieving improvements of 12.9% in instance segmentation and 18.8% in semantic segmentation when utilizing a 1/2 ground truth ratio. In addition to enhanced accuracy, our approach offers several benefits, including time savings, cost reductions, and the elimination of biases associated with human judgment. By enabling more precise and reliable identification of car damages, our method enhances the overall effectiveness of the assessment process. The integration of pseudo-labeling and deep learning techniques in car damage assessment holds significant potential for improving efficiency and accuracy in real-world scenarios.
Detect Overlapping Community via Graph Neural Network and Topological Potential
ABSTRACT. Overlapping community structure is an important characteristic of real complex networks, the goal of the overlapping community detection is to resolve the modular with the information contained in the networks. However, most existing methods based on deep learning techniques directly utilize the original network topology or node attributes, ignoring the importance of various edge information. Inspired by the effective representation learning capability of graph neural network and the ability of topological potential to measure the intimacy between nodes, we propose a novel model, named DOCGT, for overlapping community detection. This model deconstructs the original graph into a first-order graph and a second-order graph, and builds a set of graph neural network modules based on the Bernoulli-Poisson (BP) model, and then uses its advantages to independently learn the node embedding representation of different orders. To this end, we introduce the concept of topological potential matrix. It can not only effectively merge the above embeddings, but also integrate abundant edge information into the entire model. This fused embedding matrix can help us get the final community structure. Experimental results on real datasets show that our method can effectively detect overlapping community structures.
Adaptive Focal Inverse Distance Transform Maps for Cell Recognition
ABSTRACT. The quantitative analysis of cells is crucial for clinical diagnosis, and effective analysis requires accurate detection and classification. Using point annotations for weakly supervised learning is a common approach for cell recognition, which significantly reduces the labeling workload. Cell recognition methods based on point annotations primarily rely on manually crafted smooth pseudo labels. However, the diversity of cell shapes can render the fixed encodings ineffective. In this paper, we propose a multi-task cell recognition framework. The framework utilizes a regression task to adaptively generate smooth pseudo labels with cell morphological features to guide the robust learning of probability branch and utilizes an additional branch for classification. Meanwhile, in order to address the issue of multiple high-response points in one cell, we introduce Non-Maximum Suppression (NMS) to avoid duplicate detection. On a bone marrow cell recognition dataset, our method is compared with five representative methods. Compared with the best performing method, our method achieves improvements of 2.0 F1 score and 3.6 F1 score in detection and classification, respectively.
Stereo Visual Mesh for Generating Sparse Semantic Maps at High Frame Rates
ABSTRACT. The Visual Mesh is an input transform for deep learning that
allows depth independent object detection at extremely high frame rates.
The present study introduces a Visual Mesh based stereo vision method
for sparse stereo semantic segmentation. A new dataset of simulated 3D
scenes was generated and used for training to show that the method is ca-
pable of processing high resolution stereo inputs to generate both left and
right sparse semantic maps. The new stereo method demonstrated supe-
rior classification accuracy when compared to the corresponding monoc-
ular approach. The very high frame rates and high accuracy may make
the proposed approach attractive to fast-paced on-board robot or IoT
applications.
ABSTRACT. Micro-expressions (MEs) have the characteristics of small motion amplitude and short duration. How to learn discriminative ME features is a key issue in ME recognition. Motivated by the success of PCB model in person retrieval, this paper proposes a ME recognition method called PCB-PCANet+. Considering that the important information of MEs is mainly concentrated in a few key facial areas like eyebrows and eyes, based on the output of shallow PCANet+, we use a multiple branch LSTM networks to separately learn the local spatiotemporal features for each facial ROI region. In addition, in the stage of multiple branch fusion, we design a feature weighting strategy according
to the significances of different facial regions to further improve the performances of ME recognition. The experimental results on the SMIC and CASME II datasets validate the effectiveness of the proposed method.
Unsupervised Fabric Defect Detection Framework based on Knowledge Distillation
ABSTRACT. Fabric defect detection is a critical task in the textile industry. Efficient and accurate automated detection schemes, such as computer vision fabric quality inspection, are urgently needed. However, traditional feature-based methods are often limited and difficult to implement universal solutions in industrial scenarios due to their specificity towards certain defect types or textures. Meanwhile, machine learning methods may face difficulties in harsh industrial production environments due to insufficient data and labels. To address these issues, we propose an unsupervised defect detection framework based on knowledge distillation, which includes a visual localization module to assist with the detection task. Our approach significantly improves classification and segmentation accuracy compared to previous unsupervised methods. Besides, we perform a comprehensive set of ablation experiments to determine the optimal values of different parameters. Furthermore, our method demonstrates promising performance in both open databases and real industrial scenarios, highlighting its high practical value.
Global Exponential Synchronization of Quaternion-Valued Neural Networks via Quantized Control
ABSTRACT. In this paper, quantization controllers are designed to implement global exponential synchronization of quaternionvalued neural networks. Firstly, based on Hamilton’s principle, the quaternion-valued neural networks are decomposed into four equivalent real-valued neural networks. Then, utilizing the principles of Lyapunov stability and matrix inequality theory, the drive-response synchronization method is utilized to obtain the result on exponential synchronization of quaternion-valued neural networks. Finally, the effectiveness of the proposed method is verified by numerical simulation examples.
Informative Prompt Learning for Low-shot Commonsense Question Answering via Fine-Grained Redundancy Reduction
ABSTRACT. Low-shot commonsense question answering (CQA) poses a big challenge due to the absence of sufficient labeled data and commonsense knowledge. Recent work focuses on utilizing the potential of commonsense reasoning of pre-trained language models (PLMs) for low-shot CQA. In addition, various prompt learning methods have been studied to elicit implicit knowledge from PLMs for performance promotion. Whereas, it has been shown that PLMs suffer from the redundancy problem that many neurons encode similar information, especially under a small sample regime, leading prompt learning to be less informative in low-shot scenarios.
In this paper, we propose an informative prompt learning approach, which aims to elicit more diverse and useful knowledge from PLMs for low-shot CQA via fine-grained redundancy reduction. Specifically, our redundancy-reduction method imposes restrictions upon the fine-grained neuron-level to encourage each dimension to model different knowledge or clues. Experiments on three benchmark datasets show the great advantages of our proposed approach in low-shot settings. Moreover, we conduct both quantitative and qualitative analyses, which sheds light on why our approach can lead to great improvements.
Rethinking unsupervised domain adaptation for nighttime tracking
ABSTRACT. Despite the considerable progress that has been achieved in visual object tracking, it remains a challenge to track in low-light circumstances. Prior nighttime tracking methods suffer from either weak collaboration of cascade structures or the lack of pseudo supervision, and thus fail to bring out satisfactory results. In this paper, we develop a novel unsupervised domain adaptation framework for nighttime tracking. Specifically, we benefit from the establishment of pseudo supervision in the mean teacher network, and further extend it with three components at the input level and the optimization level. For the unlabeled target domain dataset, we first introduce an assignment-based object discovery strategy to generate suitable training patches. Meanwhile, a low-light enhancer is embedded to improve the pseudo labels that facilitate the following consistency learning. Then, with the aid of better training data and pseudo labels, we replace the common mean square error with two stricter losses, which are entropy-decreasing classification consistency loss and confidence-weighted regression consistency loss, for better convergence. Experiments demonstrate that our proposed method achieves significant performance gains on multiple nighttime tracking benchmarks, and even brings slight enhancement on the source domain.
Improving SLDS performance using explixit duration variables with infinite support
ABSTRACT. We improve the segmentation in Switching Linear Dynamical Systems.
We extend the Beam Sampling algorithm to perform the efficient inference allowing for a duration distribution with infinite support.
We conduct experiments on three benchmarks (two already prevalent in the state-space model literature and one demonstrating behavior in a sparse setting) that test the correctness and efficiency of our solution.
Exploring Adaptive Regression Loss and Feature Focusing in Industrial Scenarios
ABSTRACT. Industrial defect detection is designed to detect quality defects in industrial products. However, the surface defects of different industrial products vary greatly—for example, the variety of texture shapes and the complexity of background information. A lightweight Focus Encoder-Decoder Network (FEDNet) is presented to solve these problems. Specifically, the novelty of FEDNet is as follows: First, the feature focusing module (FFM) is designed to focus the attention on defect features in complex backgrounds. Secondly, a lightweight texture extraction module (LTEM) is proposed to lightly extract the texture and relative location information of shallow network defect features. Finally, the AZIoU, an adaptive adjustment loss function, is reexamined in the prediction box's specific circumference and length-width bits. Experiments on two industrial defect datasets show that FEDNet achieves the accuracy of Steel at 42.86% and DeepPCB at 72.19% using only 15.3 GFLOPs.
Modeling User's Neutral Feedback in Conversational Recommendation
ABSTRACT. Conversational recommendation systems (CRS) enable the traditional recommender systems to obtain fine-grained dynamic user preferences by incorporating interactive conversations. Although CRS has shown success in generating recommendation lists based on user’s preferences, existing methods restrict users to make binary responses, i.e., accept and reject, after recommending, which greatly limits users from expressing their needs. In fact, the user's rejection feedback may potentially contain other valuable information. To address this limitation, we try to refine user's negative item-level feedback into attribute-level and extend the CRS to a more realistic scenario that not only incorporates positive and negative feedback, but also neutral feedback. Neutral feedback denotes incomplete satisfaction with recommended items, which can guide the system to infer user’s preferences and optimize the recommendation. To better cope with the new setting, we propose a conversation recommendation model called Neutral Feedback in Conversational Recommendation (NFCR). We adopt a joint learning task framework for feature extraction and use predefined rules from inverse reinforcement learning to train the decision network, which enables us to make appropriate decisions at each turn. Finally, we utilize the fine-grained neutral feedback from users to acquire their dynamic preferences in the update and deduction module. We conducted comprehensive evaluations on four benchmark datasets to demonstrate the effectiveness of our model.
Jointly extractive and abstractive training paradigm for text summarization
ABSTRACT. Text summarization is a classical task in natural language
generation, which aims to generate concise summary of the original ar-
ticle. Neural networks based on the Encoder-Decoder architecture have
made great progress in recent years in generating abstractive summaries
with high fluency. However, due to the randomness of the abstractive
model during generation, the summaries risk missing important infor-
mation in the articles. To address this challenge, this paper proposes
a jointly trained text summarization model that combines abstractive
and extractive summarization. On the one hand, extractive models have
higher ROUGE scores but poorer readability on the other hand, ab-
stractive models can produce a more fluent summary but suffer from the
problem of omitting important information in the original text. There-
fore, We share the encoder of both models and jointly train both models
to obtain a text representation that benefits from regularisation. We
also add document level information obtained from an extractive model
to the decoder of the abstractive model to improve abstractive summary.
Experiments on CNN/Daily Mail dataset , Pubmed dataset and Arxiv
dataset demonstrate the effectiveness of the proposed model.
FHSI-GNN: Fusion Hierarchical Structure Information Graph Neural Network for Extractive Long Documents Summarization
ABSTRACT. Extractive text summarization aims to select salient sentences from documents. However, most existing extractive methods struggle to capture inter-sentence relations in long documents. In addition, the hierarchical structure information of the document is ignored. For example, some scientific documents have fixed chapters, and sentences in the same chapter have the same theme. To solve these problems, this paper proposes a Fusion Hierarchical Structure Information Graph Neural Network for Extractive Long Documents Summarization. The model constructs a section node containing sentence nodes and global information according to the document structure. It integrates the hierarchical structure information of the text and uses position information to identify sentences. The section node acts as an intermediary node for information interaction between sentences, which better enriches the relationships between sentences and has higher computational efficiency. Our model has achieved excellent results on two datasets, PubMed and arXiv. Further analysis shows that the hierarchical structure information of documents helps the model select salient content better.
A Three-Stage Framework For Event-Event Relation Extraction with Large Language Model
ABSTRACT. Expanding the parameter count of a large language model (LLM) alone is insufficient to achieve satisfactory outcomes in natural language processing tasks, specifically event extraction (EE), event temporal relation extraction (ETRE), and
event causal relation extraction (ECRE). To tackle these challenges, we propose a novel three-stage extraction framework(ThreeEERE) that integrates an improved automatic chain of thought prompting (Auto-CoT) with LLM and is tailored based on a golden rule to maximize event and relation extraction precision. The three stages include constructing examples in each category, federating local knowledge to extract relationships between events, and selecting the best answer. By following these stages, we can achieve our objective. Although supervised models dominate for these tasks, our experiments on three types of extraction tasks demonstrate that utilizing these three stages approach yields significant results in event extraction and event relation extraction, even surpassing some supervised model methods in the extraction task.
Two-Phase Semantic Retrieval For Explainable Multi-Hop Question Answering
ABSTRACT. Explainable Multi-Hop Question Answering (MHQA) requires an ability to reason explicitly across facts to arrive at the answer. The majority of multi-hop reasoning methods concentrate on semantic similarity to obtain the next hops or act as entity-centric inference. However, approaches that ignore the rationales required for problems can easily lead to blindness in reasoning. In this paper, we propose a two-Phase text Retrieval method with an entity Mask mechanism (PRM), which focuses on the rationale from global semantics along with entity consideration. Specifically, it consists of two components: 1) The rationaleaware retriever is pre-trained via a dual encoder framework with an entity mask mechanism. The learned representations of hypotheses and facts are utilized to obtain top K candidate core facts by a sentence-level dense retrieval. 2) The entity-aware validator determines the reachability of hypotheses and core facts with an entity granularity sparse matrix. Our experiments on three public datasets in the scientific domain (i.e., OpenbookQA, Worldtree, and ARC-Challenge) demonstrate that the proposed model has achieved remarkable performance over the existing methods.
Effective Guidance in Zero-Shot Multilingual Translation via Multiple Language Prototypes
ABSTRACT. In a multilingual neural machine translation model that fully shares parameters across all languages, a popular approach is to use an artificial language token to guide translation into the desired target language. However, recent studies have shown that language-specific signals in prepended language tokens are not adequate to guide the MNMT models to translate into right directions, especially on zero-shot translation (i.e., off-target translation issue). We argue that the representations of prepended language tokens are overly affected by its context information, resulting in potential information loss of language tokens and insufficient indicative ability. To address this issue, we introduce multiple language prototypes to guide translation into the desired target language. Specifically, we categorize sparse contextualized language representations into a few representative prototypes over training set, and inject their representations into each individual token to guide the models. Experiments on several multilingual datasets show that our method significantly alleviates the off-target translation issue and improves the translation quality on both zero-shot and supervised directions.
Towards Scalable Feature Selection: An Evolutionary Multitask Algorithm Assisted by Transfer Learning Based Co-surrogate
ABSTRACT. When faced with large-instance datasets, existing feature selection methods based on evolutionary algorithms still face the challenge of high computational cost. To address this issue, this paper proposes a scalable evolutionary algorithm for feature selection on largeinstance datasets, namely, transfer learning based co-surrogate assisted evolutionary multitask algorithm (cosEMT). Firstly, we tackle the feature selection on large-instance datasets via an evolutionary multitasking framework. The co-surrogate models are constructed to measure the similarity between each auxiliary task and main task, and the knowledge transfer between tasks is realized through instance-based transfer learning. Through the numerical relationship between the relative and absolute number of transferable instances, we propose a novel dynamic resource allocation strategy to make more efficient use of limited computational resources and accelerate evolutionary convergence. Meanwhile, an adaptive surrogate model update mechanism is proposed to balance the exploration and exploitation of the base optimizer embedded in the cosEMT framework. Finally, the proposed algorithm is compared with several state-of-the-art feature selection algorithms on twelve large-instance datasets. The experimental results show that the cosEMT framework can obtain significant acceleration in the convergence speed and high-quality solutions. All verify that cosEMT is a highly competitive method for feature selection on large-instance datasets.
A Two-Stage Network for Segmentation of Vertebrae and Intervertebral Discs: Integration of Efficient Local-Global Fusion Using 3D Transformer and 2D CNN.
ABSTRACT. In the field of computer-aided diagnosis (CAD) for spinal diseases,the fundamental task of multi-label segmentation for vertebrae and intervertebral discs (IVDs) assumes a significant role.However,the distinctive characteristics inherent to the spinal structure pose considerable challenges to the segmentation process,impeding its practical applicability in clinical settings.Convolutional neural networks have been widely used in this task;however,their limited receptive field restricts their capacity to capture extended-range spatial correlations.Consequently,the model's ability to accurately delineate vertebral boundaries is compromised,leading to a notable deterioration in the quality of segmentation outputs.To address this limitation,we propose a novel two-stage convolutional neural network (CNN) framework that incorporates both 3D Transformers and 2D CNNs.By synergistically leveraging the advantages of Transformers in facilitating the integration of long-range dependencies and the ability of CNNs to learn global and local features,our proposed approach exhibits promising potential in enhancing the segmentation performance for vertebrae and intervertebral discs.Moreover,we introduce a graph convolution module into our network architecture to exploit the inherent spatial dependencies present in MRI scans of spinal structures,thereby extracting semantic feature representations and further augmenting the efficacy of segmentation.The evaluation of our proposed method is conducted on the MRSpineSeg Challenge dataset,encompassing T2-weighted MR images.Experimental results affirm the superiority of our approach over representative state-of-the-art methodologies.
A domain knowledge-based semi-supervised pancreas segmentation approach
ABSTRACT. Deep learning-based methods have obtained the remarkable achievements, however, obtaining enough labeled data is time-consuming and labor-intensive. Semi-supervised learning is an effective way to alleviate dependence on annotated data by combining unlabeled data. The existing semi-supervised segmentation works are easier to ignore the domain knowledge, leading to location and shape bias. In this paper, we propose a semi-supervised medical segmentation method based on domain knowledge. Specifically, the prior constraints for different organ sub-regions are used to guide the pseudo-label generation for unlabeled data. Then the bidirectional information flow regularization is designed by further utilizing pseudo-labels, encouraging the model to align the labeled and unlabeled data distributions. Extensive experiments on NIH pancreas datasets show: the proposed method achieved Dice of 76.23% and 80.76% under 10% and 20% labeled data, respectively, which is superior to other semi-supervised pancreas segmentation methods.
How to support sport management with decision systems? Swimming athletes assessment study case
ABSTRACT. Information systems in sports play an increasingly important role due to the opportunities and benefits they present to sports clubs. The purpose of these systems is to assist in decision-making processes concerning marketing, the selection of training parameters, recovery methods, or the selection of team members, among others. It results in increased interest in decision support systems in this area, allowing clubs to gain an advantage over their competitors.
In this paper, we propose a research approach for creating models to evaluate the performance of swimming athletes based on their physical parameters and the sport level they represent. Due to the uncertainties involved, data presented in the form of Triangular Fuzzy Numbers (TFN) were used in the Fuzzy Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS) method to obtain a ranking of the athletes. A sensitivity analysis for the exclusion of subsequent criteria was also carried out. The results obtained were compared regarding selecting the different significance of the criteria weights. An additional six Fuzzy Multi-Criteria Decision Analysis (MCDA) methods were used for a comprehensive analysis, and the results showed that the proposed averaged ranking is a reasonable solution. The proposed approach can be used to evaluate players from different sports so that sports clubs can recruit athletes with high-performance potential.
Graph Attention Hashing via Contrastive Learning for Unsupervised Cross-modal Retrieval
ABSTRACT. Hashing-based cross-modal retrieval maps multi-modal features into binary codes into a common Hamming space. Due to its small storage consumption and high efficiency, hashing has received extensive attention in recent years. However, the current researches have difficulty in constructing a well-defined joint semantic space and conduct more detailed and in-depth learning guidance. In this paper, Graph Attention Hashing via Contrastive Learning (GAHCL) is proposed to address these issues. First, we use the idea of contrastive learning to generate positive samples, and propose a novel contrastive adjacency matrix through a graph attention network. Specifically, this matrix assigns higher weights to node pairs whose source is the same sample, and assigns lower weights to node pairs that do not match each other. The key semantic features can be captured more carefully and accurately under the influence of attention weights. In addition, the contrastive loss function is constructed by taking the output features of different modalities in an instance and its generated positive sample features as a positive sample pair. Extensive experiments on two datasets show that the proposed method can significantly outperform existing competitors.
An Efficient Enhanced-YOLOv5 Algorithm for Multi-scale Ship Detection
ABSTRACT. Ship detection has gained considerable attentions from industry and academia. However, due to the diverse range of ship types and complex marine environments, multi-scale ship detection suffers from great challenges including low detection rate, high computation time and so on. To solve the above issues, we propose an efficient enhanced-YOLOv5 algorithm for multi-scale ship detection. Specifically, to dynamically extract two-dimensional features, we design a MetaAconC-inspired adaptive spatial-channel attention module for reducing the impact of complex marine environments on large-scale ships. In addition, we construct a gradient-refined bounding box regression module to enhance the sensitivity of loss function gradient and strengthen the feature learning ability, which can relieve the issue of uneven horizontal and vertical features in small-scale ships. Finally, a Taylor expansion-based classification module is established which increases the feedback contribution of gradient by adjusting the first polynomial coefficient vertically, and improves the detection performance of the model on few sample ship objects. Extensive experimental results confirm the effectiveness of the proposed method.
Double-Layer Blockchain-Based Decentralized Integrity Verification for Multi-Chain Cross-Chain Data
ABSTRACT. With the development of blockchain technology, there are certain bottlenecks in terms of storage, throughput, and latency. To address these issues, many multi-chain architectures have emerged to enable data interoperability among different blockchains through cross-chain techniques. However, in highly distributed cross-chain scenarios, the integrity of cross-chain data can be compromised intentionally or unintentionally. Due to the decentralized nature of blockchain, centralized verification schemes are not feasible, making decentralized cross-chain data integrity verification a critical and challenging problem.In this paper, based on the ideas of "governing the chain by chain" and "double-layer blockchain", we propose a double-layer blockchain-based decentralized integrity verification scheme to solve this problem. We construct a supervision-chain by selecting representative nodes from multiple blockchains, which is responsible for cross-chain data integrity verification and recording results.Specifically, we elaborate on the consensus process comprising integrity consensus and block consensus. The integrity consensus stage achieves decentralized data integrity verification, while the block consensus stage packages and records the results from the integrity consensus stage. Furthermore, we design reputation system, and election algorithm within the supervision-chain.Through security analysis and performance evaluation, we demonstrate the security and effectiveness of our proposed scheme.
Self-Supervised-Enhanced Dual Hierarchical Graph Convolution Network for Social Recommendation
ABSTRACT. Graph convolution networks (GCNs) have made significant progress in the field of recommendation systems in recent years, and many GCN-based frameworks have applied in social recommendation methods. The essence of social recommendation tasks is modeling user preferences through user social relationships to alleviate the sparsity issue. However, existing GCN-based social recommendation frameworks still have some inherent problems. Firstly, since there are no node attributes available as semantic information in the recommendation task, lightweight graph convolutions that remove feature transformation and non-linear activation function have become widely applied in recommendation task, with the core lies in its message passing mechanism. Social recommendation methods always directly apply existing message passing paradigms, which always have obvious limitations in their message passing mechanisms. Secondly, most existing social recommendation frameworks are limited to pairwise relations and unable to effectively extract implicit inter-graph information. To address these issues, we propose a Self-Supervised-Enhanced Dual Hierarchical Graph Convolution Network (SSHGCN). In this framework, we first propose a LEWB message passing paradigm applied to graph convolution network to train user and item representations. Then we explicitly model marginal information between user social graph and user-item interaction graph, item knowledge graph and user-item interactions graph by hypergraph. Finally, we construct hierarchical self-supervised signals and unify self-supervised task and recommendation task for joint training. Extensive experiments on real-world datasets demonstrate that our method outperforms competitive methods. Thorough ablation study verifies the rationality of LEWB message passing paradigm and the effectiveness of the hierarchical self-supervised tasks in our framework.
Differential Private (Random) Decision Tree without Adding Noise
ABSTRACT. The decision tree is a typical algorithm in machine learning with multiple expanded variations. However, regarding privacy, few of these variations have been put into practice due to a number of issues associated with balancing between privacy preservation and performance. The decision tree is a typical algorithm in machine learning and has multiple expanded variations. However, regarding privacy, few in the variations reached practical level due to many challenges on balancing privacy preservation and performance. In this paper, we propose a method of applying privacy preservation to the (random) decision tree, which is a variation of the expanded decision tree proposed by Fan et al. in 2003, to achieve the following goals:
- Model training with data belonging to multiple organizations and concealing these data among organizations.
- No leakage of training data from trained models.
NDGR: A Noise Divide and Guided Re-labeling Framework for Distantly Supervised Relation Extraction
ABSTRACT. Distant supervision(DS) is widely used in relation extraction to reduce the cost of annotation but suffers from noisy instances. Existing methods always select reliable instances that rely on potential noisy labels, result in the selection of many noisy instances or ignore a large number of valuable training instances. In this paper, we propose NDGR, a novel training framework for sentence-level distantly supervised relation extraction. NDGR divides the noisy data from DS-built data by modeling the loss distribution with a Gaussian Mixture Model, then assigns pseudo labels for noisy data to transform them into useful training data. To alleviate the noise in generated labels, we adopt a guided label generation strategy that uses the updated Relation Extraction Network as a reference to optimize the Label Generation Network. Through iterative execution of noise divide and guided label generation, NDGR helps refine the noisy DS-built data and enhance the performance. Extensive experiments on widely-used benchmarks have demonstrated that our method has significant improvement on sentence-level evaluation and de-noise effect.
Inter-modal Fusion Network with Graph Structure Preserving for Fake News Detection
ABSTRACT. The continued ferment of fake news on the network threatens the stability and security of society, prompting researchers to focus on fake news detection. The development of social media has made it challenging to detect fake news by only using uni-modal information. Existing studies tend to integrate multi-modal information to pursue completeness for information mining. How to eliminate modality differences effectively while capturing structure information well from multi-modal data remains a challenging issue. To solve this problem, we propose an Inter-modal Fusion network with Graph Structure Preserving (IF-GSP) approach for fake news detection. An inter-modal cross-layer fusion module is designed to bridge the modality differences by integrating features in different layers between modalities. Intra-modal and cross-modal contrastive losses are designed to enhance the inter-modal semantic similarity while focusing on modal-specific discriminative representation learning. A graph structure preserving module is designed to make the learned features fully perceive the graph structure information based on a graph convolutional network (GCN). A multi-modal fusion module utilizes an attention mechanism to adaptively integrate cross-modal feature representations. Experiments on two widely used datasets show that IF-GSP significantly outperforms related multi-modal fake news detection methods.
Learning to Match Features with Geometry-aware Pooling
ABSTRACT. Finding reliable and robust correspondences across images is a fundamental and crucial step for many computer vison tasks, such as 3D-reconstruction and virtual reality. However, previous studies still struggle in challenging cases, including large view changes, repetitive pattern and textureless regions, due to the neglect of geometric constraint in the process of feature encoding. Accordingly, we propose a novel GPMatcher, which is designed to introduce geometric constraints and guidance in the feature encoding process. To achieve this goal, we compute camera poses with the corresponding features in each attention layer and adopt a geometry-aware pooling to reduce the redundant information in the next layer. By these means, an iterative geometry-aware pooing and pose estimation pipeline is constructed, which avoids the updating of redundant feaures and reduces the impact of noise. Experiments conducted on a range of evaluation benchmarks demonstrate that our menthod improves the matching accurary and achieves the state-of-the-art performance.
Multi-Scale Information Fusion Combined with Residual Attention for Text Detection
ABSTRACT. Driven by deep learning and neural networks, text detection
technology has made further developments. Due to the complexity and
diversity of scene text, detecting text of arbitrary shapes has become a
challenging task. Previous segmentation-based text detection methods
can hardly solve the problem of missed detection in complexity scene
text detection. In this paper, we propose a text detection model that
combines residual attention with a multi-scale information fusion structure to effectively capture text information in natural scenes and avoid
text omission. Specifically, the multi-scale information fusion structure
extracts text features from different levels to achieve better text localisation and facilitate the fusion of text information. At the same time,
residual attention is combined with features from high-resolution images
to enhance the contextual information of the text and avoid missing
text. Finally, text instances are obtained by a binarisation method. This
method is very helpful for text detection in complex scenes. Experiments
conducted on three public benchmark datasets show that the method
achieves state-of-the-art performance.
DAMFormer: Enhancing Polyp Segmentation through Dual Attention Mechanism
ABSTRACT. Abstract. Polyp segmentation has been a challenging problem for re-
searchers because it does not define a specific shape, color, or size.
Traditional deep learning models, based on convolutional neural net-
works (CNNs), struggle to generalize well on unseen datasets. However,
the Transformer architecture has shown promising potential in address-
ing medical problems by effectively capturing long-range dependencies
through self-attention. This paper introduces the DAMFormer model
based on Transformer for high accuracy while keeping lightness. The
DAMFormer utilizes a Transformer encoder to extract better global in-
formation. The Transformer outputs are strategically fed into the Con-
vBlock and Enhanced Dual Attention Module to effectively capture high-
frequency and low-frequency information. These outputs are further pro-
cessed through the Effective Feature Fusion module to combine global
and local features efficiently. In our experiment, five standard benchmark
datasets were used Kvasir, CVC-Clinic DB, CVC-ColonDB, CVC-T, and
ETIS-Larib.
Customized Anchors Can Better Fit the Target in Siamese Tracking
ABSTRACT. Most existing siamese trackers rely on some fixed
anchors to estimate the scale and aspect ratio for all targets.
However, in real tracking, different targets have different
sizes and shapes, these predefined anchors are not enough to
cover all possible scales and aspect ratios caused by various
movement and deformation, so an adaptive scale and aspect
ratio estimation method is expected for robust online tracking.
In this paper, a customized anchor generation module is first
proposed to estimate the shape of the target and generate
customized anchors adapted to the target. Then, through an
anchor adaptation module, each anchor information is embed
into corresponding feature to learn more discriminative features. Finally, We design a Target-aware feature
correlation module to reduce the interference of background
information . It takes the region of interest of template as
variable template and its central subregion as central template,
and then performs global and local correlation operations,
respectively. Experiments on benchmarks including OTB100,
VOT2019, LaSOT, UAV123, and VOT2018 show that our
tracker achieves promising performance.
Topological Dynamics of Functional Neural Network Graphs During Reinforcement Learning
ABSTRACT. This study investigates the topological structures of neural network activation graphs, with a focus on detecting higher-order Betti numbers during reinforcement learning. The paper presents visualisations of the neurotopological dynamics of reinforcement learning agents both during and after training, which are useful for different dynamics analysis which we explore in this work. Two applications are considered: frame-by-frame analysis of agent neurotopology and tracking per-neuron presence in cavity boundaries over training steps. The experimental analysis suggests that higher-order Betti numbers found in a neural network's functional graph can be associated with learning more complex behaviours.
Retrieval-augmented GPT-3.5-based Text-to-SQL Framework with Sample-aware Prompting and Dynamic Revision Chain
ABSTRACT. Text-to-SQL aims at generating SQL queries for the given natural language questions and thus helping users to query databases. Prompt learning with large language models (LLMs) has emerged as a recent approach, which designs prompts to lead LLMs to understand the input question and generate the corresponding SQL. However, it faces challenges with strict SQL syntax requirements. Existing work prompts the LLMs with a list of demonstration examples (i.e. question-SQL pairs) to generate SQL, but the fixed prompts fixed prompts can hardly handle the scenario where the semantic gap between the retrieved demonstration and the input question is large. In this paper, we propose a retrieval-augmented prompting method for a LLM-based Text-to-SQL framework, involving sample-aware prompting and a dynamic revision chain. Our approach incorporates sample-aware demonstrations, which include the composition of SQL operators and fine-grained information related to the given question. To tackle the problem of employing different expressions to convey the same SQL intention, we propose two strategies for assisting retrieval. Firstly, we leverage LLMs to simplify the original questions, unifying the syntax and thereby clarifying the intentions of the users. To generate executable and accurate SQLs without human intervention, we design a dynamic revision chain which iteratively adapts fine-grained feedback from the previously generated SQL. Experimental results on three Text-to-SQL benchmarks demonstrate the superiority of our method over strong baseline models.
Rumor Detection with Supervised Graph Contrastive Regularization
ABSTRACT. Rumors generated on social networks can spread quickly and have a serious impact on social stability and residents' daily life. Recently, rumor detection methods based on feedback information and propagation structure generated during user interaction have received attention. Most rumors with salient features can be effectively distinguished by graphical models employing cross-entropy loss. However, these traditional models may lead to poor generalization and lack of robustness in the face of noisy reviews and mislabels containing malicious fabrications. In this paper, we propose a novel Supervised Graph Contrastive Regularization (SGCR) method to deal with these complex situations, in which the label information is used for supervised contrastive learning by applying simple regularization to the embedding variance of each dimension separately. To explicitly avoid the crash problem, session threads belonging to the same class are pulled together in the embedding space, while session threads from different classes are separated. When updating model parameters through backpropagation, we reduce gradient conflicts between different tasks through gradient projection. Experimental results on two real-world datasets demonstrate that our SGCR performs better than baseline methods.
PoShapley-BCFL: A fair and robust decentralized federated learning based on blockchain and the proof of Shapley-value
ABSTRACT. Recently, blockchain-based Federated learning(BCFL) has emerged as a promising technology for promoting data sharing in the Internet of Things(IoT) without relying a central authority, while ensuring data privacy, security, and traceability. However, it remains challenging to design an decentralized and appropriate incentive scheme that should promise a fair and efficient contribution evaluation for participants while defending against low-quality data attacks. Although Shapley-Value(SV) methods have been widely adopted in FL due to their ability to quantify individuals' contributions, they rely on a central server for calculation and incur high computational costs, making it impractical for decentralized and large-scale BCFL scenarios. In this paper, we designed and evaluated PoShapley-BCFL, a new blockchain-based FL approach to accommodate both contribution evaluation and defense against inferior data attacks. Specifically, we proposed PoShapley, a Shapley-value-enabled blockchain consensus protocol tailored to support a fair and efficient contribution assessment in PoShapley-BCFL. It mimics the Proof-of-Work mechanism that allows all participants to compute contributions in parallel based on an improved lightweight SV approach. Following using the PoShapley protocol, we further designed a fair-robust aggregation rule to improve the robustness of PoShapley-BCFL when facing inferior data attacks. Extensive experimental results validate the accuracy and efficiency of PoShapley in terms of distance and time cost, and also demonstrate the robustness of our designed PoShapley-BCFL.
Dendritic Neural Regression Model Trained by Chicken Swarm Optimization Algorithm for Bank Customer Churn Prediction
ABSTRACT. Recently, banks and other enterprises are constantly facing the problem of customers churning. As an important component of Customer Relationship Management, predicting customer churn has been increasingly urgent. Inspired by biological neurons, we build up a dendritic neural network model with four layers, namely synaptic layer, dendritic layer, membrane layer and soma layer for bank customer churn prediction. CSO algorithm is used in this experiment as one of our training algorithms. In this paper, we propose a novel dendritic neural model called CSO-DNM, and experimental results are based on a benchmark dataset from Kaggle. Compared with other algorithms and models, our proposed model obtains the highest accuracy and convergence speed in customer churn prediction.
BERT-LBIA: A BERT-Based Late Bidirectional Interaction Attention Model for Legal Case Retrieval
ABSTRACT. Most legal case retrieval methods rely on pre-trained language models like BERT, which can be slow and inaccurate. Alternatively, representation-based models provide quick responses but may not be the most accurate. To address these issues, our paper proposes a BERT-based late bidirectional interaction attention model for similar legal case retrieval. We use a dual BERT model as our backbone network to obtain feature representations of a query and its case candidates. Then, we develop a bidirectional interaction attention network to generate deep interactive attention signals between the query and its corresponding case candidates. Our experiments show that our model is faster and more accurate than existing retrieval models.
ML2FNet: A Simple but Effective Multi-Level Feature Fusion Network for Document-Level Relation Extraction
ABSTRACT. Document-level relation extraction presents new challenges compared to its sentence-level counterpart, which aims to extract relations from multiple sentences. Current graph-based and transformer-based models have achieved certain success. However, most approaches focus only on local information, independently on a certain entity, without considering the global interdependency among the relational triples. To solve this problem, this paper proposes a novel relation extraction model with a Multi-Level Feature Fusion Network (ML2FNet). Specifically, we first establish the interaction between entities by constructing an entity-level relation matrix. Then, we employ an enhanced U-shaped network to fuse the multi-level feature of entity pairs from local to global. Finally, the relation classification of entity pairs is performed by a bilinear classifier. We conduct experiments on three public document-level relation extraction datasets, and the results show that ML2FNet outperforms the other baselines.
Implicit Clothed Human Reconstruction Based on Self-attention and SDF
ABSTRACT. Recently, implicit function-based approaches have advanced 3D human reconstruction from a single-view image. However, previous methods suffer from problems such as artifacts, broken limbs, and loss of surface details when dealing with challenging poses. To address these issues, this paper proposes a novel neural network model based on PaMIR. Firstly, the Signed Distance Function (SDF) is introduced to define an implicit function, which improves the generalization ability of the model. Secondly, a feature volume encoding network with self-attention mechanism is designed to extract voxel-aligned features and provide richer geometric information, further improving the accuracy of shape topology structure. Through validation on the CAPE dataset, our method exceeds the PaMIR by 50.9% and 30.6% reduction in Chamfer and Point-to-Surface Distances respectively, and 18.2% reduction in normal estimation errors.
Improving GNSS-R Sea Surface Wind Speed Retrieval from FY-3E Satellite Using Multi-Task Learning and Physical Information
ABSTRACT. Global Navigation Satellite System Reflectometry (GNSS-R) technology has great advantages over traditional satellite remote sensing detection of sea surface wind field in terms of cost and timeliness. It has attracted increasing attention and research from scholars around the world. This paper focuses on the Fengyun-3E (FY-3E) satellite, which carries the GNOS II sensor that can receive GNSS-R signals. We analyze the limitations of the conventional sea surface wind speed retrieval method and the existing deep learning model for this task, and propose a new sea surface wind speed retrieval model for FY-3E satellite based on a multi-task learning (MTL) network framework. The model uses the forecast product of Hurricane Weather Research and Forecasting (HWRF) model as the label, and inputs all the relevant information of Delay-Doppler Map (DDM) in the first-level product into the network for comprehensive learning. We also add wind direction, U wind and V wind physical information as constraints for the model. The model achieves good results in multiple evaluation metrics for retrieving sea surface wind speed. On the test set, the model achieves a Root Mean Square Error (RMSE) of 2.5 and a Mean Absolute Error (MAE) of 1.85. Compared with the second-level wind speed product data released by Fengyun Satellite official website in the same period, which has an RMSE of 3.37 and an MAE of 1.9, our model improves the performance by 52.74% and 8.65% respectively, and obtains a better distribution.
An Attack Entity Deducing Model for Attack Forensics
ABSTRACT. The forensics of Advanced Persistent Threat (APT) attacks, known for their prolonged duration and utilization of multiple attack methods, require extensive log analysis to discern their attack steps. Facing the massive amount of data, researchers have increasingly turned to extended machine learning methods to enhance attack forensics. However, the limited number of attack samples used for training and the inability of the data to accurately represent real-world scenarios pose significant challenges. To address these issues, we propose ASAI, an attack deduction model that leverages auxiliary strategies and dynamic word embeddings. Firstly, ASAI tackles the problem of data imbalance through a sequence sampling method enhanced by a custom auxiliary strategy. Subsequently, the sequences are transformed into dynamic vectors using dynamic word embedding. The model is trained to capture the spatio-temporal characteristics of entities under diverse contextual conditions by employing these dynamic vectors. In this paper, ASAI is evaluated using ten real-world APT attacks executed within an actual virtual environment. The results demonstrate ASAI's ability to successfully recover the key steps of the attacks and construct attack stories, achieving an impressive F1 score of up to 99.70%—a significant 16.98% improvement over the baseline.
Small-World Echo State Networks for Nonlinear Time-Series Prediction
ABSTRACT. Echo state networks (ESNs) are an efficient paradigm for training recurrent neural networks (RNNs). However, ESNs sometime suffer from poor performance and robustness due to the non-trainable reservoir. This paper proposes a novel computational framework for ESNs, where a small-world network is applied as the reservoir topology, a biologically plausible unsupervised learning method named dual-threshold Bienenstock-Cooper-Munro (DTBCM) learning rule is applied to adjust reservoir weights adaptively, and a recursive least square (RLS) algorithm equipped with memory regressor extension is applied to update readout weights. The proposed model is compared with several kinds of ESNs on two benchmarking nonlinear time-series datasets, including the 10th-order nonlinear autoregressive moving average (NARMA10) system and the Mackey Glass system. Simulation results show that the proposed model not only achieves the best prediction performance but also exhibits remarkable robustness against noise.
Zero-shot Relation Triplet Extraction via Retrieval-Augmented Synthetic Data Generation
ABSTRACT. In response to the challenge of existing relation triplet extraction models struggling to adapt to new relation categories in zeroshot scenarios, we propose a method that combines generated synthetic training data with the retrieval of relevant documents through a rankbased filtering approach for data augmentation. This approach alleviates
the problem of low-quality synthetic training data and reduces noise that may affect the accuracy of triplet extraction in certain relation categories. Experimental results on two public datasets demonstrate that our model exhibits stable and impressive performance compared to the baseline models in terms of precision, recall, and F1 score, resulting in improved effectiveness for zero-shot relation triplet extraction.
Parallelizable Simple Recurrent Units with Hierarchical Memory
ABSTRACT. Recurrent neural networks and its many variants have been widely used in language modeling, text generation, machine translation, speech recognition and so forth, due to the excellent ability to process sequence data. However, the above-mentioned networks are constructed in a multi-layer stacking way, which makes the memory-dependent information in the distant past continuously decay. To this end, this paper proposes a parallelizable simple recurrent unit with hierarchical memory (PSRU-HM) to preserve more long-term historical information for inference. It is achieved by the nested SRU structure and realizes the information interaction between inner and outer memory cell through the connection between inner and outer layers. The depth of network can be dynamically adjusted according to the task complexity. Meanwhile, a jump connection that combines high-level and low-level features is added to the outermost layer. It maximizes the utilization of effective input information. In order to accelerate the training and inference of the network, the weights of PSRU-HM are reorganized to enable the parallelization deployment in the CUDA framework. Extensive experiments are conducted to verify the proposed method using several public datasets, including text classification, language modeling and question answering. Experimental results show that PSRU-HM outperforms the traditional methods and achieves 2× speed-up compared to cuDNN optimized LSTM.
Incorporating Syntactic Cognitive in Multi-granularity Data Augmentation for Chinese Grammatical Error Correction
ABSTRACT. Chinese grammatical error correction (CGEC) has recently attracted a lot of attention due to its real-world value. The current mainstream approaches are all data-driven, but the following flaws still exist: First, there is less high-quality training data with complexity and a variety of errors, and data-driven approaches frequently fail to significantly increase performance due to the lack of data. Second, the existing data augmentation methods for CGEC mainly focus on word-level augmentation and ignore syntactic-level information. Third, the currently data augmentation are strongly randomized, and fewer of them can fit the pattern of students cognition of grammatical errors for CGEC. In this paper, we propose a novel multi-granularity data augmentation method for CGEC. We construct a syntactic error knowledge base for error type Missing and Redundant Components, and syntactic conversion rules for error type Improper Word Order based on a finely labeled syntactic structure treebank. Additionally, we compile a knowledge base of character and word errors from actual student essays. Then, a data augmentation algorithm incorporating character, word, and syntactic noise is designed to build the training set. Extensive experiments and detailed analysis show that the F0.5 in the test set is 36.77%, which is a 6.2% improvement compared to the best model in the NLPCC Shared Task, proving the validity of our method.
Long Short-Term Planning for Conversational Recommendation Systems
ABSTRACT. In Conversational Recommendation Systems (CRS), the central question is how the conversational agent can naturally ask for user preferences and provide suitable recommendations. Existing works mainly follow the hierarchical architecture, where a higher policy decides whether to invoke the conversation module (to ask questions) or the recommendation module (to make recommendations). This architecture prevents these two components from fully interacting with each other. In contrast, this paper proposes a novel architecture, the long short-term feedback architecture, to connect these two essential components in CRS. Specifically, the recommendation predicts the long-term recommendation target based on the conversational context and the user history. Driven by the targeted recommendation, the conversational model predicts the next topic or attribute to verify if the user preference matches the target. The balance feedback loop continues until the short-term planner output matches the long-term planner output, that is when the system should make the recommendation.
6D Object Pose Estimation with Attention Aware Bi-Gated Fusion
ABSTRACT. Accurate object pose estimation is a prerequisite for successful robotic grasping tasks. Currently keypoint-based pose estimation methods using RGB-D data have shown promising results in simple environments. However, how to fuse the complementary features from RGB-D data is still a challenging task. To this end, this paper proposes a two-branch network with attention aware bi-gated fusion (A2BF) module for the keypoint-based 6D object pose estimation, named A2BNet for abbreviation. A2BF module consists of two key components, bidirectional gated fusion and attention mechanism modules. The former is introduced to selectively filter and fuse RGB and point cloud information, and the latter is responsible to prioritize essential information while disregarding irrelevant details. Several A2BF modules can be embedded in the network to generate complementary texture and geometric information. Extensive experiments are conducted on the public LineMOD and Occlusion LineMOD datasets. Experimental results demonstrate that the average accuracy using the proposed method on both datasets can reach 99.8\% and 67.6\% respectively, outperforms the state-of-the-art methods.
Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition
ABSTRACT. In this paper, we propose a novel time-frequency joint learning method for speech emotion recognition, called Time-Frequency Transformer. Its advantage is that the Time-Frequency Transformer can excavate global emotion patterns in the time-frequency domain of speech signal while modeling the local emotional correlations in the time domain and frequency domain respectively. For the purpose, we first design a Time Transformer and Frequency Transformer to capture the local emotion patterns between frames and inside frequency bands respectively, so as to ensure the integrity of the emotion information modeling in both time and frequency domains. Then, a Time-Frequency Transformer is proposed to mine the time-frequency emotional correlations through the local time-domain and frequency-domain emotion features for learning more discriminative global speech emotion representation. The whole process is a time-frequency joint learning process implemented by a series of Transformer models. Experiments on IEMOCAP and CASIA databases indicate that our proposed method outdoes the state-of-the-art methods.
MOOCs Dropout Prediction via Classmates Augmented Time-Flow Hybrid Network
ABSTRACT. Massive Open Online Courses (MOOCs) provides learners a platform for free learning. However, MOOCs have been criticized for the high dropout rates in recent years. For the purpose of predicting users' potential dropout risk in advance, a novel framework named as Classmates Augmented Time-Flow Hybrid Network (CA-TFHN) is proposed in this paper. TFHN, absorbed the advantages of LSTM and self-attention mechanism, is designed to generate the activity features of users by using users' learning records. At the same time, an effective correlation calculation is defined based on users' potential interests on courses with link prediction, bringing in relationships of classmates. The influences among classmates, modeled by a reconstructed user graph, are employed to augment the activity features of user, resulting in accurate dropout prediction. Experiments on the XuetangX dataset demonstrate the effectiveness of CA-TFHN in predicting dropout of MOOCs. The codes of CA-TFHN are available from https://github.com/codeds27/CA-TFHN.
ABSTRACT. Graph structure learning (GSL), which aims to optimize graph structure and learn suitable parameters of graph neural networks (GNNs) simultaneously, has shown great potential in boosting the performance of GNNs. As a branch of GSL, multi-view methods mainly learn an optimal graph structure (final view) from multiple information sources (basic views). However, basic views’ structural information is insufficient, existing methods ignore the fact that different views can complement each other. Moreover, existing methods obtain the final view through simple combination, fail to constrain the noise, which inevitably brings irrelevant information. To tackle these problems, we propose a Gated Bi-View GSL architecture, named GBV-GSL, which interacts two basic views through a selection gating mechanism, so as to "turn off" noise as well as supplement insufficient structures. Specifically, two basic views that focus on different knowledge are extracted from original graph as two inputs of the model. Furthermore, we propose a novel view interaction technique based on selection gating mechanism to remove redundant structural information and supplement insufficient topology while retaining their focused knowledge. Finally, we design a view attention fusion mechanism to adaptively fuse two interacted views to generate the final view. In numerical experiments involving both clean and attacked conditions, GBV-GSL shows significant improvements in the effectiveness and robustness of structure learning and node representation learning. Code is available at https://github.com/Simba9257/GBV-GSL.
Efficient Lightweight Network with Transformer-based Distillation for Micro-crack Detection of Solar Cells
ABSTRACT. Micro-cracks on solar cells often affect the power generation efficiency, so this paper proposes a lightweight network for cell image micro-crack detection task. Firstly, a Feature Selection framework is proposed, which can efficiently and adaptively decide the number of layers of the feature extraction network, and clip unnecessary feature generation process. In addition, based on the design of the Transformer layer, Transformer Distillation is proposed. In Transformer Distillation, the designed Transformer Refine module excavates the distillation information from the two dimensions of features and relations. Using a combination of Feature Selection and Transformer Distillation, the lightweight networks based on ResNet and ViT can achieve much better effects than the original networks, with classification accuracy rates of 88.58% and 89.35% respectively.
MTLAN: Multi-Task Learning and Auxiliary Network for Enhanced Sentence Embedding
ABSTRACT. The objective of cross-lingual sentence embedding learning is to map sentences into a shared representation space, where semantically similar sentence representations are closer together, while distinct sentence representations exhibit clear differentiation. This paper proposes a novel sentence embedding model called MTLAN, which incorporates multi-task learning and auxiliary networks. The model utilizes the LaBSE model for extracting sentence features and undergoes joint training on tasks related to sentence semantic representation and distance measurement. Furthermore, an auxiliary network is employed to enhance the contextual expression of words within sentences. To address the issue of limited resources for low-resource languages, we construct a pseudo-corpus dataset using a multilingual dictionary for unsupervised learning. We conduct experiments on multiple publicly available datasets, including STS and SICK, to evaluate both monolingual sentence similarity and cross-lingual semantic similarity. The empirical results demonstrate the significant superiority of our proposed model over state-of-the-art methods.
ABSTRACT. Human Activity Recognition (HAR) using data from Inertial Measurement Unit (IMU) sensors has practical applications in healthcare and assisted living environments. However, its use in real-world scenarios has been limited due to the lack of comprehensive IMU-based HAR datasets covering various activities. Zero-shot HAR (ZS-HAR) can overcome these data limitations. However, most existing ZS-HAR methods based on IMU data rely on attributes or word embeddings of class labels as auxiliary data to relate the seen and unseen classes. This approach requires expert knowledge and lacks motion-specific information. In contrast, videos depicting various human activities provide valuable information for ZS-HAR based on inertial sensor data, and they are readily available. Our proposed model, TEZARNet: TEmporal Zero-shot Activity Recognition Network, uses videos as auxiliary data and employs a Bidirectional Long-Short Term IMU encoder to exploit temporal information, distinguishing it from current work. The proposed model outperforms the state-of-the-art accuracy by 4.7%, 7.8%, 3.7%, and 9.3% for benchmark datasets PAMAP2, DaLiAc, UTD-MHAD, and MHEALTH, respectively.
ABSTRACT. Spatial Transcriptomics (ST) quantitatively interprets human diseases by providing the gene expression of each fine-grained spot (i.e., window) in a tissue slide. This paper focuses on predicting gene expression at windows on a tissue slide image of interest. However, gene expression related to image features usually exhibits diverse spatial scales. To spatially model these features, we newly introduced Hierarchical Sparse Attention Network(HSATNet ). The core idea of HSATNet is to apply two level sparse attention, coarse (i.e., area) and fine, respectively. Each HSAT Block consists of two main modules: i) an adaptive sparse coarse attention, filter out the most irrelevant areas to acquire adaptive sparse areas, ii) an adaptive sparse fine attention, filter out the most irrelevant windows to acquire adaptive sparse windows. The first module finds the sameness of the different windows with the same area, and the second captures the difference of the different windows. Particularly, after fusion this two modules, without any additional training data or pre-training, experiments conducted on 10X Genomics breast cancer data show that our HSATNet achieves an impressive PCC@S of 7.43 for gene expression prediction. This performance exceeds the current state- of-the-art model. Code is available at https://github.com/biyecc/HSATNet.
Interpreting Decision Process in Offline Reinforcement Learning for Interactive Recommendation Systems
ABSTRACT. Recommendation systems, which predict relevant and appealing items for users on web platforms, often rely on static user interests, resulting in limited interactivity and adaptability. Reinforcement Learning (RL), while providing a dynamic and adaptive approach, brings its unique challenges in this context. Interpreting the behavior of an RL agent within recommendation systems is complex due to factors such as the vast and continuously evolving state and action spaces, non-stationary user preferences, and implicit, delayed rewards often associated with long-term user satisfaction.
Addressing the inherent complexities of applying RL in recommendation systems, we propose a framework that includes innovative metrics and a synthetic environment. The metrics aim to assess the real-time adaptability of an RL agent to dynamic user preferences. We apply this framework to LastFM datasets to interpret metric outcomes and test hypotheses regarding MDP setups and algorithm choices by adjusting dataset parameters within the synthetic environment. This approach illustrates potential applications of our framework, while highlighting the necessity for further research in this area.
Reconstructing Challenging Hand Posture from Multi-modal Input
ABSTRACT. 3D Hand reconstruction is critical for immersive VR/AR, action understanding or human healthcare. Without considering actual skin or texture details, existing solutions have concentrated on recovering hand pose and shape
using parametric models or learning techniques. In this study, we introduce the
first challenging hand dataset, CHANDS, which is composed of articulated precise 3D geometry corresponding to previously unheard-of challenging gestures
performed by real hands. Specifically, we construct a multi-view camera setup to
acquire multi-view images for initial 3D reconstructions and use a hand tracker
to separately capture the skeleton. Then, we present a robust method for reconstructing an articulated geometry and matching the skeleton to the geometry using
a template. In addition, we build a hand pose model from CHANDS that covers
a wider range of poses and is particularly helpful for difficult poses.
Identify Vulnerability Types: A Cross-Project Multiclass Vulnerability Classification System based on Deep Domain Adaptation
ABSTRACT. Software Vulnerability Detection(SVD) is a important means to ensure system security due to the ubiquity of software. Deep learningbased approaches achieve state-of-the-art performance in SVD but one of the most crucial issues is coping with the scarcity of labeled data in projects to be detected. One reliable solution is to employ transfer learning skills to leverage labeled data from other software projects. However, existing cross-project approaches only focused on detecting
whether the function code is vulnerable or not. The requirement to identify vulnerability types is essential because it offers information to patch the vulnerabilities. Our aim in this paper is to propose the first system for cross-project multiclass vulnerability classification. We detect at the granularity of code snippet, which is finer-grained compare to function and effective to catch inter-procedure vulnerability patterns. After generating code snippets, we define several principles to extract snippet attentions and build a deep model to obtain the deep fused features; We then extend different domain adaptation approaches to reduce feature distributions of different projects. Experimental results indicate that our system outperforms other state-of-the-art systems.
Two-Stage Graph Convolutional Networks for Relation Extraction
ABSTRACT. The purpose of relation extraction is to extract semantic relationships between entities in sentences, which can be seen as a classification task. In recent years, the use of graph neural networks to handle relation extraction tasks has become increasingly popular. However, most existing graph-based methods have the following problems: 1) they cannot fully utilize dependency relation information; 2) there is no consistent criterion for pruning dependency trees. To address these issues, we propose a two-stage graph convolutional networks for relation extraction. In the first stage of the model, the node representation, dependency relation type representation and dependency type weight jointly generate new node representations, fully utilizing the dependency relation information. In the second stage, with the help of the adjacency matrix derived from the dependency tree, the graph convolution operation is performed. In this way, the model can automatically complete the pruning operation. We evaluated our proposed method on two public datasets, and the results show that our model outperforms previous studies in terms of F1 score and achieves the best performance. Further ablation experiments also confirm the effectiveness of each component in our proposed model.
Progressive Temporal Transformer for Bird's-Eye-View Camera Pose Estimation
ABSTRACT. Visual relocalization is a crucial technique used in visual odometry and SLAM to predict the 6-DoF camera pose of a query image. Existing works mainly focus on ground view in indoor or outdoor scenes. However, camera relocalization on unmanned aerial vehicles is less focused. Also, frequent view changes and a large depth of view make it more challenging. In this work, we establish a Bird's-Eye-View (BEV) dataset for camera relocalization, a large dataset contains four distinct scenes (\emph{roof}, \emph{farmland}, \emph{bare ground}, and \emph{urban area}) with such challenging problems as frequent view changing, repetitive or weak textures and large depths of fields. All images in the dataset are associated with a ground-truth camera pose. The BEV dataset contains 177242 images, a challenging large-scale dataset for camera relocalization. We also propose a Progressive Temporal transFormer (dubbed as PTFormer) as the baseline model. PTFormer is a sequence-based transformer with a designed progressive temporal aggregation module for temporal correlation exploitation and a parallel absolute and relative prediction head for implicitly modeling the temporal constraint. Thorough experiments are exhibited on both the BEV dataset and widely used handheld datasets of 7Scenes and Cambridge Landmarks to prove the robustness of our proposed method.
Task Scheduling with Multi-strategy Improved Sparrow Search Algorithm in Cloud Datacenters
ABSTRACT. Combining the task scheduling characteristics of the cloud computing environment, an improved sparrow search algorithm MSSA that takes into account task completion time, task completion cost and load balancing is proposed. First, the initialization of the population using PWLCM chaotic mapping enhances the degree of individual dispersion. After that, the global search phase in the marine predator algorithm (MPA) is combined to enhance the search space of the discoverer. The introduction of dynamic adjustment factors in the joiner part strengthens the search ability of the algorithm in the early stage and the convergence ability in the late stage. Finally, the greedy strategy is used to update the joiner’s position so that the information of the optimal solution and the worst solution can guide the next generation of position updates. Using CloudSim for simulation, the experimental results show that the improved algorithm has a shorter task completion time and a more balanced system load. Compared with the ACO, MPA, and SSA algorithms, the MSSA algorithm improves the integrated fitness function values by 20% , 22%, and 17%, confirming the feasibility of the improved algorithm.
Traffic Signal Optimization at T-shaped intersections Based on Deep Q Networks
ABSTRACT. In this paper traffic signal control strategies for T-shaped intersections in urban road networks using deep Q network (DQN) algorithms are proposed. Different DQN networks and dynamic time aggregation were used for decision-makings. The effectiveness of various strategies under different traffic conditions are checked using the Simulation of Urban Mobility(SUMO) software. The simulation results showed that the strategy combining the Dueling DQN method and dynamic time aggregation significantly improved vehicle throughput. Compared with DQN and fixed-time methods, this strategy can reduce the average travel time by up to 43\% in low-traffic periods and up to 15\% in high-traffic periods. This study demonstrated the significant advantages of applying Dueling DQN in traffic signal control strategies for urban road networks.
Enhanced State-Aware Traffic Light Optimization Control Method
ABSTRACT. In this paper, the Dueling Double Deep Recurrent Q Network - Attention Mechanism (3DRQN_AM) algorithm is proposed for TLC. algorithm is based on deep Q networks, using competing networks, double Q networks and target networks to improve the earning performance of the algorithm; The Long Short Term Memory (LSTM) network is introduced to combine the historical and current states of the vehicle trajectory for optimal decision making. Meanwhile, an attention mechanism is added to make the neural network automatically focus on the important state components and improve the state representation capability. The experimental results show that, compared with he Dueling Double Deep Q Network -Attention Mechanism (3DQN_AM), Dueling Double Deep Recurrent Q Network (3DRQN) and Dueling Double Deep Q Network (3DQN) signal control algorithms, the average waiting time under normal traffic flow is reduced by about 20.8%, 32.1%, and 36.7%, respectively, and the average queue length is reduced by about 41.9%, 44.6%, and 76%, respectively; The average waiting time under peak traffic flow is reduced by about 46.2%, 53.3%, 85.1%, and in the average queue is reduced by about 2.7%, 2.7%, 21.3%, respectively.
Responsive CPG-Based Locomotion Control for Quadruped Robots
ABSTRACT. Quadruped robots with flexible movement are gradually replacing traditional mobile robots in many spots. In order to improve the motion stability and speed of the quadruped robot, this paper presents a responsive gradient CPG (RG-CPG) approach. Specifically, the method introduces a vestibular sensory feedback mechanism into the gradient CPG (central pattern generators) model and uses a differential evolution algorithm to optimize the vestibular sensory feedback parameters. Simulation results show that the movement stability and linear movement velocity of the quadruped robot controlled by RG-CPG are effectively improved, and the quadruped robot has the ability to cope with complex terrains. Prototype experiments demonstrate that RG-CPG works for real quadruped robots.
High-Resolution Self-Attention with Fair Loss for Point Cloud Segmentation
ABSTRACT. Applying deep learning techniques to analyze point cloud data acquired by various three-dimensional (3D) sensors has emerged as a prominent research direction. However, the challenges posed by insufficient spatial and feature information integration in the original point cloud and unbalanced classes in real-world datasets have hindered the advancement of research in this domain. In light of the success achieved by self-attention mechanisms in natural language processing and two-dimensional (2D) vision tasks, we put forward the High-Resolution Self-Attention (HRSA) module as a plug-and-play solution for facilitating point cloud segmentation. More precisely, the proposed HRSA module is designed to preserve high-resolution internal representations in both spatial and feature dimensions. HRSA ensures that each branch retains complete spatial and feature information while efficiently compressing the other dimension. Additionally, by affecting the gradient of dominant and weak classes, we introduce the Fair Loss to address the problem of unbalanced class distribution on a real-world dataset to improve the network's inference capabilities. The introduced modules are seamlessly integrated into an MLP-based architecture tailored for large-scale point cloud processing, resulting in a new segmentation network called PointHR. PointHR achieves impressive performance with mIoU scores of 69.8% and 74.5% on S3DIS Area-5 and 6-fold cross-validation, respectively. With a significantly smaller number of parameters, these performances are remarkably close to the state-of-the-art methods, making PointHR highly competitive in point cloud semantic segmentation.
Minimizing Distortion in Linguistic Steganography via Adaptive Language Model Tuning
ABSTRACT. Linguistic steganography, a technique that hides secret information within normal text, possesses tremendous potential in various applications such as protecting user privacy. However, previous research in linguistic steganography has primarily focused on adjusting the probability distribution of steganographic text (stegotext) to minimize the difference with text generated by language models, thereby achieving indistinguishability between the two. Nonetheless, the significant gap between real text and generated text has often been overlooked. To address this issue, this paper proposes an innovative method: using an adaptive model tuning strategy, the generated stegotext becomes statistically closer to real text. We leverage a well-trained classifier in conjunction with a fundamental generative language model to produce stegotext that aligns closely with the distribution of real text. Consequently, we gain better control over the distortion between the stegotext and real text, while effectively embedding secret information. Compared to traditional methods, our approach reduces Kullback-Leibler divergence and steganography detection rates, demonstrating its enhanced effectiveness.
Nonlinear NN-Based Perturbation Estimator Designs for Disturbed Unmanned Systems
ABSTRACT. This paper deals with the perturbation estimation problem for a classical unmanned system against perturbations composed by internal uncertainties of systems and external disturbances. In order to approximate the unmeasurable perturbations accurately, a new nonlinear radial basis function neural network (RBFNN)-based estimator is developed to rebuild the structure of perturbations. It is established that the asymptotic estimation result can be achieved by utilizing RBFNN-based estimator designs through Lyapunov stability analysis. The efficacy of the developed perturbation estimation method is substantiated by the simulations of an unmanned marine system and a quadrotor system.
FLDNet: A Foreground-Aware Network for Polyp Segmentation Leveraging Long-Distance Dependencies
ABSTRACT. Given the close association between colorectal cancer and polyps, the diagnosis and identification of colorectal polyps play a critical role in the detection and surgical intervention of colorectal cancer. In this context, the automatic detection and segmentation of polyps from various colonoscopy images has emerged as a significant problem that has attracted broad attention. Current polyp segmentation techniques face several challenges: firstly, polyps vary in size, texture, color, and pattern; secondly, the boundaries between polyps and mucosa are usually blurred, existing studies have focused on learning the local features of polyps while ignoring the long-range dependencies of the features, and also ignoring the local context and global contextual information of the combined features. To address these challenges, we propose FLDNet (Foreground-Long-Distance Network), a Transformer-based neural network that captures long-distance dependencies for accurate polyp segmentation. Specifically, the proposed model consists of three main modules: a pyramid-based Transformer encoder, a local context module, and a foreground-Aware module. Multilevel features with long-distance dependency information are first captured by the pyramid-based transformer encoder. On the high-level features, the local context module obtains the local characteristics related to the polyps by constructing different local context information. The coarse map obtained by decoding the reconstructed highest-level features guides the feature fusion process in the foreground-Aware module of the high-level features to achieve foreground enhancement of the polyps. Our proposed method, FLDNet, was evaluated using seven metrics on common datasets and demonstrated superiority over state-of-the-art methods on widely-used evaluation measures.
MelMAE-VC: Extending Masked Autoencoders to Voice Conversion
ABSTRACT. Voice conversion is a technique that generates speeches with text contents same as source speeches and timbre features same as reference speeches. This paper proposes MelMAE-VC, a neural network for non-parallel many-to-many voice conversion that utilizes pre-trained Masked Autoencoders (MAEs) for representation learning. Our neural network consists of mainly transformer layers and no recurrent units, aiming to achieve better scalability and parallel computing capability. We follow a similar scheme of image-based MAE in pre-training phase that conceals a portion of input spectrogram and set up a vanilla autoencoding task for training. The encoder yields latent representation from visible subset of the full spectrogram; then the decoder reconstructs the full spectrogram from the representation of only visible patches. To achieve voice conversion, we adopt the pre-trained encoder to extract preliminary features, then use a speaker embedder to control speakers' timbre of synthesized spectrograms. The style transfer decoder could be either a simple autoencoder or a conditional variational autoencoder (CVAE) that mixes timbre and text information from different utterances. The optimization goal of voice conversion model training is a hybrid loss function that combines reconstruction loss, style loss and stochastic similarity. Results show that our model speeds up and simplifies training process, and has better modularity and scalability while achieving similar performance comparing with other models.
RPF3D: Range-Pillar Feature Deep Fusion 3D Detector for Autonomous Driving
ABSTRACT. In this paper, we present RPF3D, an innovative single-stage framework that explores the complementary nature of point clouds and range images for 3D object detection. Our method addresses the sampling region imbalance issue inherent in fixed-dilation-rate convolutional layers, allowing for a more accurate representation of
the input data. To enhance the model's adaptability, we introduce several attention layers that accommodate a wide range of dilation rates necessary for processing range image scenes. To tackle the challenges of feature fusion and alignment, we propose the AttentiveFusion module and the Range Image Guided Deep Fusion (RIGDF) backbone architecture in the Range-Pillar Feature Fusion section, which effectively addresses the one-pillar-to-multiple-pixels feature alignment problem caused by the point cloud encoding strategy. These innovative components work together to provide a more robust and accurate fusion of features for improved 3D object detection. We validate the effectiveness of our RPF3D framework through extensive experiments on the KITTI and Waymo Open Datasets. The results demonstrate the superior performance of our approach compared to existing methods, particularly in the Car class detection where a significant enhancement is achieved on both datasets. This showcases the practical applicability and potential impact of our proposed framework in real-world scenarios and emphasizes its relevance in the domain of 3D object detection.
P-IoU: Accurate Motion Prediction based Data Association for Multi-Object Tracking
ABSTRACT. Multi-object tracking in complex scenarios remains a challenging task due to objects' irregular motions and indistinguishable appearances. Traditional methods often approximate the motion direction of objects solely based on their bounding box information, leading to cumulative noise and incorrect association. Furthermore, the lack of depth information in these methods can result in failed discrimination between foreground and background objects due to the perspective projection of the camera. To address these limitations, we propose a Pose Intersection over Union (P-IoU) method to predict the true motion direction of objects by incorporating body pose information, specifically the motion of the human torso. Based on P-IoU, we propose PoseTracker, a novel approach that combines bounding box IoU and P-IoU effectively during association to improve tracking performance. Exploiting the relative stability of the human torso and the confidence of keypoints, our method effectively captures the genuine motion cues, reducing identity switches caused by irregular movements. Experiments on the DanceTrack and MOT17 datasets demonstrate that the proposed PoseTracker outperforms existing methods. Our method highlights the importance of accurate motion prediction of objects for data association in MOT and provides a new perspective for addressing the challenges posed by irregular object motion.
RF-Based Drone Detection and Identification with Deep Neural Network: Review and Case Study
ABSTRACT. Drones have been widely used in many application scenarios such as logistics and on-demand instant delivery, surveillance, traffic monitoring, firefighting, photography, and recreation. On the other hand, there is a growing level of misemployment and malicious utilization of drones being reported on a local and global scale. Thus, it is essential to employ security measures to reduce these risks. Drone detection is a very crucial initial step in several tasks such as identifying, locating, tracking, and intercepting malicious drones. This paper reviews related work for drone detection and classification using deep neural networks. Moreover, it presents a case study to compare the impact of utilizing magnitude and phase spectra as input to the classifier. The findings indicate that the prediction performance is better when the magnitude spectrum is used. However, the phase spectrum can be more resilient to errors due to signal attenuation and changes in the surrounding conditions.
ABSTRACT. Various active learning methods with ingenious sampling strategies have been proposed to solve the lack of labeled samples in supervised learning, but most are designed for specific tasks. In this paper, we propose a simple but task-agnostic active sampling method. We introduce "multi-view clustering module" to extract multiple feature maps at different levels for unsupervised clustering. According to clustering distribution, we calculate consistency, representativeness and stability to guide sampling and training. Our method does not depend on the specific network, and can be constructed as a two-stage sampling module to supplement the existing sampling algorithm. Experiments on image classification and object detection tasks show that our method can further enhance the effect of active learning on the basis of baseline method.
Graph Pointer Network and Reinforcement Learning for Thinnest Path Problem
ABSTRACT. The complexity and NP-hard nature make finding optimal solutions challenging for Combinatorial optimization problems (COPs) using traditional methods. Recently, deep learning-based approaches have shown promise in solving COPs. Pointer Network (PN) has become a popular choice due to its ability to handle variable-length sequences and generate variable-sized outputs. Graph Pointer Network (GPN), which incorporates graph embedding layers in PN, can be well-suited for problems with graph structures. Additionally, Reinforcement Learning (RL) has great potential in enhancing scalability for solving large-scale instances. In this paper, we focus on Thinnest Path Problem (TPP). We propose an approach using RL to train GPN with constraints (GPN-c) to solve TPP. Our approach outperforms traditional solutions by providing faster and more efficient solving strategies. Specifically, we achieve significant improvements in solution quality, runtime, and scalability, successfully extending our approach to instances with up to 500 nodes. Furthermore, RL and GPN can provide more flexible and adaptive solving strategies, making them highly applicable to real-world scenarios.
A DNN-based Learning Framework for Continuous Movements Segmentation
ABSTRACT. This study presents a novel experimental paradigm for collecting Electromyography (EMG) data from continuous movement sequences and a Deep Neural Network (DNN) learning framework for segmenting movements from these signals. Unlike prior research focusing on individual movements, this approach characterizes human motion as continuous sequences. The DNN framework comprises a segmentation module for time point level labeling of EMG data and a transfer module predicting movement transition time points. These outputs are integrated based on defined rules. Experimental results reveal an impressive capacity to accurately segment movements, evidenced by segmentation metrics (accuracy: 88.3%; Dice coefficient: 82.9%; mIoU: 72.7%). This innovative approach to time point level analysis of continuous movement sequences via EMG signals offers promising implications for future studies of human motor functions and the advancement of human-machine interaction systems.
User Multi-Preferences Fusion for Conversational Recommender Systems
ABSTRACT. Conversational recommender systems (CRS) aim to provide recommendations by inferring user preferences during conversations.Many current CRS models utilize third-party information, such as reviews, to supplement the extraction of user preferences. Consequently, users develop preferences for third-party information and their own preferences extracted from original dialog data.However, the prevailing approach of combining these preferences as a unified whole for self-attention at the element level compromises their independence.In real-life decision-making, we refer to third-party information and it is important to distinguish whether the reference is from a third party or from the original dialog data.This paper emphasizes the independence of users' own preferences and third-party information.To effectively integrate multiple user preferences, we propose an Attentive Wide&Deep Conversational Recommender (AWDCore).Specifically, we design an attentive wide linear module and an attentive deep neural network to capture the low-order linear and high-order nonlinear relationships between the user's own preference and third-party information, respectively.To highlight the significance of the user's current preference, we incorporate attention mechanisms and a SENet layer in the wide module and deep neural networks, respectively.The learned user preferences are then employed for recommendation and dialogue generation.Extensive experiments have demonstrated the effectiveness of our approach in both recommendation and conversation tasks.
Data Protection and Privacy: Risks and Solutions in The Contentious Era of AI-driven Ad Tech
ABSTRACT. Internet Service Providers (ISPs) exponentially incorporate Artificial Intelligence (AI) algorithms and Machine Learning techniques to achieve commercial objectives of extensive data harvesting and manipulation to drive customer traffic, decrease costs, and exploit the virtual public sphere through innovative Advertisement Technology (Ad-Tech). Increasing incorporation of Generative AI to aggregate and classify collected data to generate persuasive content tailored to the behavioral patterns of users questions their information security and informational self-determinism. Significant risks arise with inappropriate information processing and analysis of big data, including personal and special data protected by influential data protection and privacy regulation worldwide. This paper bridges the inter-disciplinary gap between developers of AI applications and their socio-legal impact on democratic societies. Accordingly, the work asks how AI Behavioural Marketing poses inadequately addressed data and privacy protection risks, followed by an approach to mitigate them. The approach is developed through doctrinal research of regulatory frameworks and court decisions to establish the current legislative landscape for AI-driven Ad-Tech. The work exposes the pertinent risks triggered by algorithms by analyzing the discourse of academics, developers, and social groups. It argues that understanding the risks associated with Information Processing and Invasion is seminal to developing appropriate industry solutions through a cumulative layered approach. This work proves timely to address these contentious issues using conventional and non-conventional approaches and aspires to promote a pragmatic collaboration between developers and policymakers to address the risks throughout the AI Value Chain to safeguard individuals' data protection and privacy rights.
CAS-NN: a robust cascade neural network without compromising clean accuracy
ABSTRACT. Adversarial training has emerged as a prominent approach for training robust classifiers. However, recent researches indicate that adversarial training inevitably results in a decline in a classifier’s accuracy on clean (natural) data. Robustness is at odds with clean accuracy due to the inherent tension between the objectives of adversarial robustness and standard generalization. Training a single classifier that combines high adversarial robustness and high clean accuracy appears to be an insurmountable challenge. This paper proposes a straightforward strategy to bridge the gap between robustness and clean accuracy. Inspired of the idea underlying dynamic neural networks, i.e., adaptive inference, we propose a robust cascade framework that integrates a standard classifier and a robust classifier. The cascade neural network dynamically classifies clean and adversarial samples using distinct classifiers based on the confidence score of each input sample. As deep neural networks suffer from serious overconfident problem on adversarial samples, we propose an effective confidence calibration algorithm for the standard classifier, enabling accurate confidence scores for adversarial samples. The robust classifier within the robust cascade framework is trained independently and can be combined with any state-of-the-art adversarial training algorithms. The experiments demonstrate that the proposed cascade neural network increases the clean accuracies by 10.1%, 14.67%, 9.11% compared to the advanced adversarial training (HAT) on CIFAR10, CIFAR100 and Tiny ImageNet, while keeping similar robust accuracies.
Optimal Task Grouping Approach in Multitask Learning
ABSTRACT. Multi-task learning has become a powerful solution in which multiple tasks are trained together to leverage the knowledge learned from one task to improve the performance of the other tasks. However, the tasks are not always constructive on each other in the multi-task formulation and might play negatively during the training process leading to poor results. Thus, this study focuses on finding the optimal group of tasks that should be trained together for multi-task learning in an automotive context. We proposed a multi-task learning approach to model multiple vehicle long-term behaviors using low-resolution data and utilized gradient descent to efficiently discover the optimal group of tasks/vehicle behaviors that can increase the performance of the predictive models in a single training process. In this study, we also quantified the contribution of individual tasks in their groups and to the other groups' performance. The experimental evaluation of the data collected from thousands of heavy-duty trucks shows that the proposed approach is promising.
T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification
ABSTRACT. Cancer is a complex disease characterized by uncontrolled cell growth and proliferation, which can lead to the development of tumors and metastases. The identification of the cancer type is crucial for selecting the most appropriate treatment strategy and improving patient outcomes.
T cell receptors (TCRs) are essential proteins for the adaptive immune system, and their specific recognition of antigens plays a crucial role in the immune response against diseases, including cancer. The diversity and specificity of TCRs make them ideal for targeting cancer cells, and recent advancements in sequencing technologies have enabled the comprehensive profiling of TCR repertoires. This has led to the discovery of TCRs with potent anti-cancer activity and the development of TCR-based immunotherapies.
To analyze these complex biomolecules effectively, it is essential to represent them in a way that captures their structural and functional information. In this study, we investigate the use of sparse coding for the multi-class classification of TCR protein sequences with cancer categories as target labels. Sparse coding is a popular technique in machine learning that enables the representation of data with a set of informative features and can capture complex relationships between amino acids and identify subtle patterns in the sequence that might be missed by low-dimensional methods.
We first compute the $k$-mers from the TCR sequences and then apply sparse coding to capture the essential features of the data. To improve the predictive performance of the final embeddings, we integrate domain knowledge regarding different types of cancer properties such as Human leukocyte antigen (HLA) types, gene mutations, clinical characteristics, immunological
features, and epigenetic modifications.
We then train different machine learning (linear and non-linear) classifiers on the embeddings of TCR sequences for the purpose of supervised analysis.
Our proposed embedding method on a benchmark dataset of TCR sequences significantly outperforms the baselines in terms of predictive performance, achieving an accuracy of 99.8\%.
Our study highlights the potential of sparse coding for the analysis of TCR protein sequences in cancer research and other related fields.
GACE: Learning Graph-Based Cross-Page Ads Embedding For Click-Through Rate Prediction
ABSTRACT. Predicting click-through rate (CTR) is the core task of many ads online recommendation systems, which helps improve user experience and increase platform revenue. In this type of recommendation system, we often encounter two main problems: the joint usage of multi-page historical advertising data and the cold start of new ads. In this paper, we proposed GACE, a graph-based cross-page ads embedding generation method. It can warm up and generate the representation embedding of cold-start and existing ads across various pages. Specifically, we carefully build linkages and a weighted undirected graph model considering semantic and page-type attributes to guide the direction of feature fusion and generation. We designed a variational auto-encoding task as pre-training module and generated embedding representations for new and old ads based on this task. The results evaluated in the public dataset AliEC and real-world industry dataset from Alipay show that our GACE method is significantly superior to the SOTA method. In the online A/B test, the click-through rate on three real-world pages from Alipay has increased by 3.6%, 2.13%, and 3.02% respectively. Especially in the cold-start task, the CTR increased by 9.96%, 7.51%, and 8.97% respectively.
Can You Really Reason: A Novel Framework for Assessing Natural Language Reasoning Datasets and Models
ABSTRACT. Recent research has revealed that numerous natural language
understanding (NLU) and reasoning datasets contain statistical
cues that sophisticated models can exploit, leading to an overestimation
of these models’ capabilities. However, no existing work has
precisely identified these cues. In this paper, we propose a lightweight
framework that automatically detects potential biases in any multiple-choice
NLU-related datasets and evaluates the robustness of models
designed for these datasets. Furthermore, this framework has the
potential to filter biased training data, enabling the training of models
with improved reasoning capabilities. By addressing the issue of
dataset biases, our framework contributes to the development of more
robust and accurate NLU models.
A Meta Learning-based Training Algorithm for Robust Dialogue Generation
ABSTRACT. There are many low-resource scenarios in the field of dialogue generation, such as medical diagnosis. Many dialogue generation models in these scenarios are usually unstable due to the lack of training corpus. As one of the most popular training algorithms in recent years, meta learning has achieved remarkable results. MAML in meta learning can find a fast adaptive initialization parameter for a series of low resource tasks, which is often used to solve the problem of low resource, and has achieved excellent performance in image classification tasks. However, in the field of text generation, such as dialogue generation, because of the large vocabulary, long sequence and large number of parameters involved in text generation, the effect of MAML is unstable. Therefore, this paper proposes a high robust text generation training framework based on meta learning for the dialogue generation task. By identifying the significant information in the model parameters, the optimizer can train the important parameters more concentrated in the limited data in the bi-level optimization. Experiments show that our method has a good performance on the BLEU scores on the six single-domain datasets.
Extraction of One Time Point Dynamic Group Features via Tucker Decomposition of Multi-Subject FMRI Data: Application to Schizophrenia
ABSTRACT. Group temporal and spatial features of multi-subject fMRI data are essential for studying mental disorders, especially those exhibiting dynamic properties of brain function. Taking advantages of a low-rank Tucker model in effec-tively extracting temporally and spatially shared features of multi-subject fMRI data, we propose to extract dynamic group features via Tucker decom-position for identifying patients with schizophrenia (SZs) from healthy con-trols (HCs). We segment multi-subject fMRI data using sliding-window technique with different lengths and step size of one time point, and analyze amplitude of low frequency fluctuations and voxel features for shared time courses and shared spatial maps obtained by Tucker decomposition of seg-mented data. Results of two-sample t-tests show that HCs have higher am-plitudes of low frequency fluctuations within 0.01~0.08Hz than SZs within window length of 40s~160s, and significant HC-SZ activation differences ex-ist in such as the inferior parietal lobule and left part of auditory within 40s window, providing new evidence for analyzing schizophrenia.
Detecting Depression and Alcoholism Disorders by EEG Signal
ABSTRACT. The World Health Organization estimates that more than 264 and 80 million patients worldwide suffer from depression and alcoholism, respectively. Depression and alcoholism might cause severe negative repercussions on a patient's life and relationships, such as self-harm and suicide. A person can lead a normal life after these brain disorders are timely and accurately diagnosed and cured. In order to recognize the brain's activity and identify different mental disorders, Electroencephalography (EEG) is often employed. The EEG signals in our study are separated into rhythms in the empirical wavelet transform domain, and then linear and nonlinear features are extracted. Significant features are selected by a feature selection method, and the output of the feature selection method is fed into a classifier. In this paper, a fast and effective diagnostic tool is proposed to detect and recognize depression and alcoholism disorders. The proposed diagnostic tool is built on the Salp Swarm Algorithm and the Tree Growth Algorithm as feature selection methods and Cascade Forward Neural Network and Feed-forward Neural Network classifiers. The diagnostic tool is evaluated on two datasets for depression and alcoholism, and the results show that the classification accuracies are 100\% and 99.58\% for depression and alcoholism, using 10-fold cross-validation strategy, respectively. The proposed diagnostic tool can be used in hospitals and clinics for fast and accurate detection of depression and alcoholism.
Three-Dimensional Medical Image Fusion with Deformable Cross-Attention
ABSTRACT. Multimodal medical image fusion plays an instrumental role in several areas of medical image processing, particularly in disease recognition and tumor detection. Traditional fusion methods tend to process each modality independently before combining the features and reconstructing the fusion image. However, this approach often neglects the fundamental commonalities and disparities between multimodal information. Furthermore, the prevailing methodologies are largely confined to fusing two-dimensional (2D) medical image slices, leading to a lack of contextual supervision in the fusion images and subsequently, a decreased information yield for physicians relative to three-dimensional (3D) images. In this study, we introduce an innovative unsupervised feature mutual learning fusion network designed to rectify these limitations. Our approach incorporates a Deformable Cross Feature Blend (DCFB) module that facilitates the dual modalities in discerning their respective similarities and differences. We have applied our model to the fusion of 3D MRI and PET images obtained from 660 patients in the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Through the application of the DCFB module, our network generates high-quality MRI-PET fusion images. Experimental results demonstrate that our method surpasses traditional 2D image fusion methods in performance metrics such as Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). Importantly, the capacity of our method to fuse 3D images enhances the information available to physicians and researchers, thus marking a significant step forward in the field. The code will soon be available online.
Handling Class Imbalance in Forecasting Parkinson's Disease Wearing-off with Fitness Tracker Dataset
ABSTRACT. Parkinson's disease (PD) patients experience the "wearing-off phenomenon", where their symptoms resurface before they can take the following medication. As time passes, the duration of the medicine's efficacy reduces, leading to discomfort among PD patients. Therefore, patients and clinicians must meticulously observe, and document symptom changes to administer appropriate treatment.
Forecasting PD wearing-off phenomenon is challenging due to the class imbalance from the difficulty in documenting the phenomenon. This paper compares different approaches for handling class imbalance in forecasting the PD wearing-off phenomenon using the fitness tracker and smartwatch dataset (wearing-off dataset): oversampling, undersampling, and combining the two. Previous studies reported the potential use of commercially-worn fitness tracker datasets to predict and forecast wearing-off periods. However, some participants' high false positives and negatives have been observed with the developed models.
This paper uses and compares different approaches to handling class imbalance in the wearing-off dataset. First, changes were made during the data collection phase, as the nursing staff struggled with the data collection tool. Second, different oversampling and undersampling techniques were tried to improve the ratio of wearing-off labels to non-wearing-off instances. Finally, adjustments to forecast probabilities were applied due to the resampling in the second step.
BIN: A Biosignature Identification Network for Interpretable Liver Cancer Microvascular Invasion Prediction based on Multi-modal MRIs
ABSTRACT. Microvascular invasion (MVI) is a critical factor that affects the postoperative cure of hepatocellular carcinoma (HCC). Precise preoperative diagnosis of MVI by magnetic resonance imaging (MRI) is crucial for effective treatment of HCC. Compared with traditional methods, deep learning-based MVI diagnostic models have shown significant improvements. However, the black-box nature of deep learning models poses a challenge to their acceptance in medical fields that demand interpretability. To address this issue, this paper proposes an interpretable deep learning model, called Biosignature Identification Network (BIN) based on multi-modal MRI images for the liver cancer MVI prediction task. Inspired by the biological ways to distinguish the species through their biosignatures, our proposed BIN method classifies patients into MVI absence (i.e., Non-MVI or negative) and MVI presence (i.e., positive) by utilizing Non-MVI and MVI biosignatures. The adoption of a transparent decision-making process in BIN ensures interpretability, while biosignatures in the model overcome the limitations associated with manual feature extraction. Moreover, a multi-modal MRI based BIN method is also explored to further enhance the diagnostic performance with an attempt to interpretability of multi-modal MRI fusion. Through extensive experiments on the real dataset, it was found that BIN maintains deep model-level performance while providing effective interpretability. Overall, the proposed model offers a promising solution to the challenge of interpreting deep learning-based MVI diagnostic models.
KSHFS: Research on Drug-Drug Interaction Prediction Based on Knowledge Subgraph and High-order Feature-aware Structure
ABSTRACT. Effective drug-drug interaction (DDI) prediction can prevent adverse reactions and side effects caused by taking multiple drugs at the same time. However, most methods that obtain drug information through large-scale biomedical knowledge graphs (KGs), ignore the problem of high noise and complexity, and have certain limitations in obtaining rich neighborhood information for each entity in the KG. Therefore, this paper proposes an end-to-end method called Knowledge Subgraph and High-order Feature-aware Structure (KSHFS) to address DDI prediction. In KSHFS, this paper first designs a subgraph extraction module to reduce the noise caused by the KG, remove irrelevant information, and effectively utilize the entity information in external knowledge graphs to assist DDI prediction. Then, a high-order feature-aware module is designed to aggregate entity information propagated from high-order neighbors, learn high-order structural embeddings for each entity, and effectively capture potential semantic neighborhood features of drug pairs. Finally, in binary DDI prediction, a self-attention mechanism is used for feature fusion to predict drug interaction events. The experimental results demonstrate that the KSHFS model, which was proposed, outperforms the baseline models in binary and multi-relation DDI prediction based on various evaluation metrics, including AUC, AUPR, and F1.
ASGNet: Adaptive Semantic Gate Networks for Log-Based Anomaly Diagnosis
ABSTRACT. Logs are widely used in the development and maintenance of software systems. Logs can help engineers understand the runtime behavior of systems and diagnose system failures. For anomaly diagnosis, existing methods generally use log event data extracted from historical logs to build diagnostic models. However, we find that existing methods do not make full use of two types of features, (1) statistical features: some inherent statistical features in log data, such as word frequency and abnormal label distribution, are not well exploited. Compared with log raw data, statistical features are deterministic and naturally compatible with corresponding tasks. (2) semantic features: Logs contain the execution logic behind software systems, thus log statements share deep semantic relationships. How to effectively combine statistical features and semantic features in log data to improve the performance of log anomaly diagnosis is the key point of this paper. In this paper, we propose an adaptive semantic gate networks (ASGNet) that combines statistical features and semantic features to selectively use statistical features to consolidate log text semantic representation. Specifically, ASGNet encodes statistical features via a variational encoding module and fuses useful information through a well-designed adaptive semantic threshold mechanism. The threshold mechanism introduces the information flow into the classifier based on the confidence of the semantic features in the decision, which is conducive to training a robust classifier and can solve the overfitting problem caused by the use of statistical features. The experimental results on the real data set show that our method proposed is superior to all baseline methods in terms of various performance indicators.
Can We Transfer Noise Patterns? A Multi-environment Spectrum Analysis Model Using Generated Cases
ABSTRACT. Spectrum analysis systems in online water quality testing are designed to detect types and concentrations of pollutants and enable regulatory agencies to respond promptly to pollution incidents. However, spectral data-based testing devices suffer from complex noise patterns when deployed in non-laboratory environments. To make the analysis model applicable to more environments, we propose a noise patterns transferring model, which takes the spectrum of standard water samples in different environments as cases and learns the differences in their noise patterns, thus enabling noise patterns to transfer to unknown samples. Unfortunately, the inevitable sample-level baseline noise makes the model unable to obtain the paired data that only differ in dataset-level environmental noise. To address the problem, we generate a sample-to-sample case-base to exclude the interference of sample-level noise on dataset-level noise learning, enhancing the system's learning performance. Experiments on spectral data with different background noises demonstrate the good noise-transferring ability of the proposed method against baseline systems ranging from wavelet denoising, deep neural networks, and generative models. From this research, we posit that our method can enhance the performance of DL models by generating high-quality cases. The source code is made publicly available online at https://github.com/Magnomic/CNST.
Exploring Non-Isometric Alignment Inference for Representation Learning of Irregular Sequences
ABSTRACT. The development of Internet of Things (IoT) technology has led to increasingly diverse and complex data collection methods. This unstable sampling environment has resulted in the generation of a large number of irregular monitoring data streams, posing significant chal-lenges for related data analysis tasks. Current approaches mainly focus on the uncertainty of data representations caused by the local non-isometricity in sequences. Previous works often design specific embedding structures tailored to specific tasks to make models adapt to the non-isometric nature of sequences and mitigate these negative effects. However, we have observed that irregular sequence sampling densities are uneven, containing randomly occurring dense and sparse intervals. This data imbalance tendency often leads to overfit-ting in the dense regions and underfitting in the sparse regions, ultimately impeding the representation performance of models. Conversely, the irregularity at the data level has limited impact on the deep semantics of sequences. Based on this observation, we propose a novel Non-isometric Alignment Inference Architecture (NAIA), which utilizes a multi-level semantic continuous representation structure based on inter-interval segmentation to learn representations of irregular sequences. This architecture efficiently extracts the latent features of irregular sequences. We evaluate the performance of NAIA on multiple datasets for downstream tasks and compare it with recent benchmark methods, demonstrating NAIA's state-of-the-art performance results.
CPSSDS-R:Data stream semi-supervised classification algorithm based on conformal prediction
ABSTRACT. In this article, we consider the problem of semi-supervised data stream classification. The main challenges of data stream semisupervised classification include fast arrival of samples, limited labeled data, and handling concept drift. Existing algorithms detect concept drift, which causes the classifier to be constantly reinitialized, It is very consuming and wasteful of space resources.Therefore, a concept drift data stream semi-supervised classification algorithm CPSSDS-R based on model reuse is proposed. First, the labeled sample set in the data block is used to initialize the classification model. Secondly, After detecting concept drift during the data iteration process, the model and corresponding conformal prediction outputs of unlabeled samples are added to the classifier pool, and a new model is reconstructed. Then, the component classifiers in the classifier pool that are similar to the conformed prediction output of the current data block are detected, and the recurring concept is detected based on a distribution-based method. Finally, the classifier and incremental update model are updated based on the concept drift detection results. The algorithm is experimentally tested on multiple synthetic and real datasets, and its cumulative accuracy and block accuracy at different labeling ratios demonstrate the effectiveness of the algorithm in detecting concept drift.
ABSTRACT. Multilingual modeling has gained increasing attention in recent years, as the cross-lingual Text-based Visual Question Answering (TextVQA) are requried to understand questions and answers across different languages. Current researches mainly work on multimodal information assuming that multilingual pretrained models are effective to encode questions. However, the semantic comprehension of a textbased question varies between languages, creating challenges in directly deducing its answer from an image. To this end, we propose a novel multilingual text-based VQA framework suited for cross-language scenarios(CLVQA), transductively considering multiple answer generating interactions with questions. First, a question reading module densely connects encoding layers in a feedforward manner, which can adaptively work together with answering. Second, a multimodal OCR-based module decouples OCR features in an image into visual, linguistic, and holistic parts to facilitate the localization of a target-language answer. By incorporating enhancements from the above two input encoding modules, the proposed framework outputs its answer candidates mainly from the input image with a object detection module. Finally, a transductive answering module jointly understands input multimodal information and identified answer candidates at the multilingual level, autoregressively generating cross-lingual answers. Extensive experiments show that our framework outperforms state-of-the-art methods for both of crosslingual (English<->Chinese) and mono-lingual (English<->English and Chinese<->Chinese) tasks in terms of accuracy based metrics. Moreover, significant improvements are achieved in zero-shot cross-lingual settings(French<->Chinese).
ABSTRACT. When collecting answers from crowds, if there are many instances, each worker can only provide the answers to a small subset of the instances, and the instance-worker answer matrix is thus sparse. The solutions for improving the quality of crowd answers such as answer aggregation are usually proposed in an unsupervised fashion. In this paper, for enhancing the quality of crowd answers used for inferring true answers, we propose a solution with a self-supervised fashion to effectively learn the potential information in the sparse crowd answers. We propose a method named \textsc{CrowdLR} which first learns rich instance and worker representations from the crowd answers based on two types of self-supervised signals. We create a multi-task model with a Siamese structure to learn two classification tasks for two self-supervised signals in one framework. We then utilize the learned representations to complete the answers to fill the missing answers, and can utilize the answer aggregation methods to the complete answers. The experimental results based on real datasets show that our approach can effectively learn the representations from crowd answers and improve the performance of answer aggregation especially when the crowd answers are sparse.
Exploring the Capability of ChatGPT for Cross-Linguistic Agricultural Document Classification: Investigation and Evaluation
ABSTRACT. In the sustainable smart agriculture era, a vast amount of agricultural knowledge is available on the internet, making it necessary to explore effective document classification techniques for enhanced accessibility and efficiency. Over the past few years, fine-tuning strategies based on pre-trained language models (PLMs) have gained popularity as mainstream deep learning approaches, showcasing impressive performance. However, these approaches face several challenges, including a limited availability of training data, poor domain transferability, lack of model interpretability, and the challenges in deploying large models. In spired by ChatGPT’s significant success, we investigate its capability and utilization in the field of agricultural information processing. We explore various attempts to maximize ChatGPT’s potential, including various prompting construction strategies, ChatGPT question-answering (Q&A) inference, and intermediate answer alignment technique. Our preliminary comparative study demonstrates that ChatGPT effectively addresses research challenges and bottlenecks, positioning it as an ideal solution for agricultural document classification. This findings encourage the development of a general-purpose agricultural document processing paradigm. Our preliminary study also indicates the trend towards achieving Artificial General Intelligence (AGI) for sustainable smart agriculture in the future. Code is available on Github:https://github.com/albert-jin/agricultural_textual_classification_ChatGPT.
Leveraging Sound Local and Global Features for Language-Queried Target Sound Extraction
ABSTRACT. Language-queried target sound extraction is a fundamental audio-language task that aims to estimate the audio signal of the target sound event class by a natural language expression in a sound mixture. One of the key challenges of this task is leveraging the language expression to highlight the target sound features in the noisy mixture interpretably. In this paper, we leverage language expression to guide the model to extract the most informative features of the target sound event by adaptively using local and global features, and we present a novel language-aware synergic attention network (LASA-Net) for language-queried target sound extraction, as the first attempt to leverage local and global operations using language representation to extract target sound in single or multiple sound source environments. In particular, language-aware synergic attention consists of a local operation submodule, a global operation submodule, and an interaction submodule, in which local and global operation submodules extract sound local and global features while the interaction submodule adaptively selects the most discriminative features with the guidance of linguistic features. In addition, we introduce a linguistic-acoustic fusion module that leverages the well-proven correlation modeling power of self-attention for excavating helpful multi-modal contexts. Extensive experiments demonstrate that our proposed LASA-Net is able to achieve state-of-the-art performance while maintaining an attractive computational complexity.
CM-TCN: Channel-aware Multi-scale Temporal Convolutional Networks For Speech Emotion Recognition
ABSTRACT. Speech emotion recognition (SER) plays a crucial role in understanding user intent and improving human-computer interaction (HCI). Currently, the most widely used and effective methods are based on deep learning. In the existing research, the temporal information becomes more and more important in SER. Although some advanced deep learning methods can achieve good results, such as convolutional neural networks (CNN) and attention module, they often ignore the temporal information in speech, which can lead to insufficient representation and low classification accuracy. In order to make full use of temporal features, we proposed channel-aware multi-scale temporal convolutional networks (CM-TCN). Firstly, channel-aware temporal convolutional networks (CATCN) is used as the basic structure to extract multi-scale temporal features combining channel information. Then, global feature attention (GFA) captures the global information at different time scales and enhances the important information. Finally, we use the adaptive fusion module (AFM) to establish the overall dependency of different network layers and fuse features. We conduct extensive experiments on six corpora, and the experimental results demonstrate the superior performance of CM-TCN.
Self-Supervised Multimodal Representation Learning for Product Identification and Retrieval
ABSTRACT. Solving object similarity remains a persistent challenge in the field of data science. In the context of e-commerce retail, the identification of substitutable and similar products involves similarity measures. Leveraging the multimodal learning derived from real-world experiences, humans are capable of recognizing similar products based solely on their titles, even in cases where significant literal differences exist. Motivated by this intuition, we propose a self-supervised mechanism that extracts strong prior knowledge from product images. This mechanism serves to enhance the encoder's capacity for learning product representations in a multimodal framework. The similarity between products can be reflected by the distance between their respective representations. Additionally, we introduce a novel attention regularization to effectively directs attention towards product category-related signals. The proposed model exhibits wide applicability as it can be effectively employed in unimodal tasks where only free-text inputs are available. To validate our approach, we evaluate our model on two key tasks: product similarity matching and retrieval. These evaluations are conducted on a real-world dataset consisting of thousands of diverse products. Experimental results demonstrate that multimodal learning significantly enhances the language understanding capabilities within the e-commerce domain. Moreover, our approach outperforms strong unimodal baselines and recently proposed multimodal methods, further validating its superiority.
Time-warp-invariant Processing with Multi-spike Learning
ABSTRACT. Sensory signals are encoded and processed by neurons in the brain in a form of action potentials, also called spikes that carry clue information across both spatial and temporal dimensions. Learning of such a clue information could be challenging, especially considering the case of long-delayed reward. This temporal credit assignment problem has been solved by a new concept of aggregate-label learning that motivates the development of a family of multi-spike learning algorithms whose remarkable learning performance has been demonstrated. However, most of the current spike-based learning methods are developed without consideration of input temporal fluctuations that constitute a common source of variability in sensory signals such as speech. Therefore, robust spike-based learning under fluctuations of both compression and dilation remains intriguing for exploration. In this paper, we first show the time-warp invariant characteristic of a conductance-based neuron model, based on which we then develop a new multi-spike learning algorithm for time-warp-invariant processing. Experimental results for speech recognition highlight the outstanding robustness of our algorithm against temporal distortions as compared with other relevant spike-based methods. Therefore, our study successfully confirms the effectiveness of multi-spike learning for time-warp robustness, extending a new scope for spike-based processing and learning.
ABSTRACT. Feature-based knowledge distillation utilizes features from superior and complex teacher networks as knowledge to help portable student networks improve their generalization capability. Recent feature distillation algorithms focus on various feature processing and transmission methods while ignoring the flexibility of feature selection, resulting in limited distillation effects for students. In this paper, we propose Dynamic Feature Distillation to increase the flexibility of feature distillation by dynamically managing feature transfer sites. Our method leverages Online Feature Estimation to monitor the learning status of the student network in the feature dimension. Adaptive Position Selection then dynamically updates valuable feature transmission locations for efficient feature transmission. Notably, our approach can be easily integrated as a strategy for feature management into other feature-based knowledge transfer methods to improve their performance. We conduct extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets to validate the effectiveness of Dynamic Feature Distillation.
NMPose: Leveraging Normal Maps for 6D Pose Estimation
ABSTRACT. Estimating the 6 degrees-of-freedom (6DoF) pose of an object from a single image is an important task in computer vision. Many recent works have addressed it by establishing 2D-3D correspondences and then applying a variant of the PnP algorithm. However, it is extraordinarily difficult to establish accurate 2D-3D correspondences for 6D pose estimation. In this work, we consider 6D pose estimation as a follow-up task to normal estimation so that pose estimation can benefit from the advance of normal estimation. We propose a novel 6D object pose estimation method, in which normal maps rather than 2D-3D correspondences are leveraged as alternative intermediate representations. In this paper, we illustrate the advantages of using normal maps for 6D pose estimation and also demonstrate that the estimated normal maps can be easily embedded into common pose recovery methods. On LINEMOD and LINEMOD-O, our method easily surpasses the baseline method and outperforms or rivals the state-of-the-art correspondence-based methods on common metrics. Our code is made publicly available.
Dynamical Graph Echo State Networks with Snapshot Merging for Spreading Process Classification
ABSTRACT. The Dissemination Process Classification (DPC) is a popular application of temporal graph classification. The aim of DPC is to classify different spreading patterns of information or pestilence within a community represented by discrete-time temporal graphs. Recently, a reservoir computing-based model named Dynamical Graph Echo State Network (DynGESN) has been proposed for processing temporal graphs with relatively high effectiveness and low computational costs. In this study, we propose a novel model which combines a novel data augmentation strategy called snapshot merging with the DynGESN for dealing with DPC tasks. In our model, the snapshot merging strategy is designed for forming new snapshots by merging neighboring snapshots over time, and then multiple reservoir encoders are set for capturing spatiotemporal features from merged snapshots. After those, the logistic regression is adopted for decoding the sum-pooled embeddings into the classification results. Experimental results on six benchmark DPC datasets show that our proposed model has better classification performances than the DynGESN and several kernel-based models.
Many Is Better than One: Multiple Covariation Learning for Latent Multiview Representation
ABSTRACT. As one of the most compelling methods in multiview representation learning (MRL), canonical correlation analysis (CCA) and its variants have been widely applied in many fields. Due to the intrinsic linearity of covariance matrices, CCA can hardly reveal nonlinear relationships among features. Over the past few decades, many variants of CCA have been developed to discover nonlinear relationships.
However, the complexity and variety of relationships between features in practical applications, and the difficulty of representing them with ordinary nonlinear relationships, limit the representative capacity of these methods. To overcome this problem, we propose a multiple covariation projection (MCP) method, which can model the composite relations to learn informative and compact multiview representation. Moreover, a multiset extension of MCP, dubbed MMCP, is developed to handle more than two views simultaneously. Extensive experimental results on five multiview datasets illustrate the effectiveness of our methods in multiview tasks such as classification and clustering.
ABSTRACT. Tigrinya, a language predominantly spoken in Eritrea and the Tigray region of Ethi-
opia, is classified as a low resourced language when it comes to Natural Language Processing
(NLP) and its documents are not widely accessible due to the lack of printed material. Although
the language has a rich cultural heritage, its literature hasn’t been exposed to large-scale auto-
mated digitization when compared to other widely-spoken languages. In this paper, we design
an end-to-end CRNN (Convolutional Recurrent Neural Network) to recognize machine-printed
Tigrinya text from document images. This will help Tigrinya documents to be more accessible
and also bridge the gap with other languages rich in NLP resources. We have included all the
304 characters in Tigrinya and the network is trained on a total of over a million text-line imag-
es constructed from different domains. The majority of the data was synthesized to augment the
limited real da-ta to help the model generalize better. We employed two external datasets
(ADOCR and GLOCR) in addition to ours to train the network. Furthermore, to improve the
performance of the model, extensive parameter tuning was conduct-ed. Without the use of post
processing techniques, the model has achieved a 2.32% Character Error Rate (CER). The learn-
ing curve shows that given more data, the model can improve the CER. We finally managed to
get a lightweight model that achieves comparable results to state-of-the-art results. This result
implies that augmenting low resource data with synthetic data can significantly reduce the error
rate in text recognition and also that proper hyperparameter tuning can find us lightweight
models without compromising much accuracy.
Correlated Online k-Nearest Neighbors Regressor Chain for Online Multi-Output Regression
ABSTRACT. Online multi-output regression is a crucial task in machine learning with applications in various domains such as environmental monitoring, energy efficiency prediction, and water quality prediction. This paper introduces CONNRC, a novel algorithm specifically designed to address the challenges of online multi-output regression and provide accurate real-time predictions. CONNRC builds upon the k-nearest neighbor algorithm in an online manner and incorporates a relevant chain structure to effectively capture and utilize correlations among structured multi-outputs. The main contribution of this work lies in the potential of CONNRC to enhance the accuracy and efficiency of real-time predictions across diverse application domains. Through a comprehensive experimental evaluation on six real-world datasets, CONNRC is compared against five existing online regression algorithms. The consistent results highlight that CONNRC consistently outperforms the other algorithms in terms of average Mean Absolute Error, demonstrating its superior accuracy in multi-output regression tasks. However, the time performance of CONNRC requires further improvement, indicating an area for future research and optimization.
An improved target searching and imaging method for CSAR
ABSTRACT. Circular Synthetic Aperture Radar (CSAR) has attracted much attention in the field of high-resolution SAR imaging. In order to shorten the computation time and improve the imaging effect, in this paper, we propose a fast CSAR imaging strategy that searches the target and automatically selects the area of interest for imaging. The first step is to find the target and select the imaging center and interest imaging area based on the target search algorithm, the second step is to divide the full-aperture data into sub-apertures according to the angle, the third step is to approximate the sub-apertures as linear arrays and imaging them separately, and the last step is to perform sub-image fusion to obtain the final CSAR image. This method can greatly reduce the imaging time and obtain well-focused CSAR images. The proposed algorithm is verified by both simulation and processing real data collected with our mmWave imager prototype utilizing commercially available 77-GHz MIMO radar sensors. Through the experimental results we verified the performance and the superiority of the our algorithm.
Syntax Tree Constrained Graph Network for Visual Question Answering
ABSTRACT. Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.
Fast and Efficient Brain Extraction with Recursive MLP based 3D UNet
ABSTRACT. Extracting brain from other non-brain tissues is an essential step in neuroimage analyses such as brain volume estimation. The transformers and 3D UNet based methods achieve strong performance using attention and 3D convolutions. They normally have complex architecture and are thus computationally slow. Consequently, they can hardly be deployed in computational resource-constrained environments like small neuroimage analysis clinics. To achieve rapid segmentation, the most recent work UNeXt reduces convolution filters and also presents the Multilayer Perception (MLP) blocks that exploit simpler and linear MLP operations. To further boost performance, it shifts the feature channels in MLP block so as to focus on learning local dependencies. However, it performs segmentation on 2D medical images rather than 3D volumes. In this paper, we propose a recursive MLP based 3D UNet to efficiently extract brain from 3D head volume. Our network involves 3D convolution blocks and MLP blocks to capture both long range information and local dependencies. Meanwhile, we also leverage the simplicity of MLPs to enhance computational efficiency. Unlike UNeXt extracting one locality, we apply several shifts to capture multiple localities representing different local dependencies and then introduce a recursive design to aggregate them. To save computational cost, the shifts do not introduce any parameters and the parameters are also shared across recursions. Extensive experiments on two public datasets demonstrate the superiority of our approach against other state-of-the-art methods with respect to both accuracy and CPU inference time.
Direct Inter-Intra View Association for Light Field Super-Resolution
ABSTRACT. Light field (LF) cameras record both intensity and directions of light rays in a scene with a single exposure. However, due to the inevitable trade-off between spatial and angular dimensions, the spatial resolution of LF images is limited which makes LF super-resolution (LFSR) a research hotspot. The key of LFSR is the complementation across views and the extraction of high-frequency information inside each view. Due to the high-dimensinality of LF data, previous methods usually model these two processes separately, which results in insufficient inter-view information fusion. In this paper, LF Transformer is proposed for comprehensive perception of 4D LF data. Necessary inter-intra view correlations can be directly established inside each LF Transformer block. Therefore it can handle complex disparity variations of LF. Then based on LF Transformers, 4DTNet is designed which comprehensively performs inter-intra view high-frequency information extraction. Extensive experiments on public datasets demonstrate that 4DTNet outperforms the current state-of-the-art methods both numerically and visually.
Dynamic Data Augmentation via Monte-Carlo Tree Search for Prostate MRI Segmentation
ABSTRACT. Medical image data are often limited due to the expensive acquisition and annotation process. Hence, training a deep-learning model with only raw data can easily lead to overfitting. One solution to this problem is to augment the raw data with various transformations, improving the model's ability to generalize to new data. However, manually configuring a generic augmentation combination and parameters for different datasets is non-trivial due to inconsistent acquisition approaches and data distributions. Therefore, automatic data augmentation is proposed to learn favorable augmentation strategies for different datasets while incurring large GPU overhead. To this end, we present a novel method, called Dynamic Data Augmentation (DDAug), which is efficient and has negligible computation cost. Our DDAug develops a hierarchical tree structure to represent various augmentations and utilizes an efficient Monte-Carlo tree searching algorithm to update, prune, and sample the tree. As a result, the augmentation pipeline can be optimized for each dataset automatically. Experiments on multiple Prostate MRI datasets show that our method outperforms the current state-of-the-art data augmentation strategies.
A Federated Multi-Stage Light-Weight Vision Transformer for Respiratory Disease Detection
ABSTRACT. Artificial Intelligence (AI)-based computer-aided diagnosis (CAD) has been widely applied to assist medical professionals in several medical applications.
Although there are many studies on respiratory disease detection using Deep Learning (DL) approaches from radiographic images, the limited availability of public datasets limits their interpretation and generalization capacity. However, radiography images are available through different organizations in various countries. This condition is suited for Federated Learning (FL) training, which can collaborate with different institutes to use private data and train a global model. In FL, the local model on the client's end is critical because there must be a balance between the model's accuracy, communication cost, and client-side memory usage. The current DL or Vision Transformer (ViT)-based models have large parameters, making the client-side memory and communication costs a significant bottleneck when applied to FL training. The existing state-of-the-art (SOTA) FL techniques on respiratory disease detection either use small CNNs with insufficient accuracy or assume clients have sufficient processing capacity to train large models, which remains a significant challenge in practical applications. In this study, we tried to find one question, is it possible to maintain higher accuracy while lowering the model parameters, leading to lower memory requirements and communication costs? To address this problem, we propose a federated multi-stage light-weight ViT framework that combines the strengths of CNNs and ViTs to build an efficient FL framework. We conduct extensive experiments and show that the proposed framework outperforms a set of current SOTA models in FL training with higher accuracy while lowering communication costs and memory requirements. We adapted Grad-CAM for the infection localization and compared it with an experienced radiologist's findings. Upon acceptance, our code and used dataset will be available on our GitHub account.
Learning Dense UV Completion for 3D Human Mesh Recovery
ABSTRACT. Human mesh reconstruction from a single image is a challenging task due to the occlusion caused by self, objects, or other humans. Existing methods either fail to separate human features accurately or lack proper supervision for feature completion. In this paper, we propose Dense Inpainting Human Mesh Recovery (DIMR), a two-stage method that leverages dense correspondence maps to handle occlusion.
Our method utilizes a dense correspondence map to separate visible human features and completes human features on a structured UV space with an attention-based feature completion module. We also design a feature inpainting training procedure that guides the network to learn from unoccluded features. We evaluate our method on several datasets and demonstrate its superior performance under heavily occluded scenarios compared to other methods. Extensive experiments show that our method obviously outperforms prior SOTA methods on heavily occluded images and achieves comparable results on the standard benchmarks (3DPW). Moreover, our method is comparable with previous methods on no heavily occluded images.
Aspect-level sentiment analysis using dual probability graph convolutional networks (DP-GCN) integrating multi-scale information
ABSTRACT. Aspect-based sentiment analysis (ABSA) is a fine-grained entity-level senti-ment analysis task that aims to identify the emotions associated with specific aspects or details within text. ABSA has been widely applied to various areas such as analyzing product reviews and monitoring public opinion on social media. In recent years, methods based on graph neural networks combined with syntactic information have achieved promising results in the task of ABSA. However, existing methods using syntactic dependency trees contain redundant information, and the relationships with identical weights do not re-flect the importance of the aspect words and opinion words' dependencies. Moreover, ABSA is limited by issues such as short sentence length and in-formal expression. Therefore, this paper proposes a Double Probabilistic Graph Convolutional Network (DP-GCN) integrating multi-scale information to address the aforementioned issues. Firstly, the original dependency tree is reshaped through pruning, creating aspect-based syntactic dependency tree corresponding syntactic dependency weights. Next, two probability attention matrixes are constructed based on both semantic and syntactic information. The semantic probability attention matrix represents the weighted directed graph of semantic correlations between words. Compared with the discrete adjacency matrix directly constructed by the syntax dependency tree, the probability matrix representing the dependency relationship between words based on syntax information contains rich syntactic information. Based on this, semantic information and syntactic dependency information are sepa-rately extracted via graph convolutional networks. Interactive attention is used to guide mutual learning between semantic information and syntactic dependency information, enabling full interaction and fusion of both types of information before finally carrying out sentiment polarity classification. Our model was tested on four public datasets, Restaurant, Laptop, Twitter and MAMS. The accuracy (ACC) and F1 score improved by 0.14% to 1.26% and 0.4% to 2.19%, respectively, indicating its outstanding performance.
Staged Long Text Generation with Progressive Task-Oriented Prompts
ABSTRACT. Generating coherent and consistent long text remains a challenge for artificial intelligence. The state-of-the-art paradigm partitions the whole generating process into successive stages, however, the content plan applied in each stage may be error-prone and fine tuning large-scale language models, one for each stage, is resource-consuming. In this paper, we follow the above paradigm and devise three stages: keyphrase decompression, transition paraphrase,and text generation. We leverage task-oriented prompts to direct the producing of text in each stage which improves the quality of the generated text. Further, we propose a new content plan representation with elastic mask tokens to reduce model bias and irregular words. Moreover, we introduce length control and commonsense knowledge prompts to increase the adaptability of the proposed model. Extensive experiments conducted on two challenging tasks demonstrated that our model outperforms strong baselines significantly, and it is able to generate longer high quality texts with fewer parameters.
Chinese Medical Intent Recognition Based on Multi-feature Fusion
ABSTRACT. Popularity of online query services heighten the need for suitable methods to accurately understand the truth of query intention. Currently, most of the medical query intention recognition methods are deep learning-based. However, for the inadequate of corpus of the medical field in the pre-trained phase, these methods fail to accurately extract the text feature constructed by medical domain knowledge, as well as it could not fully capture the query intention for relying on a single technology to extract the text information. To mitigate these issues, we propose a novel intent recognition model called EDCGA (ERNIE-Health+D-CNN+Bi-GRU+Attention) in this paper. EDCGA achieves text representation using the word vectors of the pre-trained ERNIE-Health model and employs D-CNN to expand the receptive field for extracting local information features. Furthermore, it combines Bi-GRU and attention mechanism to extract global information to enhance the understanding of the intent. Extensive experimental results on multiple datasets demonstrate that our proposed model exhibits superior recognition performance compared to the baselines.
ABSTRACT. The advent of non-autoregressive machine translation (NAT) improves the decoding speed of autoregressive machine translation (AT) greatly, while bringing about a performance decrease. Semi-autoregressive neural machine translation (SAT), as a compromise, enjoys the advantages of autoregressive and non-autoregressive decoding. However, current SAT methods face the challenges of information-limited initialization and rigorous termination. This paper develops a layer-and-length-based syntactic labeling method and introduces a syntactic dependency parsing structure-guided two-stage semi-autoregressive translation (SDPSAT) model, which addresses the above challenges with a syntax-based initialization and termination. Additionally, we also present a Mixed Training strategy to shrink exposure bias. Experiments on six widely used datasets show that our SDPSAT model outperforms traditional SAT models with reduced word repetition and achieves competitive results with the AT baseline at a 2× ∼ 3× speedup.
Topic-aware Two-layer Context-enhanced Model for Chinese Discourse Parsing
ABSTRACT. In the past decade, Chinese Discourse Parsing has drawn much attention due to its fundamental role in document-level Natural Language Processing(NLP).
In this work, we propose a topic-aware two-layer context-enhanced model based on transition system. Specifically, in one hand, we first adopt a two-layer context-enhanced Chinese discourse parser as a strong baseline, where the Star-Transformer with star topology is employed to enhance the EDU representation. On the other hand, we split the document into multiple sub-topics based on the change of nuclearity of discourse relations. Then we implicitly incorporate topic boundary information via joint learning framework.
Recurrent Update Representation based on Multi-Head Attention Mechanism for Joint Entity and Relation Extraction
ABSTRACT. Joint extraction of entities and relations from unstructured text is an important task in information extraction and knowledge graph construction. However, most of the existing work only considers the information of the context in the sentence and the information of the entities, with little attention to the information of the possible relations between the entities, which may lead to the failure to extract valid triplets. In this paper, we propose a recurrent update representational method based on multi-head attention mechanism for relation extraction. We use a multi-head attention mechanism to interact the information between the relational representation and the sentence context representation, and make the feature information of both fully integrated by cyclically updating the representation. The model performs relation extraction after the representation is updated. Using this approach we are able to leverage the relationship information between entities for relational triple extraction. Our experimental results on four public datasets show that our approach is effective and the model outperforms all baseline models.
How Legal Knowledge Graph Can Help Predict Charges for Legal Text
ABSTRACT. The existing methods for predicting Easily Confused Charges (ECC) primar-ily rely on factual descriptions from legal cases. However, these approaches overlook some key information hidden in these descriptions, resulting in an inability to accurately differentiate between ECC. Legal domain knowledge graphs can showcase personal information and criminal pro-cesses in cases, but they primarily focus on entities in cases of insolation while ignoring the logical relationships between these entities. Different re-lationships often lead to distinct charges. To address these problems, this paper proposes a charge prediction model that integrates a Criminal Behav-ior Knowledge Graph (CBKG), called Charge Prediction Knowledge Graph (CP-KG). Firstly, we defined a diverse range of legal entities and relation-ships based on the characteristics of ECC. We conducted fine-grained anno-tation on key elements and logical relationships in the factual descriptions. Subsequently, we matched the descriptions with the CBKG to extract the key elements, which were then encoded by Text Convolutional Neural Network (TextCNN). Additionally, we extracted case subgraphs containing sequential behaviors from the CBKG based on the factual descriptions and encoded them using a Graph Attention Network (GAT). Finally, we concat-enated these representations of key elements, case subgraphs, and factual descriptions, collectively used for predicting the charges of the defendant. To evaluate the CP-KG, we conducted experiments on two charge predic-tion datasets consisting of real legal cases. The experimental results demon-strate that the CP-KG achieves scores of 99.10% and 90.23% in the Macro-F1 respectively. Compared to the baseline methods, the CP-KG shows sig-nificant improvements with 25.79% and 13.82% respectively.
ABSTRACT. Legal Judgment Prediction (LJP) is a critical task that aims to predict charges, articles, and terms of penalties based on the fact descriptions provided in criminal cases. However, current LJP methods often fail to fully utilize the important aspect of legal event information, leading to suboptimal predictions. In order to address this issue, our proposed model introduces a legal event type attention mechanism, which effectively identifies key event information within the fact descriptions. By combining event-aware and event-free representations, our framework enables a comprehensive understanding of the fact descriptions, resulting in improved performance on LJP. Importantly, our approach outperforms state-of-the-art models, achieving an average improvement of 3.86% in the prediction of articles, 1.82% in the prediction of charges, and 5.24% in the prediction of terms of penalties.
STA-Net: Reconstruct Missing Temperature Data of Meteorological Stations Using a Spatiotemporal Attention Neural Network
ABSTRACT. Reconstructing the missing meteorological site temperature
data is of great significance for analyzing climate change and predicting
related natural disasters, but is a trickily and urgently solved problem. In
the past, various interpolation methods were used to solve this problem,
but these methods basically ignored the temporal correlation of the site
itself. Recently, the methods based on machine learning have been widely
studied to solve this problem. However, these methods tend to handle
the missing value situation of single site, neglecting spatial correlation
between sites. Hence, we put forward a new spatiotemporal attention
neural network (STA-Net) for reconstructing missing data in multiple
meteorological sites. The STA-Net utilizes the currently state-of-the-art
encoder-decoder deep learning architecture and is composed of two subnetworks which include local spatial attention mechanism (LSAM) and
multidimensional temporal self-attention mechanism (MTSAM), respectively. Moreover, a multiple-meteorological-site data processing method
is developed to generate matrix datasets containing spatiotemporal information so the STA-Net can be trained and tested. To evaluate the STANet, a large number of experiments on real Tibet and Qamdo datasets
with the missing rates of 25%, 50% and 75%, respectively, are conducted,
meanwhile compared with U-Net, PConvU-Net and BiLSTM. Experimental results have showed that our data processing method is effective
and meantime and our STA-Net achieves greater reconstruction effect.
In the case with the missing rate of 25% on Tibet test datasets and compared to the other three methods, the MAE declines by 60.21%, 36.42%
and 12.70%; the RMSE declines by 56.28%, 32.03% and 14.17%; the R2
increases by 0.75%, 0.20% and 0.07%.
CLF-AIAD: A Contrastive Learning Framework for Acoustic Industrial Anomaly Detection
ABSTRACT. Acoustic Industrial Anomaly Detection (AIAD) has received a great deal of attention as a technique to discover faults or malicious activity, allowing for preventive measures to be more effectively targeted. The essence of AIAD is to learn the compact distribution of normal acoustic data and detect outliers as anomalies during testing. However, recent AIAD work does not capture the dependencies and dynamics of Acoustic Industrial Data (AID). To address this issue, we propose a novel Contrastive Learning Framework (CLF) for AIAD, known as CLF-AIAD. Our method introduces a multi-grained contrastive learning-based framework to extract robust normal AID representations. Specifically, we first employ a projection layer and a novel context-based contrast method to learn robust temporal vectors. Building upon this, we then introduce a sample-wise contrasting-based module to capture local invariant characteristics, improving the discriminative capabilities of the model. Finally, a transformation classifier is introduced to bolster the performance of the primary task under a self-supervised learning framework. Extensive experiments on two typical industrial datasets, MIMII and ToyADMOS, demonstrate that our proposed CLF-AIAD effectively detects various real-world defects and improves upon the state-of-the-art in unsupervised industrial anomaly detection.
Multimodal Isotropic Neural Architecture with Patch Embedding
ABSTRACT. Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability.
Furthermore, their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices, underlining their practical significance.
Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding. Minape extends the application of patch embedding to both time series and image data for classification purposes.
By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs from these pathways are merged to capitalize on the complementary information between modalities. A temporal classifier is then trained on these merged representations to distinguish between different classes.
Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors' newly collected multi-dimensional multimodal dataset, Mudestreda, obtained from real industrial processing devices\footnote{Link to code and dataset: \url{https://anonymous.4open.science/r/Minape-ED25}}.
Generating Spatiotemporal Trajectories with GANs and Conditional GANs
ABSTRACT. Modeling the movements of individual and populations, and generating synthetic spatiotemporal trajectory data play an important role in lots of (privacy-aware) analysis and applications, such as urban planning and route navigation. A key challenge in trajectory generation is to best capture the basic characteristics of the long sequences of location points. This is non-trivial considering the inherent se-quentiality and high-dimensionality of trajectory data. This paper presents TS-TrajGAN, a two-stage model to generate spatiotemporal trajectory data by com-bining a Generative Adversarial Network (GAN) and a conditional GAN. We train the GAN of stage I to simulate the distribution of the initial trajectory seg-ments such that the basic characteristics of the length-limited initial trajectory segments can be well depicted. In stage II, the conditional GAN is used to predict the next location point for the current generated trajectory and preserve the varia-bility in individuals’ mobility. In addition, a predictor network is added to the GAN of stage I for trajectory length prediction. Experiments on a real-world taxi dataset demonstrate that TS-TrajGAN is not only able to generate trajectories that have similar characteristics with the real ones, but also outperforms the state-of-the-art methods in terms of data utility. Our code is available at https://github.com/kfZhao726/TS-TrajGAN.
Trajectory Prediction with Contrastive Pre-training and Social Rank Fine-tuning
ABSTRACT. This paper focuses on the accurate prediction of pedestrian trajectories in scenarios where individuals walk alone or in social groups, and sometimes alter their paths to avoid collisions. While previous work has improved backbone neural networks to model individual motion patterns, few studies have explicitly addressed the consistency of internal motion patterns or properness of external interactions. To address this, we propose a unified framework consisting of a Contrastive History-Prediction (CHIP) module and a Differentiable Social Interaction Ranking (DSIR) module. The CHIP module utilizes unsupervised contrastive loss to optimize predicted motion patterns consistent with observations, while the supervised DSIR module ensures predicted interactions are compatible with realistic positions. Our analysis and numerical studies demonstrate the effectiveness of our approach, which achieves a 5-10% improvement in positional accuracy and a 3-7% boost in interactive properness. We provide comprehensive visualizations of anticipated trajectories with temporal interactive scores across various scenarios.
Dynamic Knowledge Distillation for Reduced Easy Examples
ABSTRACT. Knowledge distillation is usually performed by promoting a small model (student) to mimic the knowledge of a large model (teacher). The current knowledge distillation methods mainly focus on the extraction and transformation of knowledge while ignoring the importance of examples in the dataset and assigning equal weight to each example. Therefore, in this paper, we propose Dynamic Knowledge Distillation (Dy-KD). To alleviate this problem, Dy-KD incorporates a curriculum strategy to selectively discard easy examples during knowledge distillation. Specifically, we estimate the difficulty level of examples by the predictions from the superior teacher network and divide examples in a dataset into easy examples and hard examples. Subsequently, these examples are given various weights to adjust their contributions to the knowledge transfer. We validate our Dy-KD on CIFAR-100 and Tiny-ImageNet; the experimental results show that: (1) Use the curriculum strategy to discard easy examples to prevent the model's fitting ability from being consumed by fitting easy examples. (2) Giving hard and easy examples varied weight so that the model emphasizes learning hard examples, which can boost students' performance. At the same time, our method is easy to build on the existing distillation method.
Improving Out-of-Distribution Detection with Margin-Based Prototype Learning
ABSTRACT. Deep Neural Networks often make overconfident predictions when encountering out-of-distribution (OOD) data. Previous prototype-based methods significantly improved OOD detection performance by optimizing the representation space. However, practical scenarios present a challenge where OOD samples near class boundaries may overlap with in-distribution samples in the feature space, resulting in misclassification, and few methods have considered the challenge. In this work, we propose a margin-based method that introduces a margin into the common instance-prototype contrastive loss. The margin leads to broader decision boundaries, resulting in better distinguishability of OOD samples. In addition, we leverage learnable prototypes and explicitly maximize prototype dispersion to obtain an improved representation space. We validate the proposed method on several common benchmarks with different scoring functions and architectures. Experiments results show that the proposed method achieves state-of-the-art performance.
MOC: Multi-modal Sentiment Analysis via Optimal Transport and Contrastive Interactions
ABSTRACT. Multi-modal sentiment analysis (MSA) aims to utilize information from various modalities to improve the classification of emotions. Most existing studies employ attention mechanisms for modality fusion, overlooking the heterogeneity of different modalities. To address this issue, we propose an approach that leverages optimal transport for modality alignment and fusion, specifically focusing on distributional alignment. However, solely relying on the optimal transport module may result in a deficiency of intra-modal and inter-sample interactions. To tackle this deficiency, we introduce a double-modal contrastive learning module. Specifically, we propose a model MOC (Multi-modal sentiment analysis via Optimal transport and Contrastive interactions), which integrates optimal transport and contrastive learning. Through empirical comparisons on three established multi-modal sentiment analysis datasets, we demonstrate that our approach achieves state-of-the-art performance. Additionally, we conduct extended ablation studies to validate the effectiveness of each proposed module.
A Distributed Projection-based Algorithm with Local Estimators for Optimal Formation of Multi-robot System
ABSTRACT. In general, the optimal formation problem can be modeled as a standard constrained optimization problem according to the shape theory. By adding local supplementary estimators, it can be further modeled as a distributed constrained optimization problem. Then a distributed projection-based algorithm is designed for solving this problem. The aim of the algorithm is to drive a group of robots to move to the desired geometric pattern by minimizing the total travel distance of robots from the initial positions. It is worth noticing that, as long as the graph of the communication network among the robots is undirected and connected, the global convergence of the algorithm can be guaranteed. Moreover, all of the robots finally form an ideal formation in the limited space. Finally, simulation results are provided to verify the effectiveness of the proposed distributed algorithm.
A Memory Optimization Method for Distributed Training
ABSTRACT. In recent years, with the continuous development of artificial intelligence technology, the complexity of deep learning algorithms and the scale of model training is also increasing. A series of efficient pipelined parallel training methods emerged to improve the training speed and accuracy. Distributed training becomes an effective way to train large-scale models. To solve this problem, we propose an efficient pipeline-parallel training optimization method. Our approach processes small batches of data in parallel through multiple compute nodes in a pipelined manner. We propose a prefix sum partition algorithm to realize a balanced partition and save the memory of computing resources. At the same time, we also design a clock optimization strategy to limit the number of weight version generations to ensure the model's accuracy. Compared with the current famous pipeline parallel frameworks, our method can achieve about 2 times training acceleration, save about 30\% of memory consumption, and improve the model accuracy by about 10\% compared with PipeDream.
Generalizing Graph Network Models for the Traveling Salesman Problem with Lin-Kernighan-Helsgaun Heuristics
ABSTRACT. Existing graph convolutional network (GCN) models for the traveling salesman problem (TSP) cannot generalize well to TSP instances with larger number of cities than training samples, and the NP-Hard nature of the TSP renders it impractical to use large-scale instances for training. This paper proposes a novel approach that generalizes well a pre-trained GCN model for a fixed small TSP size to large scale instances with the help of Lin-Kernighan-Helsgaun (LKH) heuristics. This is realized by first devising a Sierpinski partition scheme to partition a large TSP into sub-problems that can be efficiently solved by the pre-trained GCN, and then developing an attention-based merging mechanism to integrate the sub-solutions as a whole solution to the original TSP instance. Specifically, we train a GCN model by supervised learning to produce edge prediction heat maps of small-scale TSP instances, then apply it to the sub-problems of a large TSP instance generated by partition strategies. Controlled by an attention mechanism, all the heat maps of the sub-problems are merged into a complete one to construct the edge candidate set for LKH. Experiments show that this new approach significantly enhances the generalization ability of the pre-trained GCN model without using labeled large-scale TSP instances in the training process and also outperforms LKH in the same time limit.
Deep Hashing for Multi-label Image Retrieval with Similarity Matrix Optimization of Hash Centers and Anchor Constraint of Center Pairs
ABSTRACT. Deep hashing can improve computational efficiency and save storage space, which is the most significant part of image retrieval task and has received extensive research attention. Existing deep hashing frameworks mainly fall into two categories: single-stage and two-stage. For multi-label image retrieval, most single-stage and two-stage deep hashing methods usually consider two images to be similar if one pair of the corresponding category labels is the same, and do not make full use of the multi-label information. Meanwhile, some novel two-stage deep hashing methods proposed in recent years construct hash centers firstly and then train through deep neural networks. For multi-label processing, these two-stage methods usually converts multi-label into single-label objective, which also leads to insufficient use of label information. In this paper, a novel multi-label deep hashing method is proposed by constructing the similarity matrix and designing the optimization algorithm to construct the hash centers, and the proposed method constructs the training loss function through the multi-label hash centers constraint and anchor constraint of center pairs. Experiments on several multi-label image benchmark datasets show that the proposed method can achieve the state-of-the-art results.
ABSTRACT. Designing incentive-compatible and revenue-maximizing auctions is pivotal in mechanism design. Often referred to as optimal auction design, the area has seen little theoretical breakthrough since Myerson's 1981 seminal work. Not to mention general combinatorial auctions, we don't even know the optimal auction for selling as few as two distinct items to more than one bidder. In recent years, the stagnation of theoretical progress has promoted many in using deep learning models to find near-optimal auction mechanisms. In this paper, we provide two general methods to improve such deep learning models. Firstly, we propose a new data sampling method that achieves better coverage and utilisation of the possible data. Secondly, we propose a more fine-grained neural network architecture. Unlike existing models which output a single payment percentage for each bidder, the refined network outputs a separate payment percentage for each item. Such an item-wise approach captures the interaction among bidders at a granular level beyond previous models. We conducted comprehensive and in-depth experiments to test our methods and observed improvement in all tested models over their original design. Noticeably, we achieved state-of-the-art performance by applying our methods to an existing model.
A Bi-Directional Optimization Network for De-Obscured 3D High-Fidelity Surface Reconstruction
ABSTRACT. 3D detailed face reconstruction based on monocular images aims to reconstruct a 3D face from a single image with rich face detail. The existing methods have achieved significant results, but still suffer from inaccurate face geometry reconstruction and artifacts caused by mistaking hair for wrinkle information. To address these problems, we propose a bi-directional optimization network for de-obscured 3D high-fidelity surface reconstruction. Specifically, our network is divided into two stages: face geometry fitting and face detail optimization. In the first stage, we design a global and local bi-directional optimized feature extraction network that uses both local and global information to jointly constrain the face geometry and ultimately achieve an accurate 3D face geometry reconstruction. In the second stage, we decouple the hair and the face using a segmentation network and use the distribution of depth values in the facial region as a prior for the hair part, after which the FPU-net detail extraction network we designed is able to reconstruct finer 3D face details while removing the hair occlusion problem. With only a small number of training samples, extensive experimental results on multiple evaluation datasets show that our method achieves competitive performance and significant improvements over state-of-the-art methods.
On the use of persistent homology to control the generalization capacity of a neural network
ABSTRACT. Analyzing the generalization capacity of neural networks (NN) is a crucial task to ensure that the model has been learned and can perform well on unseen data, rather than being limited to the learning data. However, the ordinary approach of evaluating the performance of NN on multiple testing datasets can be both costly and time-consuming, as it requires obtaining, pre-processing, and labeling new testing datasets. The key problem is to find the right capacity for the number of training observations. It is therefore necessary to adjust the learning system's capacity both to the task and to the information provided by the data, to obtain the best generalization. The work presented in this paper is set in this context and applies techniques from Algebraic Topology, and relevance measures to study the behaviour of the NN during learning. We define NN on a topological space as a functional topology graph. A set of topological summaries, are then calculated to estimate the generalization gap. This estimation is carried out in parallel with an assessment of the relevance of NN units, including a progressive pruning of the network units. During this pruning, the generalization gap estimation enables us to detect overfitting for the NN and then to determine when to perform early-stopping and identify the architecture offering the best generalization. Our approach provides a more comprehensive understanding of NN generalization capacity and can be used to investigate the extensibility and interpretability of a NN.
Encrypted-SNN: A Privacy-Preserving Method for converting Artificial Neural Networks to Spiking Neural Networks
ABSTRACT. The conversion from Artificial Neural Networks (ANNs) to Spiking Neural Networks (SNNs) poses a significant challenge. Privacy preservation in the conversion process is crucial to protect data-sensitive information. This work proposes a novel Encrypted-SNN to address the privacy problems in ANN to SNN conversion (ANN-SNN). By adding noise to gradients in both ANN and SNN, privacy protection can be enhanced without affecting network performance. The proposed method uses popular datasets including CIFAR10, MNIST, and Fashion MNIST for testing, with accuracies of 88.1%, 99.3%, and 93.0% respectively. The impact of three different privacy budgets (ϵ=0.5, 1.0, and 1.6) on the accuracy is discussed. Experimental results show that the privacy and performance trade-off of the proposed Encrypted-SNN is effectively improved, which has practical significance in protecting data privacy and can enhance the security and privacy of spiking neural networks.
Towards undetectable adversarial examples: a steganographic perspective
ABSTRACT. Over the past decade, adversarial examples demonstrate an enhancing ability to fool neural networks. However, most adversarial examples can be easily detected, especially under statistical analysis. Ensuring undetectability is crucial for the success of adversarial examples in practice. In this paper, we borrow the idea of the embedding suitability map from steganography and employ it to modulate the adversarial perturbation. In this way, the adversarial perturbations are concentrated in the hard-to-detect areas and are attenuated in predictable regions. Extensive experiments show that the proposed scheme is compatible with various existing attacks and can significantly boost the undetectability of adversarial examples against both human inspection and statistical analysis of the same attack ability. Code is available at github.com/zengh5/Undetectable-attack.
SDBC: A Novel and Effective Self-Distillation Backdoor Cleansing Approach
ABSTRACT. Deep Neural Networks (DNNs) are vulnerable to backdoor attacks, which only need to poison a small portion of samples to control the behavior of the target model. Moreover, the escalating stealth and power of backdoor attacks present not only significant challenges to backdoor defenses but also enormous potential threats to the widespread adoption of DNNs.
In this paper, we propose a novel backdoor defense framework, called Self-Distillation Backdoor Cleansing (SDBC), to remove backdoor triggers from the attacked model. For the practical scenario where only a very small portion of clean data is available, SDBC first introduces self-distillation to clean the backdoor in DNNs. Extensive experiments demonstrate that SDBC can effectively remove backdoor triggers under 6 state-of-the-art backdoor attacks using less than 5% or even less than 1% clean training data without compromising accuracy. Experimental results show that the proposed SDBC outperforms existing state-of-the-art (SOTA) methods, reducing the average ASR from 95.36% to 5.75% and increasing the average ACC by 1.92%.
A reinforcement learning-based controller designed for Intersection signal suffering from Information Attack
ABSTRACT. With the rapid development of smart technology and wireless communication technology, Intelligent Transportation System (ITS) is considered as an effective way to solve the traffic congestion problem. ITS is able to collect real-time road vehicle information through sensors such as networked vehicles (CV) and cameras, and through real-time interaction of information, signals can more intelligently implement adaptive signal adjustment, which can effectively reduce vehicle delays and traffic congestion. However, this connectivity also poses new challenges in terms of being affected by malicious attacks that affect traffic safty and efficiency. Reinforcement learning is considered as the future trend of control algorithms for intelligent transportation systems. In this paper, we design reinforcement learning intelligent control algorithms to control the intersection signal imposed by malicious attacks. The results show that the reinforcement learning-based signal control model can reduce vehicle delay and queue length by 22% and 23% relative to timing control. Meanwhile, the intensity learning is a model-free control method, which makes it impossible for attackers to target flaws in specific control logic and evaluate the impact of information attacks more effectively. Designing a coordinated state tampering attack between different lanes, the results show that the impact is greatest when the attacked states are in the same phase.
Quantum Autoencoder Frameworks for Network Anomaly Detection
ABSTRACT. Detecting anomalous activities in network traffic is important for the timely identification of emerging cyber attacks. Accurate analysis of the emerging patterns in the network traffic is critical to identify suspicious behaviors. In this paper, novel quantum deep autoencoder based anomaly detection frameworks are proposed for accurately detecting the security attacks that emerge in the network. In particular, we propose three frameworks, one by constructing several reconstruction error thresholds-based methods; second, a union of a quantum autoencoder and a one-class support vector machine-based method; and third a union of a quantum autoencoder and quantum random forest-based method. Using a publicly available benchmark dataset, the quantum frameworks’ effectiveness in accurately detecting the attacks are evaluated. Our empirical evaluations demonstrate the improvements in accuracy and F1-score for the three frameworks.
MIC: An Effective Defense Against Word-level Textual Backdoor Attacks
ABSTRACT. Backdoor attacks, which manipulate model output, have garnered significant attention from researchers. However, some existing word-level backdoor attack methods in NLP models are difficult to defend effectively due to their concealment and diversity. These covert attacks use two words that appear similar to the naked eye but will be mapped to different word vectors by the NLP model as a way of bypassing existing defenses. To address this issue, we propose incorporating triple metric learning into the standard training phase of NLP models to defend against existing word-level backdoor attacks. Specifically, metric learning is used to minimize the distance between vectors of similar words while maximizing the distance between them and vectors of other words. Additionally, given that metric learning may reduce a model's sensitivity to semantic changes caused by subtle perturbations, we added contrastive learning after the model's standard training. Experimental results demonstrate that our method performs well against the two most stealthy existing word-level backdoor attacks.
ABSTRACT. Botnets are one of the most serious cybersecurity threats facing organizations today. Although the analysis and detection of botnets have achieved a lot of research results, it still has problems such as strong concealment and difficult identification. Therefore, we propose a botnet detection method based on NSA and DRN. This method uses our improved NSA to expand the preprocessed and dimensionally reduced malicious traffic data with fewer samples, and then extracts useful features of network traffic from two dimensions through SENet-based DRN combined with BiGRU. Experimental results based on the CICIDS-2017 and UNSW-NB15 datasets show that our proposed method has a high accuracy for botnet detection and improves the detection accuracy of rare malicious traffic. 99.98% and 99.94%. In addition, we further demonstrate the good generalization ability and robustness of our method in botnet detection through an ablation study.
Multi-granularity Deep Vulnerability Detection using Graph Neural Networks
ABSTRACT. The significance of vulnerability detection has grown increasingly crucial due to the escalating cybersecurity threats. Investigating automated vulnerability detection techniques to avoid high false positives and high false negatives is an important issue in the current software security field. In recent years, there has been a substantial focus on deep learning-based vulnerability detectors, which have achieved remarkable success. To fill the gap of multi-granularity program representation, we propose MulGraVD, a deep learning-based vulnerability detector at the function level. MulGraVD captures the continuity and structure of the programming language by considering information at word, statement, basic block, and function granularity respectively. To overcome the constraint posed by hyperparameter layers in the information aggregation process of graph neural networks, MulGraVD serially passes information from coarse to fine granularity, which facilitates the mining of vulnerability patterns. Our experimental evaluation on FFMPeg+Qemu and ReVeal datasets shows that MulGraVD significantly outperforms existing state-of-the-art methods in terms of precision, recall, and F1 score, with an average improvement of 11.62% in precision, 27.69% in recall, and 19.71% in F1 score.
Privacy-Preserving Federated Compressed Learning Against Data Reconstruction Attacks Based on Secure Data
ABSTRACT. Federated learning is a new distributed learning framework with data privacy preserving in which multiple users collaboratively train models without sharing data. However, recent studies highlight potential privacy leakage through shared gradient information. Several defense strategies, including gradient information encryption and perturbation, have been suggested. But these strategies either involve high complexity or are susceptible to attacks. To counter these challenges, we propose to train on secure compressive measurements by compressed learning, thereby achieving local data privacy protection with minimal performance degradation. A feasible method to boost performance is the joint optimization of the sensing matrix and the inference network during the training phase, but this may suffer from data reconstruction attacks again. Thus, we further incorporate a traditional lightweight encryption scheme to protect data privacy. Experiments conducted on MNIST and FMNIST datasets substantiate that our schemes achieve a satisfactory balance between privacy protection and model performance.
ABSTRACT. Deinterlacing is a classical issue in video processing, aimed at generating progressive video from interlaced content. There are precious videos that are difficult to reshoot and still contain interlaced content. Previous methods have primarily focused on simple interlaced mechanisms and have struggled to handle the complex artifacts present in real-world early videos. Therefore, we propose a Transformer-based method for deinterlacing, which consists of a Feature Extractor, a De-Transformer, and a Residual DenseNet module. By incorporating self-attention in Transformer, our proposed method is able to better utilize the inter-frame movement correlation. Additionally, we combine a properly designed loss function and residual blocks to train an end-to-end deinterlacing model. Extensive experimental results on various video sequences demonstrate that our proposed method outperforms state-of-the-art methods in different tasks by up to 1.41~2.64dB. Furthermore, we also discuss several related issues, such as the rationality of the network structure. The code for our proposed method is available at https://github.com/Anonymous2022-cv/DeT.git.
Stereoential Net: Deep Network for Learning Building Height Using Stereo Imagery
ABSTRACT. Height estimation plays a crucial role in the planning and assessment of urban development, enabling effective decision-making and evaluation of urban built areas. Accurate estimation of building heights from remote sensing optical imagery poses significant challenges in preserving both the overall structure of complex scenes and the intricate elevation details of the buildings. This paper proposes a novel end-to-
end deep learning-based network (Stereoential Net) comprising a multi-scale differential shortcut connection module (MSDSCM) at the decoding end and a modified stereo U-Net (mSUNet). The proposed Stereoential network performs a multi-scale differential decoding features fusion to preserve fine details for improved height estimation using stereo optical imagery. Unlike existing methods, our approach does not use any multi-spectral satellite imagery, instead, it only employs freely available optical imagery, yet it achieves superior performance. We evaluate our
proposed network on two benchmark datasets, the IEEE Data Fusion Contest 2018 (DFC2018) dataset and the 42-cities dataset. The 42-cities dataset is comprised of 42 different densely populated cities of China having diverse sets of buildings with varying shapes and sizes. The quantitative and qualitative results reveal that our proposed network outperforms the SOTA algorithms for DFC2018. Our method reduces the
root-mean-square error (RMSE) by 0.31 meters as compared to state-of-the-art multi-spectral approaches on the 42-cities dataset. The code will be made publically available via the Github repository.
WCA-VFnet:a dedicated complex forest smoke fire detector
ABSTRACT. Forest fires pose a significant threat to ecosystems, causing extensive damage. While state-of-the-art detection algorithms like YoloX, Deformable DETR, and VarifocalNet have demonstrated remarkable performance in the field of object detection, their effectiveness in detecting forest smoke fires, especially in complex scenarios with small smoke and flame targets, remains limited. To address this issue, we propose WCA-VFnet, an innovative approach that incorporates the Weld C-A component—a method featuring shared convolution and fusion attention. Furthermore, we have curated a distinctive dataset called T-SMOKE, specifically tailored for detecting small-scale, low-resolution forest smoke fires. Our experimental results show that WCA-VFnet achieves a significant improvement of approximately 35% in average precision (AP) for detecting small flame targets compared to Deformable DETR.
ABSTRACT. Novel view synthesis (NVS) aims to synthesize photo-realistic images depicting a scene by utilizing existing source images. The synthesized images are supposed to be as close as possible to the scene content. We present Deep Normalized Stable View Synthesis (DNSVS), an NVS method for large-scale scenes based on the pipeline of Stable View Synthesis (SVS). SVS combines neural networks with the 3D scene representation obtained from structure-from-motion and multi-view stereo, where the view rays corresponding to each surface point of the scene representation and the source view feature vector together yield a value of each pixel in the target view. However, it weakens geometric information in the refinement stage, resulting in blur and artifacts in novel views. To address this, we propose DNSVS that leverages the depth map to enhance the rendering process via a normalization approach. The proposed method is evaluated on the Tanks and Temples dataset, as well as the FVS dataset. The average Learned Perceptual Image Patch Similarity (LPIPS) of our results is better than state-of-the-art NVS methods by 0.12%, indicating the superiority of our method.
ABSTRACT. With the driving force of powerful convolutional neural networks, image inpainting has made tremendous progress. Recently, transformer has demonstrated its effectiveness in various vision tasks, mainly due to its capacity to model long-term relationships. However, when it comes to image inpainting tasks, the transformer tends to fall short in terms of modeling local information, and interference from damaged regions can pose challenges. To tackle these issues, we introduce a novel Semantic U-shaped Transformer (SUT) in this work. The SUT is designed with spectral transformer blocks in its shallow layers, effectively capturing local information. Conversely, deeper layers utilize BRA transformer blocks to model global information. A key feature of the SUT is its attention mechanism, which employs bi-level routing attention. This approach significantly reduces the interference of damaged regions on overall information, making the SUT more suitable for image inpainting tasks. Experiments on several datasets indicate that the performance of the proposed method outperforms the current state-of-the-art (SOTA) inpainting approaches. In general, the PSNR of our method is on average 0.93 dB higher than SOTA, and the SSIM is higher by 0.026.
A Novel Interaction Convolutional Network Based on Dependency Trees for Aspect-level Sentiment Analysis
ABSTRACT. Aspect-based sentiment analysis aims to identity the sentiment polarity of a given aspect-based word in a sentence. Due to the complexity of sentences in the texts, the models based on the graph neural network still have issues in the accurately capturing the relationship between aspect words and view-point words in sentences, failing to improve the accuracy of classification. To solve this problem, the paper proposes a novel Aspect-level Sentiment Analysis model based on Interactive convolutional network with the depend-ency trees, named ASAI-DT in short. In particular, the ASAI-DT model first extracts the aspect words representation from the sentence representation trained by the Bi-GRU model. Meanwhile, the self-attention score of both the sentence and aspect representation are calculated separately by the self-attention mechanism, in order to reduce the attention to the irrelevant in-formation. Afterward, the proposed model constructs the sub-tree of the de-pendency trees for the word, while the attention weight scores of the aspect representations will be integrated into the sub-tree. Therefore, the acquired comprehensive information about aspect words is processed by the graph convolutional network to maximize the retention of valid information and minimize the interference of noise. Finally, the effective information can be preserved more completely in the integrated information through the inter-active network. Through a large number of experiments on various data sets, the proposed ASAI-DT model shows both the effectiveness and the accuracy of aspect sentiment analysis, which outperforms many aspect -based senti-ment analysis models.
Differentiable Topics Guided New Paper Recommendation
ABSTRACT. There are a large number of scientific papers published each year. Since the progresses on scientific theories and technologies are quite different, it is challenging to recommend valuable new papers to the interested researchers. Papers usually have multiple levels of contributions, and accordingly, users also have fine-grained retrieval requirements. Moreover, the propagation of academic knowledge is asymmetrical. In this paper, we investigate the new paper recommendation task from the point of involved topics and use the concept of subspace to distinguish different levels of innovations or academic contributions of papers. We adopt the neural topic model to model the papers by the topic distribution over different subspaces. The academic influences between papers are modeled as the topic propagation, which are learned by the asymmetric convolution on the citation network, reflecting the asymmetry of academic knowledge propagation. The experimental results on real datasets show that our model is better than the baselines on new paper recommendation. Specially, the introduced subspace embeddings of paper are differentiable over topics that can help find the paper innovations. Besides, we conducted experiments from multiple aspects to verify the validity of our model.
Co-GAN:A Text-to-Image Synthesis Model with Local and Integral Features
ABSTRACT. Text-to-Image synthesis is a promising technology that generates realistic images from textual descriptions by deep learning model. However, the state-of-the-art text-to-image synthesis models often struggle to balance the overall integrity and local diversity of objects with rich details, leading to unsatisfactory generation results of some domainspecific images, such as industrial applications. To address this issue, we propose Co-GAN, a text-to-image synthesis model that introduces two modules to enhance local diversity and maintain overall structural integrity respectively. Local Feature Enhancement (LFE) module improves the local diversity of generated images, while Integral Structural Maintenance (ISM) module ensures that the integral information is preserved. Furthermore, a cascaded central loss is proposed to address the instability during the generative training. To tackle the problem of incomplete image types in existing datasets, we create a new text-to-image synthesis dataset containing seven types of industrial components, and test the effects of various existing methods based on the dataset. The results of comparative and ablation experiments show that, compared with other current methods, the images generated by Co-GAN contain more details and maintain the integrity.
User stance aware network for rumor detection using semantic relation inference and temporal graph convolution
ABSTRACT. The massive propagation of rumor has impaired the credibility of online social networks while effective rumor detection remains a difficulty. Recent studies leverage stance inference to explore the semantic evidence in comments to improve detection performance. However, existing models only consider stance-relevant semantic features and ignore stance distribution and evolution, thus leaving room for improvement. Moreover, we argue that stance inference without considering the context in threads may lead to incorrect semantic features being accumulated and carried through to rumor detection. In this paper, we propose a user stance aware attention network (USAT), which learns the temporal features in semantic content, individual stance and collective stance for rumor detection. Specifically, a high-order graph convolutional operator is designed to aggregate the preceding posts of each post, ensuring a complete semantic context for stance inference. Two temporal graph convolutional networks work in parallel to model the evolution of stance distribution and semantic content respectively and share stance-based attention for de-nosing content aggregation. Extensive experiments demonstrate that our model outperforms the state-of-the-art baselines. Our model will be available on Github upon acceptance.
ABSTRACT. Supervised learning, especially supervised deep learning, requires large amounts of labeled data. One approach to collect large amounts of labeled data is by using a crowdsourcing platform where numerous workers perform the annotation tasks. However, the annotation results often contain label noise, as the annotation skills vary depending on the crowd workers and their ability to complete the task correctly. Learning from Crowds is a framework which directly trains the models using noisy labeled data from crowd workers. In this study, we propose a novel Learning from Crowds model, inspired by SelectiveNet proposed for the selective prediction problem. The proposed method called Label Selection Layer trains a prediction model by automatically determining whether to use a worker’s label for training using a selector network. A major advantage of the proposed method is that it can be applied to almost all variants of supervised learning problems by simply adding a selector network and changing the objective function for existing models, without explicitly assuming a model of the noise in crowd annotations. The experimental results show that the performance of the proposed method is almost equivalent to or better than the Crowd Layer, which is one of the state-of-the-art methods for Deep Learning from Crowds, except for the regression problem case.
Empirical Analysis of Multi-label Classification on GitterCom using BERT
ABSTRACT. To maintain development consciousness, simplify project coordination, and prevent misinterpretation, communication is essential for software development teams. Instant private messaging, group chats, and sharing code are just a few of the capabilities that chat rooms provide to assist and meet the communication demands of software-development teams. All of this is capacitated to happen in real-time. Consequently, chat rooms have gained popularity among developers. Gitter is one of these platforms that has gained popularity, and the conversations it contains may be a treasure trove of data for academics researching open-source software systems. This research made use of the GitterCom dataset, The largest collection of Gitter developer messages that have been carefully labeled and curated and perform multi-label classification for the Purpose Category in the dataset. Extensive empirical analysis is performed on 6 feature selection techniques, 14 machine learning classifiers, and BERT transformer layer architecture with layer-by-layer comparison. Consequently, we achieve proficient results through our research pipeline involving Extra Trees Classifier and Random Forest classifiers with AUC(OvR) median performance of 0.94 and 0.92 respectively. Furthermore, The research proposed research pipeline could be utilized for generic multi-label text classification on software developer forum text data.
Genetic Programming Symbolic Regression with Simplification-Pruning Operator for Solving Differential Equations
ABSTRACT. Differential equations (DEs) are important mathematical models for describing natural phenomena and engineering problems. Finding analytical solutions for DEs has theoretical and practical benefits. However, traditional methods for finding analytical solutions only work for some special forms of DEs, such as separable variables or transformable to ordinary differential equations. For general nonlinear DEs, analytical solutions are often hard to obtain. The current popular method based on neural networks requires a lot of data to train the network and only gives approximate solutions with errors and instability. It is also a black-box model that is not interpretable. To obtain analytical solutions for DEs, this paper proposes a symbolic regression algorithm based on genetic programming with the simplification-pruning operator (SP-GPSR). This method introduces a new operator that can simplify the individual expressions in the population and randomly remove some structures in the formulas. Moreover, this method uses multiple fitness functions that consider the accuracy of the analytic solution satisfying the sampled data and the differential equations. In addition, this algorithm also uses a hybrid optimization technique to improve search efficiency and convergence speed. This paper conducts experiments on two typical classes of DEs. The results show that the proposed method can effectively find analytical solutions for DEs with high accuracy and simplicity.
ABSTRACT. Offensive content in social media has become a serious issue, due to which its automatic detection is a crucial task. Deep learning approaches for Natural Language Processing (or NLP) have proven to be on or even above human-level accuracy for offensive language detection tasks. Due to this, the deployment of deep learning models for these tasks is justified. However, there is one key aspect that these models lack, which is explainability, in contrast to humans. In this paper, we provide an explainable model for offensive language detection in the case of multi-task learning. Our model achieved an F1 score of 0.78 on the OLID dataset and 0.85 on the SOLID dataset. We also provide a detailed analysis of the model interpretability.
Comparative Analysis of the Linear Regions in ReLU and LeakyReLU Networks
ABSTRACT. Networks with piecewise linear activation functions partition the input space into numerous linear regions. As such, the number of linear regions can serve as a metric to quantify the expressive capacity of networks employing ReLU (Rectified Linear Unit) and LeakyReLU activations. One notable drawback of the ReLU network lies in the potential occurrence of the "dying ReLU" issue during training, whereby the output and gradient remain zero when the input to a ReLU layer is negative. This results in ineffective weight updates and renders the affected neurons unresponsive, consequently impeding their contribution to network training. In this study, we perform a statistical analysis on the actual number of linear regions expressed by ReLU and LeakyReLU networks, providing an intuitive explanation for the "dying ReLU" problem. Our findings indicate that, under consistent input distributions and network parameters, LeakyReLU networks generally exhibit stronger expressive capacity in terms of linear regions compared to ReLU networks. We hope that our research can provide inspiration for the design of activation functions and contribute to the exploration and analysis of the behaviors exhibited by piecewise linear activation functions in networks.
DAformer: Transformer with Domain Adversarial Adaptation for EEG-based Emotion Recognition with Live-Oil Paintings
ABSTRACT. The emergence of domain adaptation has brought remarkable advancement to EEG-based emotion recognition by reducing subject variability thus increases the accuracy of cross-subject tasks. A wide variety of materials have been employed to elicit emotions in experiments, however, artistic works which aim to evoke emotional resonance of observers are relatively less frequently utilized. Previous research has shown promising results in EEG-based emotion recognition on static oil paintings. As video clips are widely recognized as the most commonly used and effective stimuli, we adopted animated live oil paintings, a novel emotional stimuli in live form which are essentially a type of video clip while possessing fewer potential influencing factors for EEG signals compared to traditional video clips, such as abrupt switches on background sound, contrast and color tones. Moreover, previous studies on static oil paintings focused primarily on the subject-dependent task, and further research involving cross-subject analysis remains to be investigated. In this paper, we proposed a novel DAformer model which combines the advantages of Transformer and adversarial learning. In order to enhance the evocative performance of oil paintings, we introduced an innovative emotional stimuli by transforming static oil paintings into animated live forms. We developed a new emotion dataset SEED-LOP (SJTU EEG Emotion Dataset-Live Oil Painting) and constructed DAformer to verify the effectiveness of SEED-LOP. The results demonstrated higher accuracies in three-class emotion recognition tasks when watching live oil paintings, with a subject-dependent accuracy achieving 61.73% and a cross-subject accuracy reaching 54.12%.
Explainable Sparse Associative Self-Optimizing Neural Networks for Classification
ABSTRACT. Contemporary models used for supervised training often suffer from a large number of possible combinations of hyperparameters, rigid nonadaptive architectures, underfitting, overfitting, the curse of dimensionality, etc. These issues slow down the model optimization and consume many resources to find satisfactory solutions. As long as real-world objects are related and similar, we cannot only train network parameters but also construct and automatically adapt the network structure to represent patterns and relationships. Such networks are easily explainable because they reproduce the most essential and frequent relationships of training data and aggregate representations of similarities. This paper presents a new approach to detecting and representing similarities and object relationships to self-adapt a network structure for vectorized numerical training data. By doing so, the network will facilitate the classification process by identifying hyperspace regions associated with the defined classes in a training dataset. This makes the produced models fully explainable and trustworthy. Furthermore, our approach demonstrates its ability to automatically reduce the dimensionality of input data, removing features that produce distortions without substantially supporting the classification process. The presented network adaptation algorithm produces a sparse network associative structure fitted contextually to any given dataset by detecting relationships and similarities. In addition, this algorithm does not require setting almost any hyperparameters as state-of-the-art methods usually do. The explanation of new associative adaptive approaches is followed by the comparisons of the classification results with other best-performing models and methods.
DRPDDet: Dynamic Rotated Proposals Decoder for Oriented object detection
ABSTRACT. Oriented object detection has gained popularity in diverse fields. However, in the domain of two-stage detection algorithms, the generation of high-quality proposals with a high recall rate remains a formidable challenge, especially in the context of remote sensing images where sparse and dense scenes coexist. To address this, we propose the DRPDDet method, which aims to improve the accuracy and recall of proposals for Oriented target detection. Our approach involves generating high-quality horizontal proposals and dynamically decoding them into rotated proposals to predict the final rotated bounding boxes. To achieve high-quality horizontal proposals, we introduce the innovative HarmonyRPN module. This module integrates foreground information from the RPN classification branch into the original feature map, creating a fused feature map that incorporates multi-scale foreground information. By doing so, the RPN generates horizontal proposals that focus more on foreground objects, which leads to improved regression performance. Additionally, we design a dynamic rotated proposals decoder that adaptively generates rotated proposals based on the constraints of the horizontal proposals, enabling accurate detection in complex scenes. We evaluate our proposed method on the DOTA and HRSC2016 remote sensing datasets, and the experimental results demonstrate its effectiveness in complex scenes. Our method improves the accuracy of proposals in various scenarios while maintaining a high recall rate.
Classification of Hard and Soft Wheat Species using Hyperspectral Imaging and Machine Learning Models
ABSTRACT. Ensuring the identification and authenticity of wheat seeds are critical tasks in the food grain industry. In this work, twenty wheat varieties were collected from three different locations in India. The near-infrared (NIR) hyperspectral imaging technique (spectral range 900-1700nm) was employed in conjunction with machine learning models to discriminate twenty different wheat varieties into two classes: hard wheat and soft wheat. The data images were taken from both sides of the seed
(ventral and dorsal side). The dataset includes images of 20,160 seeds. Five different machine learning models were used for classification: Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Naive Bayes (NB), K-Nearest Neighbor (KNN), and Random Forest (RF). Five preprocessing techniques pretreated the mean spectral values of the hyperspectral image: Standard Normal Variate (SNV), Multiplicative Scatter Correction (MSC), Savitzky Golay Smoothing (SG), Savitzky Golay Smoothing First Derivative (SG-1), and Savitzky Golay Smoothing Second Derivative (SG-2).The models peformance was evaluated for both raw and preprocessed data. The Support Vector Machine exhibited exceptional performance, attaining an astonishing accuracy rate of 95.01% for amalgamated data (encompassing both ventral and dorsal side data), 95.05% for exclusively ventral side data, and an impressive 95.37% for exclusively dorsal side data.
Pushing the Boundaries of Chinese Painting Classification on Limited Datasets: Introducing a Novel Transformer Architecture with Enhanced Feature Extraction
ABSTRACT. The study delves into the realm of Chinese painting classification, a domain that has received limited attention in research. This research paper presents a comprehensive approach to address the challenges posed by limited datasets, feature map redundancy, and the inherent limitations of the Transformer model in this context. To overcome the scarcity of available data, a diverse and high-resolution Chinese painting dataset is meticulously curated, comprising iconic works from esteemed painters across various dynasties. In order to optimize feature extraction, an innovative strategy is employed, selecting a subset of channels that effectively capture the intricate details and distinctive features of Chinese paintings, thereby reducing redundancy and enhancing model efficiency. Furthermore, a novel Transformer architecture is proposed, leveraging local self-attention with sliding windows and introducing separate spatial token mixing and channel feature transformation through two residual connections. Experimental results demonstrate the superior classification and recognition accuracy achieved by this novel architecture on small datasets. This research contributes to the advancement of Chinese painting classification and offers valuable insights into the potential of deep learning models in this artistic domain.The dataset and code are publicly available at:https://github.com/qwerty0814/gangan
Effects of Brightness and Class-unbalanced Dataset on CNN Model Selection and Image Classification considering Autonomous Driving
ABSTRACT. Even though the approach of combining machine learning
(ML) enhanced models and convolutional neural networks (CNNs) is
used for adaptive CNN model selection, thorough investigation of the ef-
fects of 1) image brightness and 2) class-balanced/-unbalanced datasets is
needed, considering image classification (object detection) for autonomous
driving in a significantly different day- and night-time settings. In this
empirical study we comprehensively investigate the effects of these two
main issues by using the ImageNet dataset, predictive models (premodel),
and CNNs. Based on the experimental results and analysis, we reveal
non-trivial pitfalls (up to 58% difference in top-1 accuracy in different
class-balance datasets) and opportunities in classification accuracy by
changing brightness levels and class-balance ratio in datasets.
CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition
ABSTRACT. Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
Real-Time Instance Segmentation and Tip Detection for Neuroendoscopic Surgical Instruments
ABSTRACT. Location information of surgical instruments and their tips can be valuable for computer-assisted surgical systems and robotic endoscope control systems. While real-time methods for instrument segmentation and tip detection have been proposed for minimally invasive abdominal surgeries, the challenges become even greater in minimally invasive neurosurgery due to its narrow operating space and diverse tissue characteristics. In this paper, we introduce a real-time approach for instance segmentation and tip detection of neuroendoscopic surgical instruments. To address the specific requirements of neurosurgery, we design a tailored data augmentation strategy for this field and propose a mask filtering module to eliminate false-positive masks. Our method is evaluated using both a neurosurgical dataset and the EndoVis15' dataset. The experimental results demonstrate that the data augmentation module improves the accuracy of instrument detection and segmentation by up to 12.6%. Moreover, the mask filtering module enhances the precision of instrument tip detection with an improvement of up to 39.51%.
ABSTRACT. A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is of immense importance to the surveillance of public security in regions like campuses, squares and parks. Different from conventional human interaction recognition, which uses choreographed videos as inputs, neglects concurrent interactive groups, and performs detection and recognition in separate stages, we introduce a new task named human-to-human interaction detection (HID). HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model. First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and 86,338 interactive groups, and each image includes up to 4 concurrent interactive groups. Second, we present a novel baseline approach SaMFormer for HID, containing a visual feature extractor, a split stage which leverages a Transformer-based model to decode action instances and interactive groups, and a merging stage which reconstructs the relationship between instances and groups. All SaMFormer components are jointly trained in an end-to-end manner. Extensive experiments on AVA-I validate the superiority of SaMFormer over representative methods. The dataset and code will be made public to encourage more follow-up studies.
DeFusion: Aerial Image Matching Based on Fusion of Handcrafted and Deep Features
ABSTRACT. With the popularity of drones with vision sensors and the advancement of image processing technology, machine vision tasks based on image matching have received widespread attention. However, due to the complexity of aerial images, traditional matching methods based on handcrafted features unavoidably suffer from the low robustness because of lacking the ability to extract high-level semantics. On the other hand, deep learning shows a great potential in improving matching accuracy, but at the cost of a large amount of specific samples and computing resources, making it infeasible in many scenarios. To fully leverage the strengths of both approaches, we propose DeFusion, a novel image matching solution with a fine-grained decision-level fusion algorithm that effectively combines handcrafted features and deep features. We train generic features on public datasets, enabling us to handle unseen scenes. We use RootSIFT as prior knowledge to guide the extraction of deep features, significantly reducing the computational overhead. We also carefully design preprocessing steps by incorporating the attitude information of the drone. Eventually, as illustrated in our experimental results, the proposed scheme achieves an overall 2.5-6x more correct matches with improved robustness when compared to the existing methods.