Jaeyoung Choi (Delft University of Technology/ ICSI, Netherlands) Martha Larson (Radboud University and Delft University of Technology, Netherlands) Gerald Friedland (University of California, Berkeley, United States) Alan Hanjalic (Delft University of Technology, Netherlands)
From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Joint Representation for Cross-Modal Retrieval
ABSTRACT. Learning a robust joint representation space is important for effective multimedia retrieval, and is increasingly important as multimodal data grows in volume and diversity. The labeled datasets necessary for learning such a space are limited in size and also in coverage of semantic concepts. These limitations constrain performance: a joint representation learned on one dataset may not generalize well to another. We address this issue, by building on the insight that, given limited data, it is easier to optimize the semantic structure of a space within a modality, than across modalities. We propose a two-stage joint representation learning framework with intra-modal optimization and subsequent cross-modal transfer learning of semantic structure that produces robust joint representation space. We integrate multi-task learning into each step, making it possible to leverage multiple datasets, annotated with different concepts, as if they were one large dataset. Large-scale systematic experiments demonstrate improvements over previously reported state-of-the-art methods on cross-modal retrieval tasks.
10:22
Giorgos Constantinou (University of Southern California, United States) Abdullah Alfarrarjeh (University of Southern California, United States) Seon Ho Kim (University of Southern California, United States) Gowri Sankar Ramachandran (IMSC and CCI, University of Southern California, United States) Bhaskar Krishnamachari (CCI, University of Southern California, United States) Cyrus Shahabi (IMSC, University ofSouthern California, United States)
A Crowd-Based Image Learning Framework Using Edge Computing for Smart City Applications
ABSTRACT. Smart city applications covering a wide area such as traffic monitoring and pothole detection are gradually adopting more image machine learning algorithms utilizing ubiquitous camera sensors.
To support such applications, an edge computing paradigm focuses on processing large amount of multimedia data at the edge to offload processing cost and reduce long-distance traffic and latency. However, existing edge computing approaches rely on pre-trained static models and are limited in supporting diverse classes of edge devices as well as learning models to support them. This research proposes a novel crowd-based learning framework which allows edge devices with diverse resource capabilities to perform machine learning towards the realization of image-based smart city applications. The intelligent retraining algorithm allows sharing key visual features to achieve a higher accuracy based on the temporal and geospatial uniqueness. Our evaluation shows the trade-off between accuracy and the resource constraints of the edge devices, while the model re-sizing option enables running machine learning models on edge devices with high flexibility.
ABSTRACT. Recently, low-rank tensor completion has attracted increasingly attention in recovering incomplete tensor whose elements are missing. The basic assumption is that the underlying tensor is a low-rank tensor, and therefore tensor nuclear norm minimization can be applied to recover such tensor. By taking color images as third-order tensors, it has been shown that these tensors are not necessary to be low-rank. The main aim of this paper is to propose and develop a weighted tensor factorization method for low-rank tensor completion. The main idea is to determine a suitable weight tensor such that the multiplication of the weight tensor to the underlying tensor can be low-rank or can be factorized into a product of low-rank tensors. Fast iterative minimization method can be designed to solve for the weight tensor and the underlying tensor very efficiently. We make use of color images as examples to illustrate the proposed approach. A series of experiments are conducted on various incomplete color images to demonstrate the superiority of proposed low-rank tensor factorization method by comparing with the state-of-the-art methods in color image completion performance.
ABSTRACT. We propose a novel approach for image corpus representative
summarization using GAN. Our
technique can be used to automatically provide a condensed set
of representatives for the given image collection. The generated summary
can be used for rapid prototyping as models can be trained
using the summarized set instead of the larger original dataset. The problem is challenging because a good summary must cover various aspects of an image set such as relevance and diversity. Additionally, lack of sufficient ground truth data makes the problem hard to solve using classical supervised machine learning approaches. In our algorithm, we use CNN and an MLP score layer to compute the priority of each image towards the summary. Our network is trained in an unsupervised manner using a generator for reconstructing the original dataset, and a discriminator, for classifying between original and summary. We show the efficacy of the algorithm using rigorous experiments.
ABSTRACT. A rapid growth in the amount of fake news on social media is a very serious concern in our society. It is usually created by manipulating images, text, audios, and videos. This indicates that there is a need for a multimodal system for fake news detection. Though there are multimodal fake news detection systems they tend to solve the problem of fake news by considering an additional sub-task like event discriminator and finding correlations across the modalities. The results of fake news detection are heavily dependent on the subtask and in absence of subtask training, the performance of fake news detection degrades by 10% on an average.
To solve this issue, we introduce SpotFake- a multi-modal framework for fake news detection. Our proposed solution detects fake news without taking into account any other subtasks. It exploits both the textual and visual features of an article. Specifically, we made use of language models (like BERT) to learn text features, and image features are learned from VGG-19 pre-trained on ImageNet dataset. All the experiments are performed on two publicly available datasets i.e., Twitter and Weibo.
The proposed model performs better than the current state-of-the-art on Twitter and Weibo datasets by 3.27% and 6.83% respectively.
11:52
Abdullah Alfarrarjeh (University of Southern California, United States) Seon Ho Kim (University of Southern California, United States) Arvind Bright (University of Southern California, United States) Vinuta Hegde (University of Southern California, United States) Akshansh (University of Southern California, United States) Cyrus Shahabi (University of Southern California, United States)
Spatial Aggregation of Visual Features for Big Image Data Search
ABSTRACT. Two main requirements of searching in a big image database are performance and accuracy. For an accurate similarity search, high-dimensional visual features are preferred while low-dimensional features resulted from dimension reduction techniques are utilized in index structures for performance. Most state-of-the-art indexes utilize low-dimensional visual descriptors to avoid the computing overhead of high-dimensionality in image search, which sacrifices search accuracy. We propose a new descriptor that balances the trade-off between accuracy and performance in image search by extending the representation of an image with the feature set of multiple similar images located in its vicinity (referred to as Spatially-Aggregated Visual Feature Descriptor (SVD)). SVD potentially preserves the visual features of images in both high and low-dimensional spaces better than conventional visual descriptors. Through an empirical evaluation on big datasets, indexing images using SVD provided a significant improvement in search accuracy comparing to using conventional descriptors while maintaining the same performance.
12:14
Osaid Rehman Nasir (Indraprastha Institute of Information Technology Delhi, India) Shailesh Kumar Jha (Indraprastha Institute of Information Technology Delhi, India) Manraj Singh Grover (Indraprastha Institute of Information Technology Delhi, India) Yi Yu (National Institute of Informatics, Japan) Ajit Kumar (Adobe System, India) Rajiv Ratn Shah (Indraprastha Institute of Information Technology Delhi, India)
Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions
ABSTRACT. In recent years, powerful generative adversarial networks (GAN) have been developed to automatically synthesize realistic images
from text. However, most existing tasks are limited to generating simple images such as flowers from captions. In this paper, we extend this problem to the less addressed domain of face generation from fine-grained textual descriptions of face, \emph{e.g., "A person has curly hair, oval face, and mustache"}. Since current datasets for the task are either very small or do not contain captions, we generate captions for images in the CelebA dataset by creating an algorithm to automatically convert a list of attributes to a set of captions. The generated captions are meaningful, versatile and consistent with the general semantics of a face. We then model the highly multi-modal problem of text to face generation as learning the conditional distribution of faces (conditioned on text) in same latent space. We utilize the current state-of-art GANs for learning conditional multi-modality. The presence of more fine-grained details and variable length of the captions makes the problem easier for user but more difficult to handle compared to the other text-to-image tasks. We flipped the labels for real and fake images and added noise in discriminator. Generated images for diverse textual descriptions show promising results. In the end we show how the widely used inceptions score is not a good metric to evaluate the performance of generative models used for synthesizing faces from text.
ABSTRACT. There is an increase of focus on both style transfer and user interaction based image editing.
Commercial apps provide users means to edit their images to their liking. Even though two-step methods exist that allow the user to perform significant changes to their images and then perform style transfer, these methods tend to distort the image and remove the aesthetic beauty the user desires. We propose a new Generative Adversarial Network (GAN), called FRGAN, that maintains the user suggested changes to the image and performs style transfer retaining the changes. We qualitatively demonstrate the efficacy of FRGAN formulation over various two-step GAN methods and traditional style transfer methods. We utilize the Mean of Opinion Score (MOS) metric to quantify our proposed models' performance.
Keiji Yanai (The University of Electro-Communications, Japan) Kaimu Okamoto (The University of Electro-Communications, Japan) Tetsuya Nagano (The University of Electro-Communications, Japan) Daichi Horita (The University of Electro-Communications, Japan)
Large-Scale Twitter Food Photo Mining and Its Applications
ABSTRACT. Many people are posting photos as well as short messages to Twitter
every minutes from everywhere on the earth. By monitoring the Twitter
stream, we can obtain various kinds of photos with texts which help
understand the current state of the world visually. Since 2011, we have
been continuously collecting photos from the Twitter stream for about
eight years. Although we are also collecting generic Twitter photos
with geotags, in this paper we focus on food image collection from
Twitter. Because foods are one of the most popular contents of Twitter
photos, we can collect a large number of food images. In fact, we have
collected more than two million food photos so far. In this paper, we
present the analysis on the food photos collected from the Twitter
stream for eight years. In addition, we describe some applications
using Twitter photos including world food analysis and food photo
translation.
Similar Seasonal-Geo-Region Mining Based on Visual Concepts in Social Media Photos
ABSTRACT. In this paper, we propose a method for similar geo-region
mining focusing on the seasonality. We define “seasonal-georegion” as a geographically and temporally continuous area where many travellers share common targets of interest. In order to extract such targets of interest, we consider that observing people’s interests through the contents of social media photographs is an effective way. We first introduce a clustering method to decide seasonal-geo-region boundaries based on the geo-tag and shooting time accompanying a photo. Next, we introduce the proposed method that compares the similarity of a given pair of seasonal-geo-regions based on the likelihood distribution of Visual Concepts that appear in photographs belonging to each seasonal-geo-region. In the end, we introduce results of the seasonal-geo-region mining experiment, and report an evaluation on part of the results through a subjective experiment. The results showed the effectiveness of the proposed method.
ABSTRACT. Social media is inevitably the most abundant source
of actionable information in times of natural disasters. Most
of the data is either available in the form of text, images or
videos. Real-time analysis of such data during the events of
calamities poses many challenges to machine learning algorithms
that require a large amount of data to perform well. Multimodal
Twitter Dataset for Natural Disasters (CrisisMMD) is one such
novel dataset that provides annotated textual as well as image
data to researchers to aid the development of crisis response
mechanism which can leverage social media platforms to extract
useful information in times of crisis. In this paper, we analyze
multimodal data related to seven different natural calamities like
hurricanes, floods, earthquakes, etc. and propose a novel decision
diffusion technique to classify them into informative and noninformative
categories. The proposed methodology outperforms
the text baselines by more than 4% accuracy and image baselines
by more than 3%.
A Framework to Detect Fake Tweet Images on Social Media
ABSTRACT. Misinformation on social media has become an alarming problem in recent years; the advent of social media platforms as a news source serves as a critical factor. Let alone misinformation; recently, social media platforms have faced a new challenge of tampered impersonated content (i.e., Tweets). Although several approaches have been proposed to detect misinformation, there is not much attention has been given to detecting impersonated fake content (i.e., Tweets). With the free tools available on the internet, it is very easy to generate a fake screen capture of a tweet. These tools are intended for fun, but the problem arises when they get weaponized to generate fake news and impersonated content. In this paper, we propose a framework that can help to detect tampered and impersonated tweets on other social media platforms or any digital platforms. We validate the results of the proposed framework with the dataset that we collected, which consists of real and multiple types of tampered tweets. Our proposed framework achieves accuracy of 83.33\% on the test cases we have covered based on all the possible tampering there can be done to a screen capture of the tweet.
Hong Xue (Fujian Key Lab for Intelligent Processing and Wireless Transmission of Media Information,Fuzhou University, China) Tiesong Zhao (Fujian Key Lab for Intelligent Processing and Wireless Transmission of Media Information,Fuzhou University, China) Weiling Chen (Fujian Key Lab for Intelligent Processing and Wireless Transmission of Media Information,Fuzhou University, China) Qian Liu (Dept. of Computer Science and Technology, Dalian University of Technology, China) Shaohua Zheng (Fujian Key Lab for Intelligent Processing and Wireless Transmission of Media Information,Fuzhou University, China) Chang Wen Chen (School of Science and Engineering, The Chinese University of Hong Kong, China)
Visual Attention and Haptic Control: a Cross-Study
ABSTRACT. The variety of multimedia big data has promoted emerging applications of Multiple sensorial media (Mulsemedia) types, in which haptic information attracts increasing attentions. Until now, the interaction between haptic signal and conventional audio-visual signals have not been fully investigated. In this work, we make an exploration on the cross-modal interactivity in task-driven scenarios. We first explore the correlation between visual attention and haptic control in three designed tasks: random-trajectory, fixed-trajectory and obstacle-avoidance. Then, we propose a visual-haptic interaction model that estimates kinesthetic position of haptic control with the information of gaze only. By incorporating a Long Short-Term Memory (LSTM) neural network, the proposed model provides effective prediction in the scenarios of fixed-trajectory and obstacle-avoidance, with its performance superior to other selected machine learning-based models. To further examine our model, we execute it in a haptic control task using visual guidance. Implementation results show a high task achievement rate.
14:30
Mingxing Xu (Department of Electronic Engineering, Shanghai Jiao Tong University., China) Wenrui Dai (Department of Computer Science and Engineering, Shanghai Jiao Tong University, China) Yangmei Shen (Department of Electronic Engineering, Shanghai Jiao Tong University., China) Hongkai Xiong (Department of Electronic Engineering, Shanghai Jiao Tong University., China)
Msgcnn: Multi-Scale Graph Convolutional Neural Network for Point Cloud Segmentation
ABSTRACT. Point cloud has emerged as a scalable and flexible geomet- ric representation for 3D data. Graph convolutional neural networks (GCNNs) have shown superior performance and ro- bustness in point cloud processing with structure-awareness and permutation invariance. However, naive graph convolu- tion networks are limited in point cloud segmentation tasks especially in the border areas of multiple segmentation in- stances due to the lack of multi-scale feature extraction abil- ity. In this paper, we propose a novel multi-scale graph convo- lutional neural network (MSGCNN) to allow multi-scale fea- ture learning for fine-grained point cloud segmentation. The proposed geometrical interpretable multi-scale point cloud processing framework is able to considerately enlarge the graph filters receptive fields and exploit discriminative multi- scale structure-aware point features for the superior segmen- tation performance against naive graph convolution networks especially in border area. Experimental results for part seg- mentation task on ShapeNet datasets show that MSGCNN achieves competitive performance with state-of-the-arts. In comparison to naive graph convolution networks, MSGCNN is shown to obtain better visual quality in the border area. We further validate that our model is robust to data point missing and noise perturbation with the learned multi-scale structure- aware point features.
15:00
Ziling Huang (National Tsing Hua University, Taiwan) Zheng Wang (National Institute of Informatics, Japan) Tzu-Yi Hung (Delta Research Center, Singapore) Shin’ichi Satoh (National Institute of Informatics, Taiwan) Chia-Wen Lin (National Tsing Hua University, Taiwan)
Group Re-Identification via Transferred Representation and Adaptive Fusion
ABSTRACT. Group re-identification (G-ReID) is a less-studied task. Its challenges include not only appearance changes of individuals which have been well-investigated in general person re-identification (ReID), but also group layout changes and group membership changes which are newly introduced by G-ReID. The key task of G-ReID is to learn representations robust to these changes. To address this issue, we design a Transferred Single and Couple Representation Learning Network (TSCN). The merits are two aspects: 1) Due to the lack of training samples, existing methods exploit unsatisfactory hand-crafted features. To obtain the superiority of deep learning models, we treat a group as multiple persons and transfer the labeled ReID dataset to the G-ReID dataset style to learn the single representation. 2) Taking into account neighborhood relationship in a group, we also propose the couple representation, which maintains more discriminative features in some cases. We also exploit an unsupervised weight learning method to adaptively fuse the results of different views together according to the result pattern. Extensive experimental results demonstrate the effectiveness of our approach, outperforming the state-of-the-art methods on two public datasets.
15:30
Salman Shaikh (National Institute of Advanced Industrial Science and Technology (AIST), Japan) Akiyoshi Matono (National Institute of Advanced Industrial Science and Technology (AIST), Japan) Kyoung-Sook Kim (Tokyo Institute of Technology, Japan)
A Distance-Window Based Real-Time Processing of Spatial Data Streams
ABSTRACT. Real-time and continuous processing of citywide spatial data is an essential requirement of smart cities to guarantee the delivery of basic life necessities to its residents and to maintain law and order in it. With the availability of low cost 3D scanners recently, citywide 3D spatial data can be obtained easily. 3D spatial data contains a wealth of information including images, point-cloud, GPS/IMU measurements, etc., and can be of potential use if integrated, processed and analyzed in real-time. The 3D spatial data is generated as continuous data stream, however traditionally it is processed offline. Many smart city applications require real-time integration, processing and analysis of spatial stream, for-instance, forest fire management, real-time road traffic analysis, disaster engulfed areas monitoring, people flow analysis, etc., however they suffer from slow offline processing of traditional systems. To make the most of this wealthy data resource, it must be processed and analyzed in real-time. This paper presents a framework for the continuous and real-time processing and analysis of 3D spatial streams. Furthermore, we propose a distance-based window for the continuous queries over 3D spatial streams. An experimental evaluation is also presented to prove the effectiveness of the proposed framework and the distance-based window.