next day
all days

View: session overviewtalk overview

09:30-10:00Coffee Break
10:00-11:30 Session 2: Best paper candidate

Best paper candidate presentations

Trajectory Similarity Assessment on Road Networks via Embedding Learning

ABSTRACT. Trajectory similarity assessment can be applied to clustering analysis of trajectory data, route planning and navigation recommendation in Intelligent Transportation System. However, the acquisition of trajectory data in urban environments is limited to road networks, different sampling rate, GPS errors, etc. In this paper, a trajectory similarity assessment approach is proposed based on road network embedding. It captures both topology and spatiality of road networks for embedding learning, by performing random walk adapt to spatial query through depth-first search and introducing distance to optimize the loss function. In that way, trajectory embedding can be obtained by mapped to road networks and thus similarity of trajectories is evaluated. Experiments on real data show that our approach is robust and efficient for trajectory similarity assessment.

Tag Boosted Hybrid Recommendations for Multimedia Data

ABSTRACT. Multimedia data is known for its variety and also for the difficulty that comes in extracting relevant features from multimedia data. Owing to which the collaborative recommendation systems have found their foothold in multimedia recommender systems. However, modern-day multimedia sites have tons of user history in the form of user feedbacks, reviews, votes, comments, and etc. We can use these social interactions to extract useful content features, which can then be used in content based recommendation system. In this paper, we propose a novel hybrid recommender systems that combines the content and collaborative systems using a Bayesian model. We substitute the concrete textual content with a sparse tag information. Extensive experiments on real-world dataset show that tags significantly improves the recommendation performance for multimedia data.

ThingiPano: A Large-Scale Dataset of 3D Printing Metadata, Images, and Panoramic Renderings for Exploring Design Reuse

ABSTRACT. The emergence of consumer-grade 3D printing has democratized innovation through online design-sharing platforms like Thingiverse. We introduce a novel multimodal dataset called ThingiPano, a large-scale collection containing multi-view 2D panoramic representations of over a million 3D files (n=1,816,295) with associated user-uploaded images (n=1,816,295), design metadata (n=1,017,687), and user metadata (n=283,873) from Thingiverse. In this paper, we exhibit how ThingiPano's metadata can facilitate greater understanding of how 3D printing designs are fabricated, by who, and for what purpose. We demonstrate how this novel multimodal dataset is sufficient for self-supervised machine learning methodologies. Such methodologies have the potential to facilitate broader reuse of 3D printable designs, through improved multimodal classification and retrieval in various applications from online file-sharing platforms to design-tools.

GeoSecure-R: Secure Computation of Geographical Distance using Region-anonymized GPS Data

ABSTRACT. Today Location-Based Services (LBS) are used by millions of users all over the world. These services widely use GPS sensory data. LBS provides enormous convenience to the users, but at the expense of continuously tracking their location. This raises severe privacy concerns. The distance computation using users' GPS coordinates is a crucial component in many LBS, such as the fitness tracker app and understanding driving habits. In this paper, we propose a method to calculate the distance without revealing the users' exact location. The proposed method ensures users' privacy with region anonymity while maintaining utility, i.e., it provides LBS that requires only the traveled distance rather than the actual location of the user. Experimental results on Microsoft's GeoLife dataset show that the proposed method calculates the distance metric very close to what we get using the actual location of the user.

11:30-13:00 Session 3: Content Processing and Understanding I

Content Processing and Understanding I


ABSTRACT. Visual content understanding has seen a significant advancement in recent years with researchers having easy access to superior compute power and infrastructure to train deep learning models. Organizations are further using multiple camera to collect more data to build robust solutions across domains. However, this has introduced lot of duplicate data and redundancy, especially if the images are captured in an uncontrolled environment. In this paper, we propose our solution to identify duplicate instances in a multi camera system which captures images of the same instance at varied viewing angles and scales. The duplicate instances are identified at a pixel level by using a combination of deep learning algorithms and computer vision geometry. We also devise a similarity index to quantify the extent of similarity between a pair of instances. Further, we present our solution for duplicate damage instance identification in vehicles, customized for the auto finance and auto insurance industry.

Focused Questions and Answer Generation by Key Content Selection

ABSTRACT. Automatic Question Generation(AQG) as a part of Natural Language Processing is an ongoing research trend. AQG is extremely helpful for ComputerAssisted Assessments where it reduces the expense of manual construction of questions and satisfies the need for a constant supply of new questions. Exam styled questions generated from AQG are mostly “WH”(“What”, “Who”, and “Where”) or reading comprehension type. In order for the questions to be most natural or human-like, they need to be diverse or semantically different, based on their levels of assessment, while their answers might remain the same. Hence generating diverse sequences as a part of question generation has become an important NLP task, especially in the education and publishing industry. In this paper, we propose a method of automatically generating answers and diversified sequences corresponding to those answers by introducing a new module called the “Focus Generator”. This module guides the decoder in an existing “encoder-decoder” model to generate questions based on selected focus contents. We use a keyword generation algorithm to generate answer tags and a pool of candidate focus from which three best focus are chosen according to the level of information contained in them. We then use this focus content to generate questions that are semantically different from each other. Our work uses a simple architecture by using a single “Focus Generator” module and experimental results show that our module demonstrates 1.2% improvements in BLEU4 score and 20% less training time over the current state-of-the-art model. Our model is also user friendly and provides ease of deriving inference

Multivariate Adaptive Gaussian Mixture for Scene Level Anomaly Modeling

ABSTRACT. Scene changes that typically occur in a real-world setting degrade anomaly detection performance over the long run. Most of the existing methods ignore the challenge of temporal concept drift in video surveillance. In this paper, we propose an unsupervised end-to-end framework for adaptive scene level anomaly detection. We utilize multivariate Gaussian mixtures for adaptive scene learning. The mixture represents the possible distribution of normal and abnormal events shown till now. The distribution adapts itself according to the slow scene changes. We introduce a Mahalanobis distance-based contribution factor to update mixture parameters on the arrival of each new event. A detailed discussion and experiments are conducted to decide optimum local as well as global temporal context. The existing public datasets for anomaly detection are of very short duration (maximum of 1.5 hours) to be used for evaluating adaptive approaches. Therefore we also collected a longer duration dataset of continuous 10 hours duration. We achieved a promising performance of 85.14% AUC and 21.26% EER on this data.

An Empirical Study on Ensemble Learning of Multimodal Machine Translation

ABSTRACT. With the increasing availability of images, multimodal machine translation (MMT) is leading a vibrant field. Model structure and multimodal information introduction are the hotspot focused by MMT researchers nowadays. Among the existing models, transformer model has reached the state-of-the-art performance in many translation tasks. However, we observe that the performance of MMT based on transformer is highly unstable since transformer model is sensitive to the fluctuation of hyper-parameters especially the number of layers, the dimension of word embeddings and hidden states, the number of multi-heads. Moreover, different ways of introducing image information also have significant influence on the performance of MMT. In this paper, we exploit some integration strategies which depend on different tasks to make collaborative decisions on the final translation results to enhance the stability of MMT based on transformer. Furthermore, we combine different ways of introducing image information to improve the semantic expression of input. Extensive experiments on Multi30K dataset demonstrate that ensemble learning in MMT which integrates text and image features exactly obtain more stable and better translation performance and the best result yields improvement of 5.12 BLEU points over the strong Transformer baseline set in our experiments.

13:00-14:00Lunch Break
14:00-15:30 Session 4A: Invited Papers I - Content Understanding

Invited Papers I - Content Understanding

Facial Expression Recognition Under Partial Occlusion from Virtual Reality Headsets based on Transfer Learning

ABSTRACT. Facial expressions of emotion are a major channel in our daily communications, and it has been subject of intense research in recent years. To automatically infer facial expressions, convolutional neural network based approaches has become widely adopted due to their proven applicability to Facial Expression Recognition (FER) task.On the other hand Virtual Reality (VR) has gained popularity as an immersive multimedia platform, where FER can provide enriched media experiences. However, recognizing facial expression while wearing a head-mounted VR headset is a challenging task due to the upper half of the face being completely occluded. In this paper we attempt to overcome these issues and focus on facial expression recognition in presence of a severe occlusion where the user is wearing a head-mounted display in a VR setting. We propose a geometric model to simulate occlusion resulting from a Samsung Gear VR headset that can be applied to existing FER datasets. Then, we adopt a transfer learning approach, starting from two pretrained networks, namely VGG and ResNet. We further fine-tune the networks on FER+ and RAF-DB datasets. Experimental results show that our approach achieves comparable results to existing methods while training on three modified benchmark datasets that adhere to realistic occlusion resulting from wearing a commodity VR headset.

On the Inference of Soft Biometrics from Typing Patterns Collected in a Multi-device Environment

ABSTRACT. In this paper, we study the inference of gender, major/minor (computer science, non-computer science), typing style, age, and height from the typing patterns collected from 117 individuals in a multi-device environment. The inference of the first three identifiers was considered as classification tasks, while the rest as regression tasks. For classification tasks, we benchmark the performance of six classical machine learning (ML) and four deep learning (DL) classifiers. On the other hand, for regression tasks, we evaluated three ML and four DL-based regressors. The overall experiment consisted of two text-entry (free and fixed) and four device (Desktop, Tablet, Phone, and Combined) configurations. The best arrangements achieved accuracies of 96.15%, 93.02%, and 87.80% for typing style, gender, and major/minor, respectively, and mean absolute errors of 1.77 years and 2.65 inches for age and height, respectively. The results are promising considering the variety of application scenarios that we have listed in this work.

Semantic Enhanced Sketch Based Image Retrieval with Incomplete Multimodal Query

ABSTRACT. Sketch Based Image Retrieval (SBIR) is a challenging problem mainly due to a significant cross-domain gap between hand-drawn sketches and natural images. While extra semantic information (such as attribute details) can facilitate query-adaptive search, we still have to face two challenges: (1) an incomplete multimodal query; (2) lack of sketch-image paired training data. Toward this end, many existing multimodal sketch retrieval frameworks utilize text-based label information to augment the limited sketch query with more semantic details. However, that single word-level category information may not always reveal sufficient characteristics on the object specific fine-grained attributes. In this work, we propose a multimodal SBIR system that allows both sketch and text level attribute description for query. In order to bridge the cross-modal gaps among sketch, image, and texts, given a semantic in consideration, two mode-specific semantic networks provide layer-wise regularizer parameters to dynamically adopt the underlying semantic within the learned sketch feature representation and thereby transform the initial generic sketch into a more comprehensive Semantic Enhanced Joint Embedding (SEJE). Also the availability of the multimodal paired samples may not be always feasible; neither during training nor during test phases. Therefore, instead of relying on strict one-one cross modal correspondence, the learning of the joint sketch embedding SEJE relies on capturing the semantic relevant cross-modal correspondences between an averaged mode-specific semantic features (image and text) and the sketch feature, which facilitates SEJE’s improved generalization ability. Evaluation on two benchmark datasets: Sketchy and TU-Berlin clearly validates the superiority of the proposed method, compared to the state-of-the-art methods. In fact, by allowing a user to add text attributes to make a complete multimodal query, the proposed method improves the mAP scores by 5-7% on the challenging sketch-attribute composition test scenario.

Fine-grained Language Identification with Multilingual CapsNet Model

ABSTRACT. Due to a drastic improvement in the quality of internet services worldwide, there is an explosion of content generation and consumption. This has led to an increasing diversity of audiences who want to consume media outside their linguistic familiarity/preference. Hence, there is an increasing need for real-time and fine-grained content analysis services, including language identification, content transcription, and analysis. Accurate and fine-grained spoken language detection is an essential first step for all the subsequent content. Current techniques in spoken language detection may lack in final language prediction accuracy, require large amounts of training data, or finally may deal with work with a small number of languages. In this work, we present a real-time language detection approach to detect spoken language from noisy and readily available data sources through a Capsule Network (CapsNet) architecture. Further, we show that the CapsNet can effectively detect whether data samples belong to none of the languages on which the model was trained. We compare our results with our baseline based on a combination of recurrent networks and attention mechanism.

14:00-15:30 Session 4B: Invited papers II - Novel Applications

Invited papers II - Novel Applications

Vyaktitv: Hindi Multimodal Personality Assessment Dataset

ABSTRACT. Automatically detecting personality traits can aid several applications, such as mental health recognition and human resource management. Most techniques used for personality detection so far have analyzed these traits for each individual in isolation. However, personality is intimately linked to our social behavior. Furthermore, surprisingly little research has focused on personality analysis using low resource languages. To this end, we present a novel peer-to-peer Hindi conversation dataset for personality detection, Vyaktitv. It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation. The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants. We release the dataset for public use, along with some interesting and insightful analysis.

Touchless Typing Using Head Movement-based Gestures

ABSTRACT. In this paper, we propose a novel touchless typing interface that makes use of an on-screen QWERTY keyboard and a smartphone camera. The keyboard was divided into nine color-coded clusters. The user moved their head toward clusters, which contained the letters that they wanted to type. A front-facing smartphone camera recorded the head movements. A bidirectional GRU based model which used pre-trained embedding rich in head pose features was employed to translate the recordings into cluster sequences. The model achieved an accuracy of 96.78% and 86.81% under intra- and inter-user scenarios, respectively, over a dataset of 2234 video sequences collected from 22 users.


ABSTRACT. The lockdowns and travel restrictions in the current coronavirus pandemic situation has replaced face-to-face teaching and meeting with the online alternatives. Recently, the video conferencing tool Zoom has become extremely popular for its simple-to-use features and low network bandwidth requirement. However, Zoom has serious security and privacy issues. Due to weak authentication mechanisms, unauthorised persons are invading Zoom sessions and creating disturbances (known as Zoom bombing). In this paper, we propose a preliminary work towards a seamless authentication mechanism for Zoom-based teaching and meeting. Our method is based on PRNU (Photo Response Non-Uniformity)-based camera authentication, which can authenticate the camera of a device used in a Zoom meeting without requiring any assistance from the participants (e.g., needing the participant to provide biometric). Results from a smallscale experiment validate the proposed method.

Feature Extraction and Feature Selection for Emotion Recognition using Facial Expression

ABSTRACT. Facial expressions play a significant role in describing the emotions of a person. Due to its applicability to a wide range of applications, such as human-computer interaction, driver status monitoring, etc. Facial Expression Recognition (FER) has received substantial attention among the researchers. According to the earlier studies, a small feature set is used for the extraction of facial features for FER system. To date, a systematic comparison of the facial features does not exist. Therefore, in the current research, we identified 18 different facial features (cardinality of 46,352) by reviewing 25 studies and implemented them on the publicly available ExtendedCohn-Kanade (CK+) dataset. After extracting facial features, we performed Feature Selection (FS) using Joint Mutual Information (JMI), Conditional Mutual Information Maximization (CMIM) and MaxRelevance Min-Redundancy (MRMR) and explain the systematic comparison between them, and for classification, we applied various machine learning techniques. The Bag of Visual Words (BoVW) model approach results in significantly higher classification accuracy over the formal approach. Also, we found that the optimal classification accuracy for FER can be obtained by using only 20% of the total identified features. Grey comatrix and haralick features were explored for the first time for the FER and grey comatrix feature outperformed several most commonly used features Local Binary Pattern (LBP) and Active Appearance Model (AAM). Histogram of Gradients (HOG) turns out to be the most significant feature for FER followed Local Directional Positional Pattern (LDSP) and grey comatrix.

15:30-16:00Coffee Break
16:00-17:00 Session 5: Invited papers III - Deep Learning based Analysis

Invited papers III - Deep Learning based Analysis

Unravelling Small Sample Size Problems in the Deep Learning World

ABSTRACT. The growth and success of deep learning approaches can be attributed to two major factors: availability of hardware resources and availability of large number of training samples. For problems with large training databases, deep learning models have achieved superlative performances. However, there are a lot of small sample size or S^3 problems for which it is not feasible to collect large training databases. It has been observed that deep learning models do not generalize well on $S^3$ problems and specialized solutions are required. In this paper, we first present a review of deep learning algorithms for small sample size problems in which the algorithms are segregated according to the space in which they operate, i.e. input space, model space, and feature space. Secondly, we present Dynamic Attention Pooling approach which focuses on extracting global information from the most discriminative sub-part of the feature map. The performance of the proposed dynamic attention pooling is analyzed with state-of-the-art ResNet model on relatively small publicly available datasets such as SVHN, C10, C100, and TinyImageNet.

Stratified Sampling Based Experience Replay for Efficient Camera Selection Decisions

ABSTRACT. Target tracking across a network of cameras has various applications in surveillance and forensics. These networks typically have cameras with non-overlapping fields of view, which necessitates target handovers involving approaches like target re-identification that are robust to illumination and pose variations of the target. Re-identification based target handovers are susceptible to false alarms, more so in a high data volume setting like a surveillance camera network. In this work, we learn to decide when to make a re-identification query and to which camera in the network. We model this camera selection problem as a sequential decision making problem and solve it using a Reinforcement Learning (RL) based policy, which selects one of the cameras for querying or decides to not query. Using a Deep Q-Network (DQN) type approach, we observed that existing experience replay (ER) methods with DQN are inadequate for learning an optimal policy when the actions are imbalanced. Employing standard ER techniques result in learning policies that are likely to be biased towards selecting a more frequent action and result in a poorer performance of the end task, in our case multicamera tracking. To address this problem, we segregate the experiences from agent-environment interaction into multiple replay memories and sample independently from these to create a diverse minibatch. We demonstrate the performance of the proposed method on the NLPR MCT dataset and DukeMTMC dataset along with its computational benefits.

Semi-Supervised Clustering with Neural Networks

ABSTRACT. Clustering using neural networks has recently demonstrated promising performance in machine learning and computer vision applications. However, the performance of current approaches is limited either by unsupervised learning or their dependence on large set of labeled data samples. In this paper, we propose \cnet that uses pairwise semantic constraints from very few labeled data samples ($<5\%$ of total data) and exploits the abundant unlabeled data to drive the clustering approach. We define a new loss function that uses pairwise semantic similarity between objects combined with constrained $k$-means clustering to efficiently utilize both labeled and unlabeled data in the same framework. The proposed network uses convolution autoencoder to learn a latent representation that groups data into $k$ specified clusters, while also learning the cluster centers simultaneously. We evaluate and compare the performance of ClusterNet on several datasets and state of the art deep clustering approaches.

17:00-18:00 Session 6: Invited papers IV - Social Media

Invited papers IV - Social Media

Unsupervised Generative Adversarial Alignment Representation for Sheet music, Audio and Lyrics

ABSTRACT. Sheet music, audio, and lyrics are three main modalities during writing a song. In this paper, we propose an unsupervised generative adversarial alignment representation (UGAAR) model to learn deep discriminative representations shared across three major musical modalities: sheet music, lyrics, and audio, where a deep neural network based architecture on three branches is jointly trained. In particular, the proposed model can transfer the strong relationship between audio and sheet music to audio-lyrics and sheet-lyrics pairs by learning the correlation in the latent shared subspace. We apply CCA components of audio and sheet music to establish new ground truth. The generative (G) model learns the correlation of two couples of transferred pairs to generate new audio-sheet pair for a fixed lyrics to challenge the discriminative (D) model. The discriminative model aims at distinguishing the input which is from the generative model or the ground truth. The two models simultaneously train in an adversarial way to enhance the ability of deep alignment representation learning. Our experimental results demonstrate the feasibility of our proposed UGAAR for alignment representation learning among sheet music, audio, and lyrics.

A Fairness-Aware Fusion Framework for Multimodal Cyberbullying Detection

ABSTRACT. Recent reports of bias in multimedia algorithms (e.g., lesser accuracy of face detection for women and persons of color) have underscored the urgent need to devise approaches which work equally well for different demographic groups. Hence, we posit that ensuring fairness in multimodal cyberbullying detectors (e.g., equal performance irrespective of the gender of the victim) is an important research challenge. We propose a fairness-aware fusion framework that ensures that both fairness and accuracy remain important considerations when combining data coming from multiple modalities. In this Bayesian fusion framework, the inputs coming from different modalities are combined in a way that is cognizant of the different confidence levels associated with each feature and the interdependencies between features. Specifically, this framework assigns weights to different modalities not just based on accuracy but also their fairness. Results of applying the framework on a multimodal (visual + text) cyberbullying detection problem demonstrate the value of the proposed framework in ensuring both accuracy and fairness.

SGG: Spinbot, Grammarly and GloVe based Fake News Detection

ABSTRACT. Recently, news consumption using online news portals has increased exponentially due to several reasons, such as low cost and easy accessibility. However, such online platforms inadvertently also become the cause of spreading false information across the web. They are being misused quite frequently as a medium to disseminate misinformation and hoaxes. Such malpractices call for a robust automatic fake news detection system that can keep us at bay from such misinformation and hoaxes. We propose a robust yet simple fake news detection system, leveraging the tools for paraphrasing, grammar-checking, and word-embedding. In this paper, we try to the potential of these tools in jointly unearthing the authenticity of a news article. Notably, we leverage Spinbot (for paraphrasing), Grammarly (for grammar-checking), and GloVe (for word-embedding) tools for this purpose. Using these tools, we were able to extract novel features that could yield state-of-the-art results on the Fake News AMT dataset and comparable results on Celebrity datasets when combined with some of the essential features. More importantly, the proposed method is found to be more robust empirically than the existing ones, as revealed in our cross-domain analysis and multi-domain analysis.

Classification of Propagation Path and Tweets for Rumor Detection using Graphical Convolutional Networks and Transformer based Encodings

ABSTRACT. Social media platforms have become an integral part of our lives. We rely on them for entertainment purpose, social interactions and to learn what’s happening around the world. One such social media network ‘Twitter’ has become extensively popular around the world, majorly for sharing news and important information. As it is rapidly being used to share news and information, it is also causing a problem which implicitly comes with sharing of news without fact checking in place, which is, spreading of misinformation and rumors. To tackle this challenge, we propose a novel method wherein we leverage the structural and graphical properties of a tweet’s propagation and tweet’s text which tells us how a news of a specific class spreads, the characteristics of users involved in spreading that news and the linguistic cue of the source of news. This methodology extracts user features (like account verified, user description, followers count etc.) by modeling each user as a node and creating a graphical network of users retweeting the source tweet. We are extracting the textual content of source tweets in form of RoBERTa text’s vector representations which is the current state-of-the-art for text embedding. The model is trained on benchmark datasets of Twitter15 and Twitter16 along with scrapped data of users retweeting in their current state. The results have been promising for rumor and fake detection and outperforms current state-of-the-art algorithms.