BIGMM2019: THE 5TH IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA
PROGRAM FOR THURSDAY, SEPTEMBER 12TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:30-10:00Coffee Break
10:00-11:30 Session 7: Regular: Deep learning & analysis
Chair:
Baoning Niu (Taiyuan University of Technology, China)
10:00
Yucheng Xu (Shanghai Jiao Tong University, China)
Li Song (Shanghai Jiao Tong University, China)
Rong Xie (Shanghai Jiao Tong University, China)
Wenjun Zhang (Shanghai Jiao Tong University, China)
Deep Video Inverse Tone Mapping
PRESENTER: Li Song

ABSTRACT. Inverse tone mapping is an important topic in High Dynamic Range technology. Recent years, deep learning based image inverse tone mapping methods have been extensively studied and perform better than classical inverse tone mapping methods. However, these methods consider the inverse tone mapping problem as a domain transformation problem from LDR domain directly to HDR domain and ignore the relationship between LDR and HDR. Besides, when using these deep learning based methods to transform frames of videos, it will lead to temporal inconsistency and flickering. In this work, we propose a new way to consider the inverse tone mapping problem and design a deep learning based video inverse tone mapping algorithm to reduce the flickering. Different from previous methods, we first transform LDR resources back to approximate real scenes and use these real scenes to generate the HDR outputs. When generating HDR outputs, we use 3D convolutional neural network to reduce the flickering. We also use methods to further constrain the luminance information and the color information of HDR outputs separately. Finally, we compare our results with existing classical video inverse tone mapping algorithms and deep image inverse tone mapping methods to show our great performance, and we also prove the necessity of each part of our method.

10:22
Shengwei Zhao (Chinese Academy of Sciences, China)
Xindi Gao (Chinese Academy of Sciences, China)
Shikun Li (Chinese Academy of Sciences, China)
Shiming Ge (Chinese Academy of Sciences, China)
Low-Resolution Face Recognition in the Wild with Mixed-Domain Distillation
PRESENTER: Shiming Ge

ABSTRACT. Low-resolution face recognition in the wild still is an open problem. In this paper, we propose to address this problem via a novel learning approach called Mixed-Domain Distillation (MDD). The approach applies a teacher-student framework to mix and distill knowledge from four different domain datasets, including private high-resolution, public high-resolution, public low-resolution web and target low-resolution wild face datasets. In this way, high-resolution knowledge from the well-trained complex teacher model is first adapted to public high-resolution faces and then transferred to a simply student model. The student model is designed to identify low-resolution faces, and could perform face recognition in the wild effectively and efficiently. Experimental results show that our proposed model outperforms several existing models for low-resolution face recognition in the wild.

10:44
Bohui Xia (The University of Tokyo, Japan)
Xueting Wang (The University of Tokyo, Japan)
Toshihiko Yamasaki (The University of Tokyo, Japan)
Kiyoharu Aizawa (The University of Tokyo, Japan)
Hiroyuki Seshime (Septeni Ltd,., Japan)
Deep Neural Network-Based Click-Through Rate Prediction Using Multimodal Features of Online Banners
PRESENTER: Bohui Xia

ABSTRACT. As the online advertisement industry continues to grow, by 2020, it will account for over 40% of global advertisement spending. Thus, predicting the click-through rates (CTRs) of various advertisements is increasingly crucial for companies. Many studies have addressed CTR prediction. However, most tried to solve the problem using only metadata and excluded information such as advertisement images or texts. Using deep learning techniques, we propose a method to predict CTRs for online banners, a popular form of online advertisements, using all these features. We show that multimedia features of advertisements are useful for the task at hand. The proposed learning architecture outperforms a previous method that uses the three features mentioned above. We also present an attention-based model, which enables visualization of contributions of each feature to the prediction. We analyze how each feature affects CTR prediction with visualization and ablation studies.

11:06
Wenjia Wang (Tongji University, China)
Text Detection with Cascade Fpn and Channel-Wise Feature Selective

ABSTRACT. The detection and recognition of scene text in wild environment have attracted mass of attention because of the rapid development of Convolutional Neural Networks .However, most of the existing state-of-art algorithms have their shortages. On the one hand, object detection based methods require quadrangle bounding box which cannot detect arbitrary text. On the other hand, the segmentation based methods fail to pay attention to the feature and channel interdependencies. The Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes, could be a useful baseline. Our method is to combine the feature fusion and channel-wise information with an effective post-processing method, making it possible to make full use of feature and channel importance to detect arbitrary-shaped text instances. To accomplish this goal, we introduce the Path aggregation module and feature selective module. In the path aggregation module, we implement cascade FPN structure and shortcut procedure. In the feature selective module, we implement SE-block. Extensive experiments on Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of our method. On ICDAR2017, our best F-measure (75.3%) outperforms PSENet by 3.1%.

11:30-13:00 Session 8: Regular/Industry: Novel applications
Chair:
Seon Ho Kim (University of Southern California, United States)
11:30
Maitree Leekha (Delhi Technological University, India)
Mononito Goswami (Delhi Technological University, India)
Yifang Yin (National University of Singapore, Singapore)
Rajiv Ratn Shah (IIIT-Delhi, India)
Roger Zimmermann (National University of Singapore, Singapore)
[Regular] Are You Paying Attention? Detecting Distracted Driving in Real-Time
PRESENTER: Maitree Leekha

ABSTRACT. Each year, millions of people lose their lives to fatal road accidents. An ever increasing proportion of these accidents is due to distracted driving caused by co-passengers and mobile devices. Thus, there is a growing need for a system which detects the distracted driver in real-time and raises a warning alarm. To this end, we present a simple and robust architecture using foreground extraction and a Convolutional Neural Network (CNN). Our ConvNet model has significantly fewer parameters (0.5M) than state-of-the-art models. Detailed experimental evaluation on two publicly available datasets, the State Farm Distracted Driver Detection dataset (SFD3) and the AUC Distracted Driver dataset (AUCD2), confirm that our model either outperforms or compares with the models proposed so far on both the datasets, with a test accuracy of 98.48% and 95.64%, respectively. Our experiments also suggest that incorporating additional features such as the posture of a driver through foreground extraction using GrabCut substantially improves the performance of our model. Furthermore, our ConvNet is capable of detecting real-time distractions without any parallel processing.

11:52
Wu Luo (Shanghai Jiao Tong University, China)
Jintao Wu (Shanghai Jiao Tong University, China)
Weiwei Liu (Shanghai Jiao Tong University, China)
Chongyang Zhang (Shanghai Jiao Tong University, China)
Weiyao Lin (Shanghai Jiao Tong University, China)
[Regular] Improving Action Recognition with Valued Patches Exploiting
PRESENTER: Wu Luo

ABSTRACT. Recent human action recognition methods mainly model a two-stream or part-based multi-stream deep learning network, with which human spatiotemporal features can be exploited and utilized effectively. However, due to the ignoring of interactive or related scenes exploiting, most of these methods cannot achieve impressive performance. In this paper we propose a novel multi-stream feature fusion framework based on discriminative scene patches and motion patches. Unlike existing two-stream frameworks, part-based or attention based multi-stream methods, our work improves the recognition accuracy by: 1) Attaching more attention on exploiting of discriminative scene patches and motion patches, and more effective feature aggregation is adopted. 2) Proposing a novel 2D+3D multi-stream feature aggregation mechanism: 2D ConvNets features from RGB images and 3D ConvNets features of valued patches are combined together to enhance spatiotemporal feature representation ability. Our framework is evaluated on three widely used video action benchmarks, where it outperforms other state-of-the-art recognition approaches by a significant margin: the accuracy up to 85.7% at JHMDB, 87.7% at HMDB51, and 98.6% at UCF101, respectively.

12:14
Ranjodh Singh (Humonics Global Pvt. Ltd., India)
Mohit Sharma (MIDAS, Indraprastha Institute of Information Technology, Delhi, India)
Hemant Yadav (MIDAS, Indraprastha Institute of Information Technology, Delhi, India)
Sandeep Gosain (Humonics Global Pvt. Ltd., India)
Rajiv Ratn Shah (MIDAS, Indraprastha Institute of Information Technology, Delhi, India)
[Industry] Automatic Speech Recognition for Real Time Systems
PRESENTER: Ranjodh Singh

ABSTRACT. Automatic Speech Recognition (ASR) systems have proven to be a useful tool to perform various day to day operations when used along with systems like Personal AI Assistants. Various industries require ASR to be trained in their domains. Music on demand (MoD), over IVR, is one such industry where the user interacts with the dialogue system to play music using voice commands only. Domain adaptation of the model is expected to perform well on this domain as systems trained on public datasets are very generic in nature and not very domain specific. To train the ASR for MoD, we experiment with the HMM-based classical approach and DeepSpeech2 on Voxforge dataset. We then fine-tuned the DeepSpeech2 model on MoD data. With very limited data and little finetuning of the model, we were able to achieve 18.28 word error rate.

12:36
Ranjodh Singh (Humonics Global Pvt. Ltd., India)
Meghna P Ayyar (MIDAS, Indraprastha Institute of Information Technology, Delhi, India)
Tata Sri Pavan (Humonics Global Pvt. Ltd., India)
Sandeep Gosain (Humonics Global Pvt. Ltd., India)
Rajiv Ratn Shah (MIDAS, Indraprastha Institute of Information Technology, Delhi, India)
[Industry] Automating Car Insurance Claims Using Deep Learning Techniques
PRESENTER: Ranjodh Singh

ABSTRACT. With the number of people driving a car increasing every day, there has been a proliferation in the number of cars insurance claims being registered. The life cycle of registering, processing and making a decision for each claim involves the manual examination by the service engineer who creates the damage report followed by the physical inspection by a surveyor from the insurance company which makes it a long drawn out process. We propose an end to end system to automate this process, which would be beneficial for both the company and the customer. This system takes images of the damaged car as input and gives relevant information like the damaged parts and provides an estimate of the extent of damage (no damage, mild or severe) to each part. This serves as a cue to then estimate the cost of repair which would be used in deciding insurance claim amount. We have experimented with popular instance segmentation models like the Mask R-CNN, PANet and an ensemble of these two along with a transfer learning based VGG16 network to perform different tasks of localizing and detecting various classes of parts and damages found in the car. Additionally, the proposed system achieves good mAP scores for parts localization and damage localization (0.38 and 0.40 respectively).

13:00-14:00Lunch Break
14:00-16:00 Session 9A: Short: Content analysis
Chair:
Chiranjoy Chattopadhyay (Indian Institute of Technology Jodhpur Rajasthan, India, India)
14:00
Tianwei Chen (Graduate School of Informatics, Kyoto University, Kyoto, Japan, Japan)
Qiang Ma (Graduate School of Informatics, Kyoto University, Kyoto, Japan, Japan)
Discriminative Object Discovery Towards Personalized Sightseeing Spot Recommendation
PRESENTER: Tianwei Chen

ABSTRACT. Discovering discriminative objects representing local characteristics is an important task for sightseeing spot discovery, assessment and personalized recommendation. Recently, some patch-based methods have been proposed to discover discriminative objects from geo-social images on SNS. The conventional methods discover discriminative objects by comparing two or more cities in the early steps and require re-run the whole process when the comparative cities change. Such manner makes the conventional methods inflexible for personalized sightseeing recommendation in which different users may have different background and the comparative targets are different. In this paper, we propose a flexible framework for discriminative objects discovery towards personalized sightseeing recommendation. Our framework detects common patches/objects in advance and leaves the comparison process of discovering discriminative objects in the later steps. The experimental results on real world datasets show that our framework is more efficient than previous works, while still keeps a good performance in discovering discriminative objects.

14:17
Baani Leen Kaur Jolly (IIIT Delhi, India)
Palash Aggrawal (IIIT Delhi, India)
Surabhi S. Nath (IIIT Delhi, India)
Viresh Gupta (IIIT Delhi, India)
Manraj Singh Grover (IIIT Delhi, India)
Rajiv Ratn Shah (IIIT Delhi, India)
Generating Universal Embeddings from EEG Signals for Learning Diverse Intelligent Tasks
PRESENTER: Rajiv Ratn Shah

ABSTRACT. Brain Computer Interfaces (BCI) are being increasingly exploited for creating direct communication between the human brain and an external agent. Electroencephalography (EEG) is one of the most commonly used signal acquisition techniques due to its non-invasive nature, and high temporal resolution. One of the major challenges in BCI studies is the individualistic analysis required for each task. Thus, task-specific feature extraction and classification are performed, which fail to generalize to other tasks with similar time-series EEG input data. Towards this end, we design a GRU-based universal deep learning encoding architecture to extract meaningful features from publicly available datasets for five diverse EEG-based classification tasks namely Emotion Detection, Digit Recognition, Object Recognition, Task Identification and Error Detection. The network can generate task and format independent embeddings and outperforms the state of the art EEGNet architecture on most experiments. Such a representation can enable efficient analysis across multiple datasets and eliminate the need to manually extract handcrafted features pertaining to every task. We also compare our results with CNN-based, and Autoencoder network, in turn performing local, spatial, temporal and unsupervised analysis on the data.

14:34
Ashima Yadav (Delhi Technological University, Delhi, India)
Ayush Agarwal (ww.dtu.ac.in, India)
Dinesh Kumar Vishwakarma (www.dtu.ac.in, India)
XRA-Net Framework for Visual Sentiments Analysis
PRESENTER: Ashima Yadav

ABSTRACT. The exponential growth of social media has motivated people to express themselves in various forms. Visual media is one of the most effective and popular ways of conveying sentiments or opinions on the web as people keeps on uploading millions of photos on famous social networking sites. Hence, Visual Sentiment Analysis is instrumental in monitoring an overview of the broader public consensus behind a specific topic or issue. This work proposes a deep learning based architecture XRA-Net (Xception Residual Attention based Network) for visual sentiment analysis. Moreover, the performance of the XRA-Net architecture is evaluated on the publicly available real-world Twitter I dataset, which is further composed of three subsets of the dataset: 3-agree, 4-agree, and 5-agree. The accuracy achieved on these datasets are: 79.2%, 81.2%, and 86.4% respectively, which shows that the proposed architecture has outperformed the state-of-the-art results on all the three subsets of Twitter I dataset as it can focus on the most informative features in an input image, which boosts the visual sentiment analysis process.

14:51
Chhavi Dhiman (Delhi Technological University, Delhi, India)
Manan Saxena (Delhi Technological University, Delhi, India)
Dinesh Kumar Vishwakarma (Delhi Technological University, Delhi, India)
Skeleton-Based View Invariant Deep Features for Human Activity Recognition
PRESENTER: Chhavi Dhiman

ABSTRACT. Recently skeleton-based human activity recognition is receiving significant attention due to its efficient localization of human pose in 3D space. This paper introduces a novel view-invariant human action recognition framework by using skeleton joints coordinates based features and RGB Dynamic Images (DIs). Skeleton coordinates based features help to localize human pose and its variations in orientation during the action performed, whereas DIs encrypts the temporal dynamics of action by using the concept of Average Rank pooling. This hybrid representation of human features is projected into higher dimensional space by applying the concept of transfer learning on InceptionV3 architecture, which is fine-tuned for two multi-view human action datasets. In addition, the final prediction of the action class is made by applying late fusion on skeleton-based features and DIs features. The performance of the proposed scheme is tested for two multi-view NUCLA and NTU RGB+D Dataset.

15:08
Dhruva Sahrawat (IIITD, India)
Mohit Agarwal (IIITD, India)
Sanchit Sinha (IIITD, India)
Aditya Adhikary (IIITD, India)
Mansi Agarwal (DTU, India)
Rajiv Ratn Shah (IIITD, India)
Roger Zimmermann (NUS, Singapore)
Video Summarization Using Global Attention with Memory Network and Lstm
PRESENTER: Dhruva Sahrawat

ABSTRACT. Videos are one of the most engaging and interesting mediums of effective information delivery. A typical video conveys a lot more information effectively as compared to a text or image. This is directly reflected by the fact that videos constitute the majority of the content generated online today. As a result, it has become increasingly important to come up with better video management techniques, video summarization being one of them. As human attention span shrinks, it is imperative to shorten videos while maintaining most of its information. As an example, people are more inclined to watch highlights of a match rather than a full replay. The premier challenge is that summaries which might be more intuitive to a human are difficult for machines to generalize to a significantly greater extent. This is due to the presence of a variety of different actions spread across the frames of a video and thus, the machine is not able to tie them up together adequately. To that end, we present a simple approach to video summarization using Kernel Temporal Segmentation (KTS) for shot segmentation and a global attention based modified memory network module with LSTM for shot score learning. KTS is fast and can be adeptly used to generate appropriate shot boundaries. The modified memory network termed as Global Attention Memory Module (GAMM) increases the learning capability of the model and with the addition of LSTM, it is further able to learn better contextual features. Experiments on the benchmark datasets TVSum and SumMe show that our method outperforms the current state of the art by about 15%.

15:25
Tej Singh (Delhi Technological University, Delhi, India)
Shivam Rustagi (Delhi Technological University, Delhi, India)
Aakash Garg (Delhi Technological University, Delhi, India)
Dinesh Kumar Vishwakarma (Delhi Technological University, Delhi, India)
Deep Learning Framework for Single and Dyadic Human Activity Recognition
PRESENTER: Tej Singh

ABSTRACT. Today human activity recognition in videos attracts much attention in computer vision community because of its broad real-life applications in the. In this context, we introduced a robust two stream deep learning model with less complexity which utilized only the raw RGB color sequences and their dynamic motion images (DMIs) to recognized complex human activities. In this proposed work we have trained our model with the latest Inception deep learning architecture, and the extracted feature vector is fed to Bi-Directional Long Short-Term Memory (Bi-LSTM) to estimate the temporal information from input video sequences. Further, the extracted features from LSTM model of RGB frames and DMIs are fused to get discriminative activity representation. The proposed approached has been evaluated over single person activity dataset MIVIA Action and dyadic interaction dataset SBU Kinect. Our model obtained improved performance than existing similar approaches.

14:00-16:00 Session 9B: Short: Applications
Chair:
Ralf Steinmetz (Techn. Universitaet Darmstadt, Germany)
14:00
Shengzhou Yi (The University of Tokyo, Japan)
Xueting Wang (The University of Tokyo, Japan)
Toshihiko Yamasaki (The University of Tokyo, Japan)
Impression Prediction of Oral Presentation Using LSTM and Dot-Product Attention Mechanism
PRESENTER: Shengzhou Yi

ABSTRACT. For automatically evaluating oral presentations, we propose an end-to-end system to predict audience’s impression for speech videos. Our framework is a multimodal neural network including two Long Short-Term Memory systems and a dot-product attention mechanism to learn linguistic features and acoustic features, respectively. An attention network is also used to consider the correlation between different feature representations for feature fusion. We utilize 2,445 videos with official captions and users’ ratings from TED Talks. The experimental result shows the good performance for our proposed system, which can recognize 14 types of audience impressions with an average accuracy of 85.0%. Our proposal not only has the advantage of making noticeable improvements to the accuracy of predicting audiences’ impressions, but also can significantly decrease the model complexity as compared to the existing method.

14:17
Xiaoyan Wang (Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, China)
Hai-Miao Hu (Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, China)
Yugui Zhang (Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, China)
Pedestrian Detection Based on Spatial Attention Module for Outdoor Video Surveillance
PRESENTER: Hai-Miao Hu

ABSTRACT. Pedestrian detection remains challenging because of hard instances, such as illumination change, various occlusion, and special appearance, etc. The current methods to detect these hard examples depend on complicate manual designs or additional annotations. We observe that the spatial information of pedestrians can be obtained through motion information, which enlightens us to utilize this spatial information to guide effective training of detectors. In this paper, we introduce the Spatial Attention Module, which guides Convolutional Neural Networks (CNNs) to focus on potential pedestrian positions indicated by hierarchical unsupervised guidance information, including motion information and static information. The experimental results on two datasets demonstrate that the proposed method outperforms the state-of-the-art and can capture hard examples, which are missed by the baseline.

14:34
Yi-Ning Chen (National Taiwan Normal University, Taiwan)
Mei-Chen Yeh (National Taiwan Normal University, Taiwan)
Photo Filter Recommendation Through Analyzing Objects, Scenes and Aesthetics
PRESENTER: Mei-Chen Yeh

ABSTRACT. Photo filters are widespread---they give photos a stylized look without requiring the user to have professional knowledge of image processing. However, with the increasing number of photo filters and limited display size of mobile devices, selecting a proper filter for a given photo can be a tedious task for camera phone users. In this paper, we present a photo filter recommendation approach to address this problem. In particular, we rely on state-of-the-art deep learning approaches to extract objects, scenes, and image aesthetics reliably and represent them using deep features as the observations of our recommendation model. Furthermore, we collected 68,400 filtered photos from Instagram to learn the relationships among objects, scenes, aesthetics, and filter types. Experimental results using the FACD benchmark dataset demonstrated the state-of-the-art recommendation performance of the proposed approach; these results showed that objects, scenes, and aesthetic attributes influence filter preference.

14:51
Kajal Kansal (IIIT-Delhi, India)
A.V. Subramanyam (IIIT-Delhi, India)
Autoencoder Ensemble for Person Re-Identification
PRESENTER: Kajal Kansal

ABSTRACT. Person re-identification (re-id) aims to match people from non-overlapping multi-camera networks. Recently, with the advancement of deep learning techniques, the performance of re-id has been improved swiftly. However, most of the existing re-id methods need large number of samples for training due to which the models do not generalize well on smaller datasets and suffers from small sample size problem. Additionally, they focus on single scale appearance information while ignoring rich information that can be exploited from other scales. In this paper, we propose a simple yet effective autoencoder, comprising of an encoder and a sequential decoder. The goal of the network is two fold. First, the network learns features by introducing a generative task to the embedding layer so that it can make features more generalizable to the unknown test data to prevent from overfitting. Second, the encoder feature embedding is used as an input to decoder to reconstruct the input image with various scales to achieve robustness against scale variations. The effectiveness of our proposed method is validated on three public person re-identification datasets, Market-1501, DukeMTMC-reID and CUHK03.

15:08
Naoto Inoue (The University of Tokyo, Japan)
Toshihiko Yamasaki (The University of Tokyo, Japan)
Fast Instance Segmentation for Line Drawing Vectorization
PRESENTER: Naoto Inoue

ABSTRACT. In this paper, we present a fast raster-to-vector method for line art images based on a convolutional neural network (CNN). State-of-the-art approaches for vectorization are very slow because they mostly consist of multiple steps including iterative optimization during the inference. In contrast, our model is based on a simple CNN extending a proposal-based instance segmentation algorithm. Therefore, it is very fast and end-to-end trainable. We experimentally show that our model is about 100 times - 1000 times faster than the previous vectorization methods without sacrificing accuracy very much.

15:25
Xian-Hua Han (Yamaguchi University, Japan)
Yen-Wei Chen (Ritsumeikan University, Japan)
Deep Residual Network of Spectral and Spatial Fusion for Hyperspectral Image Super Resolution
PRESENTER: Xian-Hua Han

ABSTRACT. Fusing a low-resolution hyperspectral image with the corresponding high-resolution multispectral image to obtain a high-resolution hyperspectral image is an important research topic for capturing comprehensive scene information in both spatial and spectral domains. Existing approaches estimate high-resolution hyperspectral image via minimizing the reconstruction errors of the available low-resolution hyperspectral and high-resolution multispectral images with different constrained prior knowledge such as representation sparsity, spectral physical properties, spatial smoothness and so on, where the performance of the recovered hyperspectral image greatly depends on the previously defined constraints and has still a large space to be enhanced. Recently, deep convolutional neural network (DCNN) has been applied to resolution enhancement of natural images, and proven to achieve promising performance. However, the architecture of the used DCNN for natural images takes a single low-resolution image only as input, and cannot be applied to fusion the two available low-resolution hyperspectral and high-resolution multispectral images for hyperspectral image super-resolution. This study proposes a novel deep residual network of spatial and spectral Fusion to merger the two available images for hyperspectral image superresolution. Experiment results on benchmark datasets validate that the proposed residual CNN for hyperspectral superresolution outperforms the state-of-the-art methods in both quantitative values and visual effect.

14:00-16:00 Session 9C: Short: Media processing
Chair:
Yifang Yin (National University of Singapore, Singapore)
14:00
A. A. Patwardhan (IIT Hyderabd, India)
S. Das (IIT Hyderabd, India)
S. Varshney (IIT Hyderabd, India)
M. S. Desarkar (IIT Hyderabd, India)
D. P. Dogra (IIT Bhubaneswar, India)
ViTag: Automatic Video Tagging Using Segmentation and Conceptual Inference
PRESENTER: D. P. Dogra

ABSTRACT. Massive increase in multimedia data has created a need for effective organization strategy which has significant impact on latency as well as accuracy in searching. The multimedia collection needs to be organized considering multiple attributes such as domain, index-terms, content description, owners etc. Typically, index-term is a prominent attribute for effective video retrieval systems. Given a large multimedia collection, it is equally important to obtain such attributes efficiently. In this paper, we present a unique approach to the problem of automatic video tagging referred to as ViTag. Our approach comprises of analyzing the input video to obtain representative frames. The analysis relies upon various image similarity metrics to automatically extract such key-frames. For each key-frame, raw tags are generated by performing reverse image tagging. The final step analyzes raw tags in order to discover hidden semantic information. We show that such semantic similarity information can effectively be used to infer generic tags. We extensively evaluate the approach on a dataset of 103 videos belonging to 13 popular domains derived from various YouTube categories. We are able to generate tags with 65.51 % accuracy. We also rank the generated tags based upon the number of proper nouns present in it. Such ranking allows us to measure effectiveness of the approach by using Reciprocal Rank. The geometric mean of Reciprocal Rank estimated over the entire collection is 0.873.

14:17
Hao Xie (School of Software and Microelectronics of Peking University, China)
Guoqing Xiang (School of Electronics Engineering and Computer Science of Peking University, China)
Hong Yu (School of Software and Microelectronics of Peking University, China)
Wei Yan (School of Software and Microelectronics of Peking University, China)
Perceptual Fast Cu Size Decision Algorithm for Avs2 Intra Coding
PRESENTER: Hao Xie

ABSTRACT. AVS2 is a kind of video coding standard proposed by China, it adopts a flexible partition structure to improve coding performance, while the recursive quad-tree split coding unit (CU) structure brings a significant increase in coding complexity. In order to reduce the computational complexity of intra coding, quite a number of fast CU size decision algorithms have been studied. However, traditional algorithms have focused on reducing encoding time while remaining the objective without perceptual consideration. In order to obtain more time savings by taken of the subjective performance, therefore, in this paper, a perceptual CU size decision algorithm based on the spatial-temporal neighboring information and its internal perceptual texture information is proposed. Experimental results show that our approach can achieve 28.06% time complexity reduction on average under all intra testing configuration on RD17.0, while the BD-Rate loss is negligible with SSIM metric.

14:34
Trong-Dat Phan (University of Science, VNU-HCMC, Viet Nam)
Minh-Son Dao (National Institute of Information and Communications Technology, Japan)
Koji Zettsu (NICT, Japan)
An Interactive Watershed-Based Approach for Lifelog Moment Retrieval
PRESENTER: Trong-Dat Phan

ABSTRACT. Recently, terminologies "lifelogging" and "lifelog" became to represent the activity of continuously recording people everyday experiences and dataset contained these recorded experiences, respectively. Hence, providing an excellent tool to retrieve life moments from lifelogs to fast and accurately bring a memory back to a human when required, become a challenging but exciting task for researchers. In this paper, a new method to meet this challenge by utilizing the hypothesis that a sequence of images taken during a specific period can share the same context and content is introduced. This hypothesis can be explained in another way that if there is one image satisfies a given query (i.e., seed); then a certain number of its spatiotemporal neighbors probably can share the same content and context (i.e., watershed). Hence, an interactive watershed-based approach is applied to build the proposed method that is evaluated on the imageCLEFlifelog 2019 dataset and compared to participants joined this event. The experimental results confirm the high productivity of the proposed method in both stable and accuracy aspects as well as the advantage of having an interactive schema to push the accuracy when there is a conflict between a query and how to interpret such a query.

14:51
Donglin Di (Harbin Institute of Technology, China)
Xindi Shang (National University of Singapore, Singapore)
Weinan Zhang (Harbin Institute of Technology, China)
Xun Yang (National University of Singapore, Singapore)
Tat-Seng Chua (National University of Singapore, Singapore)
Multiple Hypothesis Video Relation Detection
PRESENTER: Donglin Di

ABSTRACT. Video relation in the form of relation triplet {subject, predicate, object} plays a vital role in video content understanding. Existing works on video relation detection are limited in associating short-term relations into long-term relations throughout the video, because of the inaccurate and missing problem of short-term proposals. In this work, we propose a novel approach called Multi-Hypothesis Relational Association (MHRA), that can generate multiple hypotheses for video relation instances in order to be robust to the problem of short-term proposals. Experiments on the benchmark dataset show that MHRA outperforms the state-of-the-art methods.

15:08
Anh-Khoa Vo (Natural Science University - Vietnam National University in HCM City, Viet Nam)
Minh-Son Dao (National Institute of Information and Communications Technology, Japan)
Koji Zettsu (National Institute of Information and Communications Technology, Japan)
From Chaos to Order: the Role of Content and Context of Daily Activities in Rearranging Lifelogs Data
PRESENTER: Anh-Khoa Vo

ABSTRACT. Lifelogs is known as a set of heterogeneous data recorded daily from various types of wearable sensors from the first-person perspective. Nevertheless, getting insights from lifelogs still a journal to a mystical land where many obstacles need to be overcome. One of the exciting obstacles is to augment human memory when the chaos of memory snapshots must be rearranged chronologically to give a whole picture of a user’s life moment. In this paper, a new method to tackle this challenge by leveraging the content and context of daily activities is introduced. The backbone idea of this method is, under the same context (e.g., a particular activity), one image should have at least one associative image (e.g., same content, same concepts) taken from different moments. Thus, a given set of images can be rearranged chronologically by ordering their associative images whose orders are known precisely. The proposed method is evaluated on the imageCLEFlifelog 2019 dataset and compared to participants joined this event. The experimental results confirm the high productivity of the proposed method in both stable and accuracy aspects.

15:25
Yongqing Sun (NTT Media Intelligence Labs, Japan)
Pranav Shenoy K P (Georgia Institute of Technology, United States)
Jun Shimamura (NTT Media Intelligence Labs, Japan)
Atsushi Sagata (NTT Media Intelligence Labs, Japan)
Concatenated Feature Pyramid Network for Instance Segmentation
PRESENTER: Yongqing Sun

ABSTRACT. Low level features like edges and textures play an important role in accurately localizing instances in neural networks. In this paper, we propose an architecture which improves feature pyramid networks commonly used instance segmentation networks by incorporating low level features in all layers of the pyramid in an optimal and efficient way. Specifically, we introduce a new layer which learns new correlations from feature maps of multiple feature pyramid levels holistically and enhances the semantic information of the feature pyramid to improve accuracy. Our architecture is simple to implement in instance segmentation or object detection frameworks to boost accuracy. Using this method in Mask RCNN, our model achieves consistent improvement in precision on COCO Dataset with the computational overhead compared to the original feature pyramid network.

16:00-16:30Coffee Break
16:30-17:30 Session 10B: Demos
Chair:
D. P. Dogra (IIT Bhubaneswar, India)
16:30
Jiong Huang (Grab, China)
Sheng Hu (Grab, China)
Yun Wang (Grab, China)
Chunhong Zhao (Grab, China)
Guanfeng Wang (Grab, Singapore)
Xudong He (Grab, China)
Xiaocheng Huang (Grab, Singapore)
Shaolin Zheng (Grab, China)
Tom Galloway (Grab, United States)
Yifang Yin (National University of Singapore, Singapore)
Roger Zimmermann (National University of Singapore, Singapore)
GrabView: a Scalable Street View System for Images Taken from Different Devices
PRESENTER: Yun Wang

ABSTRACT. In the last decade, many researchers and applications focused on street view systems' geo-spatial and multimedia content. However, most street view systems have high data collection costs and have out of date content in certain regions. An efficient and effective street view system reflecting real world multimedia content on a weekly basis has been an elusive goal.

We present GrabView, a street view system that: a) Uses a data capture during ride sharing service trips model. b) Collects up-to-date geo-referenced multimedia content at low cost. c) Processes both the multimedia and geo-sensor content. d) Serves a navigation and browsing user experience equivalent to that from dedicated mapping vehicles.

Grab ride hailing service vehicles collect most of GrabView's geo-sensor data and multimedia content. This results in much better up-to-the-minute road network coverage at no extra cost.

16:30
Satoshi Yoshida (NEC corporation, Japan)
Jianquan Liu (NEC corporation, Japan)
Shoji Nishimura (NEC corporation, Japan)
A Robust People Tracking Method in Multiple Cameras
PRESENTER: Satoshi Yoshida

ABSTRACT. This paper proposes a robust people tracking method on videos obtained from multiple cameras. Yoshida et al. proposed a method that combines re-identification and video tracking technologies to track people in multiple cameras. However, this method is not robust enough because it uses only one type of features. The method fails to track people when some type of features is not available or when we use features of which characteristics can be easily changed with time or place. We propose a method to enhance the robustness by utilizing multiple types of features in this paper. We can track people even if some type of features are not available. Experimental results showed that 76% of images obtained from the same person were identified in average.

16:30
Jieliang Ang (National University of Singapore, Singapore)
Tianyuan Fu (National University of Singapore, Singapore)
Johns Paul (National University of Singapore, Singapore)
Shuhao Zhang (National University of Singapore, Singapore)
Bingsheng He (National University of Singapore, Singapore)
Teddy Sison David Wenceslao (Grab Taxi, Singapore, Singapore)
Sienyi Tan (Grab Taxi, Singapore, Singapore)
Trav: an Interactive Trajectory Exploration System
PRESENTER: Johns Paul

ABSTRACT. The proliferation of modern GPS-enabled devices like smartphones have led to significant research interest in large-scale trajectory exploration, which aims to identify all nearby trajectories of a given input trajectory. Trajectory exploration is beneficial, for example, in identifying incorrect road network information or in assisting users when traveling in unfamiliar geographical regions as it can reveal the popularity of certain routes/trajectories. In this study, we develop an interactive trajectory exploration system, named TraV. TraV allows users to easily plot and explore trajectories using an interactive Graphical User Interface (GUI) containing a map of the geographical region. Under the hood, TraV applies the Hidden Markov Model to (optionally) calibrate the user input trajectory and then makes use of the massively parallel execution capabilities of modern hardware to quickly identify nearby trajectories to the input provided by the user. In order to ensure a seamless user experience, TraV adopts a progressive execution model that contrast to the conventional “querybefore-process” model. Demonstration participants will gain first-hand experience with TraV and its ability to calibrate user input and analyze billions of trajectories obtained from Grab taxi drivers (normalized) in Singapore.

18:00-21:00 Banquet
Chair:
Roger Zimmermann (National University of Singapore, Singapore)