ISVC'20: 15TH INTERNATIONAL SYMPOSIUM ON VISUAL COMPUTING
PROGRAM FOR TUESDAY, OCTOBER 6TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 8: Keynote - Ahmed Elgammal
Location: K
09:00
The Shape of Art History in the Eyes of the Machine

ABSTRACT. In this talk, I will present results of research activities at the Art and Artificial Intelligence Laboratory at Rutgers University. We investigate perceptual and cognitive tasks related to human creativity in visual art. In particular, we study problems related to art styles, influence, and the quantification of creativity. We develop computational models that aim at providing answers to questions about what characterizes the sequence and evolution of changes in style over time. The talk will cover advances in automated prediction of style, how that relates to art history methodology, and what that tells us about how the machine sees art history. The talk will also delve into our recent research on quantifying creativity in art in regards to its novelty and influence, as well as computational models that simulate the art-producing system.

10:10-11:10 Session 9A: Biometrics
Chair:
Location: A
10:10
Deep Partial Occlusion Facial Expression Recognition via Improved CNN
PRESENTER: Shiguang Liu

ABSTRACT. Facial expression recognition (FER) can indicate a person's emotion state, that is of great importance in virtual human modelling and communication. However, FER suffers from a partial occlusion problem when applied under an unconstrained environment. In this paper, we propose to use facial expressions with partial occlusion for FER. This differs from the most conventional FER problems which assume that facial images are detected without any occlusion. To this end, by reconstructing the partially occluded facial expression database, we propose a 20-layer ``VGG + residual" CNN network based on the improved VGG16 network, and adapt a hybrid feature strategy to parallelize the Gabor filter with the above CNN. We also optimize the components of the model by LMCL and momentum SGD. The results are then combined with a certain weight to get the classification results. The advantages of this method are demonstrated by multiple sets of experiments and cross-database tests.

10:30
Towards an Effective Approach for Face Recognition with DCGANs Data Augmentation
PRESENTER: Sirine Ammar

ABSTRACT. Deep Convolutional Neural Networks (DCNNs) are widely used to extract high-dimensional features in various image recognition tasks and have shown significant performance in face recognition. However, accurate face recognition in real-time remains a challenge, mainly due to the high computation cost associated with the use of DCNNs and the need to balance precision requirements with time and resource restrictions. Besides, the supervised training process of DCNNs requires a large number of labeled samples. Aiming at solving the problem of data insufficiency, this study proposes a Deep Convolutional Generative Adversarial Net (DCGANs) based solution to increase the face dataset by generating synthetic images. Our proposed face recognition approach is based on FaceNet model. First, we perform face detection using MTCNN. After, a 128-D face embedding is extracted to quantify each face and a Support Vector Machine (SVM) is applied on top of the embeddings to recognize faces. In the experiment part, both LFW Database and Chockepoint video Database showed that the proposed approach with DCGANs data augmentation has improved the face recognition performance.

10:50
Controlled AutoEncoders to Generate Faces from Voices

ABSTRACT. Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that contribute to these observed correlations. A computational methodology to explore this can be devised by rephrasing the question to: ``how much would a target face have to change in order to be perceived as the originator of a source voice?'' With this in perspective, we propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation in this paper. Our framework includes a guided autoencoder that converts one face to another, controlled by a unique model-conditioning component called a gating controller which modifies the reconstructed face based on input voice recordings. We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval. Various experiments demonstrate the effectiveness of our proposed model.

10:10-11:10 Session 9B: Object Recognition/Detection/Categorization
Location: B
10:10
Few-shot Image Recognition with Manifolds
PRESENTER: Debasmit Das

ABSTRACT. In this paper, we extend the traditional few-shot learning (FSL) problem to the situation when the source-domain data is not accessible but only high-level information in the form of class prototypes is available. This limited information setup for the FSL problem deserves much attention due to its implication of privacy-preserving inaccessibility to the source-domain data but it has rarely been addressed before. Because of limited training data, we propose a non-parametric approach to this FSL problem by assuming that all the class prototypes are structurally arranged on a manifold. Accordingly, we estimate the novel-class prototype locations by projecting the few-shot samples onto the average of the subspaces on which the surrounding classes lie. During classification, we again exploit the structural arrangement of the categories by inducing a Markov chain on the graph constructed with the class prototypes. This manifold distance obtained using the Markov chain is expected to produce better results compared to a traditional nearest-neighbor-based Euclidean distance. To evaluate our proposed framework, we have tested it on two image datasets – the large-scale ImageNet and the small-scale but fine-grained CUB-200. We have also studied parameter sensitivity to better understand our framework.

10:30
A scale-aware YOLO model for pedestrian detection
PRESENTER: Xingyi Yang

ABSTRACT. Pedestrian detection is a challenging problem in computer vision, as it involves the combination of classification and localization within a scene. Recently, convolutional neural networks (CNNs) have demonstrated their superior detection results when compared to traditional approaches. Although YOLOv3 (an improved version of the You Only Look Once model) is proposed as one of state-of-the-art method in CNN-based object detection, it remains very challenging to leverage this method for real-time pedestrian detection. In this paper, we propose a new framework called SA YOLOv3, a scale-aware You Only Look Once framework which improves YOLOv3 in the case of pedestrian detection of small scale pedestrian instances. Our network introduces two sub-networks which detect pedestrians of different scales. Outputs from the sub-networks are then combined to generate robust detection results. Experimental results show that the proposed SA YOLOv3 framework outperforms the results of YOLOv3 on public datasets and run at an average of 11 fps on a GPU.

10:50
Image categorization using Agglomerative clustering based smoothed Dirichlet mixtures
PRESENTER: Fatma Najar

ABSTRACT. With the rapid growth of multimedia data and the diversity of the available image contents, it becomes necessary to develop advanced machine learning algorithms for the purpose of categorizing and recognizing images. Hierarchical clustering methods have shown promising results in computer vision applications. In this paper, we present a new unsupervised image categorization technique in which we cluster images using an agglomerative hierarchical procedure and a dissimilarity metric is derived based on smoothed Dirichlet (SD) distribution. We propose a mixture of SD distributions and a maximum-likelihood learning framework, from which we derive a Kulback-Leibler divergence between two SD mixture models. Experiments on challenging images dataset that contains different indoor and outdoor places reveal the importance of the hierarchical clustering when categorizing images. The conducted tests prove the robustness of the proposed image categorization approach as compared to the other related-works.

10:10-11:10 Session 9C: Motion and Tracking
Location: C
10:10
Coarse-to-Fine Object Tracking Using Deep Features and Correlation filters
PRESENTER: Ahmed Zgaren

ABSTRACT. During the last years, deep learning trackers achieved stimulating results while bringing interesting ideas to solve the tracking problem. This progress is mainly due to the use of learned deep features obtained by training deep convolutional neural networks (CNNs) on large image databases. But since CNNs were originally developed for image classification, appearance modeling provided by their deep layers might be not enough discriminative for the tracking task. In fact, such features represent high-level information, that is more related to object category than to a specific instance of the object. Motivated by this observation, and by the fact that discriminative correlation filters (DCFs) may provide a complimentary low-level information, we present a novel tracking algorithm taking advantage of both approaches. We formulate the tracking task as a two-stage procedure. First, we exploit the generalization ability of deep features to coarsely estimate target translation, while ensuring invariance to appearance change. Then, we capitalize on the discriminative power of correlation filters to precisely localize the tracked object. Furthermore, we designed an update control mechanism to learn appearance change while avoiding model drift. We evaluated the proposed tracker on object tracking benchmarks. Experimental results show the robustness of our algorithm, which performs favorably against CNN and DCF-based trackers.

10:30
Asynchronous Corner Tracking Algorithm based on Lifetime of Events for DAVIS Cameras

ABSTRACT. Event cameras, i.e., the Dynamic and Active-pixel Vision Sensor (DAVIS) ones, capture the intensity changes in the scene and generates a stream of events in an asynchronous fashion. The output rate of such cameras can reach up to 10 million events per second in high dynamic environments. DAVIS cameras use novel vision sensors that mimic human eyes. Their attractive attributes, such as high output rate, High Dynamic Range (HDR), and high pixel bandwidth, make them an ideal solution for applications that require high-frequency tracking. Moreover, applications that operate in challenging lighting scenarios can exploit from the high HDR of event cameras, i.e., 140dB compared to 60dB of traditional cameras. In this paper, a novel asynchronous corner tracking method is proposed that uses both events and intensity images captured by a DAVIS camera. The Harris algorithm is used to extract features, i.e., frame-corners from keyframes, i.e., intensity images. Afterward, a matching algorithm is used to extract event-corners from the stream of events. Events are solely used to perform asynchronous tracking until the next keyframe is captured. Neighboring events, within a window size of 5x5 pixels around the event-corner, are used to calculate the velocity and direction of extracted event-corners by fitting the 2D planar using a randomized Hough transform algorithm. Experimental evaluation showed that our approach is able to update the location of the extracted corners up to 100 times during the blind time of traditional cameras, i.e., between two consecutive intensity images.

10:50
TAGCN: Topology-Aware Graph Convolutional Network for Trajectory Prediction
PRESENTER: Brendan Morris

ABSTRACT. Predicting future trajectories of agents in a dynamic environment is essential for natural and safe decision making of autonomous agents. The trajectory of an agent in such an environment not only depends on the past motion of that agent but also depends on its interaction with other agents present in that environment. To capture the effect of other agents on trajectory prediction of an agent, we propose a three stream topology-aware graph convolutional network (TAGCN) for interaction message passing between the agents. In addition, temporal encodings of local level and global level topological features are fused to better characterize dynamic interactions between participants over time. Results are competitive results compared to previous best methods for trajectory prediction on ETH and UCY datasets and highlights the need for both local and global interaction structure.

11:10-11:30Coffee Break
11:30-12:10 Session 10A: Biometrics
Chair:
Location: A
11:30
Gender and Age Estimation without Facial Information from Still Images
PRESENTER: Michalis Vrigkas

ABSTRACT. In this paper, the task of gender and age recognition is per- formed on pedestrian still images, which are usually captured in-the-wild with no near face-frontal information. Moreover, another diculty origi- nates from the underlying class imbalance in real examples, especially for the age estimation problem. The scope of the paper is to examine how dierent loss functions in convolutional neural networks (CNN) perform under the class imbalance problem. For this purpose, as a backbone, we employ the Residual Network (ResNet). On top of that, we attempt to benet from appearance-based attributes, which are inherently present in the available data. We incorporate this knowledge in an autoencoder, which we attach to our baseline CNN for the combined model to jointly learn the features and increase the classication accuracy. Finally, all of our experiments are evaluated on two publicly available datasets.

11:50
Face Reenactment Based Facial Expression Recognition
PRESENTER: Kamran Ali

ABSTRACT. Representations used for Facial Expression Recognition (FER) are usually contaminated with identity specific features. In this paper, we propose a novel Reenactment-based Expression-Representation Learning Generative Adversarial Network (REL-GAN) that employs the concept of face reenactment to disentangle facial expression features from identity information. In this method, the facial expression representation is learned by reconstructing an expression image employing an encoder-decoder based generator. More specifically, our method learns the disentangled expression representation by transferring the expression information from the source image to the identity of the target image. Experiments performed on widely used datasets (BU-3DFE, CK+, Oulu-CASIA, SEFW) show that the proposed technique produces comparable or better results than state-of-the-art methods.

11:30-12:10 Session 10B: Object Recognition/Detection/Categorization
Location: B
11:30
SAT-CNN: A Small Neural Network for Object Recognition from Satellite Imagery
PRESENTER: Dustin Barnes

ABSTRACT. Satellite imagery presents a number of challenges for object detection, such as the significant variation in object size (from small cars to airports), and low object resolution. In this work, we focus on recognizing objects taken from the xView Satellite Imagery dataset. The xView dataset introduces its own set of challenges, the most prominent being the imbalance between the 60 classes present. xView also contains considerable label noise as well as both semantic and visual overlap between classes. In this work, we focus on techniques to improve performance on an imbalanced, noisy dataset through data augmentation and balancing. Additionally, we show that a very small convolutional neural network (SAT-CNN) with approximately three million parameters can outperform a deep pre-trained classifier, VGG16 - which is used for many state-of-the-art tasks - with over 138 million parameters.

11:50
Domain Adaptive Transfer Learning on Visual Attention Aware Data Augmentation for Fine-grained Visual Categorization
PRESENTER: Ashiq Imran

ABSTRACT. Fine-Grained Visual Categorization (FGVC) is a challenging topic in computer vision. It is a problem characterized by large intra-class differences and subtle inter-class differences. In this paper, we tackle this problem in a weakly supervised manner, where neural network models are getting fed with additional data using a data augmentation technique through a visual attention mechanism. We perform domain adaptive knowledge transfer via fine-tuning on our base network model. We perform our experiment on six challenging and commonly used FGVC datasets, and we show competitive improvement on accuracies by using attention-aware data augmentation techniques with features derived from deep learning model InceptionV3, pre-trained on large scale datasets. Our method outperforms competitor methods on multiple FGVC datasets and showed competitive results on other datasets. Experimental studies show that transfer learning from large scale datasets can be utilized effectively with visual attention based data augmentation, which can obtain state-of-the-art results on several FGVC datasets. We present a comprehensive analysis of our experiments. Our method achieves state-of-the-art results in multiple fine-grained classification datasets including challenging CUB200-2011 bird, Flowers-102, and FGVC-Aircrafts datasets.

11:30-12:10 Session 10C: Motion and Tracking
Location: C
11:30
3D articulated body model using anthropometric control points and an articulation video
PRESENTER: Chenxi Li

ABSTRACT. We introduce an efficient and practical integrated system for human body model personalization with articulation. We start with a 3D personalized model of the individual in a standard pose obtained using a 3D scanner or using body model reconstruction method based on canonical images of the individual. As the person moves, the model is updated to accommodate the new articulations captured in an articulation video. The personalized model is segmented into different parts using anthropometric control points on the boundary silhouette of the frontal projection of the 3D model. The control points are endpoints of the segments in 2D, and the segments are projections of corresponding regions of independently moving parts in 3D. These joint points can either be manually selected or predicted with a pre-trained point model such as active shape model (ASM) or using convolutional neural network (CNN). The evolution of the model through the articulation process is captured in a video clip with N frames. The update consists of finding a set of 3D transformations that are applied to the 3D model on its parts so that the projections of the 3D model ‘match’ those observed in the video sequence at corresponding frames. This is done by minimizing the error between the frontal projection body region points and the target points from the image for each independent moving part. Our articulation reconstructed method leads sub-resolution recovery errors.

11:50
Body Motion Analysis for Golf Swing Evaluation
PRESENTER: Jen-Jui Liu

ABSTRACT. A golf swing requires full-body coordination and much practice to perform the complex motion precisely and consistently. The force from the golfer's full-body movement on the club and the trajectory of the swing are the main determinants of swing quality. In this research, we introduce a unique motion analysis method to evaluate the quality of golf swing. The primary goal is to evaluate how close the user's swing is to a reference ideal swing. We use 17 skeleton points to evaluate the resemblance and report a score ranging from 0 to 10. This evaluation result can be used as real-time feedback to improve player performance. Using this real-time feedback system repeatedly, the player will be able to train their muscle memory to improve their swing consistency. We created our dataset from a professional golf instructor including good and bad swings. Our result demonstrates that such a machine learning-based approach is feasible and has great potential to be adopted as a low-cost but efficient tool to improve swing quality and consistency.

12:10-13:30Lunch Break
13:30-14:30 Session 11: Keynote - Ramin Zabih
Chair:
Location: K
13:30
Object-oriented image stitching

ABSTRACT. Image stitching is one of the most widely used applications of computer vision, appearing in well-known applications like Google Street view and panorama mode in commercial cell phones. However, despite the prevalence of artifacts and errors, there has been little to no progress in stitching research over the last ten years. There is no generally accepted evaluation metric and relatively few attempts to directly deal with large view point changes or object movement. We describe a reframing of stitching that exploits the importance of objects, and the algorithmic and evaluation techniques that naturally result. We will also present a technique that directly addresses the most visually disruptive stitching errors and can act as an alarm bell for these errors in stitching results. These ideas can be naturally extended to the panorama algorithms widely used in smartphones. Joint work with Charles Herrmann, Chen Wang, Richard Bowen and Emil Keyder, from Cornell Tech and Google Research.

14:40-15:40 Session 12A: 3D Reconstruction
Location: A
14:40
A Light-Weight Monocular Depth Estimation With Edge-Guided Occlusion Fading Reduction
PRESENTER: Kuo Shiuan Peng

ABSTRACT. Self-supervised monocular depth estimation methods suffer occlusion fading which is a result of a lack of supervision by the per pixel ground truth. A recent work introduced a post-processing method to reduce occlusion fading; however, the results have a severe halo effect. In this work, we propose a novel edge-guided post-processing method that reduces occlusion fading for self-supervised monocular depth estimation. We also introduce Atrous Spatial Pyramid Pooling with Forward-Path (ASPPF) into the network to reduce computational costs and improve inference performance. The proposed ASPPF-based network is lighter, faster, and better than current depth estimation networks. Our light-weight network only needs 7.6 million parameters and can achieve up to 67 frames per second for $256\times512$ inputs using a single nVIDIA GTX1080 GPU. The proposed network also outperforms the current state-of-the-art methods on the KITTI benchmark. The ASPPF-based network and edge-guided post-processing produces better results either quantitatively and qualitatively than the competitors.

15:00
Minimal Free Space Constraints for Implicit Distance Bounds
PRESENTER: Simen Haugo

ABSTRACT. A general approach for fitting implicit models to sensor data is to optimize an objective function measuring the quality of the fit. The objective function often involves evaluating the model's implicit function at several points in space. When the model is expensive to evaluate, the number of points can become a bottleneck, making the use of volumetric information, such as free space constraints, challenging. When the model is the Euclidean distance function to its surface, previous work has been able to integrate free space constraints in the optimization problem, such that the number of distance computations is linear in the scene's surface area. Here, we extend this work to only require the model's implicit function to be a bound of the Euclidean distance. We derive necessary and sufficient conditions for the model to be consistent with free space. We validate the correctness of the derived constraints on implicit model fitting problems that benefit from the use of free space constraints.

15:20
Iterative Closest Point with Minimal Free Space Constraints
PRESENTER: Simen Haugo

ABSTRACT. The Iterative Closest Point (ICP) method is widely used for fitting geometric models to sensor data. By formulating the problem as a minimization of distances evaluated at observed surface points, the method is computationally efficient and applicable to a rich variety of model representations. However, when the scene surface is only partially visible, the model can be ill-constrained by surface observations alone. Existing methods that penalize free space violations may resolve this issue, but require that the explicit model surface is available or can be computed quickly, to remain efficient. We introduce an extension of ICP that integrates free space constraints, while the number of distance computations remains linear in the scene's surface area. We support arbitrary shape spaces, requiring only that the distance to the model surface can be computed at a given point. We describe an implementation for range images and validate our method on implicit model fitting problems that benefit from the use of free space constraints.

14:40-15:40 Session 12B: Computer Graphics
Location: C
14:40
Simulation of High-Definition Pixel-Headlights
PRESENTER: Mirko Waldner

ABSTRACT. This contribution presents a novel algorithm for real-time simulation of adaptive matrix- and pixel-headlights for motor vehicles. The simulation can generate the light distribution of a pair of pixel-headlamps with a resolution of more than one and a half million matrix-elements per light module in real-time. This performance is achieved by dividing the superposition process of the matrix-light-sources into an offline and an online part. The offline part creates a light database that is used by the online component to generate the illumination in memory access efficient way. For an ideal pixel-headlight the run-time of the approach is nearly constant by increasing the number of matrix-lights. This contribution evaluates the visual quality of the simulation. It also presents the changes in the run-time for different pixel-headlamp resolutions and solid angle discretizations.

15:00
ConcurrentHull: A Fast Parallel Computing Approach to the Convex Hull Problem
PRESENTER: Sina Masnadi

ABSTRACT. The convex hull problem has practical applications in mesh generation, file searching, cluster analysis, collision detection, image processing, statistics, etc. In this paper, we present a novel pruning-based approach for finding the convex hull set for 2D and 3D datasets using parallel algorithms. This approach, which is a combination of pruning, divide and conquer, and parallel computing, is flexible to be employed in a distributed computing environment. We propose the algorithm for both CPU and GPU (CUDA) computation models. The results show that ConcurrentHull has a performance gain as the input data size increases. Providing an independently dividable approach, our algorithm has the benefit of handling huge datasets as opposed to other approaches presented in this paper which failed to manage the same datasets.

15:20
A Data-Driven Creativity Measure for 3D Shapes
PRESENTER: Manfred Lau

ABSTRACT. There has been much interest in generating 3D shapes that are perceived to be ``creative'' and previous works develop tools that can be used to create shapes that may be considered ``creative''. However, previous research either do not formally define what is a creative shape, or describe manually pre-defined methods or formulas to evaluate whether a shape is creative. In this paper, we develop a computational measure of 3D shape creativity by learning with raw data and without any pre-defined conception of creativity. We first collect various types of data on the human perception of 3D shape creativity. We then analyze the data to gain insights on what makes a shape creative, show results of our learned measure, and discuss some applications.

14:40-15:40 Session 12C: Medical Image Analysis I
Location: B
14:40
Fetal Brain Segmentation using Convolutional Neural Networks with Fusion Strategies
PRESENTER: Andrik Rampun

ABSTRACT. Most of the Convolutional Neural Network (CNN) architectures are based on a single prediction map when optimising the loss function. This may lead to the following consequences; firstly, the model may not be optimised, and secondly the model may be prone to noise hence more sensitive to false positives/negatives, both resulting in poorer results. In this paper, we propose four fusion strategies to promote ensemble learning within a network architecture by combining its main prediction map with its side outputs. The architectures combine multi-source, multi-scale and multi-level local and global information collectively together with spatial information. To evaluate the performance of the proposed fusion strategies, we integrated each of them into three baseline architectures namely the classical U-Net, attention U-Net and recurrent residual U-net. Subsequently, we evaluate each model by conducting two experiments; firstly, we train all models on 200 normal fetal brain cases and test them on 74 abnormal cases, and secondly we train and test all models on 200 normal cases using a 4-fold cross validation strategy. Experimental results show that all fusion strategies consistently improve the performance of the baseline models and outperformed some of the state-of-arts.

15:00
Fundus2Angio: A Novel Conditional GAN Architecture for Generating Fluorescein Angiography Images from Retinal Fundus Photography

ABSTRACT. Carrying out clinical diagnosis of retinal vascular degeneration using Fluorescein Angiography (FA) is a time consuming process and can pose significant adverse effects on the patient. Angiography requires insertion of a dye that may cause severe adverse effects and can even be fatal. Currently, there are no non-invasive systems capable of generating Fluorescein Angiography images. However, retinal fundus photography is a non-invasive imaging technique that can be completed in a few seconds. In order to eliminate the need for FA, we propose a conditional generative adversarial network (GAN) to translate fundus images to FA images. The proposed GAN consists of a novel residual block capable of generating high quality FA images. These images are important tools in the differential diagnosis of retinal diseases without the need for invasive procedure with possible side effects. Our experiments show that the proposed architecture outperforms other state-of-the-art generative networks. Furthermore, our proposed model achieves better qualitative results indistinguishable from real angiograms.

15:20
Multiscale Detection of Cancerous Tissue in High Resolution Slide Scans
PRESENTER: Qingchao Zhang

ABSTRACT. We present an algorithm for multi-scale tumor (chimeric cell) detection in high resolution slide scans. The broad range of tumor sizes in our dataset pose a challenge for current Convolutional Neural Networks (CNN) which often fail when image features are very small (8 pixels). Our approach modifies the effective receptive field at different layers in a CNN so that objects with a broad range of varying scales can be detected in a single forward pass. We define rules for computing adaptive prior anchor boxes which we show are solvable under the equal proportion interval principle. Two mechanisms in our CNN architecture alleviate the effects of non-discriminative features prevalent in our data - a foveal detection algorithm that incorporates a cascade residual-inception module and a deconvolution module with additional context information. When integrated into a Single Shot MultiBox Detector (SSD), these additions permit more accurate detection of small-scale objects. The results permit efficient real-time analysis of medical images in pathology and related biomedical research fields.

15:40-16:00Coffee Break
17:00-18:00 Session 14: Poster Session I (pre-recorded)
Location: PR
17:00
Video based fire detection using Xception and ConvLSTM
PRESENTER: Tanmay Verlekar

ABSTRACT. Immediate detection of wildfires can aid firefighters in saving lives. The re-search community has invested a lot of their efforts in detecting fires using vision-based systems, due to their ability to monitor vast open spaces. Most of the current state-of-the-art vision-based systems operate on individual im-ages, limiting them to only spatial features. This paper presents a novel sys-tem that explores the spatio-temporal information available within a video sequence to perform classification of a scene into fire or non-fire category. The system, in its initial step, selects 15 key frames from an input video se-quence. The frame selection step allows the system to capture the entire movement available in a video sequence regardless of the duration. The spa-tio-temporal information among those frames can then be captured using a deep convolutional neural network (CNN) called Xception, which is pre-trained on the ImageNet, and a convolutional long short term memory net-work (ConvLSTM). The system is evaluated on a challenging new dataset, presented in this paper, containing 70 fire and 70 non-fire sequences. The dataset contains aerial shots of fire and fire-like sequences, such as fog, sun-rise and bright flashing objects, captured using a dynamic/moving camera for an average duration of 13 sec. The classification accuracy of 95.83% high-lights the effectiveness of the proposed system in tackling such challenging scenarios.

17:05
Highway Traffic Classification for the Perception Level of Situation Awareness
PRESENTER: Julkar Nine

ABSTRACT. The automotive industry is rapidly moving towards the highest level of autonomy. However, one of the major challenges for highly autonomous vehicles is the differentiation between driving modes according to different driving situations. Different driving zones have different driving safety regulations. For example, German traffic regulations require a higher degree of safety measurements for highway driving. Therefore, a classification of the different driving scenarios on a highway is necessary to regulate these safety assessments. This paper presents a novel vision-based approach to the classification of German highway driving scenarios. We develop three different and precise algorithms utilizing image processing and machine learning approaches to recognize speed signs, traffic lights, and highway traffic signs. Based on the results of these algorithms, a weight-based classification process is performed, which determines the current driving situation either as a highway driving mode or not. The main goal of this research work is to maintain and to ensure the high safety specifications required for the German highway. Finally, the result of this classification process is provided as an extracted driving scenario-based feature on the perceptual level of a system known as situation awareness to provide a high level of driving safety. This study was realized on a custom-made hardware unit called "CE-Box", which was developed at the Department of Computer Engineering at TU Chemnitz as automotive test solution for testing automotive software applications on an embedded hardware unit.

17:10
3D-CNN for Facial Emotion Recognition in Videos
PRESENTER: Jad Haddad

ABSTRACT. In this paper, we present a video-based emotion recognition neural network operating on three dimensions. We show that 3D convolutional neural networks (3D-CNN) can be very good for predicting facial emotions that are expressed over a sequence of frames. We optimize the3D-CNN architecture through hyper-parameters search and prove that this has a very strong influence on the results, even if architecture tuning of 3D CNNs has not been much addressed in the literature. Our proposed resulting architecture improves over the results of the state-of-the-art techniques when tested on the CK+ and Oulu-CASIA datasets. We compare the results with cross-validation methods. The designed 3D-CNN yields a 97.56% using Leave-One-Subject-Out cross-validation, and 100% using 10-fold cross-validation on the CK+ dataset, and 84.17% using 10-fold cross-validation on the Oulu-CASIA dataset.