Program for Thursday, September 9th

PROGRAM FOR THURSDAY, SEPTEMBER 9TH

Days:

previous day

next day

all days

View: session overview talk overview

10:00-11:20 Session 3A: Image Recognition & Detection (2)

Chair:

Hideki Komagata

Location: Room A

10:00	Mana Masuda, Ryo Hachiuma, Ryo Fujii, Hideo Saito and Yusuke Sekikawa Unsupervised Anomaly Detection from Event Camera for Surveillance Monitoring ABSTRACT. Event cameras are bio-inspired novel vision sensors that report per-pixel brightness changes asynchronously in the form of a stream of events. Event cameras offer significant advantages to conventional RGB cameras; high temporal resolution, high dynamic range, and no motion blur. This paper presents a method for unsupervised anomaly detection using event data and showing that event data can detect anomalies with outperformed accuracy to RGB images. To the best of our knowledge, this is the first work to show that deep learning models can detect anomalies from event data with an accuracy comparable to RGB. We experiment by converting a publicly available dataset of anomaly detection into event data.
10:20	Xinyun Li, Ryosuke Furuta, Go Irie and Yukinobu Taniguchi Accurate Indoor Localization Using Multi-View Image Distance ABSTRACT. Due to the increasing complexity of indoor facilities such as shopping malls and train stations, there is a need for a technology that can find the current location of a user using a smartphone or other devices, even in indoor areas where GPS signals cannot be received. Indoor localization methods based on image recognition have been proposed as solutions. While many localization methods have been proposed for outdoor use, indoor localization has difficultly in achieving high accuracy from just one image taken by the user (query image), because there are many similar objects (walls, desks, etc.) and there are only a few cues that can be used for localization. In this paper, we propose a novel indoor localization method that uses multi-view images. The basic idea is to improve the localization quality by retrieving the pre-captured image with location information (reference image) that best matches the multi-view query image taken from multiple directions around the user. To this end, we introduce a simple metric to evaluate the distance between multi-view images. Experiments on two image datasets of real indoor scenes demonstrate the effectiveness of the proposed method.
10:40	Zelin Zhang and Jun Ohya Early Detection of Objects on the Road Using V2V and Multiple Image Streams for Supervised Deep-Learning Based Autonomous Driving ABSTRACT. This paper proposes a novel method for early detection of objects on the road using V2V (vehicle-to-vehicle communication) and three image steams for a supervised deep-leaning based autonomous driving system, where the three image streams consist of streams of RGB, depth, and semantic segmentation. The V2V lets a frontal cooperative vehicle send the three image streams to the host cooperative. An end-to-end deep learning recognizes objects on the road and estimate the distances to the recognized objects by processing the three streams. By integrating the results from the cooperative and host vehicles, earlier detection of the objects can be achieved compared with non-V2V systems. Experiments for early recognition and location of other vehicles, pedestrians and traffic lights on ordinary roads are conducted. Experimental results demonstrate the validity of the proposed method.
11:00	Ryo Onuki, Taiki Suzuki, Jeonghwang Hayashi, Chanjin Seo, Jun Ohya and Takaaki Ohkawauchi Estimating the 3D Cut Position of Pork Frontal Legs in RGBD Images by a Deep Learning Based Method for Achieving a Robot That Cuts Pork Legs Autonomously ABSTRACT. Towards the actualization of a robot that disassembles pork autonomously, as its first step, this paper proposes a deep learning based method for estimating the pork frontal leg’s position to be cut in the RGBD image. First, sample RGBD images are collected, and the regions to be cut are annotated in the RGB images. Semantic segmentation model is obtained by training the annotated images. Using the obtained model, the cut region is automatically extracted from unknown RGBD images. Using 3D information in the extracted region, the 3D position of the cut position and the angle between the orientation of the claw and vertical direction are calculated. Experimental results show that the accuracy for the cut position is 0.99, and angle measurement error is within 3 degrees. These promising results demonstrate the validity of the proposed method.

10:00-11:20 Session 3B: Visualization & Image Processing

Chair:

Takafumi Saito

Location: Room B

10:00	Rena Okuri, Shuhei Kodama and Tokiichiro Takahashi Double-Eyelid Eye Size Illusion Brought by Eyeliner of Different Thicknesses and the Sense of Incongruity on the Illusion as an Effect ABSTRACT. Eyeliner is one of the makeup methods that enhances the attractiveness of female by a geometric illusion that makes the eyes appear larger. Eyeliner has no fixed shape, and its thickness and length can be freely adjusted. In this paper, we experimentally verified the relationship between thickness of eyeliner and the amount of the illusion. In addition, by examining the sense of incongruity of the eyeliner, we clarified the optimal thickness of eyeliner as makeup to make the eyes appear larger. The results showed that the thicker eyeliner, the larger eyes were perceived to be. Furthermore, excessive thickness of eyeliner increased the sense of incongruity and reduced the illusion effect.
10:20	Rintaro Akamine, Tomoya Ito and Tsukasa Kikuchi Efficient Smoke up-Res Method with LOD and Culling Algorithm ABSTRACT. In recent years, 3DCG fluid simulations are increasingly used in video works and games to reproduce natural phenomena and as VFX, requiring high quality visuals. In order to shorten the computation time of fluid simulation, there is a method in which an artist simulates a low-resolution fluid, which is easy for trial and error, and then interpolates it to a high-resolution fluid as a post-processing. In our method, we combined dynamic resolution settings based on the distance between the fluid and the camera and the removal of useless parts. This makes it possible to reduce the computational cost and generate higher resolution fluids.
10:40	Naru Yamamoto, Kengo Ito and Suguru Saito Contrast Sensitivity Function Model with Bézier Surface in Peripheral Vision ABSTRACT. The contrast sensitivity function(CSF) is obtained by measuring detection thresholds of contrast over a range of spatial frequencies. Ito et al. measured contrast sensitivities in the peripheral vision up to 84 degrees in eccentricity. We propose a new CSF model as a Bezier surface to fit them in a three-dimensional space whose axes are spatial frequency, eccentricity, and inverse of contrast. It reveals a novel gradient tendency along with eccentricity that previous models cannot describe.
11:00	Fumiya Kimura, Hidehiko Shishido and Itaru Kitahara Lively Free-Viewpoint Video Generation Using Video Textures ABSTRACT. This paper describes a method for generating free-viewpoint video that can express the liveliness. One of the problems with conventional free-viewpoint video is that it is difficult to express liveliness in a free-viewpoint video because the subject is observed as a static object. We focus on “motion” as one of the factors for expressing liveliness. The proposed method adds liveliness to a static 3D model by minute movements of the body surface expressed by change in texture. The 3D model is generated from multi-viewpoint images, followed by projecting video textures to the model with randomness in the motion.

11:20-13:00 Lunch

13:00-14:00 Invited Talk 2

Chair:

Kitahiro Kaneda

14:00-14:15 Break

14:15-15:35 Session 4A: (Special Session) Drone

Chair:

Katsuya Hasegawa

Location: Room A

14:15	Hana Baum, Ayaka Harada, Maria Mitsuhashi, Nana Yoshida, Yasushi Kuronuma and Katsuya Hasegawa A Study of Drones Through Photographing Sunflower Fields ABSTRACT. Drones and 360-degree cameras have a variety of uses, and we are researching drones through filming our own approximately 3a wide sunflower field. We would like to think about how we can use these equipments that we cannot usually use and let more people know how useful they are.
14:25	Aoi Nagasaka, Mie Ueda, Moana Sasaki, Hikari Shimazaki, Harune Masuya, Natsuho Isogai, Misa Honma, Yasushi Kuronuma and Katsuya Hasegawa Research on Visual Expression Using Drones and 360-Degree Cameras ABSTRACT. We had an encounter with drones and learned that there are many different types of drones. As we discussed how to make use of drones, we felt that there were many possibilities. Therefore, we decided to explore the "possibility of new visual expression" by combining the 360-degree camera we had been using before with drones, and finally decided to produce a video work.
14:35	Kohei Hashimoto and Kohei Cho Evaluating the Suitable Resolution of UAV Images for Identying Different Vegetation ABSTRACT. Due to the rapid advancement of UAV (Unmanned Aerial V) technologies, the UAV has become a popular platform in the field of remote sensing. Compared with satellites, UAVs can take images more frequently with much higher spatial resolution. In this study, the authors have taken images from an UAV over the agricultural fields of Tokai University located in Kumamoto Japan. The comparison of the image histograms of Red, Green and Blue channel images suggested that the spatial resolution of around 40 cm is suitable for identifying the difference between rice and sweet potato using values of the Green channel image.
14:55	Katsuya Hasegawa Low Gravity Simulation of the Moon and Mars Using Drones ABSTRACT. Falls in the elderly can cause bedridden and walking problems. Therefore, knowledge of fall prevention is very important. The human head is in a high position. Therefore, when human fall, head position drops significantly. The sensors in the head detect the low gravity and the body responds to the crisis. To understand the mechanism of fall prevention, it is important to know the low-gravity response of the body. In this study, we used drones to create low-gravity conditions on the Moon and Mars to understand the gravity response of the body.

14:15-15:35 Session 4B: Deep Learning & Video processing

Chair:

Shinji Mizuno

Location: Room B

14:15	Masahiro Sakamoto, Kazufumi Kaneda and Raytchev Bisser Spectral Super-Resolution Using CNN Decomposing a Color Image into Luminance and Chrominance Components ABSTRACT. Hyper-spectral images are used in a wide range of fields such as industry, medicine, remote sensing, and so on. They are also used in computer graphics as light probe images and textures in spectral rendering. The acquisition of spectral images is, however, costly in terms of equipment and time, which hinders its acquisition and use. Conventional spectral super-resolutions using deep learning have been adopting a direct end-to-end learning method to RGB and hyper-spectral images. In contrast, we focus on the fact that hyper-spectral images are decomposed into luminance and chrominance components, and we propose a novel spectral super-resolution using a deep learning to estimate each component separately. Finally, in the proposed method, a hyper-spectral image is reconstructed by combining the estimated luminance and chrominance components.
14:35	Mateo Perez Rubio, Ryo Hachiuma, Ryo Fujii, Hideo Saito and Luis Salgado 2D Human Pose Completion Using GANs and Graph Convolutional Networks ABSTRACT. Due to overlap and occlusions, even state-of-the-art human pose estimators often generate incomplete poses with missing points. This is a serious problem for applications such as 3D pose regression, as they may tolerate slight imprecision in joint locations, but not incompleteness. Our work aims to predict the location of missing joints in incomplete human poses. To that end, we propose a deep-learning architecture based on Generative Adversarial Networks (GAN) and Graph Convolutional Networks (GCN). We show how the introduction of adversarial training and graphs to represent the human body helps our models learn geometric relationships between different body parts and generate more realistic poses, outperforming the state of the art with no additional information other than the input 2D pose.
14:55	Yuki Sakamura, Hidehiko Shishido and Itaru Kitahara Multi-View Stereo Based on Omnidirectional Epipolar Geometry ABSTRACT. This paper introduces a multi-view stereo (MVS) algorithm for an omnidirectional image. A data and smoothing term for MVS are calculated without distortion of omnidirectional images at high latitudes based on its epipolar geometry. Matching a normal vector with a pole with an omnidirectional image removes the distortion on an epipolar curve. Since semi-global matching (SGM) is a one-dimensional smoothing method, a combination of SGM can wholly remove from the distortion of the omnidirectional image. As a result of experiments, disparity and depth can be calculated using the proposed method.
15:15	Tomokazu Ishikawa Background Music Search System to an Input Video Using Factor Analysis for Impressionwords ABSTRACT. The use of video clip sharing sites and applications has become popular in the recent years. Adding the appropriate background music to a video is desirable for video submission. However, in the past, when searching for a sound source that can be used as a background music, video creators rely on meta-information, such as language information, subjectively attached to the database by an owner or actually listen to a number of songs to confirm whether the music matches the movie. The author of this study believes that video contributors would improve their production efficiency if there is a system that automatically lists up suitable background music for videos. Hence, such a search system is proposed herein. First, a website for a questionnaire survey is constructed. Through the questionnaire survey, the impressions of some videos and music are scored using language to describe movie and music, respectively. The study participants are then asked to select the top five background music suitable for the video. In the proposed method, the features for the videos are formulated using color and optical flow histograms. Mel frequency cepstrum coefficients, which have often been used in the recent deep learning research, are used for the music features. A function that converts these features to linguistic evaluation values is obtained by multiple regression analysis. The top five music files with the highest similarity to the input video are calculated to evaluate the proposed system. Whether they are compatible with the respondents' ranking is then checked. The result shows that the proposed method via impression words can perform a more accurate retrieval than the learning network that directly connects video and music features.

15:35-15:50 Break

15:50-16:50 Session Poster: Posters

Junyang Zhou, Joong-Sun Lee, Hiroyuki Suzuki and Takashi Obi

Channel Attention Based Convolutional Neural Network for Grading Smartphone Surface with Sparse Feature

ABSTRACT. Grading smartphone surface is a crucial procedure in trading refurbished and used smartphone, of which the market is growing with contribution to environmental protection. So far, grading is mostly done manually with limited computer diagnosis. Manual classification inevitably leads to errors and the computer diagnosis gives little support in dealing with sparse feature of the surface. To tackle this problem, we propose an Efficient Channel Attention (ECA) based deep learning method for grading smartphone surface. In our method, we add lightweight channel attention module into ResNet50, and replace the maximum pooling layer with the softpooling layer in the modified network for further improvement. In the experiment, we have confirmed the grading accuracy was significantly increased without considerable computational complexity.

Takashi Ozeki and Eiji Watanabe

A System to Understand the Class Atmosphere in Distance Learning

ABSTRACT. One of the disadvantages of distance learning is that it is difficult for teachers to understand the class atmosphere. So, we propose a system to understand the status of students who are taking live lectures. In the proposed system, the number of students looking at their computer screen is used as the status of students. The state of each student in live lectures obtained using web camera of each student's PC is sent to the server. Then, the server aggregates them every second and sends the result to the teacher's PC. The teacher's PC displays it as a time series graph. By experiments, it was confirmed that the graph is updated correctly in real time.

Takuya Sato, Haruki Kuroki, Hiroshi Ikeoka and Koichi Isawa

Development of Real-Time Fish Position Recognition System for Automatic Feeding Aquaculture

ABSTRACT. Sillago japonica is a fish that is familiar in Japanese cuisine. If it is possible to become stably aquaculture growing over 25 cm in length that are traded at high prices, it will lead to revitalization of the aquaculture industry. However, it is difficult for Sillago japonica aquaculture to handle with a conventional simple automatic feeding system. Hence, we have developed a fish position recognition system with image recognition AI as preprocessing of the automatic feeding control AI to conclusion at the optimum timing. Therefore, we have used yolov3 to realize high accuracy position recognition.

Natsumi Kaneko, Masahiro Ishikawa, Tsutomu Inoue, Eito Kozawa, Hirokazu Okada and Naoki Kobayashi

A Method for Estimating Renal Function Using Multi-Resolution Texture Analysis

ABSTRACT. Diagnostic of Chronic kidney disease used by blood test, it concerns that examination data is unstable. Therefore, it has been desired development of robust a renal function estimation method using by the medical image. In this study, it is suggested of the renal function estimation method according to texture analysis of renal MRI. In the accuracy evaluation experiment, we obtained correlation coefficient of 0.52 by conventional TLCO while correlation coefficient of 0.81 by proposal research.

Kunio Ohno and Toshiko Kimura

A Study on Image Information Role for Language Understanding and Intercultural Communication

ABSTRACT. Individual opinion and living culture have not only been influenced by such image information as advertising papers and signboards but also recent advertisements through TV, Web and digital signage. Effectiveness of the image information above has been analyzed by human information history, computer media history, language acquisition, and Maslow’s theory. Then image information effectiveness for cultural propagation has been discussed through the images of the Bible in the West, and figures of Confucianism filial piety in the East

Sota Watanabe and Makoto Hasegawa

Reflection Removal on Eyeglasses Using GAN

ABSTRACT. When using a video conferencing system, a problem for users is that the display screen reflects off their eyeglasses, hiding their facial expressions and leaking information. This problem occurs especially with blue light-blocking eyeglasses. In this study, we propose a network for reflection removal called Generative Adversarial Network. As a result of training our model, we successfully removed undesirable reflections. This model has better performance than simple networks called autoencoder and U-net, and successfully generated high quality images with high enough quality for practical. The proposed reflection removal method will contribute to solving this problem in video conferencing system.

Hiroki Fujinaga and Katsumi Tadamura

Development of a Method for Superimposing a Non-Existent Building on the Real World by AR

ABSTRACT. The advance in the performance of mobile devices such as smartphones and tablets allow us to use computing cost-intensive functions such as the AR with practical response time. On the other hand, there are many examples of the use of ICT as a means of attracting domestic and foreign tourists to local tourist attractions. In particular, there is a growing trend as follows: reconstructing historical buildings that no longer exist today with computer graphics (CG) and using it with virtual reality (VR) to allow visitors to freely observe them in a virtual space, or to combine them with the real landscape with augmented reality (AR). In this paper, we propose a method for synthesizing historical buildings that do not exist today with the real world through the marker-less AR, taking into account effects brought to the real world by CG objects. A prototype of the proposed method was installed to a smartphone and its usefulness was confirmed by being applied to a miniature of the Hagi Castle and its surroundings.

Junsiri Cholathep and Kenji Mizutani

Effect of Lighting Environment in VR Room on Calculation Task and Sense of Time

ABSTRACT. Nowadays many become active at night. One of the strengths of virtual space is that it can provide the activity environment that users preference. By adjusting the activity environment according to the user's taste, it is expected to enhance the immersive feeling in VR. In this article, we will use VR to simulate a day and night environment. We tested whether being in an immersive moment had any effect on game performance or time perception. The results of this paper show that users prefer game space in a night environment, even during the day, but the efficiency of computational tasks has not improved, and the time has gone too fast.

Tomoka Kojima and Kazuya Ueki

Kuzushiji Recognition Using Multiple Datasets and Its Analysis

ABSTRACT. It has been reported that recognition models with good accuracy can be created by training on the Japanese cursive kuzushiji character image dataset for kusushiji recognition using machine learning. In this study, we used a model trained on this dataset to see whether it is effective for recognizing kuzushiji characters before the Edo period. As a result of the analysis, we confirmed that the reason for the decrease in accuracy was that some character shapes were not included in the training data or that degraded images were included. Furthermore, we were able to identify issues for improving the accuracy by adding data from before the Edo period during the training process.

Kazuya Ueki, Takayuki Hori, Yongbeom Kim and Yuma Suzuki

Universal Face Attribute Model

ABSTRACT. We propose a deep learning-based universal face attributes model that simultaneously estimates various attributes using images of a person's face region for detailed retrieval of a person in an image or video. To construct the model, we trained on several publicly available face image datasets such as CelebA, UTKFace, RAF-DB, and SCUT-FBP5500, as well as our own datasets concurrently. Since the proposed model has the structure of shared and attribute-specific layers. it is capable of estimating multiple attributes of a person at the same time.

Yoshiki Nagasaki, Masaki Hayashi, Naoshi Kaneko and Yoshimitsu Aoki

Temporal Cross-Modal Attention for Audio-Visual Event Localization

ABSTRACT. In this paper, we propose a new method for audio-visual event localization [1] to find the corresponding segment between audio and visual event. While previous methods use Long Short-Term Memory (LSTM) networks to extract temporal features, recurrent neural networks like LSTM are not able to precisely learn long-term features. Thus, we propose a Temporal Cross-Modal Attention (TCMA) module which extract temporal features more precisely from two modalities. Inspired by the success of previous works in capturing long-term features, we introduce TCMA, which incorporates self-attention. Finally, we were able to localize audio-visual event precisely and achieved a result 2.7 points higher.

Mizuki Obayashi, Masahiro Yamaguchi, Ryo Hachiuma, Hideo Saito, Hiroki Kajita and Yoshifumi Takatsume

Pose Estimation of Human Body Parts from RGB Images for AR Display of Photoacoustic 3D Data

ABSTRACT. 3D Photoacoustic imaging is a technology using ultrasound waves to visualize micro-blood vessels in the skin. Our goal is to superimpose 3D photoacoustic data of the leg to RGB images for enhancing the experience of medical doctors. As a first step of the research, we formulate this task as the pose estimation of 3D CAD models by aligning the 3D photoacoustic model and 3D CAD model manually. We employ Pose Interpreter Network which is the deep neural network to predict 3D pose of the object from binary masked image. In the experiment, we verify the pose estimation accuracy using three video sequences.