IEVC2024: THE 8TH IIEEJ INTERNATIONAL CONFERENCE ON IMAGE ELECTRONICS AND VISUAL COMPUTING
PROGRAM FOR TUESDAY, MARCH 12TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:20 Session 2A: Computer Vision & 3D Image Processing (1)
Location: F. C. Room
09:00
A Study on Correction Method for Vertical Direction of Earthenware Fragments Based on 3D Measured Point Clouds

ABSTRACT. Restoration of earthenware fragments is one of the important roles in archaeology area. If earthenware fragments are excavated from ruins, the fragments are restored by manual operations. When earthenware are restored with computer, the vertical direction of fragments is required. In our previous method, a set of center points derived from cross sections of fragments is estimated to determine the vertical direction. However, the vertical direction is inverted because of the displacement of center points. Therefore, to manage the vertical inversion, this paper proposes a method to accurately correct the vertical direction of earthenware fragments based on 3D measured point clouds with using linear regression.

09:20
Real-Time Intuitive Interaction and Realistic Illumination for CT Volume Rendering

ABSTRACT. We developed a real-time, intuitive interaction and photo-realistic illumination method for CT volume rendering. Our approach involves auto-stereoscopic display and hand-sensor-based gesture control, as well as lightweight and effective illumination, and a fast-sampling algorithm that enables them to be rendered in real-time. Consequently, our rendering method achieved render volume data obtained from general CT examinations at real-time 4K stereo view, allowing intuitive comprehension of the 3D structure, and providing the realism required not only for diagnostics but also for educational materials and forensic evidence.

09:40
2D-To-3D Conversion Profiling of Monocular Food Images
PRESENTER: Yue-Cong Kuo

ABSTRACT. Recently, deep neural networks (DNN) are utilized in many stereoscopic applications, but DNNs have high power consumption and huge amount of computation. This paper presents a lightweight and hardware-friendly algorithm that can convert monocular food images into stereoscopic views through image segmentation, depth assignment, and Depth-image-based rendering (DIBR). We extract color and texture features from the food image, locally indicating the characteristics of different food, and then segment the image into multiple regions. According to the spatial information of segmented regions, we generate the depth map where central regions appear closer to people and the edge regions appear farther away. Subsequently, the generated depth map and the original monocular image are used to render the 3D food image via DIBR. Furthermore, we measure system performance by benchmarking our algorithm on Google’s Custom Function Unit (CFU) Playground and perform profiling using VexRiscv CPU@100MHz on the Arty A7-100T FPGA board.

10:00
An efficient smoothing method for texture image for V-PCC coding

ABSTRACT. A V-PCC compliant reference software (TMC) encoder projects texture information of a 3D point cloud onto a 2D plane, places it in a 2D image as a patch, and encodes it using existing video coding methods. Pixels in the 2D image that do not belong to a patch may be smoothed by any method. In this study, we investigate a method of inter-patch smoothing suitable for coding.

09:00-10:20 Session 2B: Virtual, Augmented, and Mixed Reality (1)
Location: ATI Room
09:00
Frequency and Kansei evaluation of tones that change the sence of weights with volume.

ABSTRACT. The impact sound of a dumbbell being dropped was created with a metallic tones and a wooden tones, and the volume was varied to give the impression of weight due to cross modal phenomenon. Both tones had a fundamental frequency of approximately 1 kHz; the metallic tones had peak frequencies in the fundamental and its harmonics. The wooden tones had no peak frequencies other than the fundamental. Kansei evaluation showed that louder sounds related to weight had higher scores in the "loud," "strong," "hard," "dynamic," and "clear" categories. By using tones rather than single frequencies, it was possible to evoke a sense of weight due to loudness, which has tended to be denied in the past.

09:20
Investigating Factors Influencing Personal Space Perception in the Metaverse

ABSTRACT. In the Metaverse, avatar interaction is similar to real-life human communication. To create a comfortable environment in the virtual world, investigating the factors that influence the perception of interpersonal space (PS) between avatars is crucial. While there is a lot of research on PS in the physical world, not much is known about it between avatars in VR. In this study, we examine how various situations and environments impact the boundaries of PS in VR. Understanding the properties of the interaction in VR will be helpful in designing a suitable space for an occasion like a virtual conference or a poster session.

09:40
Lyrics onto Cityscape: An Augmented Reality System with Visual Experience Harmonizing Music and Surrounding Cityscape
PRESENTER: Rin Izumi

ABSTRACT. We propose an Augmented Reality (AR) system called “Lyrics onto Cityscape” and a new experience it offers. “Lyrics onto Cityscape” allows users to carry a visual ex- pression, like lyric videos that harmonize with the surround- ing landscape along with music. We analyzed existing lyric videos and examined the elements of a system required for this research. Using the analysis results, we implement an MR system that dynamically projects lyrics as anima- tions onto the outdoor environment, including buildings, the ground, the sky, and the space, while users walk and listen to music. The analysis of the questionnaire responses from participants resulted in high ratings for the effectiveness of the system.

10:00
A Study on a Round Structure: The Spatial Comparison of Mongolian Ger and Geodesic Dome Using Virtual Environment

ABSTRACT. The increasing popularity of more comfortable apartments has decreased the use of round structures in Mongolia. This research aims to determine a dome structure's comfort level by comparing its interior design to that of a traditional Mongolian ger. To achieve this, we created a 3D model of a new dome design that utilizes a Geodesic dome structure the same size as a Mongolian four-panel ger. We utilized a Virtual Environment to compare the spatial models of the Mongolian ger-inspired interlocking joint and the dome. A qualitative research approach was used to assess three types of virtual environments: i) an open space layout, ii) a traditional furniture arrangement, and iii) a modern layout, using comparative questions. The study identified differences in natural light and furniture layout between the ger and dome designs through paired T-Test analysis. The study's findings demonstrate the potential application of round structure principles in architectural practice, offering a solution for developing sustainable structures.

09:00-10:20 Session 2C: Security and Privacy
09:00
Study on the Color Factor that Affects Feelings toward Visual Emergency Notification on Smartphones

ABSTRACT. This study investigated the use of color in notification messages to optimize visual communication, with a specific focus on discerning disaster severity levels and their impact on recipient perceptions. It engaged 84 participants, including 18 with hearing impairments and 66 without, who evaluated simulated disaster alerts on smartphones. Participants rated their emotional responses regarding safety, alertness, and danger perception in relation to color notifications, accounting for diverse background images. Analysis using an ANOVA model indicated that factors such as hearing impairments and age influenced color perception in notifications. Hearing-impaired participants found color beneficial for situational awareness, with green evoking security and red intensifying alertness and danger perception, particularly against contrasting backgrounds. These findings inform the design of smartphone disaster notifications for both hearing and hearing-impaired individuals.

09:20
Comparative Visualization for the Depiction of Media-Mix Works

ABSTRACT. Recently, many media mix works based on comic books and novels have been released. Many mixed media works are constructed based on the setting and story of the original work. However, due to the development media and restrictions on length of the video, settings may have to be changed or the story may have to be cut. In this study, we proposes a method to analyze media mix works based on comic books and novels for each media and visualize the depiction allocation of each storyline and the transition of story elements. By visualizing the structure of the works, we expect to discover differences from the original works and the characteristics of the works depending on the media.

09:40
Digital seal and digital sign using color-coded round or rounded rectangular symbols with digital signatures

ABSTRACT. A seal impression or handwritten sign is displayed in a circular or rounded rectangular symbol display area, with a name, date of creation, etc. A digital signature is generated with the creator's private key to create a two-dimensional symbol. By reading digital seals and documents (paper, PDF, etc.) with digital signatures with a smartphone, it is possible to authenticate seals and handwritten signs.

13:20-14:35 Session 4: Poster1
Multimedia Information Gateway Service

ABSTRACT. Do you ever feel annoyed when receiving information using communication media, such as receiving a phone call during a meeting or having to check the video when you don't have time? This is because it's the sender's chosen medium. In other words, there is no guarantee that the recipient will be able to receive the information through the appropriate media. In this research, we consider a service that allows recipients to receive information through appropriate media depending on their situation. To realize such a service, we construct a multimedia information gateway system that converts data in the process of information transmission, so that the information can be received from the media used by the sender to the media desired by the receiver. At the information gateway, data is processed so that the information can be expressed according to the characteristics of the media through which the recipient receives the information.

High-Precision Video Retrieval through Image Generation

ABSTRACT. In recent years, advancements in the training techniques of image and text embedding models have enabled video retrieval under zero-shot learning conditions. In this study, we explored methods to improve the accuracy of video retrieval tasks by leveraging both multimodal embedding models and image generation models. By inputting search query text into a pre-trained Stable Diffusion model to generate images, and subsequently using these generated images as queries for video retrieval, we experimentally confirmed an enhancement in video retrieval performance.

Object Detection Method for Drone Videos Using Optical Flow

ABSTRACT. The application of object detection in aerial video taken by drones is crucial. Since objects appear small in drone-captured video, object detection models with many parameters are required for accurate object detection. Therefore, video compression and information reduction methods are considered for transmitting videos to the object detection model on a server. Our paper suggests an approach of utilizing optical flow to estimate object positions to reduce the number of frames transmitted to the server. Experimental results reveal that frame reduction by the proposed method does not result in degradation of image recognition accuracy.

Extension of Tactical Board Capable of Determining Normative Offensive Sequences for Rugby Sevens

ABSTRACT. There have been analysis methods and tactical board tools available for various team sports. However, based on the authors’ investigation, there is no tactical board tool for rugby that can determine and visualize the normative offensive sequences. In this context, the authors have developed a prototype tactical board tool that enables us to automatically determine the normative offensive sequences from input players’ locations and velocities. To improve the practicality of this tool, this study extends the tool by incorporating functions that allow users to set some ability levels of players. The extended tool is tested via several settings.

Image Coding for Machines with Objectness-based Feature Distillation

ABSTRACT. Recently, the capability and popularity of automatic machine analysis of images and videos have grown rapidly. Consequently, the analysis of decoded images and videos by machines, rather than humans, is becoming more popular. This shift has led to a growing need for efficient compression methods that are optimized for machines instead of humans. In response to this demand, various methods of image coding for machines (ICM) have been developed. For training ICM models, distillation-based loss is often used. However, the valuable insights gained from the distillation methods in machine vision tasks have not been fully utilized yet. In this study, we propose an objectness-based feature distillation for ICM to improve rate-distortion (R-D) performance. We conducted experiments in object detection and instance segmentation tasks and confirmed that there was an improvement of up to approximately 1.5 points in mAP at the same rate.

Creation and Evaluation of a Board Game for Environmental Learning :Gamification for learning about river water quality and ecosystems

ABSTRACT. In this study, board game materials were created with the goal of deepening understanding of river water quality and ecosystems. Images of various living creatures and facilities related to environmental protection were used in the creation of the teaching materials. In addition, workshops were held for 24 university students, and questionnaires and video recordings were collected. In this presentation, we will present the educational materials we created and the results of the workshop.

The Relationship between the Variation in Drawn Lines and Reaction Time in a Simulated Right Turn Situation

ABSTRACT. To efficiently measure cognitive characteristics, we investigated the relationship between lines drawn in a task and reaction time during simulated right turns. Participants used a tablet to draw the best line based on a given point cloud. Reaction time in the simulated right turn was evaluated by dividing the screen into two areas for conducting response tasks. The experiment involved 20 participants. Correlation analysis revealed a positive correlation between the mean of standard deviations of angles in drawn lines across 17 point clouds and variation in reaction time or the errors. The result suggests that the reaction time or errors can be predicted from the variation in drawn lines.

Anomaly detection using MIST score from surveillance videos

ABSTRACT. Surveillance camera have become an indispensable part of criminal investigations. We thought a study on anomaly behavior detection from surveillance videos, considering that the use of surveillance videos can provide higher crime deterrence. We built a system using the anomaly detection frameworks MIST and Bayesian Online Change Point Detection. Detecting anomalies from score variations by applying Bayesian Online Change Point Detection on anomaly score generated from MIST. As an improvement over existing methods, we only use the increasing probability of Bayesian Online Change Point Detection used in the calculation of Run Length. As a result, an improvement in Recall was observed, but over detection is also frequent, which is a problem.

Occlusion-Robust Faces Recognition Using Multi-Task Model with Region Segmentation

ABSTRACT. In the past few years, face recognition methods for masked faces have been proposed. However, most methods were effective only for masks and are rarely used because the number of people wearing masks has decreased post-COVID-19. We propose a comprehensive, occlusion-robust face recognition method. We build a multi-task learning model that simultaneously solves face recognition and unsupervised segmentation to extract finer features from face regions. Our model improved accuracy compared to ArcFace without increasing the inference time.

Edge-Cloud Collaborative Object Detection Model with Feature Compression

ABSTRACT. Recently, dramatic performance improvements in object detection models have increased the demand for real-time video processing tasks in edge devices. However, it is trade-off between real-time processing and high detection accuracy. Hence, a two-phase prediction network model, Edge-Cloud Net (ECNet), has been proposed to coordinate the edge-side AI model and the cloud-side AI model. However, ECNet is originally designed for image classification task. In this study, we propose a method to apply ECNet into object detection tasks utilizing a lightweight edge model connecting YOLOv3 and YOLOv3-tiny. We also implement feature compression to reduce the amount of transmission to be sent to the cloud side. Our approach can reduce the amount of data transmission while concurrently preserving object detection accuracy, particularly in instances when dealing with small bpp.

Development of a Piano Learning Mobile Application Using Augmented Reality and Hand Tracking

ABSTRACT. This study aims to develop a piano learning application using augmented reality (AR) and hand tracking on smartphones. Users can play dual-handed on any table, overcoming traditional mobile application limitations. AR technology and hand tracking enable real-time dual-handed playing on a virtual or paper piano via a smartphone. Presented as rhythm games, it offers a relaxing, interactive experience, enhancing flexibility. The study aims to make musical instrument accessible, provide an intuitive way to play, and break through traditional application limitations in simulating dual-handed performance.

Sentinel-1 image Shoreline Monitoring technology in Waisanding SandBar based on Deep-Learning

ABSTRACT. In this work, we propose an effective method to moni-tor shoreline monitoring of the Waisanding sandbar in Sentinel-1 image. However, we found that end-to-end technique would suffer from a wide variety of image distributions, involving the varying shape of the Waisanding sandbar and image noises caused by tide levels and rainy days. To state and solve this problem, the instance segmentation, MaskRCNN, would be in-troduced, which can classify the target region and roughly generate a mask or a bounding box. Pre-classified Sentinel-1 images can be more precisely and semantically segmented. The masking operation used for post-process helps eliminate many additional false positive pixels. Experimentally, we compare the per-formance of different semantic segmentation models, such as ResUnet, Transformer, and CNN hybrid DNN, and both additionally with or without instance segmen-tation mask. We also show that the proposed method is more effective using the attention-based model with the captured long-term features.

The Effect of Edge Information in Stable Diffusion Applied to Image Coding

ABSTRACT. We focus on image coding schemes whose primary purpose is recognition in computer vision rather than consumers viewing. The input image is decomposed into prompt, hyperparameter, and edge information by Stable Diffusion. For decoding, we use the same diffusion model is used to sequentially remove noise determined by the sampling method and the initial value of random variables. The quality of the generated image depends on the fineness of the edge information. In this study, we investigate the effect of the amount of edge information on the quality of the decoded image.

A Preliminary Study on Quantitative Evaluation of Cataract Using Lens Images

ABSTRACT. Conventionally, the degree of cataract progression is evaluated by finding areas of opacity in the crystalline lens. Inthis examination, the ophthalmologist determines the degree of progression based on subjective evaluation, whichmay vary from case to case. In this study, we propose amethod for extracting cataract opacity regions from lens images (transillumination images). In addition, we propose amethod to classify the type of cataract based on the shapeand position of the extracted opacity regions and to quantitatively evaluate the degree of progression of cataracts, andshow some experimental results.

A Preliminary Study on Pupil Tracking in Cataract Surgery Using YOLOv8

ABSTRACT. In recent years, 3D imaging systems have come to be used in ophthalmic surgery, and it has become possible to enhance the visibility of the target object by color conversion of the displayed image. In this case, an attempt to improve visibility by extracting the pupil region, which is the doctor's gazing area, and performing color conversion to the area has been studied. In this work, we investigate a method to extract and track only the pupil region in surgical video images by learning the pupil region using YOLOv8, and show some experimental results.

Attentive listening system using generative images and an affective emoji

ABSTRACT. This paper proposes a new multimodal spoken dialogue system utilizing conversational interaction and real-time generative images with an affective facial emoji. The system attentively listens to the user’s talk, referring to the image displayed on the screen as a cue for the talking topic. The image will be updated, reflecting the user’s last-minute utterances. In a preliminary experiment, it was found that users tend to use more positive words or more positive expressions when the positive facial expression emoji is presented than the negative one.

14:50-16:10 Session 5A: Image Recognition & Detection (2)
Location: F. C. Room
14:50
Automatic Image Selection System for High School Baseball Player Photos

ABSTRACT. The National High School Baseball Championship is immensely popular in Japan, and a large number of photos are captured by news reporters during the competitions. There is a substantial demand for players and their parents to acquire photos taken of the player himself during these competitions. However, unlike the selection of report photos, manual selection of quality photos for sales proves to be a time-consuming and error-prone task. This paper introduces an automated system designed to enhance work efficiency in the image selection process. This system incorporates a streamlined pipeline that integrates advanced technologies including histogram analysis, Laplacian filtering, image clustering, human detection, pose estimation, face detection, and head pose estimation. The experimental and practical results indicate the effectiveness of the proposed system, and operating time is reduced by approximately 85%.

15:10
Online hand-drawing pattern classification using Sketch-RNN

ABSTRACT. One recent approach to unsupervised pattern classification is training an Auto-Encoder and using the reconstruction loss as an index. In this work, we adopt Sketch-RNN, a sequential Variational Auto-Encoder (VAE) model, to classify the online hand-drawing pattern of healthy persons or Parkinson's disease patients. We train the Sketch-RNN with no labeled data and evaluate the accuracy of the attribute recognition using the reconstruction loss. Though the recognition by still image data and CNN model currently surpasses our proposed method of online drawing data and Sketch-RNN model, we show the pipeline of online data classification using VAE.

15:30
Estimating the Way of Grasp Complex Contoured Pieces

ABSTRACT. How to grasp things is analyzed in many areas, such as medicine, human engineering, or robotics. Graspability pre- diction or grasping position detection from image are also researched as the machine learning problem in computer vi- sion. Most of them are tasks specified in the context of the affordance of objects such as tools. In this work, we ana- lyze how people grasp objects whose usage as a tool or the concepts of handling are not apparent: puzzle pieces, for in- stance. We generated image datasets of objects of randomly shaped pieces and grasp region annotation obtained by ex- periments. The result of investigations on the grasp region suggests that unevenness or the protrudes of the shape, and the usage of fingers have a relation.

15:50
VigNet: Semiautomatic Generation of Vignette Illustrations from Video

ABSTRACT. A variety of summarization techniques has recently been proposed to manage the growing volume of media data, but most are oriented toward homogeneous media conversion, resulting in a limited compression ratio. In this study, we focus on the creation of a vignette illustration representing the story in an animation or game briefly and that allows the viewer to understand its world perspective at a glance. This paper proposes VigNet, a system that semiautomatically converts an input video to vignette illustrations so that they reflect the users’ preferences.

14:50-16:10 Session 5B: Virtual, Augmented, and Mixed Reality (2)
Location: ATI Room
14:50
Task-Dependent Optimization of Stereophonic Sound in VR Contents

ABSTRACT. With the development of virtual reality (VR), various contents utilizing the “metaverse” are attracting attention. This paper focuses on the optimization of sound in VR contents and attempts to apply various sound technologies. We created VR contents using Unity and implemented sound using three methods: Ambisonics, binaural audio with HRTF, and conventional stereo sound sources. Experiments with 32 students found that binaural was more effective in improving the sense of realism, while Ambisonics was more effective in improving playability of a treasure hunt game with audible clues.

15:10
Stability Enhancement in VR Hand Manipulation through Truss Structure Connections

ABSTRACT. In VR hand interaction, performing consecutive actions such as rotating or inserting objects held in the hand can lead to unexpected deviations from the intended grasp positions and unintentional pop out, particularly during intricate fingertip tasks. In this study, we devised a novel approach based on the anatomical roles of individual fingers to determine the grasp direction and central point. This approach enables to maintain optimal grasp positions in response to dynamic hand pose changes during VR hand interaction.

15:30
An examination of motion analysis during weight illusion by impression change of the own body

ABSTRACT. The development of VR(virtual reality) technology has made it possible to easily change the appearance of a person, and it has been suggested that the impression change of the appearance using VR avatars affects not only self-perception but also weight perception. We have studied the physical effects of the impression change of the own body during the weight illusion based on EMG(electromyogram). As a result, a significant correlation was obtained between the EMG and the degree of the weight illusion among the subjects, although the correlation was not significant for each subject. In this study, to investigate the change in motion during the weight illusion, we analyze the motion under the impression change of the own body through a comparison experiment of dumbbells’ weights.

15:50
Chronological changes in the form of festival preparation works and their influence on the local community bonds

ABSTRACT. We have been investigating the festival preparation works for the Nozawa Onsen Dosojin Fire Festival for 11 years. Year by year, the number of people performing the festival tends to decrease due to depopulation and declining birthrate. As a result, the form of festival preparation works has changed in several ways, e.g., simplifying the work and utilizing machinery. Until recent years, by working together on extremely hard work that requires long hours of physical labor, a great sense of accomplishment has been obtained, and a strong sense of unity has arisen among the villagers. We consider how the gradual loss of time spent with colleagues during the festival preparation works influences the bonds of the local community of the villagers.

14:50-16:10 Session 5C: Image Recognition & Detection (1)
14:50
Color compensation for underwater images of battleship Yamato

ABSTRACT. The Battleship Yamato sunk to the seabed was photographed by a diving survey at the Kure Maritime Museum of Maritime History and Science. However, the images appear in colors different from their true colors due to the scattering and absorption of light by seawater, so it is necessary to perform color correction of the images to obtain the correct appearance in order to survey and study the current status of Yamato. Therefore, in this study, for color correction, we use photogrammetry to recover three-dimensional images of Yamato, etc., and estimate the scattering and absorption parameters of seawater from the results.

15:10
EEG-Based BCI System Using Deep Learning to Control PC Mouse

ABSTRACT. Brain Computer Interface (BCI) technology has proven its advantages after many studies, whereas BCI is about to interface the human brain with computers. In this research, a BCI system has been implemented by interpreting some of human thoughts into machine instructions by exploiting the electroencephalography (EEG) signals which are extracted from the scalp using relatively cheap and non-invasive EEG mobile devices. Currently, we have classified four mental activity classes for mouse control in four different directions (down, left, right, and up) using an OpenBCI (8 channels) device. The Convolutional Neural Network (CNN) classifier has achieved a testing accuracy of 97.50%.

15:30
Open Vocabulary 3D Multiple Object Tracking through Stereo Camera

ABSTRACT. In recent years, 3D multiple object tracking has garnered significant attention from researchers, as understanding the surrounding environment is crucial for ensuring the safe operation of systems like autonomous vehicles and robotics. Previous research often focused on targets in specific categories such as pedestrians or vehicles, utilizing LiDAR or monocular cameras for 3D object tracking. However, these approaches fell short of providing a comprehensive understanding of the environment, confining tracking within predefined boundaries. Furthermore, methods relying on LiDAR or monocular cameras each have their limitations, whether in terms of cost or accuracy. In this paper, we propose an open vocabulary 3D multiple object tracking approach based on stereo cameras, capable of tracking any object in the world coordinate system, to better assist systems in comprehending their surroundings. Our method is cost-effective and yields precise results, and the code is publicly available at https://gitlab.com/syo093c/tas3d.

15:50
Guidance Map Transformation for Lighting Control in Patch-Based Style Transfer

ABSTRACT. Existing methods, like texture projection, create stylized shading for 3D animations but may introduce deforming artifacts, especially in high-frequency patterns. In this study, we explore guided patch-based style transfer to animate high-frequency details while minimizing such artifacts. We use common normal maps as guidance for practical lighting control through normal map transformations. We also investigate local editing, posterization, and normal map smoothing to expand the range of resulting animations. To view our results, please watch the video at https://shorturl.at/qrDEM

16:25-17:45 Session 6A: Computer Vision & 3D Image Processing (2)
Location: F. C. Room
16:25
Tube-NeRF: Fast Synthesis of Stereo Novel Views Observed from within a Long Passage

ABSTRACT. We present a NeRF-based system that quickly synthesizes stereo novel views in large-scale environments. It decomposes a scene into tube-shaped segments to train multiple networks parallelly and synthesize novel views quickly as it selects the most relevant network. Such shape of the segments helps to replicate smooth walks and rides since the camera origin usually moves along a curve. The system has four main optimizations on speed: interpolation between cached points on the curve, introduction of kd-Tree, parallel training, and CUDA support. The first two improve time performance in the network selection while the rest boost up the networks' speed.

16:45
Quantifying the Effect of Image Transformation Using Hypersphere Embedding
PRESENTER: Daiki Ishiguro

ABSTRACT. Recently, unsupervised learning, which does not require supervised labels, has made remarkable progress. In particular, a method called contrastive learning self-generates supervised information by comparing similarities between transformed images. However, determining a better transformation method is difficult because many factors such as hyperparameters and downstream tasks to be applied should be considered. We focus on the relationship between the distribution of feature embeddings and the model’s ability to acquire representations, and show that it is possible to evaluate effective transformation methods independent of the downstream tasks that require a lot of computational cost by defining two quantitative metrics, Uniformity and Embedding Similarity, based on the unit hypersphere.

17:05
Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

ABSTRACT. A method using auxiliary task is a type of multi-task learning. This improves the performance of the target task by simultaneously learning auxiliary task. However, this method requires that the auxiliary task must be effective for the target task. It is very difficult to determine in advance whether a designed auxiliary task is effective, and the effective auxiliary task changes dynamically according to the learning status of the target task. Therefore, we propose an auxiliary task selection mechanism, Auxiliary Selection, based on deep reinforcement learning. We confirmed the effectiveness of our method by introducing it to UNREAL, a method that has achieved high agent performance by introducing auxiliary tasks.

17:25
Estimating Wiped Rate of Cleaning Process in Semiconductor Manufacturing Using DNN and 3D Body Tracking

ABSTRACT. Nanoscale particles can cause mechanical failures in semiconductor manufacturing, hence effective cleaning procedures over the equipment are crucial. However, due to the lack of efficient evaluation of the cleaning motions, these procedures remain empirical and difficult to enhance. In this study, we concentrate on the "wiping" motion. We introduce a deep learning approach to estimate the wiping rate, which is a scale used to evaluates the wiping performance, by tracking the engineers' motions over time using an RGB-D camera. In this paper, we propose two models for our purpose: the Autoencoder (MLP) – LSTM model and the Autoencoder (LSTM) – MLP model. In our experiment, these models are compared with each other and evaluated against a standalone LSTM.

16:25-17:45 Session 6B: Artificial intelligence and deep learning
Location: ATI Room
16:25
Learning Weight Parameter Distribution in Neural Network

ABSTRACT. Weight and bias parameters in neural network play an important role to find a sub-optimum solution for minimizing a loss function. Typical optimizers, like Adam and SGD, control how to converge these parameters into the optimums, but they often reach unpreferable ones, because there are so many local minimums around the best solution. Our approach to avoid such local minimums is to replace a single neuron to a combination of neurons that support a distribution of the original parameters. The combination of the neurons contributes to the final output in proportion to the probability randomly assigned to each neuron at a learning stage, and the center of their vectors behaves as the parameter vector of a single neuron at an inference stage. Consequently, a hyper space inside the plural parameter vectors of the combinational neurons plays like a distribution of the original parameter vectors. We evaluated the performance of the proposed combination of neurons for hand-writing number classification in MNIST, and found that they were converged into better solutions than the original.

16:45
FoodGAN: Realistic Cuisine Image Synthesis with Multi-Scale GANs

ABSTRACT. This paper proposes a novel approach that employs a Multi-Scale Generative Adversarial Network (GAN) to produce high-quality and realistic cooking images from ingredient and instruction texts. Unlike the generation of distinctive object images with well-defined shapes, cooking images are inherently complex and encompass a variety of ingredients. This complexity necessitates advanced image generation techniques. Our proposed model integrates text and image features within a hierarchical structure, facilitating the generation of images at multiple resolutions. Experimental results from the Recipe1M dataset underscore the efficacy of our approach, striking a balance between image quality and diversity as evidenced by the Inception Score. This research further contributes to the progression of the field of computational food analysis.

17:05
Deep Neural Network 3D Reconstruction Using One-Shot Color Mapping of Light Ray Direction Field

ABSTRACT. In many manufacturing processes, real-time inspection of microscale three-dimensional (3D) surfaces is crucial. A method integrating deep neural networks (DNNs) is thus proposed for obtaining a microscale 3D surface from a single image, captured by an imaging system equipped with a multicolor filter. This system can determine light ray directions in the field of view through one-shot color mapping. Even for a 3D surface with microscale height variations, the system can assign light ray directions to different colors. Assuming a smooth and continuous surface, the 3D shape of the surface can be reconstructed from a single captured image using DNNs, without the need for training data. The DNNs can calculate the 3D surface by solving a nonlinear partial differential equation, representing the relationship between the height distribution of the surface and light ray directions.

17:25
Data augmentation for image and clinical data in classification of parotid tumors by deep learning

ABSTRACT. In surgery for parotid tumors, the preoperative diagnosis of benign or malignant is important because the handling of the facial nerve depends on the histological type. However, even MRI, which is considered the most useful preoperative imaging diagnosis, does not have a high sensitivity (58-81%) due to the variety of histological types. Therefore, we propose a deep learning model that combines MRI image data and clinical data, and propose an augmentation method of extending each data. We report the experimental results of utilizing 5-fold cross- validation on 151 datasets along with the efficacy of the proposed approach.

16:25-17:45 Session 6C: Computer Vision & 3D Image Processing (3)
16:25
Development of Interaction Methods Using Physical Objects with a Stereoscopic Mid-air 3D Image

ABSTRACT. These days, mid-air image using Micro Mirror Array Plate (MMAP) attracts attention, and we have been developing methods to display stereoscopic mid-air 3DCG objects and interact with them. In this paper, we propose two novel methods to interact with mid-air 3DCG objects using physical objects. One is interactive observation of the cross-section of the mid-air 3DCG object using a physical objects. When a user places a physical plate in the mid-air 3DCG object, the user can see the inside of the object as if it has been sliced by the plate. The other is the interaction with a mid-air 3DCG object using a transparent glass. By simulating the properties of liquids with the mid-air 3DCG object, it's possible to create an interaction that makes it seem as if there is liquid inside the glass. Using physical objects for interaction is expected to enhance the sense of presence of the mid-air 3DCG objects.

16:45
Frame-break: Enhancing stereoscopic effect of an anamorphosis-based naked-eye 3D imaging system

ABSTRACT. We have been developing a naked-eye stereoscopic imaging system based on a classic design technique called anamorphosis. The system, however, suffers from a problem that depending on the user's viewpoint, part of the displayed objects may be out of the rendering area and missing, and thus, resulting in reduced stereoscopic effect. We attempt to address this problem by incorporating into the system a so-called frame-break 3D effect, which has recently been used in film-making. This paper reports the results of an initial empirical study on how the proposed amelioration approach works.

17:05
Portable See-through AR using Parallax Barrier-based Autostereoscopic 3D Tablet

ABSTRACT. Autostereoscopic 3D see-through AR is a technology that displays a stereo image by superimposing a virtual stereo image on a real stereo image. We have made this technology portable and evaluated the accuracy of depth-position estimation. We use a parallax barrier-based autostereoscopic 3D tablet with viewpoint tracking. Stereo images are generated in real-time from RGBD images captured by the tablet’s built-in depth camera, and CG stereo images are superimposed to realize a portable autostereoscopic 3D see-through AR. Some subjects show improved accuracy in depth-position estimation in the autostereoscopic 3D display compared to the 2D display.

17:25
Automatic Rigging for 3D Human Scanning Data using Edge Convolution
PRESENTER: Yiqing Li

ABSTRACT. 3D human models, nowadays, are used in various areas, such as virtual fitting, special effects in movies and 3D games. Skeleton and skin deformation information are very important and useful to achieve natural human movements, although 3D scanning data for human are too complex and difficult for rigging systems to estimate the skeleton precisely. To make matters worse, clothes conceal the surface of body in many ways, and then prevents the rigging systems from estimating joint positions with accuracy. Difficulty of learning the skeletal structures for different poses is another problem to be solved. This study proposes an automatic rigging method using neural networks. The dataset comprises scans of individual humans in different outfits and poses. By comparing and adjusting skeletal structures for various poses, this method optimizes more precise positions of joints.