Reproducing the Appearance of Metallic Materials by Capturing Long-Range Dependencies using Image-to-Image Translation Network with Vision Transformer

Title:Reproducing the Appearance of Metallic Materials by Capturing Long-Range Dependencies using Image-to-Image Translation Network with Vision Transformer

Authors:Kaito Kojima, Taishi Iriyama and Takashi Komuro

Conference:CGI 2025

Tags:Generative Adversarial Network, Illumination Consistency, Image-to-Image Translation, Material Appearance and Vision Transformer

Abstract:

Seamlessly integrating virtual objects into real scenes is a critical challenge in Augmented Reality. To achieve this, it is important to provide globally consistent estimations of lighting conditions in the real scene. In this study, we propose a method to reproduce the appearance of metallic materials using an image-to-image translation network incorporating Vision Transformer (ViT). The network receives an image consisting of a background and the normal map of a virtual object, and transforms the normal map into a virtual object with the metallic material appearance. Specifically, ViT, which effectively captures long-range dependencies in the input image, is introduced into the network's encoder to extract lighting information, and the network produces the natural appearance of the virtual object that reflects the extracted lighting information in the output image. We created a synthetic dataset and compared images generated by the proposed method with those generated by a CNN-based image-to-image translation network. The results showed that the proposed method outperformed the CNN-based method in three quantitative metrics and reproduced a more natural appearance of metallic materials that is consistent with the lighting conditions.