View: session overviewtalk overview
Zoom Link: https://uni-sydney.zoom.us/j/85070505056?pwd=OHN3dm9RS2JKTHhONVlOQjVaMjgwUT09 Meeting ID: 850 7050 5056, Password: cgi2023
Image Analysis and Processing 1
08:30 | A Multi-label Privacy-preserving Image Retrieval Scheme based on Object Detection for Efficient and Secure Cloud Retrieval PRESENTER: Jing Cui ABSTRACT. With the development of self-media, the burden of client-side computation and storage of massive data has become increasingly heavy. Additionally, considering the presence of sensitive information in images, image owners commonly adopt the practice of encrypting images before storing them in the cloud. However, encrypted image retrieval faces a challenge of striking a balance between security and efficiency. To address this issue, a Multi-label Privacy-preserving Image Retrieval scheme based on Object Detection (MPIR-OD) is proposed. Firstly, image labels are extracted using object detection techniques. Then, frequent itemsets of labels are discovered through mining label association rules, and they are matched and classified with the previously extracted image labels to construct an index. Lastly, the Asymmetric Scalar-product Preserving Encryption (ASPE) is employed to encrypt image feature vectors, ensuring the privacy of the images, and enabling secure K-Nearest Neighbor (KNN) operations using the ASPE algorithm. Compared to existing schemes, the MPIR-OD scheme achieves a reduction in retrieval time of approximately 6 times and an improvement in retrieval accuracy of around 15. |
08:42 | AMCNet: Adaptive Matching Constraint for Unsupervised Point Cloud Registration PRESENTER: Zhuohan Xiao ABSTRACT. The registration of 3D point cloud is a crucial challenge in computer vision with numerous applications in robotics, medical imaging and other industries. However, due to the lack of accurate data annotation, the performance of unsupervised point cloud registration networks is often unsatisfactory. In this paper, we propose an unsupervised method based on generating corresponding points and utilizing structural constraints for rigid point cloud registration. The objective is to optimize the similarity matrix using the neighborhood score of matching point pairs, and the feature extractor is designed to capture better features by constraining the structural difference between the source neighborhoods and the predicted neighborhoods. The key components in our approach are similarity optimization module and structure variation checking module. In the similarity optimization module, we improve the similarity matrix by adaptively weighting the matching scores of neighbors. Through this method, the spatial information of matching point pairs can be fully utilized, resulting in high-quality corresponding estimations. We observe that the solution of the rigid transformation matrix is easily affected by incorrect matching point pairs, while the predicted point cloud is crucial for constructing accurate correspondences. Therefore, we developed a structure variation checking module to constrain the predicted point cloud and the source point cloud to have similar structural information. Based on the constraints, the extraction network is continuously optimized and adjusted to obtain even better features. The extensive experimental results show that our method achieves state-of-the-art performance when compared with other supervised and unsupervised tasks on the ModelNet40 data set, and significantly outperforms previous methods on the real-world indoor 7Scenes data set. |
08:54 | COCCI: Context-Driven Clothing Classification Network PRESENTER: Shuqing Liu ABSTRACT. Clothing classification aims to obtain labels for any given clothing item and serves as a fundamental task for clothing retrieval, clothing recommendation, and other related applications. Its potential commercial value has attracted widespread attention from researchers. In this task, there are two inherent challenges: suppressing complex backgrounds outside the clothing region and disentangling the feature entanglement of shape-similar clothing samples. These challenges arise from insufficient attention to key distinctions of clothing, which hinders the accuracy of clothing classification. Also, the high computational resource requirement of some complex and large-scale models also decreases the inference efficiency. To tackle these challenges, we propose a new context-driven clothing classification network (COCCI), which improves inference accuracy while reducing model complexity. First, we design a self-adaptive attention fusion (SAAF) module to enhance category-exclusive clothing features and prevent misclassification by suppressing ineffective features that have confused image contexts. Second, we propose a novel multi-scale feature aggregation (MSFA) module to establish spatial context correlations by using multi-scale clothing features. This helps disentangle feature entanglement among shape-similar clothing samples. Finally, we introduce knowledge distillation to extract reliable teacher knowledge from complex datasets, which helps student models learn clothing features with rich representation information, thereby improving generalization and classification accuracy while reducing model complexity. In comparison to state-of-the-art networks trained with one single model, COCCI shows a significant improvement of 5.47% in top-1 accuracy on the Clothing 1M dataset for images with complex backgrounds. Moreover, COCCI achieves an improvement of up to 6.4% in top-1 accuracy on the Deepfashion dataset. Experimental results demonstrate that our method achieves SOTA performance on the widely-used clothing classification benchmark. |
09:06 | Hierarchical Edge Aware Learning for 3D Point Cloud ABSTRACT. This paper proposes an innovative approach to Hierarchical Edge Aware 3D Point Cloud Learning (HEA-Net) that seeks to address the challenges of noise in point cloud data, and improve object recognition and segmentation by focusing on edge features. In this study, we present an innovative edge-aware learning methodology, specifically designed to enhance point cloud classification and segmentation. Drawing inspiration from the human visual system, the concept of edge-awareness has been incorporated into this methodology, contributing to improved object recognition while simultaneously reducing computational time. Our research has led to the development of an advanced 3D point cloud learning framework that effectively manages object classification and segmentation tasks. A unique fusion of local and global network learning paradigms has been employed, enriched by edge-focused local and global embeddings, thereby significantly augmenting the model's interpretative prowess. Further, we have applied a hierarchical transformer architecture to boost point cloud processing efficiency, thus providing nuanced insights into structural understanding. Our approach demonstrates significant promise in managing noisy point cloud data and highlights the potential of edge-aware strategies in 3D point cloud learning. The proposed approach is shown to outperform existing techniques in object classification and segmentation tasks, as demonstrated by experiments on ModelNet40 and ShapeNet datasets. |
09:18 | Neural Differential Radiance Field: Learning the Differential Space Using a Neural Network PRESENTER: Saeed Hadadan ABSTRACT. We introduce an adjoint-based inverse rendering method using a Neural Differential Radiance Field, i.e. a neural network representation of the solution of the differential rendering equation. Inspired by neural radiosity techniques, we minimize the norm of the residual of the differential rendering equation to directly optimize our network. The network is capable of outputting continuous, view-independent gradients of the radiance field w.r.t scene parameters, taking into account differential global illumination effects while keeping memory and time complexity constant in path length. To solve inverse rendering problems, we simultaneously train networks to represent radiance and differential radiance, and optimize the unknown scene parameters. |
09:30 | Aware-Transformer: A Novel Pure Transformer-based Model for Remote Sensing Image Captioning PRESENTER: Jialuo Yan ABSTRACT. Remote sensing image captioning (RSIC) is the task of generating accurate and coherent descriptions of the visual content in remote sensing images. While recent progress has been made in developing CNN-Transformer based models for this task, given the significant scale differences in the visual objects within these images, many existing methods still have some deficiencies in effectively capturing the multiscale visual features of these images. Additionally, applying these visual features directly to a vanilla Transformer architecture may result in the loss of important visual information. To address these challenges, we propose a novel pure Transformer-based model that first utilizes a fine-tuned Swin-Transformer as the encoder to extract multiscale visual features from remote sensing images. Then it introduces an Aware-Transformer as the decoder, which enhances multiscale and multiobject visual information to help generate accurate and detailed captions. To assess the performance of our proposed method, we conducted ablation and comparison experiments on three publicly available RSIC datasets: Sydney-Captions, UCM-Captions, and NWPU-Captions. The results demonstrate that our method outperforms state-of-the-art RSIC models in captioning quality. |
09:42 | Blind image quality assessment method based on DeepSA-Net PRESENTER: Jingyi Li ABSTRACT. Blind image quality assessment refers to the accurate prediction of the visual quality of any input image without a reference image. With the rapid growth of the number of images and increasing requirements for image quality, how to assess image quality has become an urgent problem. Complex images are difficult to consider professionally from a single perspective. A blind image quality assessment algorithm based on a deep semantic adaptation network (DeepSA-Net) is proposed. Based on the end-to-end deep learning model, the semantic pre-trained models and multi-resolution adaptive module are added, and the adaptive factor α is proposed to better capture global and local quality information and fuse multi-resolution features to improve the convergence ability and speed of the network. Finally, the quality assessment results of images are obtained by regression. The experiment used the Spearman Correlation Coefficient and Pearson Correlation Coefficient as assessment indicators. The results showed that DeepSA-Net outperformed most current methods in real distortion scene databases and had excellent assessment ability in synthetic distortion databases. In addition, ablation study and different distortion studies were designed to fully validate the effectiveness and feasibility of the DeepSA-Net algorithm. |
09:54 | Deep Feature Learning for Image-based Kinship Verification ABSTRACT. Facial image-based kinship verification is one of the challenging tasks in computer vision. This task has many potential applications, such as human trafficking, studying human genetics, generating family maps, family photo albums, etc. The main obstacle in practice is that there is always have great difference between the images of parents and children. Therefore, we propose a deep feature learning method (DFLKV) which can extract more discriminative features for kinship verification. For a pair of facial images, we firstly design a network with multi-scale channel attention for the features extraction; then, we select four methods for feature fusion; finally, we infer kinship based on the fused features. We construct the final loss by jointly adopting the contrastive loss and the binary cross-entropy loss to compute matching degree for paired samples. The experimental results on datasets KinFaceW-I, KinFaceW-II, Cornell KinFace and TS KinFace validate the effectiveness of our approach. |
10:06 | Efficient Semantic-Guidance High-resolution Video Matting PRESENTER: Ding Li ABSTRACT. Video matting has made significant progress in the trimap-based field. However, researchers are increasingly interested in auxiliary-free matting because it is more useful in real-world applications. The semantic feature can play an important role in improving video matting results. However, the size and speed of the current semantic-guidance method suffers as a result of over-bloated network architecture. We propose a new efficient semantic-guidance high-resolution video matting network. This network maintains efficiency while improving the comprehension of semantic feature. We still apply the convolutional network as the backbone while also employing the transformer in the encoder. The transformer is used as a submodule to provide semantic feature to help the convolutional network while ensuring that the network is not overly bloated. Two cross-attention modules are used to implement semantic feature adjustment and guidance. In addition, a channel-wise attention mechanism is introduced in the decoder to improve the representation of semantic feature. In comparison to the current state-of-the-art methods, the method proposed in this paper achieves better results while maintaining the speed and efficiency of prediction. We can complete the real-time auxiliary-free matting for high-resolution video (4K or HD). |
10:18 | Segment Any Building ABSTRACT. The identification and segmentation of buildings in remote sensing imagery has consistently been a important point of academic research. This work highlights the effectiveness of using diverse datasets and advanced representation learning models for the purpose of building segmentation in remote sensing images. By fusing various datasets, we have broadened the scope of our learning resources and achieved exemplary performance across several datasets. Our innovative joint training process demonstrates the value of our methodology in various critical areas such as urban planning, disaster management, and environmental monitoring. Our approach, which involves combining dataset fusion techniques and prompts from pre-trained models, sets a new precedent for building segmentation tasks. The results of this study provide a foundation for future exploration and indicate promising potential for novel applications in building segmentation field. |
Zoom Link: https://uni-sydney.zoom.us/j/85464507535?pwd=OEQ1TVZsR2ZOSEo4cTB3MitLR25CQT09 Meeting ID: 854 6450 7535, Password: cgi2023
Image Analysis and Processing 2
08:30 | A Novel Zero-Watermarking Algorithm based on Texture Complexity Analysis PRESENTER: Xiaochao Wang ABSTRACT. To address the problems of most existing watermarking algorithms cannot effectively resist complex attacks, we propose a novel zero-watermarking algorithm based on texture complexity analysis. First, we calculate the standard deviation map of the host image by the spatially selective texture method and achieve the optimal target regions (OTRs) by clustering the binary standard deviation map. To improve the robustness of the proposed algorithm, we use singular value decomposition (SVD) to extract multiple feature sequences from the OTRs. Then, these robust feature sequences are binarized to generate multiple feature images. For the watermark image, we apply the chaotic mapping to encrypt it and ensure the security of the watermark image. Finally, we perform an exclusive-or (XOR) operation on each of the extracted multiple feature images with the encrypted watermark image to construct multiple zero watermarks, which will be saved at the Copyright Certification Center to protect the copyright of the image. A large number of experimental results show that the newly-proposed algorithm not only has good distinguishability, but also has high robustness to complex attacks. Compared with existing watermarking algorithms, our proposed algorithm has advantages in invisibility, robustness and security. |
08:42 | Video-Based Self-Supervised Human Depth Estimation PRESENTER: Xiaoyan Zhang ABSTRACT. In this paper, we propose a video-besed met-hod for self-supervised human depth estimation, aiming at the problem of joint point distortion in human depth and insufficient utilization of 3D information in video-based depth estimation. We use the relative ordinal relations between human joint point pairs to deal with the problem of joint point distortion. Meanwhile, a temporal correlation module is proposed to focus on the temporal correlation between past and present frames, taking into account the influence of temporal characteristics in the video sequence. A hierarchical structure is adopted to fuse adjacent features, thus fully mine the 3D information based on the video. The experimental results show that this model significantly improves the human depth estimation performance, especially at the joints. |
08:54 | TSC-Net: Theme-Style-Color guided Artistic Image Aesthetics Assessment Network PRESENTER: Yin Wang ABSTRACT. Image aesthetic assessment is a hot issue in current research, but less research has been done in the art image aesthetic assessment field, mainly due to the lack of large-scale artwork datasets. The recently proposed BAID dataset fills this gap and allows us to delve into the aesthetic assessment methods of artworks, and this research will contribute to the study of artworks and can also be applied to real-life scenarios, such as art exams, to assist in judging. In this paper, we propose a new method, TSC-Net (Theme-Style-Color guided Artistic Image Aesthetics Assessment Network), which extracts image theme information, image style information, and color information and fuses general aesthetic information to assess art images. Experiments show that our proposed method outperforms existing methods using the BAID dataset. |
09:06 | Weakly Supervised Method for Domain Adaptation in Instance Segmentation PRESENTER: Yan Tian ABSTRACT. Instance segmentation is an active research area in the signal processing field. The domain adaptation of a segmentation model can be improved by introducing supervision signals from a target dataset. However, manual annotation is tedious and time-consuming, and self-training contains too much pseudolabel noise. Inspired by weakly supervised methods, we propose a method to handle these domain adaptation challenges by limited verification signals. Labels of relevant samples are updated by label propagation. First, we construct semantic trees to explore the relation between samples by using a clustering method. Then, we verify and propagate reliable pseudolabels to their corresponding unreliable labels, which improves our instance segmentation model by employing the updated samples. Experiments on public datasets demonstrate that the proposed approach is competitive with state-of-the-art approaches. |
09:18 | PRESENTER: Sheng Yu ABSTRACT. With the increasing emphasis on green development, garbage classification has become one of the important elements of green development. However, in scenarios where garbage stacking occurs, the task of segmenting highly overlapping objects is difficult because the bottom garbage is in an obscured state and its contours and obscured boundaries are usually difficult to distinguish. In this paper, we propose an Op-PSA model, which uses the HTC model as the baseline model and improves the modeling method of backbone network and model interest region using attention model and occlusion perception model. The Op-PSA model constructs the image as two overlapping layers and uses the two-layer structure to explicitly model the occluded and occluded objects, so that the boundaries of the occluded and occluded objects are naturally decoupled, and their interactions are considered in the mask regression. It is experimentally verified that the model can effectively detect the masked garbage and improve the detection accuracy of the masked garbage. |
09:30 | SPC-Net: structure-aware pixel-level contrastive learning network for OCTA A/V segmentation and differentiation PRESENTER: Huaying Hao ABSTRACT. Recent studies have indicated that morphological changes in retinal vessels are associated with many ophthalmic diseases, which have different impacts on arteries and veins (A/V) respectively. To this end, retinal vessel segmentation and further A/V classification are essential for quantitative analysis of related diseases. OCTA is a newly non-invasive vascular imaging technique that provides visualization of microvasculatures with higher resolution than traditional fundus imaging modality. Recently, the task of A/V classification has attracted a lot of attention in the field of OCTA imaging. However, there exist two main challenges in this task. On one hand, there is a lack of intensity information in OCTA images to differentiate between arteries and veins. On the other hand, signal fluctuations during OCTA imaging could also bring about vessel discontinuity. These challenges limit the performance of A/V classification in OCTA images. In this paper, we propose a novel Structure-aware Pixel-level Contrastive learning network (SPC-Net) for A/V classification. In the proposed SPC-Net, a latent alignment-based network is first utilized to produce a vessel segmentation map in the original OCTA images. The introduction of latent alignment could guide the model in learning more contextual information to obtain more continuous vessel segmentation results. Then a pixel-level contrast learning-based network is used to further differentiate between arteries and veins according to the topology of vessels. This network adopts a novel pixel-level contrast learning topology loss to accurately classify the vessel pixels into arteries and veins by taking full account of global semantic similarity. The experimental results demonstrate the superiority of our method compared with the existing state-of-the-art methods respectively on one public OCTA dataset and one in-house OCTA dataset. |
09:42 | MRI-GAN: Generative Adversarial Network for Brain Segmentation PRESENTER: Afifa Khaled ABSTRACT. Segmentation is an important step in medical imaging. In particular, machine learning, especially deep learning, has been widely used to efficiently improve and speed up the segmentation process in clinical practices of MRI brain images. Despite the acceptable segmentation results of multi-stage models, little attention was paid to the use of deep learning algorithms for brain image segmentation, which could be due to the lack of training data. Therefore, in this paper, we propose MRI − GAN, a Generative Adversarial Network (GAN) model that performs segmentation MRI brain images. Our model enables the generation of more labeled brain images from existing labeled and unlabeled images. Our segmentation targets brain tissue images, including white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). We evaluate the performance of the MRI − GAN model using a commonly used evaluation metric, which is the Dice Coefficient (DC). Our experimental results reveal that our proposed model significantly improves segmentation results compared to the standard GAN model while taking shorter training time |
09:54 | Fast Prediction of Ternary Tree Partition for Efficient VVC Intra Coding PRESENTER: Jiamin Sun ABSTRACT. In versatile video coding (VVC) intra coding, the partition pattern depends on the rate-distortion optimization process, which is time-consuming and has a great impact on the overall coding efficiency. Hence, in this paper, a fast decision mechanism is proposed for ternary tree partition based on the LightGBM model aiming to improve the decision-making efficiency by skipping the calculation process of rate-distortion cost. Firstly, five features of each coding unit (CU) are selected based on their importance to the optimal partition pattern. Secondly, the selected five features are employed to train the LightGBM models and optimize the parameters. Finally, the trained models are embedded into the VTM 4.0 platform to predict whether to use or skip the ternary tree partition pattern for each CU. Theoretically, the proposed mechanism can effectively reduce the VVC intra coding complexity. Experiments are conducted and the results show that the proposed scheme can save 46.46% encoding time with only 0.56% BDBR increase and 0.03% BD-PSNR decrease compared with VTM4.0, out forming most of the existing major methods. |
10:06 | Large GAN is all you need PRESENTER: Kai Liu ABSTRACT. Sketch-to-portrait conversion is an emerging research area that aims to transform rough facial line sketches into highly detailed and realistic portrait images. This paper presents a comprehensive study on the impact of different loss functions and data augmentation techniques in achieving superior results using the U-Net256 network architecture. The study explores the effects of Mean Squared Error (MSE) loss, L1 loss, Generative Adversarial Network (GAN) loss, and the number of parameters on the quality of the generated portrait images. Experimental results demonstrate that the choice of loss function significantly influences the perceptual quality and accuracy of the converted portraits. While both MSE and L1 loss contribute to capturing the overall structure, GAN loss excels in generating fine-grained details. Moreover, a trade-off is observed between the number of parameters and image quality, with higher parameter counts resulting in more intricate outputs but increased computational complexity. In conclusion, this paper offers valuable insights into the sketch-to-portrait conversion task, shedding light on the effects of different loss functions and data augmentation techniques. The findings contribute to the advancement of sketch-to-portrait conversion systems, pushing the boundaries of realism and detail in generated portrait images. We finally reached FID value of 0.2184, the second in the CGI-PSG2023 leaderboard as of May 21st. All code is open-source and can be found in https://github.com/KKK-Liu/Portrait.git |
10:18 | EAID: An Eye-tracking based Advertising Image Dataset with Personalized Affective Tags PRESENTER: Song Liang ABSTRACT. Contrary to natural images with randomized content, advertisements contain abundant emotion-eliciting manufactured scenes and multi-modal visual elements with highly related semantics. However, little research has evaluated the interrelationships of advertising vision and affective perception. The absence of advertising data sets with affective labels and visual attention benchmarks is one of the most pressing issues that have to be addressed. Meanwhile, growing evidence indicates that eye movements can reveal the internal states of human minds. Inspired by these, we use a high-precision eye tracker to record the eye-moving data of 57 subjects when they observe 1000 advertising images. 7-score opinion ratings for the five advertising attributes (i.e., ad liking, emotional, aesthetic, functional, and brand liking) are then collected. We further make a preliminary analysis of the correlation among advertising attributes, subjects’ visual attention, eye movement characteristics, and personality traits, obtaining a series of enlightening conclusions. To our best knowledge, the proposed dataset is the largest advertising image dataset based on eye tracking and with multiple personalized affective tags. It provides a new exploration space and data foundation for multimedia visual analysis and affection computing community. The data are available at: https://github.com/lscumt/EAID. |
Zoom Link: https://uni-sydney.zoom.us/j/85070505056?pwd=OHN3dm9RS2JKTHhONVlOQjVaMjgwUT09 Meeting ID: 850 7050 5056, Password: cgi2023
Image Restoration and Enhancement
11:00 | PRESENTER: Mingfu Jiang ABSTRACT. Medical ultrasound imaging has gained widespread prevalence in human muscle and internal organ diagnosis. However, defects in the circuitry during image acquisition, operating methods, defects in the image signal transmission process or other objective factors can lead to the occurrence of speckle noise and distortion in ultrasound images. These issues not only make it challenging for doctors to diagnose diseases but can also pose difficulties in image post-processing. While traditional denoising methods are time-consuming, they are also not effective in removing speckle noise while retaining image details, leading to potential misdiagnosis. Therefore, there is a significant need to accurately and quickly denoise medical ultrasound images to enhance image quality. In this paper, we propose a flexible and lightweight deep learning denoising method for ultrasound images. Initially, we utilize a considerable number of natural images to train the convolutional neural network for acquiring a pre-trained denoising model. Next, we employ the plane-wave imaging technique to generate simulated noisy ultrasound images for further transfer learning of the pre-trained model. As a result, we obtain a non-blind, lightweight, fast, and accurate denoiser. Experimental results demonstrate the superiority of our proposed method in terms of denoising speed, flexibility, and effectiveness compared to conventional convolutional neural network denoisers for ultrasound images. |
11:12 | Degradation-aware Blind Face Restoration via High-quality VQ Codebook PRESENTER: Yuzhou Sun ABSTRACT. Blind face restoration, as a kind of face restoration method dealing with complex degradation, has been a challenging research hotspot recently. However, due to the influence of a variety of degradation in low-quality images, artifacts commonly exist in the low fidelity results of existing methods, resulting in a lack of natural and realistic texture details. In this paper, we propose a degradation-aware blind face restoration method based on a high-quality vector quantization (VQ) codebook to improve the degradation-aware capability and texture quality. The overall framework consists of Degradation-aware Module (DAM), Texture Refinement Module (TRM) and Global Restoration Module (GRM). DAM adopts the channel attention mechanism to adjust the weight of feature components in different channels, so that it has the ability to perceive complex degradation from redundant information. In TRM, continuous vectors are quantized and replaced with high-quality discretized vectors in the VQ codebook to add texture details. GRM adopts the reverse diffusion process of the pre-trained diffusion model to restore the image globally. Experiments show that our method outperforms state-of-the-art methods on synthetic and real-world datasets. |
11:24 | Seamless Image Editing for Perceptual Size Restoration Based on Seam Carving PRESENTER: Naohiko Ishikawa ABSTRACT. Thing of interest (ToI) in a photograph may be perceived as smaller than being perceived from the real scene due to the discrepancy between the imaging principles in the camera and human perception. When using existing image resizing approaches to enlarge the ToI in the input image, the resulting image may have problems, such as loss of distance sense, composition collapse, failure to preserve salient object shapes, etc. In this study, we propose a ToI resizing method based on seam carving method. The proposed method adopts an energy function, which takes image composition preservation into consideration. Furthermore, to prevent salient objects from being edited, the state-of-the-art deep learning model for salient object detection (SOD) has been adopted in the proposed method. To confirm the performance of the proposed method, a subjective evaluation experiment was conducted in this study. The experimental result shows that the effectiveness of the proposed method in terms of the preservation of perceptual size and perceptual distance of the ToI. |
11:36 | Underwater image enhancement based on the fusion of PUIENet and NAFNet PRESENTER: Chao Li ABSTRACT. Due to light absorption and scattering in the ocean, underwater images suffer from blur and color bias, and the colors tend to be biased towards blue or green. To enhance underwater images, many underwater image enhancement (UIE) methods have been developed. Probabilistic Network for UIE (PUIENet) is a neural network model that produces good results in processing underwater images. However, it cannot handle underwater images with motion blur, which is caused by camera or object motion. Nonlinear Activation Free Network (NAFNet) is a network model designed to remove image blur by simplifying everything. Inspired by NAFNet, we simplified the convolution, activation function, and channel attention module of PUIENet, resulting in Probabilistic and Nonlinear Activation Hybrid for UIE (PNAH_UIE), which reduced training time by approximately 19% and also reduced loss. In this paper, we propose a deep learning-based method for underwater image enhancement, called Probabilistic and Nonlinear Activation Hybrid Network for UIE (PNAHNet_UIE), which integrates the two most advanced network structures, PNAH_UIE and NAFNet, to improve overall image clarity and remove motion blur. The URPC2022 dataset was used in the experiments, which comes from the "CHINA UNDERWATER ROBOT PROFESSIONAL CONTEST." PNAH_UIE was used to enhance the URPC2022 dataset, and the processed images were checked for motion blur. If the variance of an image was below a certain threshold, the NAFNet network was used to process the image, thus reducing computational pressure. |
11:48 | Infrared image enhancement for photovoltaic panels based on improved homomorphic filtering and CLAHE PRESENTER: Dongdong Xue ABSTRACT. Photovoltaic panels are extremely vulnerable to thermal imaging camera performance and other external factors such as extreme weather during the imaging process. This will result in low contrast and low illumination of the acquired infrared image, which is not conducive to the subsequent detection of photovoltaic panels. Aiming at this problem, an infrared image enhancement algorithm for photovoltaic panels based on improved homomorphic filtering and CLAHE (Contrast Limited Adaptive Histogram Equalization) is proposed. Firstly, in order to improve the overall brightness and contrast of the infrared image, a homomorphic filtering algorithm based on the improved transfer function is designed. The algorithm constructs a transfer function with a similar structure to the homomorphic filtering profile. Then, the CLAHE algorithm fused with gamma correction is used to further process the image, which overcomes the defects of weak details and uneven brightness of the image after homomorphic filtering enhancement, and improves the clarity and anti-interference of the image. The experimental results show that the comprehensive evaluation index value of the infrared image enhanced by the proposed algorithm is 50% higher than that of the original image. Compared with other algorithms, it has better visual effect, which is helpful to reduce the background interference. In addition, when the enhanced dataset of this algorithm is used for detection, the mAP is up to 97.8%, and the F1-score is 6% higher than that of the original dataset. It indicates that the proposed algorithm can effectively improve the detection accuracy of photovoltaic panels in infrared images. |
12:00 | An Efficient and Lightweight Structure for Spatial-Temporal Feature Extraction in Video Super Resolution PRESENTER: Xiaonan He ABSTRACT. Video Super Resolution (VSR) model based on deep convolutional neural network (CNN) uses multiple Low-Resolution (LR) frames as input and has a strong ability to recover High-Resolution (HR) frames and maintain video temporal information. However, to realize the above advantages, VSR must consider both spatial and temporal information to improve the perceived quality of the output video, leading to expensive operations such as cross-frame convolution. Therefore, how to balance the output video quality and computational cost is a worthy issue to be studied. To address the above problem, we propose an efficient and lightweight multi-scale 3D video super resolution scheme that arranges 3D convolution features extraction blocks using a U-Net structure to achieve multi-scale feature extraction in both spatial and temporal dimensions. Quantitative and qualitative evaluation results on public video datasets show that compared to other simple cascaded spatial-temporal feature extraction structure, an U-Net structure achieves comparable texture details and temporal consistency while with a significant reduction in computation costs and latency. |
12:12 | Specular Highlight Detection and Removal Based on Dynamic Association Learning PRESENTER: Jinyao Shen ABSTRACT. Specular highlight widely exists in daily life. Its strong brightness influences the recognition of text and graphic patterns in images, especially for documents and cards. In this paper, we propose a coarse-to-fine dynamic association learning method for specular highlight detection and removal. Specifically, based on the dichromatic reflection model, we first use a sub-network to separate the specular highlight layer and locate the regions of the highlight. Instead of directly subtracting the estimated specular highlight component from the raw image to get the highlight removal result, we design an associated learning module (ALM) together with a second-stage sub-network to restore the color distortion of the specular highlight layer removal. Our ALM respectively extracts features from the specular highlight part and non-specular highlight part to improve the color restoration. We conducted extensive evaluation experiments and ablation study on the synthetic dataset and the real-world dataset. Our method achieved 36.09 PSNR and 97% SSIM on SHIQ dataset, along with 28.90 PSNR and 94% SSIM on SD1 dataset, which outperformed the SOTA methods. |
12:24 | Highlight removal from a single image based on an prior knowledge gudided unsupervised CycleGAN PRESENTER: Yongkang Ma ABSTRACT. Highlights widely exist in many objects, such as the optical images of high-gloss leather, glass, plastic, metal parts, and other mirror-reflective objects. It is difficult to directly apply optical measurement techniques, such as object detection, intrinsic image decomposition, and tracking which are suitable for objects with diffuse reflection characteristics. Although deep CNNs can be used to perform supervised learning of material rendering parameters to automatically remove highlights by applying a large number of paired specular–diffuse reflection datasets. It is hard to deal with unpaired datasets. In this paper, we proposed a specular-to-diffuse-reflection image conversion network based on improved CycleGAN to automatically remove image highlights. It does not require paired training data, and the experimental results verify the effectiveness of our method. There are two main contributions for this framework. On one hand, we proposed a confidence map based on independent average values as the initial value to solve the slow convergence problem of the network due to the lack of a strict mathematical definition for distinguishing specular reflection components from diffuse reflection components. On the other hand, we designed a logarithm-based transformation method generator which made the specular reflection and diffuse reflection components comparable. It could solve the anisotropy problem in the optimization process. This problem was caused by the fact that the peak specular reflection on the surface of a specular object was much larger than the value of the off-peak diffuse reflection. We also compared our method with the latest methods. It was found that the SSIM and PSNR values of our proposed algorithm were significantly improved, and the comparative experimental results showed that the proposed algorithm significantly improves the image conversion quality. |
Zoom Link: https://uni-sydney.zoom.us/j/85464507535?pwd=OEQ1TVZsR2ZOSEo4cTB3MitLR25CQT09 Meeting ID: 854 6450 7535, Password: cgi2023
11:00 | Facial expression recognition with global multiscale and local attention network PRESENTER: Shukai Zheng ABSTRACT. Due to problems such as occlusion and pose variation, facial expression recognition (FER) in the wild is a challenging classification task. This paper proposes a global multiscale and local attention network(GL-VGG) based on the VGG structure, which consists of four modules: a VGG base module, a dropblock module, a global multiscale module, and a local attention module. The base module pre-extracts features,the dropblock module prevents overfitting in the convolutional layers, the global multiscale module is used to learn different receptive field features in the global perception domain, which reduces the susceptibility of deeper convolution towards occlusion and variant pose, and the local attention module guides the network to focus on local rich features, which releases the interference of occlusion on FER in the wild. Experiments on two public wild FER datasets show that our GL-VGG approach outperforms the baseline and other state of the art methods with 88.33% on RAF-DB and 74.17% on FER2013. |
11:12 | MARANet: Multi-scale Adaptive Region Attention Network for Few-Shot Learning PRESENTER: Xiyang Li ABSTRACT. Few-shot learning, which aims to classify unknown categories with fewer label samples, has become a research hotspot in computer vision because of its wide application. Objects will present different regional locations in nature, and the existing few-shot learning only focuses on the overall location information, while ignoring the impact of local key information on classification tasks. To solve this problem, (1) we propose a new multi-scale adaptive region attention network (MARANet), which makes use of the semantic similarity between images to make the model pay more attention to the areas that are beneficial to the classification task. (2) MARANet mainly includes two modules---the multi-scale feature generation module uses low-level features (LR) of different scales to solve the problem of different target scales in nature; the adaptive region metric module selects the LR of key regions by assigning masks to each classification task. We have conducted experiments on four common data sets (i.e. miniImageNet, CUB-200, Stanford Dog, and Stanford Cars). The experimental results show that the new category classification task of MARANet is 1.1%-4.9% higher than the existing methods. |
11:24 | Enhancing Image Rescaling Using High Frequency Guidance and Attentions in Downscaling and Upscaling Network PRESENTER: Yan Gui ABSTRACT. Recent image rescaling methods adopt invertible bijective transformations to model downscaling and upscaling simultaneously, where the high-frequency information learned in the downscaling process is used to recover the high-resolution image by inversely passing the model. However, less attention has been paid to exploiting the high-frequency information when upscaling. In this paper, an efficient end-to-end learning model for image rescaling, based on a new designed neural network, is developed. The network consists of a downscaling generation sub-network (DSNet) and a super-resolution sub-network (SRNet), and learns to recover high-frequency signals. Concretely, we introduce dense attention blocks to the DSNet to produce the visually-pleasing low resolution (LR) image and model the distribution of the high-frequency information using a latent variable following a specified distribution. For the SRNet, we adapt an enhanced deep residual network by using residual attention blocks and adding a long skip connection, which transforms the predicted LR image and the random samples of the latent variable back during upscaling. Finally, we define a joint loss and adopt a multi-stage training strategy to optimize the whole network. Experimental results demonstrate that the superior performance of our model over existing methods in terms of both quantitative metrics and visual quality. |
11:36 | Convolutional Neural Networks and Vision Transformers in product GS1 GPC brick code recognition PRESENTER: Maciej Szymkowski ABSTRACT. Online stores and auctions are commonly used nowadays. It means that we buy much more on the Internet than in traditional stores. It leads to the case that during looking for the products we need to have precise categories assigned to each of them (to find only records that can be of interest for a consumer). Sometimes it is hard, users make simple mistakes by assigning wrong categories to the product they sell. In this paper, we propose an approach to the analysis of product images and their real categories assignment. The proposed algorithm is based on Convolutional Neural Networks (CNNs). Vision Transformers were also tested and compared with CNNs. Products categories were represented by GS1 GPC brick codes. The maximum accuracy reached around 80%. Based on the discussions with e-commerce experts, it was claimed that such precision is acceptable, as the differences between real and assigned categories were effectively small (change in the class not segment or family). |
11:48 | Multi-source Information Perception and Prediction for Panoramic Videos PRESENTER: Chenxin Qu ABSTRACT. With the popularization and development of virtual reality technology, panoramic video has gradually become one of the mainstream forms of VR technology in various fields. However, the research on the information perception of panoramic video in different media is insufficient. And shortcomings still exist in building information perception and prediction models owing to small samples. This work focuses on users' perception of multi-source information in panoramic videos with different media. We conducted the experiment ( N=40 ) to analyze the differences of users' perception level when viewing panoramic videos using different media (i.e. VR and traditional media). We also studied the correlation between user characteristics and information reception effectiveness. Finally, we use the few-shot-learning prediction model to predict the perception effect of multi-source information. The results show that users' perception of multi-source information in VR is better than in traditional media, except for sound information. Besides, there is a positive correlation between observational ability, memory, concentration, and spatial perception, whether playing computer games frequently and multi-source information perception. And the few-shot learning prediction achieves an accuracy of 90.0875% and can accurately predict the user's information perception effect based on their characteristics. |
12:00 | Multi-Scale Attention Conditional GAN for Underwater Image Enhancement PRESENTER: Zhenbo Li ABSTRACT. Underwater image enhancement (UIE) has achieved impressive achievements in various marine tasks, such as aquaculture and biological monitoring. However, complex underwater scenarios impede current UIE method application development. Some UIE methods utilize CNN-based models to improve the quality of degradation images, but these methods fail to capture multi-scale high-level features, leading to sub-optimal results. To address these issues, we propose a multi-scale attention conditional generative adversarial network, dubbed Mac-GAN, to recover the degraded underwater images by utilizing an encoder-decoder structure. Concretely, a novel multi-scale conditional GAN architecture is utilized to aggregate the multi-scale features and reconstruct the high-quality underwater images with high perceptual information. Meanwhile, a novel attention module (AMU) is designed to integrate associated features among the channels for the UIE tasks, effectively suppressing non-significant features to improve the extraction effect of multi-scale features. Extensive experiments demonstrate that our proposed model achieves remarkable results in terms of qualitative and quantitative metrics, such as 0.7dB improvement in PSNR metrics and 0.8dB improvement in UIQM metrics. Moreover, Mac-GAN can generate a pleasing visual result without obvious over-enhancement and over-saturation over the comparison of UIE methods. A detailed set of ablation experiments analyzes the core components’ contribution to the proposed approach. |
12:12 | PRESENTER: Chenhao Yao ABSTRACT. Although there has been some progress in 3D human pose and shape estimation, accurately predicting complex human poses is still challenging. To tackle this issue and improve the accuracy of the human mesh reconstruction, we propose an end-to-end framework called Multi-level Attention Network (MANet) that improves reconstruction results. MANet consists of three modules: Intra Part Attention Network (IntraPA-Net), Inter Part Attention Network (InterPA-Net), and Hierarchical Pose Regressor (HPR), which model attention at various levels. IntraPA-Net utilizes pixel attention and aggregates pixel-level features for each body part, InterPA-Net establishes attention between different body parts, and HPR implicitly captures the attention of different joints in a hierarchical structure. Experimental results demonstrate that MANet achieves high accuracy in reconstructing the human mesh and aligning well with images that contain flexible human motion. |
12:24 | LIELFormer: Low-light Image Enhancement with a Lightweight Transformer ABSTRACT. Images captured under low-light conditions often suffer from (partially) poor visibility. One of the challenges of low-light enhancement, in addition to inadequate lighting, is noise and color distortion due to the limited quality of the cameras. Previous researchers have typically used paired data (low-light and high-definition images) for training to solve single-image enhancement problems. However, those approaches have two disadvantages. One is the difficulty of collecting data in pairs, which wastes time and computational resources. Secondly, such models tend to be poorly generalizable and perform poorly on multiple datasets. This paper proposes a simple but accurate single image enhancement network to solve this problem. Our network consists of the light estimation module and the color correction module. The light estimation module is based on the Retinex principle and uses CNN to enhance illumination. The color correction module uses a global prediction module (transformer block) to obtain the actual color distribution. This module extracts the image's original colors to make it more realistic. Our network has a simple structure and does not any paired and unpaired datasets. It allows a single image enhancement task to be performed using only iterations of the image itself. Our approach outperforms current state-of-the-art methods in qualitative and quantitative experiments. We will release our code after publication. |
Zoom Link: https://uni-sydney.zoom.us/j/85070505056?pwd=OHN3dm9RS2JKTHhONVlOQjVaMjgwUT09 Meeting ID: 850 7050 5056, Password: cgi2023
13:30 | Single-view 3D reconstruction of curves PRESENTER: Ali Fakih ABSTRACT. This paper describes a method to generate a 3D curve from a planar polygonal curve. One application of such method is the modeling of trajectories of moving objects in 3D using sketches. Given a planar polygonal curve C_2D , our algorithm computes a 3D curve C_3D such that its orthogonal projection matches the input curve. The algorithm aims at minimizing the variation of the curvature along the reconstructed curve. The driving idea is to fit a set of ellipses to the input curve; these ellipses enable us to determine the osculating circles and thus the tangent at every point of the curve to reconstruct in 3D. The reconstruction of 3D curve using these tangents is then straightforward. The method is demonstrated with several examples. |
13:42 | Audio-Driven Lips and Expression on 3D Human Face PRESENTER: Le Ma ABSTRACT. Extensive researches have been conducted on audio-driven 3D facial animation with many attempts to achieve human-like performance, while creating a truly realistic and expressive 3D facial animation remains a challenging task, and existing methods often fall short in capturing the nuances of anthropomorphic expressions. We propose the Audio-Driven Lips and Expression (ADLE), which is designed to generate highly expressive and lifelike conversations between individuals, complete with important social signals like laughter and excitement, based solely on audio cues. At the core of our approach is the groundbreaking audio-expression-consistency strategy, which disentangles person-specific lips from dependent expressions. This allows our method to robustly learn lips movements and generic expression parameters on a 3D human face from an audio sequence. As a result, our ADLE is a multimodal fusion approach that can automatically generate accurate lip movements accompanied by vivid facial expressions on a 3D face, all in real-time. Our experiments demonstrate that our ADLE outperforms other state-of-the-art works in this field, making it a highly promising approach for a wide range of applications. |
13:54 | Multi-Image 3D Face Reconstruction via An Adaptive Aggregation Network PRESENTER: Xiaoyu Chai ABSTRACT. Image-based 3D face reconstruction suffers from inherent drawbacks of incomplete visible regions and interference from occlusion or lighting. One solution is to utilize multiple face images for collecting sufficient knowledges. Nevertheless, most existing methods typically do not make full use of information among different images since they roughly fuse the results of individual reconstructed face for multi-image 3D face modeling, thus may ignore the intrinsic relations within various images. To tackle this problem, we propose a framework named Adaptive Aggregation Network (ADANet) to investigate the subtle correlations among multiple images for 3D face reconstruction. Specifically, we devise an Aggregation Module that can adaptively establish both the in-face and cross-face relationships by exploiting the local- and long-range dependencies among visible facial regions of multiple images, thus can effectively extract complementary aggregation face features in the multi-image scenario. Furthermore, we incorporate contour-aware information to promote the boundary consistency of 3D face model. The seamless combination of these novel designs forms a more accurate and robust multi-image 3D face reconstruction scheme. Extensive experiments have demonstrated the superiority of the proposed network over other state-of-the-art models. |
14:06 | METRO-X: Combining Vertex and Parameter Regressions for Recovering 3D Human Meshes with Full Motions PRESENTER: Chenhao Yao ABSTRACT. It is well known that regressing the parametric representation of a human body from a single image suffers low accuracy due to sparse information use and error accumulation. Although being able to achieve higher accuracy by avoiding these issues, directly regressing vertices may result in vertex outliers and can only deal with the body mesh with very limited number of vertices. We present METRO-X, a novel method for reconstructing full-body human meshes with body pose, facial expression and hand gesture from a single image, which combines the advantages from the two disciplines so as to achieve higher accuracy than parameters regression while bear denser vertices and generate smoother shape than vertices regression. It first detects and extracts hands, head and the whole body parts from a given image, then regresses the vertices of three parts separately using METRO, and finally fits SMPL-X to the reconstructed meshes to obtain the complete parametric representation of the human body, facial expression and hand gesture. Experimental results show that METRO-X outperforms the ExPose method, with a significant 23% improvement in body accuracy and a 35% improvement in gesture accuracy. These results demonstrate the potential of our approach in enabling various applications. |
14:18 | PRESENTER: Ritesh Sharma ABSTRACT. This paper proposes a novel approach for automatically generating accurate floor plans and 3D models of building interiors using scanned mesh data. Unlike previous methods, which begin with a high resolution point cloud from a laser range-finder, our approach begins with triangle mesh data, such as from a Microsoft HoloLens headset. The approach includes generating two types of floor plans, a "pen-and-ink" style that preserves details and a drafting-style that reduces clutter, and processing the 3D model for use in applications by aligning it with coordinate axes, annotating important objects, dividing it into stories, and removing the ceiling. The performance of each step is analyzed on commercial and residential buildings, and experiments are conducted to evaluate the appearance of results when different amounts of transparency and numbers of mesh slices are used. Our approach has applications in navigation, interior design, furniture placement, facilities management, building construction, and heating, ventilation, and air conditioning (HVAC) design. In general, our approach appears to be promising for automatic digitization and orientation of scanned mesh data for floor plan and 3D model generation. |
14:30 | An Adaptive-Guidance GAN for Accurate Face Reenactment PRESENTER: Xiaoyu Chai ABSTRACT. Face reenactment has been widely used in face editing, augmentation and animation. However, it is still challenging to generate photo-realistic target face with accurate pose or expression as reference face, meanwhile retain the identity as the source face. To achieve this goal, we propose an Adaptive-Guidance Generative Adversarial Network (AD-GAN) for accurate face reenactment. Unlike previous methods that control GANs by either directly employing a simple set of vectors or sparse representations (e.g., facial landmarks or boundaries), which ignore the correspondence between reference and source faces, thus leading to inaccurate reenactment or artifacts on target faces. We devise a Correlation Module (CM) that can adaptively establish dense correspondence between a 3D face model as the conditions and the latent features from sources to formulate an indicator map for implementing explicit control of target faces. Besides, the Texture Module (TM) and Guiding Blocks (GB) in generator can restore the facial appearance distorted by expression or pose changes, and progressively guide the generation process. Extensive experiments demonstrate the superiority of our AD-GAN in generating photo-realistic and accurately controllable images. |
14:42 | PRESENTER: Wiem Grina ABSTRACT. In this study, we address the challenge of unsupervised learning for disentangled representations in datasets including independent variation factors. We propose a new approach inspired from Factor-VAE and $\beta$VAE but integrating the ranger optimizer with dropout layers, which encourages the distribution of representations to be factorial, ensuring independence between dimensions and leading to faster convergence. Our method outperforms Factor-VAE by finding a better balance between disentanglement and reconstruction quality and better optimization of model parameters leading to improved convergence and generalization during learning by effectively adapting the learning rate. |
14:54 | CaSE-NeRF: Camera Settings Editing of Neural Radiance Fields PRESENTER: Ciliang Sun ABSTRACT. Neural Radiance Fields (NeRF) have shown excellent quality in three-dimensional (3D) reconstruction by synthesizing novel views from multi-view images. However, previous NeRF-based methods do not allow users to perform user-controlled camera setting editing in the scene. While existing works have proposed methods to modify the radiance field, these modifications are limited to camera settings within the training set. Hence, we present Camera Settings Editing of Neural Radiance Fields (CaSE-NeRF) to recover a radiance field from a set of views with different camera settings. In our approach, we allow users to perform controlled camera settings editing on the scene and synthesize the novel view images of the edited scene without re-training the network. The key to our method lies in modeling each camera parameter separately and rendering various 3D defocus effects based on thin lens imaging principles. By following the image processing of real cameras, we implicitly model it and learn gains that are continuous in the latent space and independent of the image. The control of color temperature and exposure is plug-and-play, and can be easily integrated into NeRF-based frameworks. As a result, our method allows for manual and free post-capture control of the viewpoint and camera settings of 3D scenes. Through our extensive experiments on two real-scene datasets, we have demonstrated the success of our approach in reconstructing a normal NeRF with consistent 3D geometry and appearance. Our related code and data is available at https://github.com/CPREgroup/CaSE-NeRF. |
15:06 | A Submodular-based Autonomous Exploration for Multi-Room Indoor Scenes Reconstruction PRESENTER: Haipeng Wang ABSTRACT. To autonomously explore and densely recover an unknown indoor scene is a nontrivial task in 3D scene reconstruction. Especially, it is difficult for scenes composed of compact and complicated interconnected rooms with no priors. To address this issue,we propose a novel approach to autonomous scan and reconstruct multi-room scenes without any prior knowledge. Specifically, the proposed method introduces a submodular-based planning to efficiently guide the active scanning by “Next-Best-View” until marginal gains diminish. The submodular-based planning gives an approximately optimal solution of “Next-Best-View” which is NP-hard in case of no prior knowledge. Experiments show that our method can improve scanning efficiency significantly for multi-room scenes while maintaining reconstruction errors. |
15:18 | Learning Degradation for Real-World Face Super-Resolution PRESENTER: Jin Chen ABSTRACT. Acquiring degraded faces with corresponding high-resolution (HR) faces is critical for real-world face super-resolution (SR) applications. To generate realistic low-resolution (LR) faces with degradation similar to that in real-world scenarios, most approaches learn a deterministic mapping from HR faces to LR faces. However, these deterministic models fail to model the various degradation of real-world LR faces, which limits the performance of the following face SR models. In this work, we learn a degradation model based on conditional generative adversarial networks (cGANs). Specifically, we propose a simple and effective weight-aware content loss that adaptively assigns different content losses to LR faces generated from the same HR face under different noise vector inputs. It significantly improves the diversity of the generated LR faces while having similar degradation to real-world LR faces. Compared with previous degradation models, the proposed degradation model can generate HR-LR pairs, which can better cover various degradation cases of real-world LR faces and further improve the performance of face SR models in real-world applications. Experiments on four datasets demonstrate that the proposed degradation model can help the face SR model achieve better performance in both quantitative and qualitative results. |
Zoom Link: https://uni-sydney.zoom.us/j/85464507535?pwd=OEQ1TVZsR2ZOSEo4cTB3MitLR25CQT09 Meeting ID: 854 6450 7535, Password: cgi2023
13:30 | Visualization of Irregular Tree Canopy Centerline Data from a Depth Camera Based on An Optimized Spatial Straight-Line Fitting PRESENTER: Jiale Wang ABSTRACT. We propose a novel method for visualizing the canopy centerline of a single tree based on a depth camera. Initially, the depth camera captures the image of the target tree to obtain 3D point cloud data. To improve the accuracy of the model reconstruction and enhance the smoothness of the model, the point cloud data is filtered and denoised. Next, we employ the Poisson surface reconstruction method to reconstruct the 3D space surface of the point cloud data, which can accurately restore the real scene. Additionally, we fit the 3D point cloud to a circle using Random Sample Consensus (RANSAC) and Least Square Circles (LSC) in MATLAB software and propose a new spatial straight-line fitting method to visualize the canopy centerline. This method has the advantage of no error in the Z coordinates of spatial scatter points, and the fitted straight line is perpendicular to the xoy plane. Furthermore, compared to the traditional spatial straight-line fitting method, the new method yields a smaller root mean square error (RMSE). This method can be effectively applied in practical applications such as tree canopy pruning, providing precise position information for the positioning of tools during the pruning process. It can ultimately reduce pruning time and improve the accuracy of the process. |
13:42 | PRESENTER: Dan Mei ABSTRACT. Recent studies have shown that implicit neural representation can be effectively applied to geometric surface reconstruction. Existing methods have achieved impressive results. However, they often struggle to recover geometric details, or require normal vectors as supervisory information for surface points, which is often unavailable in actual scanned data. In this paper, we propose a coarse-to-fine approach to enhance the geometric details of the reconstructed results without relying on normal vectors as supervision, and able to fill holes caused by missing scanned data. In the coarse stage, a local spatial normal consistent term is presented to estimate a stable but coarse implicit neural representation. In the fine stage, a local fitting penalty is proposed to locally modify the reconstruction results obtained in the previous stage to better fit the original input data and recover more geometric details. Experimental results on three widely used datasets (ShapeNet, SRB and ABC) indicate that our method is very competitive when compared with current state-of-the-art methods, especially for restoring the geometric details. |
13:54 | Fine-grained Web3D Culling-Transmitting-Rendering Pipeline PRESENTER: Anning Huang ABSTRACT. Web3D has gradually become the mainstream online 3D technology to support Metaverse. However, massive multiplayer online Web3D still faces challenges such as slow culling of potentially visible set at servers, networking congestion and sluggish online rendering at web browsers. To address the challenges, in this paper we propose a novel Web3D pipeline that coordinates PVS culling, networking transmitting, and Web3D rendering in a fine-grained way. The pipeline integrates three key steps: establishment of a granularity-aware voxelization scene graph, fine-grained PVS culling and transmitting scheduling, and incremental & instanced rendering. Our experiments on a massive 3D plant have demonstrated that the proposed pipeline outperforms existing Web3D approaches in terms of transmitting and rendering. |
14:06 | The Chemical Engine Algorithm and Realization based on Unreal Engine-4 PRESENTER: Yue Zhang ABSTRACT. The Chemical Engine is a new concept introduced by Nintendo company as a counterpart to the traditional physics engine in game development. However, Nintendo has not released any details of the Chemical Engine, also Nintendo blurred the definition between ``chemical" and ``physical". Therefore, this paper clarifies the concept of physical engine and chemical engine in game development, then based on the definition, two chemical engine algorithms are proposed. One is called the ``elemental energy" algorithm, which is based on Nintendo's philosophy and optimized for future scalability, ``elemental energy" can be widely used in general game scenarios. The second one is called the ``factorization and properties" algorithm, which is more in line with the definition of chemistry in academics, this method can realistically render chemical reactions, but the realization is more difficult and too costly to use in game development. Therefore, this paper provides a specific means of implementation in the Unreal Engine 4 engine based on the elemental energy algorithm Through the analysis of the achievement and experiment, the cost and method of the ``elemental energy" algorithm are reasonable. Therefore, the scheme is more practical in this scenario, and it could be widely used in commercial game development. |
14:18 | Enhanced Direct Lighting Using Visibility-Aware Light Sampling PRESENTER: Geonu Noh ABSTRACT. Next event estimation has been widely applied to Monte Carlo rendering methods such as path tracing since estimating direct and indirect lighting separately often enables finding light paths from the eye to the lights effectively. Its success heavily relies on light sampling for direct lighting when a scene contains multiple light sources since each light can contribute differently to the reflected radiance on a surface point. We present a light sampling technique that can guide such a light selection to improve direct lighting. We estimate a spatially-varying function that approximates the contribution of each light on surface points within a discretized local area (i.e., a voxel in an adaptive octree) while considering the visibility between lights and surface points. We then construct a probability distribution function for sampling lights per voxel, which is proportional to our estimated function. We demonstrate that our light sampling technique can significantly improve rendering quality thanks to improved direct lighting with our light sampling. |
14:30 | Point Cloud Rendering via Multi-plane NeRF PRESENTER: Dongmei Ma ABSTRACT. We propose a new neural point cloud rendering method by combining point cloud multi-plane projection and NeRF. Existing point-based rendering methods often rely on the high-quality geometry of point clouds. Meanwhile, NeRF and its extensions usually query the RGB and volume density of each point on the ray through neural networks, thus leading to a low inference efficiency. In this paper, we assign a feature vector to each point and project them to multiple random depth planes. The multi-plane feature maps are fed into a 3D convolutional neural network to predict the RGB and volume density map of each feature plane, then we synthesize a novel view through volume rendering. On the one hand, projecting point features to multiple planes reduces the impact of geometry noise, and on the other hand, directly using multiple planes for rendering avoids sampling points on rays, thereby improving the rendering efficiency. The introduction of volume rendering enables our approach to synthesize high-quality images even when point clouds are relatively sparse. Experimental results on the DTU dataset and ScanNet dataset show that our approach achieves state-of-the-art results. |
14:42 | Fast Geometric Sampling for Phong-like Reflection PRESENTER: Shuzhan Yang ABSTRACT. Importance sampling is a critical technique for reducing the variance of Monte Carlo samples. However, the classical importance sampling based on the Bidirectional Reflectance Distribution Function (BRDF) is often complex and challenging to implement. In this work, we present a simple yet efficient sampling method inspired by Phong's reflectance model. Our method generates samples of rays using geometric vector operations, replacing the need for BRDF. We explain our implementation of this method on WebGL and demonstrate how we obtain per-pixel random numbers in GLSL. We also conduct experiments to compare our method's speed and patterns to the Phong distribution. The results show that our sampling process can simulate reflections similar to Phong, but is about three times faster than traditional Phong or other BRDF importance sampling methods. Our sampling method is applicable to both real-time and offline rendering, making it a useful tool for computer graphics applications. |
14:54 | Multi-GPU Parallel Pipeline Rendering with Splitting Frame PRESENTER: Haitang Zhang ABSTRACT. Ray tracing is a rendering technique that simulates real world lighting effects in computers, and it can provide excellent visual experience. Using ray tracing in real-time rendering requires extremely large graphics computing resources and the computing power of a single graphics processing unit (GPU) is often insufficient in complex scenes. In this paper, we propose a multi-GPU parallel pipeline rendering approach that makes full use the computing power of multiple GPUs to accelerate real-time ray tracing rendering effectively. This approach enables heterogeneous GPUs to render the same frame cooperatively through a dynamic splitting frame load balancing scheme, and ensures that each GPU is assigned with the suitable size of splitting frame based on its rendering ability. A fine-grained parallel pipeline method divides the process of rendering into more detailed steps that enable multiple frames to be rendered in parallel, which improves the utilization of each step and speeds up the output of frames. With the experiments on various dynamic scenes, the results show that the number of frames per second (FPS) of the multi-GPU system composed of two GPUs using the parallel pipeline rendering approach is 2.2 times higher than that of the single GPU system. And the multi-GPU system composed of three GPUs has increased to 3.3 times. |
15:06 | Molecular Surface Mesh Smoothing with Subdivision PRESENTER: Dawar Khan ABSTRACT. Smoothing with subdivision has several popular techniques. However, these techniques have several limitations including mesh deformation, no care of mesh quality, and the increasing mesh complexity. Molecular meshes having a distinct surface with abrupt changes from concave to convex and vice versa are further challenging for these techniques. In this paper, we formulated a smoothing algorithm for molecular surface meshes. This algorithm integrates the advantages of three well-known algorithms including Catmull-Clarck, Loop and Centroidal Voronoi tessellation (CVT) with an error control module. CVT is used for pre-processing, and the remaining two for smoothing. We find new vertices like Catmull-Clarck and connect them like Loop. Unlike Catmull-Clark, which is generating a quad mesh, we establish a new connection making it triangular. We control the geometric loss by backward translation of the vertices toward the input mesh. We compared the results with previous methods and tested the algorithm with different numerical analysis and modeling applications. We found our results with significant improvement and always robust for downstream applications. |
15:18 | Photorealistic Aquatic Plants Rendering with Cellular Structure ABSTRACT. This paper presents a realistic real-time rendering method for aquatic plants, which considers their unique optical characteristics. While many render models has been proposed in real-time rendering of plant leaves, the rendering of aquatic plants is often inaccurate due to reliance on the botanical parameters and optical statistical characteristics of terrestrial plants. To address this issue, we combine existing rendering methods for terrestrial plants with the optical properties of aquatic plants. Through a qualitative analysis of the differences in optical properties between the cell structures of aquatic and terrestrial plants, we propose a rendering method for aquatic plants. The experimental results show that our method is more effective in expressing the rendering appearance of aquatic plants compared to general-purpose physically-based shading models and advanced plant leaf rendering models. We also demonstrated our method in virtual reality, providing a solution for the construction of virtual reality underwater environments. Our contributions include a realistic real-time rendering model for aquatic plants, a real-time underwater roaming platform, and experimental evidence demonstrating our method's effectiveness in expressing the appearance of aquatic plants while balancing model efficiency and accuracy. |
Zoom Link: https://uni-sydney.zoom.us/j/85070505056?pwd=OHN3dm9RS2JKTHhONVlOQjVaMjgwUT09 Meeting ID: 850 7050 5056, Password: cgi2023
16:00 | Staged Transformer Network with Color Harmonization for Image Outpainting PRESENTER: Wangyidai Lv ABSTRACT. Image outpainting aims at generating new looking-realistic content beyond the original boundaries for a given image patch. Existing image outpainting methods tend to generate images with erroneous structures and unnatural colors when extrapolating the sub-image all-side. To solve this problem, we propose a Transformer-based staged image outpainting network. Specifically, we restructure the encoder-decoder architecture by adding hierarchical cross attention to the connection in each layer. We propose a staged expanding module that splits the extrapolation into vertical and horizontal steps so that the generated images can have consistent contextual information and similar texture. A color harmonization module that adjusts both local and global color information is also presented to make color transitions more natural. Our experiments prove that the proposed method outperforms the advanced methods on multiple datasets. |
16:12 | SemiRefiner: Learning to refine Semi-Realistic paintings PRESENTER: Keyue Fan ABSTRACT. The previous image optimization methods cannot complete the automatic refinement of semi-realistic paintings. Aiming at improving the efficiency of refinement manually, we propose an automatic refinement method for semi-realistic figure paintings guided by the line art. In order to enable the framework to adjust the draft color in the refinement process like a real painter, we design a color correction module, which automatically fixes the inappropriate color in the draft. In order to reduce artifacts and generate high-quality results, we use the line art to guide the refinement. We further devise a line art optimization module in the framework to ensure generation of high quality results by improving the quality of the line art. The experimental results and user surveys demonstrate the effectiveness of the proposed method. |
16:24 | MagicMirror: A 3-D Real-time Virtual Try-On System through Cloth Simulation PRESENTER: Zhanyi Huang ABSTRACT. Nowadays, with the increasing development of online shopping, there exists huge latent benefit area in clothing e-commerce. It has been leading the application of emerging technologies to this field. However, online shopping can not intuitively feel the material of clothes fabric and the dynamic effect of trying on clothes. Methodologies based on cloth simulation and human-computer interaction can be used to solve this challenge. In this paper, we proposed a virtual try-on system based cloth simulation technique to tackle the realism of cloth, using physical law in garment to strengthen the realism of virtual try-on and integrated markless motion capture technique realized by common RGB-D camera to synchronize movement of models and people. We also adopt GPU acceleration solution to ensure real-time simulation. We realized the system based Unity3D using TaiChi Programming Language to control and stimulate the garment. And we verify the significance of GPU acceleration and conduct several experiments to prove the real-time performance of the simulation-based virtual try-on system. We compared the simulation time on CPU and GPU and validated the accuracy of motion capture satisfying virtual try-on task. In the end we conducted a user study to find out if the average consumer was satisfied with our proposed virtual try-on system. |
16:36 | An Ancient Murals Inpainting Method Based on Bidirectional Feature Adaptation and Adversarial Generative Networks PRESENTER: Qingtao Lu ABSTRACT. To address the issue of varying degrees of damage in ancient Chinese murals due to their age and human-induced destruction, we propose a mural image restoration method based on bidirectional feature adaptation and adversarial generative networks. The proposed method first preprocesses the mural images by resizing them and extracting masked feature maps and their corresponding reverse masked feature maps. Subsequently, an improved U-Net generator model is constructed, which captures bidirectional semantic information from the masked feature maps, enhancing the restoration of irregular regions in the mural images. Additionally, a spatial attention mechanism is introduced to adaptively enhance the features of known regions in the mural images. Furthermore, a discriminator model is constructed to discriminate between the restored mural images and real images, outputting a binary classification matrix. Finally, the network model is constrained by adversarial loss, pixel reconstruction loss, style loss, and perceptual loss to generate mural images with rich textures. Experimental results demonstrate that the proposed method effectively restores mural images with different levels of damage and produces mural images with finer texture information compared to conventional mural restoration methods. This method contributes to the preservation and inheritance of traditional Chinese culture by providing an effective means for mural image restoration. |
16:48 | An Image Extraction Method for Traditional Dress Pattern Line Drawings Based on Improved CycleGAN PRESENTER: Sichen Jia ABSTRACT. To address the problem of missing details in the general dress pattern line extraction method, we propose a traditional dress pattern line extraction method based on the improved CycleGAN. First, we input the traditional dress pattern image and extract the outline edge image by using a bi-directional cascade network. Afterwards, we construct an improved CycleGAN network model, input the traditional dress pattern image and its outline edge image into the generator model for line drawing extraction, use the discriminator model to discriminate between the generated image and the real image, and output the binary classification matrix. Finally, we construct the adversarial loss, cycle consistency loss and contour consistency loss functions to constrain the network model, output a detail rich line drawing image. Experiments show that the proposed method achieves the extraction of traditional dress pattern line images with perfect details, and the generated traditional dress pattern line images have more realistic and natural lines compared with other dress pattern line extraction methods. The method can accurately extract traditional costume pattern line images and contribute to the preservation and transmission of Chinese traditional costume culture. |
17:00 | Parametrization of Measured BRDF for Flexible Material Editing PRESENTER: Alexis Benamira ABSTRACT. Finding a low dimensional parametric representation of measured BRDF remains challenging. Currently available solutions are either not usable for editing, or rely on limited analytical solutions, or require expensive test subject based investigations. In this work, we strive to establish a parametrization space that affords the data-driven representation variance of measured BRDF models while still offering the artistic control of parametric analytical BRDFs. We present a machine learning approach that generates a parameter space relying on a compressed disentangled representation of the measured BRDF data. After training our network, we analyze the parametrization space and interpret the learned generative factors utilizing our visual perception. It should be noted that visual analysis is called upon downstream of the system for identification purposes contrary to most other existing methods where it is used upfront to elaborate the parametrization. Furthermore, we do not need a test subject investigation. A novel feature of our parametrization is the post-processing capability to incorporate new parameters along with the learned ones, thus expanding the richness of producible appearances. Furthermore, our solution allows more flexible and controllable material editing possibilities than current machine learning solutions. Finally, we provide a rendering interface, for interactive material editing and interpolation based on the presented new parametrization system. |
17:12 | cGAN-based Garment Line Draft Colorization Using A Garment-Line Dataset ABSTRACT. Garment line draft is the basis of clothing design. Automatic or semi-automatic colorization of garment line draft will improve the efficiency of fashion designers and reduce the drawing cost. In this paper, we present a garment line draft colorization method based on cGAN, which can support user interaction by adding scribbles to guide the colorization process. Due to the inadequacy of the garment line drafts, we construct a paired garment-line image dataset for training our colorization model. While existing methods for line art colorization are able to generate plausible colorized results, they tend to suffer from the color bleeding issue. We introduce a region segmentation fusion mechanism to aid colorization frameworks in avoiding color bleeding. Finally, we use a joint bilateral filter to smooth the output results and generate clearer and more vivid coloring images. The experimental results show that each module in the method can contribute to the final result. In addition, the comparison with the classical methods that our method can avoid large areas of leakage in the background and have cleaner garment details. |
17:24 | PCCNet: A Few-Shot Patch-wise Contrastive Colorization Network PRESENTER: Zizhao Wu ABSTRACT. Few-shot colorization aims to learn a model to colorize grayscale images with little training data. Yet, existing models often fail to keep color consistency due to ignored patch correlations of the images. In this paper, we propose PCCNet, a novel Patch-wise Contrastive Colorization Network to learn color synthesis by measuring the similarities and variations of image patches in two different aspects: inter-image and intra-image. Specifically, for inter-image, we investigate a patch-wise contrastive learning mechanism with positive and negative samples constraint to distinguish color features between patches across images. For intra-image, we explore a new intra-image correlation loss function to measure the similarity distribution which reveals structural relations between patches within an image. Furthermore, we augment our network with a color memory module to remember the correct color for specific kinds of structures and textures. Experiments show that our method allows the correct color to spread naturally over objects and also achieves higher scores in quantitative comparisons with related methods. |
17:36 | Reference-Based Line Drawing Colorization through Diffusion Model PRESENTER: Jiaze He ABSTRACT. Line drawing colorization is an indispensable stage in the image painting process, however, traditional manual coloring requires a lot of time and energy from professional artists. With the development of deep learning techniques, attempts have been made to colorize line drawings by means of user prompts, text, etc., but these methods also seem to require some manual involvement. In this paper, we propose a reference-based colorization method for cartoon line drawings, which uses a more stable diffusion model to automatically colorize line drawings and introduces a skeleton map as an additional guide to reduce the bleeding problem encountered during colorization and improve the quality of the generated images. In addition, to further learn the color of the reference image and improve the quality of the colorized image, we also design a two-stage training strategy, which first trains a pre-trained model matching the cartoon features on a large dataset and then obtains the model by fine-tuning a small dataset. To ensure the generality of the model, in addition to the 17,769 benchmark datasets shared on the Kaggle website, we used the cartoon dataset provided by the competition in the fine-tuning phase and produced a garment dataset with cartoon features, which we hope will contribute to the field of garment design. Finally, we illustrate the effectiveness of the model in referencebased automatic coloring through a large number of qualitative and quantitative experiments. |
17:48 | Research of Virtual Try-on Technology Based on Two-dimensional Image PRESENTER: Yue Wang ABSTRACT. The virtual try-on based on two-dimensional image is to use the given clothes to change the clothes of human image to generate try-on images. In order to solve the problems of blurred human images, body parts missing and clothes cannot be correctly warped according to the posture of human images after fitting, this paper improves the Flow-Style-VTON network and proposes the virtual try-on method A-VITON. In this paper, residual blocks and CBAM attention mechanism are added to the UNet network of the try-on module to improve the feature extraction ability of the network to the target object, so that the generated try-on images are more realistic. Secondly, this paper also proposes a layered virtual try-on method to provide consumers with more diverse try-on services. Finally, in order to reduce the interference of complex background on the try-on results, this paper proposes a virtual try-on method with background for the first time, which can generate high quality try-on images while preserving the background of original images.The try-on results on the VITON dataset show that the proposed method has great advantages in generating high quality try-on images. |
Zoom Link: https://uni-sydney.zoom.us/j/85464507535?pwd=OEQ1TVZsR2ZOSEo4cTB3MitLR25CQT09 Meeting ID: 854 6450 7535, Password: cgi2023
16:00 | Investigation on the Encoder-Decoder application for Mesh generation PRESENTER: Emanuele Balloni ABSTRACT. In computer graphics, 3D modeling is a fundamental concept. It is the process of creating three-dimensional objects or scenes using specialized software that allows users to create, manipulate and modify geometric shapes to build complex models. This operation requires a huge amount of time to perform and specialised knowledge. Typically, it takes three to five hours of modelling to obtain a basic mesh from the blueprint. Several approaches have tried to automate this operation to reduce modelling time. The most interesting of these approaches are based on Deep Learning, and one of the most interesting is Pixel2Mesh. However, training this network requires at least 150 epochs to obtain usable results. Starting from these premises, this work investigates the possibility of training a modified version of the Pixel2Mesh in fewer epochs to obtain comparable or better results. A modification was applied to the convolutional block to achieve this, replacing the classification-based approach with an image reconstruction-based approach. This modification uses a configuration based on constructing an encoder-decoder architecture using state-of-the-art networks such as VGG, DenseNet, ResNet, and Inception. Using this approach, the convolutional block learns how to reconstruct the image correctly from the source image by learning the position of the object of interest within the image. With this approach, it was possible to train the complete network in 50 epochs, achieving results that outperform the state-of-the-art. The tests performed on the networks show an increase of 0.5 percentage points over the state-of-the-art average. |
16:12 | Arbitrary Style Transfer with Style Enhancement and Structure Retention PRESENTER: Sijia Yang ABSTRACT. Arbitrary style transfer is to transfer the style of any reference image to another image through a trained neural network while retaining its content as much as possible. However, the early style transfer approaches perform poorly, while some later methods generate results that are over-adapted to the style image and struggle to preserve the image structure. To solve the above problems, we propose a new style transfer method based on a neural network structure with two modules: the style enhancement module (SEM), and the content retention module (SRM). SEM aligns stylistic images and stylized image statistics in the feature space. SRM uses fast Fourier transform and Gaussian high-pass filtering to align the high-frequency information of the content image and the transferred image simultaneously in the frequency domain and the spatial domain. This new approach works well in both style transfer and content retention. Both experimental results and the questionnaire survey show that our method can generate satisfactory stylized images without missing content information. |
16:24 | Zero3D: Semantic-Driven 3D Shape Generation For Zero-shot Learning PRESENTER: Bo Han ABSTRACT. Semantic-driven 3D shape generation aims to generate 3D shapes conditioned on textual input. However, previous approaches have faced challenges with the single-category generation, low-frequency details, and the requirement for large quantities of paired data. To address these issues, we propose a multi-category diffusion model. Specifically, our approach includes the following components: 1) To mitigate the problem of limited large-scale paired data, we establish a connec- tion between text, 2D images, and 3D shapes through the use of the pre-trained CLIP model, enabling zero- shot learning. 2) To obtain the multi-category 3D shape feature, we employ a conditional flow model to generate a multi-category shape vector conditioned on the CLIP embedding. 3) To generate multi-category 3D shapes, we utilize a hidden-layer diffusion model conditioned on the multi-category shape vector, resulting in signifi- cant reductions in training time and memory consump- tion. We evaluate the generated results of our frame- work and demonstrate that our method outperforms existing methods. |
16:36 | Seeing Is No Longer Believing: A Survey on the State of Deepfakes, AI-Generated Humans, and Other Nonveridical Media PRESENTER: Andreea Pocol ABSTRACT. Did you see that crazy photo of Chris Hemsworth wearing a gorgeous, blue ballgown? What about the leaked photo of Bernie Sanders dancing with Sarah Palin? If these don't sound familiar, it's because these events never happened-but with text-to-image generators and deepfake AI technologies, it is effortless for anyone to produce such images. Over the last decade, there has been an explosive rise in research papers, as well as tool development and usage, dedicated to deepfakes, text-to-image generation, and image synthesis. These tools provide users with great creative power, but with that power comes "great responsibility;" it is just as easy to produce nefarious and misleading content as it is to produce comedic or artistic content. Therefore, given the recent advances in the field, it is important to assess the impact they may have. In this paper, we conduct meta-research on deepfakes to visualize the evolution of these tools and paper publications. We also identify key authors, research institutions, and papers based on bibliometric data. Finally, we conduct a survey that tests the ability of participants to distinguish photos of real people from fake, AI-generated images of people. Based on our meta-research, survey, and background study, we conclude that humans are falling behind in the race to keep up with AI, and we must be conscious of the societal impact. |
16:48 | Diffusion-based Semantic Image Synthesis from Sparse Layouts PRESENTER: Yuantian Huang ABSTRACT. We present an efficient framework for generating landscape images from sparse semantic layouts via diffusion models. Previous approaches use dense semantic label maps to generate photorealistic images, where the quality of the results highly depends on the shape of each semantic region. In practice, however, it is not trivial to create detailed and accurate semantic layouts in order to obtain plausible results from these methods. To address this issue, we propose a novel type of input that is more sparse and intuitive for use in real-world settings. Our learning-based framework incorporates a carefully designed random masking process to simulate real user input during model training. We leverage the Semantic Diffusion Model (SDM) as a generator to transform sparse label maps into full landscape images where missing semantic information is complemented based on the learned image structure. Furthermore, through a model distillation process, we achieve comparable inference speed to GAN-based models while preserving the generation quality. After training with the well-designed random masking process, the proposed framework is able to generate high-quality landscape images with sparse and intuitive inputs, which is useful for practical applications. Experiments show that our proposed method outperforms existing approaches both quantitatively and qualitatively. |
17:00 | FoldGEN: Multimodal Transformer for Garment Sketch-to-photo Generation PRESENTER: Yanfang Wen ABSTRACT. Garment sketch-to-photo generation is one of the most important step in the process of garment design. Most existing methods only contain single conditional information, it is difficult to handle the combination of multiple conditional information, while failing to generate garment folds based on sketch strokes and facing a low-fidelity problem. Therefore, in this paper, we proposed a two-stage multi-modal framework for the generation of garment images, FoldGEN, to generate garment images with folds using sketches and descriptive text as conditional information. In the first stage, we combine feature matching of discriminators and semantic perception of Convolutional Neural Network in vector quantization, which can reconstruct the details and folds of the garment images. In the second stage, a multi-conditional constrained Transformer is used to establish the association between different modality data, which allows the generated images to contain not only text description information but also folds corresponding to the strokes of the sketch. Experiments show that our method can generate garment images with different folds from sketches with high fidelity, while achieving the best FID and IS on both unimodal and multimodal tasks. |
17:12 | Light Accumulation Map for Natural Foliage Scene Generation PRESENTER: Ruien Shen ABSTRACT. Foliage scene generation is an important problem of virtual reality applications. Realistic virtual floras require simulation of real plant symbiotic principles. Among the factors that affect the spatial distribution of plants, lighting is the most important one. The change of seasons, geographic locations, and shading from higher plants will greatly affect the sunlight conditions for different plants in floras, which cannot be easily described with parameters. In order to generate natural foliage scene that accurately reflects the sunlight condition while maintaining efficiency, we propose a novel method named Light Accumulation Map (LAM) which stores sunlight receiving and occlusion information of each tree model. By calculating sun lighting accumulation during one year at different latitudes, we simulate the sunlight occlusion effect of the tree model and store the occlusion result as LAM. Then, a LAM-based foliage generation algorithm is brought out to simulate accurate foliage distribution with different latitudes and seasons. The evaluation shows that our method exhibits strong adaptability in creating a lifelike distribution of foliage, particularly in undergrowth areas, across various regions and throughout different seasons of the year. |
17:24 | DrawGAN: Multi-view Generative Model Inspired By The Artist's Drawing Method PRESENTER: Zheng Chen ABSTRACT. We presents a novel approach for modeling artists' drawing processes using an unconditional generative adversarial network (GAN) architecture with a multi-view generator and multi-discriminator. The proposed method can synthesize different types of picture drawing, including line drawing, shading, and color drawing, with high quality and robustness. Also, the proposed method outperforms the existing state-of-the-art unconditional GANs. The novelty of our approach lies in the design of the architecture that closely resembles the typical sequence of an artist's drawing process, which can significantly enhance the quality of the generated images. Our experimental results demonstrate the potential of using a multi-view generative model to provide more feature knowledge for modulating image generation processes. The proposed method holds promise for advancing the field of AI in the visual arts, and can open new avenues for research and creative practices. |
17:36 | A Two-step Approach for Interactive Animatable Avatars PRESENTER: Takumi Kitamura ABSTRACT. We propose a new two-step human body animation technique based on displacement mapping that can learn a detailed deformation space, works at interactive time (more than 30 fps) and can be directly integrated into standard animation environments. To achieve real-time animation we employ the template-based approach and model pose-dependent deformations with 2D displacement images. We propose our own template model to facilitate and automatize training data preparation. Key to achieve detailed animation with few artifacts is to learn pose-dependent displacements directly in the pose space, without having to predict skinning weights. In order to generalize to totally new motions we employ a two step approach where the first step contains knowledge about general human motion while second step contains information about user specific motion. Our experimental results show that our proposed method can animate an avatar up to 300 times faster than baselines while keeping similar or even better level of details. |
17:48 | Visual simulation of crack generation and bending in deteriorated films coated on metal objects: Combination of static fracture and position-based deformation PRESENTER: Akinori Ishitobi ABSTRACT. Weathering, an expression of degradation caused by rain and wind, is essential for photorealistic computer graphics. One of the most typical targets of weathering is metal, which is omnipresent in reality. However, to reproduce scenes realistically, rust-proof paint applied to metal surfaces cannot be ignored. In our study, we propose a weathering method for coated films on metal objects. Our method models a coated film as a 3D triangular polygon mesh and deforms it by combining two kinds of simulations: static simulation for determining fractures based on the balance of the internal forces and the position-based bend simulation for moving vertices according to geometric constraints. Our method can digitally reproduce the deterioration of coated films using complex 3D deformation, which is difficult to express by material manipulation only. |