Tags:Diffusion Models, Semantic Image Synthesis and Sparse Input
Abstract:
We present an efficient framework for generating landscape images from sparse semantic layouts via diffusion models. Previous approaches use dense semantic label maps to generate photorealistic images, where the quality of the results highly depends on the shape of each semantic region. In practice, however, it is not trivial to create detailed and accurate semantic layouts in order to obtain plausible results from these methods. To address this issue, we propose a novel type of input that is more sparse and intuitive for use in real-world settings. Our learning-based framework incorporates a carefully designed random masking process to simulate real user input during model training. We leverage the Semantic Diffusion Model (SDM) as a generator to transform sparse label maps into full landscape images where missing semantic information is complemented based on the learned image structure. Furthermore, through a model distillation process, we achieve comparable inference speed to GAN-based models while preserving the generation quality. After training with the well-designed random masking process, the proposed framework is able to generate high-quality landscape images with sparse and intuitive inputs, which is useful for practical applications. Experiments show that our proposed method outperforms existing approaches both quantitatively and qualitatively.
Diffusion-Based Semantic Image Synthesis from Sparse Layouts