DreamStyler: Paint by Style Inversion with
Text-to-Image Diffusion Models

NAVER WEBTOON AI 1   Harvard University 2   KAIST AI 3   SwatchOn 4

DreamStyler synthesizes outputs based on a given context along with a style reference. Note that each model is trained on a single style image shown in this figure.

Abstract

Recent progresses in large-scale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain. However, expressing unique characteristics of an artwork with text prompts alone may encounter limitations due to the inherent constraints of verbal description. To this end, we introduce DreamStyler, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer. DreamStyler optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality. In addition, with content and style guidance, DreamStyler exhibits flexibility to accommodate a range of style references. Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation.

Model Architecture

Model overview. (a) DreamStyler constructs training prompt with an opening text \(C_o\), multi-stage style tokens \(\mathbf{S^*}\), and a context description \(C_c\), which is captioned with BLIP-2 and human feedback. DreamStyler projects the training prompt into multi-stage textual embeddings \(\mathbf{v}^* = \{v^*_1,\dots,v^*_{T}\}\), where \(T\) is the number of stages (a chunk of the denoising timestep; we use \(T=6\) by default). As a result, the denoising U-Net provides distinct textual information at each stage. (b) DreamStyler prepares the textual embedding using a provided inference prompt. For style transfer, DreamStyler employs ControlNet to comprehend the context information from a content image.

Text-to-image Synthesis

Style Transfer

Style Mixing

Style mixing. Multi-stage TI facilitates style mixing from various style references. A user can customize a new style by substituting style tokens at different stages \(t\). For example, the style token closer to \(t=T\) tends to influence the structure of the image, while those closer to \(t=0\) have a stronger effect on local and detailed attributes. For comparison, we display the baseline that employs all style tokens at every stage (i.e. using "A painting in \(S^A_t\), \(S^B_t\), \(S^C_t\) style" at all stages).

BibTeX


@article{ahn2023dreamstyler,
  title={DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models}, 
  author={Ahn, Namhyuk and Lee, Junsoo and Lee, Chunggi and Kim, Kunhee and Kim, Daesik and Nam, Seung-Hun and Hong, Kibeom},
  journal={arXiv preprint arXiv:2309.06933},
  year={2023},
}