In this work, we consider a three-stage training procedure. It can be summarized as follows, and the implementation details, like training data, can be found in Table 11 in Appendix C.

\ I Alignment Training This stage is used to alleviate the gap in multimodality, facilitating the adaptation of multimodal inputs to LLMs. The linear visual projector, linear condition projector, and learnable dream embeddings are pretrained for cross-modal manifold alignment among frozen LLMs, visual encoder, and SD. We use approximately 30M image-text pairs data, training both image-to-text comprehension and text-to-image synthesis.

\ II I-GPT Pretraining Following alignment, the LLM undergoes an unfrozen process for I-GPT pretraining (detailed in Sec. 3.1). This critical stage facilitates the learning of joint vision-language distributions via generative modeling. Training incorporates approximately 2M selectively filtered documents from MMC4-Core (Zhu et al., 2023b), adhering to a CLIP score threshold of 0.25. Furthermore, we use 2M paired data samples from LAION400M (Schuhmann et al., 2021), captioned by BLIP (Li et al., 2022) (i.e., BLIP-LAION), to enhance text-to-image training and potentially mitigate the impact of some low-quality noisy images and texts from sMMC4.

\ III Supervised Fine-tuning This stage enables the model to perform general multimodal comprehension and creative tasks following human instructions (Ouyang et al., 2022). We utilize approximately 80K visual instruction tuning data collected by Liu et al.. For instruction-following content creation, GPT-4 (OpenAI, 2023) is prompted with document summaries or image captions, collecting approximately 20K instruction-following document synthesis from MMC4 (InstructMMC4) and 20K image synthesis data from BLIP captioned LAION400M (Instruct-BLIP-LAION).

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: content technology