Your resource for web content, online publishing
and the distribution of digital products.
S M T W T F S
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 

DreamLLM: Crucial Implementation Details

DATE POSTED:November 28, 2024
Table of Links

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References

\ A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

C IMPLEMENTATION DETAILS C.1 TRAINING DATA & HYPER-PARAMETERS

In Table 11, we list the detailed training dataset usage and hyper-parameters. The training data are constructed based on the following datasets: a) LAION400M (Schuhmann et al., 2021), b) LAIONCOCO (Schuhmann et al., 2023), c) MMC4 (Zhu et al., 2023b), d) BLIP-LAION (Li et al., 2022) which is filtered and caption by BLIP (Li et al., 2022), e) LLaVAPretrain (Liu et al., 2023a) which

\  Visual question answering example comparison of DREAMLLM to LLaVA (Liu et al., 2023a), GPT-4 (OpenAI, 2023), BLIP-2 (Li et al., 2023b), and OpenFlamingo (Awadalla et al., 2023b). This table format follows OpenAI (2023).

\  Visual question answering example comparison of DREAMLLM to LLaVA (Liu et al., 2023a), GPT-4 (OpenAI, 2023), BLIP-2 (Li et al., 2023b), and OpenFlamingo (Awadalla et al., 2023b). This table format follows OpenAI (2023).

\  Supervised fine-tuning.

\ contains 558K image-text pairs from BLIP-captioned CC3M (Sharma et al., 2018), SBU (Ordonez et al., 2011), and LAION400M filtered by LLaVA, f) LLaVAInstruct (Liu et al., 2023a), which contains 80K visual instruction-following data constructed by LLaVA, and g) InstructMMC4, which is our instruction-following interleaved document generation data curated by prompting GPT-4 to generate instruction based on the text contents of MMC4. h) Instruct-BLIP-LAION, which is our instruction-following image synthesis data.

\ Similar to InstructMMC4, it is curated by prompting GPT-4 to generate instructions based on image captions. Unless otherwise specified, we randomly sample the indicated number of instances from each dataset during the training process.

C.2 DREAMLLM MODEL

Language Model We use LLaMA-1 (Touvron et al., 2023a) trained on ShareGPT (Zheng et al., 2023) as as the default LLM (i.e., Vicuna-7B[1] (Chiang et al., 2023)) following Liu et al. (2023a) to endow its instruction-following capacity. During training, we use Flash Attention (Dao et al., 2022) and PyTorch FSDP (Zhao et al., 2023b) to accelerate training efficiency.

\ Visual Encoder The visual encoder is the publicly available OpenAI CLIP-L/14 (Radford et al., 2021) model, which is frozen during the whole process. The images are resized to 224×224 resolution to align with the CLIP pretraining settings, resulting in a sequence of 256 total tokens for each image. Following prior VL practice (Lu et al., 2019; Liu et al., 2023a), we append a special token before the image sequence and a special at the end of the sequence.

\ Diffusion Image Decoder We adopt SDv2.1 (Rombach et al., 2022) trained on 512×512 resolution as the default diffusion image decoder. Same as the visual encoder, the SD model is frozen without any modifications or training throughout the whole process. When constructing the SD target to compute the MSE loss, we resize the images to 512 resolution to fit its pretraining configuration.

\ Dream Query We use dream queries to gather semantic context from MLLMs as introduced before in Sec. 3. Without specifications, we use 64 learnable query embeddings. It is both efficient and effective in generating high-quality images. In order to predict when to generate images, we also introduce the special token, which is appended before the dream query sequence. A is appended at the end of the sequence, similar to image inputs.

\  Overall descriptions of the evaluation benchmarks for evaluating capabilities, including VL comprehension, content creation, and natural language processing (NLP).

\

C.3 EVALUATION BENCHMARKS

Systemic evaluations of DREAMLLM regarding VL comprehension, content creation, and NLP capabilities have been conducted. See the used benchmarks and datasets listed in Table 11. During the evaluation, we use the prompt templates listed in Fig. 12.

\

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

[1] Vicuna-7B v1.1: https://huggingface.co/lmsys/vicuna-7b-v1.1.

:::info Authors:

(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;

(2) Chunrui Han, MEGVII Technology;

(3) Yuang Peng, Tsinghua University and Internship at MEGVII;

(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;

(5) Zheng Ge, MEGVII Technology;

(6) Jinrong Yang, HUST and Internship at MEGVII;

(7) Liang Zhao, MEGVII Technology;

(8) Jianjian Sun, MEGVII Technology;

(9) Hongyu Zhou, MEGVII Technology;

(10) Haoran Wei, MEGVII Technology;

(11) Xiangwen Kong, MEGVII Technology;

(12) Xiangyu Zhang, MEGVII Technology and a Project leader;

(13) Kaisheng Ma, Tsinghua University and a Corresponding author;

(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

:::

\