Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4× to 3.4× improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. Our code is available at https://github.com/hao-ailab/Consistency LLM.

1. Introduction

Large language models (LLMs), including GPT-4 (Achiam et al., 2023), LLaMA (Touvron et al., 2023a;b), PaLM (Anil et al., 2023), are pushing the limit of artificial intelligence. As LLMs are integrated into more applications (Zheng et al., 2023; Wu et al., 2023), the inference latency of LLMs plays a crucial role in ensuring a positive user experience and high service quality. However, LLM serving operates in an AR paradigm, generating one token at a time due to the attention mechanism’s need for previous token states to generate the next one. To produce a lengthy response, one must execute forward passes through the LLMs as many times as the number of tokens generated, resulting in high latency.

\ Existing methods address this issue from various perspectives. For example, speculative decoding (Leviathan et al., *Equal contribution 1 Shanghai Jiao Tong University 2University of California, San Diego. Correspondence to: Zhijie Deng. 2023; Chen et al., 2023) introduces a small draft LLM to guess tokens and let the target LLM verify them in parallel. Although they can opportunistically generate multiple tokens in a single evaluation of the target LLM, obtaining a small yet effective draft model is non-trivial; managing multiple models within a single system remains a challenging engineering task. Medusa (Cai et al., 2024) alternatively augments the target LLM with extra guess heads to enable self-speculation with as much as 3× speedup on various tasks. Yet, the number of added parameters can be significant (e.g., Medusa2 with 5 extra heads adds 1.6B parameters for a 6.7B target LLM). Increased memory consumption could limit generation length and negatively affect inference latency due to the reduction in available memory for key-value (KV) cache (Pope et al., 2023).

\ On the other hand, originating from the Jacobi and GaussSeidel fixed-point iteration for solving nonlinear equations (Ortega & Rheinboldt, 2000; Song et al., 2021a), the Jacobi decoding method (Santilli et al., 2023) first randomly guesses the next n tokens in a sequence (referred to as n-token sequence hereinafter) from an input prompt. The n-token sequence, along with the prompt, is then fed to the LLM to iteratively update itself. Eventually, the n-token sequence converges to the same output generated by AR decoding under a greedy strategy (see Figure 1). The evolution of the n-token sequence forms a Jacobi trajectory between a randomly initialized sequence to the n-token sequence generated by AR decoding (i.e., the fixed point).

\ However, vanilla Jacobi decoding for LLMs shows only marginal speedup over AR decoding in practice, e.g., an average of 1.05× speedup in Santilli et al. (2023). This is because a LLM can rarely yield a correct token when there are incorrection[1] in its preceding tokens due to the attention mechanism, resulting in a long trajectory as illustrated on the left side of Figure 2. Lookahead decoding (Fu et al., 2024) improves the efficiency by leveraging n-grams generated from previous Jacobi iterations and verify them in parallel during the decoding process. However, both work are unable to achieve the same level of speedup as Meudsa.

\ This work aims to achieve all three goals by refining the target LLM. Specifically, we propose to fine-tune the LLM

\ Figure 1. An instance of Jacobi trajectory. “n-token seq” refers to the n-token sequence that is iteratively updated in Jacobi iterations.

\ so that it can yield multiple, instead of one, subsequent tokens of a prefix at once. In the ideal case, with the prompt and a randomly initialized n-token sequence as input, our goal is to train a LLM that can generate the same n-token sequence as AR decoding (the fixed point) using only one step. Our preliminary experiments show the single-step learning task is difficult when n is large, and leads to slow model convergence. We therefore ease the learning process by also taking intermediate points on the Jacobi trajectory with more correct tokens into account. In particular, for the second to last point on the trajectory, the learning is identical to AR modeling, at which the target LLM without adaptation has already excelled.

\ We argue such a learning strategy that a single model is tuned to solve a series of learning problems of mapping any arbitrary point on the trajectory to the fixed-point is beneficial to model convergence (see Figure 4 and Figure 5). Imagining the evolution of the n-token sequence as the denoising process of a natural image (Ho et al., 2020; Song et al., 2021b), we surprisingly find that the above learning procedure draws a sharp analogy to the acceleration technique for diffusion models named consistency models (CMs) (Song et al., 2023; Song & Dhariwal, 2023). CMs aim to achieve single-step image generation using the denoising objective by minimizing distances between consecutive denoising steps along the probability flow ordinary differential equation (ODE) trajectory during training. Our method and CMs share the notion of directly mapping intermediate states of a solving process (of non-linear systems or ODEs) to its final solution for inference acceleration. Based on these, we refer to our trained models as Consistency Large Language Models (CLLMs). In comparison with previous methods like speculative decoding and Medusa, CLLM doesn’t introduce extra memory cost to accommodate auxiliary model components while delivering significant speedup with minimal performance degradation.

\ To implement this learning strategy, it only requires model training with two loss terms. Following CMs, we can convert the aforementioned learning objective into a consistency loss where the model is demended to map arbitrary point on the Jacobi trajectory to the fixed point. CLLMs also include an AR loss to avoid deviating from the distribution of the target LLM and hence ensure the generation quality.

\ The fine-tuning cost of CLLMs is moderate, e.g., training on only ∼ 1M tokens for LLaMA-7B to achieve a 3.4× speedup on the Spider dataset. We further empirically identify that such acceleration is likely to stem from the existence of 1) fast forwarding, where multiple consecutive tokens are correctly predicted in a single forward pass, and 2) stationary tokens, which are correctly predicted and remain unaltered through subsequent iterations, despite being preceded by inaccurate tokens. An illustration of the examples is shown in Figure 2.

\ To summarize, our key contributions are as follows:

\ • We propose Consistency Large Language Models (CLLMs), a new family of LLMs specialized for the Jacobi decoding method for latency reduction.

\ • We empirically observe the existence of fast forwarding and stationary tokens phenomena in Jacobi decoding of CLLMs. Empirically, CLLMs can lead to a 2.0× to 6.8× improvement in the count of fast-forwarded tokens and stationary tokens compared to the original LLM.

\ • We demonstrate the efficacy of CLLMs on a variety of benchmarks. On domain-specific benchmarks including GSM8K, CodeSearchNet Python, and Spider, CLLMs can achieve 2.4× to 3.4× speedup using Jacobi decoding with nearly no loss in accuracy. On open-domain benchmark MT-bench, CLLMs can achieve 2.4× speedup on ShareGPT with state-of-the-art performance, scoring 6.4.

:::info This paper is available on arxiv under CC0 1.0 Universal license.

:::

[1] By correctness, we mean alignment with the AR decoding result under a greedy sampling strategy.

Feed: Hacker Noon - Medium

View: Original article

Tags: applications distribution small