DM Television

Here, we outline the proof that the proposed bifurcated attention in Equation 3 and 4 recovers the same attention as the operations in 1 and 2 for the case of single-context batch sampling. We use the fact that the KV part corresponding to context length, all the batch indices correspond to the tensors.

E.2. Detailed Memory I/O Analysis

Overall, the memory I/O complexity changes from

\ • Original memory I/O cost: bhnk + bgmk + bhnm (for ⟨q, K⟩) + bhnm + bgmk + bnd (for ⟨w, V ⟩)

\ • Bifurcated attention memory I/O cost: bhnk + gmck + bgmdk + bhnm (for ⟨q, K⟩) + bhnm + gmck + bgmdk + bnd (for ⟨w, V ⟩)

\ There is an associated memory IO to write the ⟨w, Vc⟩ and ⟨w, Vd⟩ output twice. However, it is typically very small (bnd) compared to the IO of KV cache component bgmk since m >> n = 1.

E.3. Implementation of Bifurcated Attention

Despite the dramatic gain in inference efficiency of the bifurcated attention, we demonstrate the simplicity of our implementation involving 20 lines of code using Pytorch (Paszke et al., 2019).

\ Comparison of memory access and computation between Multi Head, Multi Query, and Multi Group attention mechanisms. The memory access is for incremental decoding with the query length n = 1.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: applications small