The Impact of Data Size on Transformer Training: Overfitting & Loss Dynamics
Table of Links
Abstract and 1 Introduction
2 Related Work
3 Model and 3.1 Associative memories
3.2 Transformer blocks
4 A New Energy Function
4.1 The layered structure
5 Cross-Entropy Loss
6 Emp...