Representational Capabilities of Transformers and Recurrent Architectures
Speaker:
Satwik Bhattamishra (University of Oxford)
Abstract:
Transformers have been the dominant architecture for building LLMs, but recently, there has been renewed interest in recurrent or state-space architectures to circumvent the quadratic complexity of inference for Transformers. Transformers process tokens in parallel using attention mechanisms without maintaining a persistent hidden state, whereas recurrent models store information in a fixed-size hidden state updated sequentially during the course of the computation. In this talk, we will explore the differences in the computational capabilities of these two classes of architectures and how our understanding of their strengths and limitations has evolved over the years. Specifically, we will discuss formal language and algorithmic reasoning problems that are provably easier for one architecture to express but difficult for the other. We will demonstrate how tools from theoretical computer science, such as communication complexity and formal language theory, can be adopted to prove limitations of neural sequence models. We will also present the differences in their empirical ability to learn various algorithmic tasks and discuss how certain synthetic tasks relate to real-world performance. The talk will primarily cover results from [1] while also discussing relevant findings from [2, 3] and related works.
[1] Separations in the Representational Capabilities of Transformers and Recurrent Architectures. Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade. NeurIPS 2024
[2] Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions. Satwik Bhattamishra, Arkil Patel, Phil Blunsom, Varun Kanade. ICLR 2024
[3] On the Ability and Limitations of Transformers to Recognize Formal Languages. Satwik Bhattamishra, Kabir Ahuja, Navin Goyal. EMNLP 2020