We chose this paper since it kicked off a revolution in NLP architectures, introducing attention-based transformers. While the paper provided a good technical approach, it didn’t have the best visualizations to help us understand how transformers worked. We looked at The Illustrated Transformer.