‘sparse Transformers’ tag

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Bibliography

https://arxiv.org/abs/2406.13131: “When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models”, Ting-Yun Chang, Jesse Thomason, Robin Jia

link-bibliography
https://www.wired.com/story/anthropic-black-box-ai-research-neurons-features/: “AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse”, Steven Levy

link-bibliography
https://ieeexplore.ieee.org/abstract/document/10446522: “Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution”, Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis

link-bibliography
https://arxiv.org/abs/2312.04927: “Zoology: Measuring and Improving Recall in Efficient Language Models”, Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Ré

link-bibliography
https://arxiv.org/abs/2306.14048: “H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models”, Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen

link-bibliography
https://arxiv.org/abs/2305.01625: “Unlimiformer: Long-Range Transformers With Unlimited Length Input”, Amanda Bertsch, Uri Alon, Graham Neubig, Matthew R. Gormley

link-bibliography
https://arxiv.org/abs/2211.03495: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah Smith, Roy Schwartz

link-bibliography
https://arxiv.org/abs/2207.10551#google: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler

link-bibliography
https://arxiv.org/abs/2111.12763#google: “Sparse Is Enough in Scaling Transformers”, Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

link-bibliography
https://arxiv.org/abs/2111.09714: “You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling”, Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh

link-bibliography
https://arxiv.org/abs/2110.15343#facebook: “Scatterbrain: Unifying Sparse and Low-Rank Attention Approximation”, Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

link-bibliography
https://arxiv.org/abs/2103.01075#google: “OmniNet: Omnidirectional Representations from Transformers”, Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

link-bibliography
https://arxiv.org/abs/2102.03902: “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention”, Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh

link-bibliography
https://arxiv.org/abs/2010.05315: “SMYRF: Efficient Attention Using Asymmetric Clustering”, Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis

link-bibliography
https://arxiv.org/abs/2003.07853#google: “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

link-bibliography
https://arxiv.org/abs/2003.05997#google: “Efficient Content-Based Sparse Attention With Routing Transformers”, Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier

link-bibliography
https://arxiv.org/abs/2001.04451#google: “Reformer: The Efficient Transformer”, Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

link-bibliography
https://arxiv.org/abs/1811.11721: “CCNet: Criss-Cross Attention for Semantic Segmentation”, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang

link-bibliography