‘Transformer’ tag

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Wikipedia

Attention (machine learning)
BERT (language model)
Hopfield network
Perceiver⁠:

https://en.wikipedia.org/wiki/Perceiver
Vision transformer
Wu Dao⁠:

https://en.wikipedia.org/wiki/Wu_Dao

Miscellaneous

Bibliography

https://arxiv.org/abs/2408.00118#google: “Gemma 2: Improving Open Language Models at a Practical Size”, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozińska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Plucińska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, Sébastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, Alek Andreev

link-bibliography
https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own: “Investigating the Ability of LLMs to Recognize Their Own Writing”, Christopher Ackerman, Nina Panickssery

link-bibliography
https://arxiv.org/abs/2405.20233: “Grokfast: Accelerated Grokking by Amplifying Slow Gradients”, Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee

link-bibliography
https://arxiv.org/abs/2405.14860: “Not All Language Model Features Are Linear”, Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

link-bibliography
https://arxiv.org/abs/2405.05254#microsoft: “You Only Cache Once: Decoder-Decoder Architectures for Language Models”, Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei

link-bibliography
https://arxiv.org/abs/2404.13292: “Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge”, Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

link-bibliography
https://arxiv.org/abs/2404.10102: “Chinchilla Scaling: A Replication Attempt”, Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You

link-bibliography
https://osf.io/preprints/psyarxiv/kjuce: “Language Models Accurately Infer Correlations between Psychological Items and Scales from Text Alone”, Björn E. Hommel, Ruben C. Arslan

link-bibliography
https://inflection.ai/inflection-2-5: “Inflection-2.5: Meet the World’s Best Personal AI”, Inflection

link-bibliography
https://arxiv.org/abs/2312.03876: “Scaling Transformer Neural Networks for Skillful and Reliable Medium-Range Weather Forecasting”, Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Aditya Grover

link-bibliography
https://arxiv.org/abs/2312.02116: “GIVT: Generative Infinite-Vocabulary Transformers”, Michael Tschannen, Cian Eastwood, Fabian Mentzer

link-bibliography
https://arxiv.org/abs/2311.03079#zhipu: “CogVLM: Visual Expert for Pretrained Language Models”, Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

link-bibliography
https://arxiv.org/abs/2310.16836: “LLM-FP4: 4-Bit Floating-Point Quantized Transformers”, Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng

link-bibliography
https://arxiv.org/abs/2310.13061: “To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets”, Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

link-bibliography
https://arxiv.org/abs/2310.07096#ibm: “Sparse Universal Transformer”, Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

link-bibliography
https://arxiv.org/abs/2310.06694: “Sheared LLaMA: Accelerating Language Model Pre-Training via Structured Pruning”, Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

link-bibliography
https://arxiv.org/abs/2310.02207: “Language Models Represent Space and Time”, Wes Gurnee, Max Tegmark

link-bibliography
https://arxiv.org/abs/2308.13418#facebook: “Nougat: Neural Optical Understanding for Academic Documents”, Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic

link-bibliography
https://arxiv.org/abs/2308.11596#facebook: “SeamlessM4T: Massively Multilingual & Multimodal Machine Translation”, Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang

link-bibliography
https://arxiv.org/abs/2306.09222#google: “RGD: Stochastic Re-Weighted Gradient Descent via Distributionally Robust Optimization”, Ramnath Kumar, Kushal Majmundar, Dheeraj Nagaraj, Arun Sai Suggala

link-bibliography
https://arxiv.org/abs/2306.05426: “SequenceMatch: Imitation Learning for Autoregressive Sequence Modeling With Backtracking”, Chris Cundy, Stefano Ermon

link-bibliography
https://arxiv.org/abs/2305.11863: “Scaling Laws for Language Encoding Models in FMRI”, Richard Antonello, Aditya Vaidya, Alexander G. Huth

link-bibliography
2023-jia.pdf: “When and How Artificial Intelligence Augments Employee Creativity”, Nan Jia, Xueming Luo, Zheng Fang, Chengcheng Liao

link-bibliography
https://arxiv.org/abs/2302.12441: “MUX-PLMs: Pre-Training Language Models With Data Multiplexing”, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan

link-bibliography
https://arxiv.org/abs/2302.05442#google: “Scaling Vision Transformers to 22 Billion Parameters”, Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

link-bibliography
https://arxiv.org/abs/2302.04907#google: “BMT: Binarized Neural Machine Translation”, Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat

link-bibliography
https://arxiv.org/abs/2301.05217: “Progress Measures for Grokking via Mechanistic Interpretability”, Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

link-bibliography
https://arxiv.org/abs/2301.03728#facebook: “Scaling Laws for Generative Mixed-Modal Language Models”, Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

link-bibliography
https://arxiv.org/abs/2301.03992#nvidia: “Vision Transformers Are Good Mask Auto-Labelers”, Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar

link-bibliography
https://arxiv.org/abs/2212.14034: “Cramming: Training a Language Model on a Single GPU in One Day”, Jonas Geiping, Tom Goldstein

link-bibliography
https://arxiv.org/abs/2212.09410: “Less Is More: Parameter-Free Text Classification With Gzip”, Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin

link-bibliography
https://arxiv.org/abs/2212.06727: “What Do Vision Transformers Learn? A Visual Exploration”, Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu, Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein

link-bibliography
https://arxiv.org/abs/2212.05199#google: “MAGVIT: Masked Generative Video Transformer”, Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

link-bibliography
https://arxiv.org/abs/2212.05051: “VindLU: A Recipe for Effective Video-And-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

link-bibliography
https://arxiv.org/abs/2212.03533#microsoft: “Text Embeddings by Weakly-Supervised Contrastive Pre-Training”, Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

link-bibliography
https://arxiv.org/abs/2212.01349#facebook: “NPM: Nonparametric Masked Language Modeling”, Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, Luke Zettlemoyer

link-bibliography
https://arxiv.org/abs/2211.09808: “Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

link-bibliography
https://arxiv.org/abs/2211.06220: “OneFormer: One Transformer to Rule Universal Image Segmentation”, Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi

link-bibliography
https://arxiv.org/abs/2210.06313#google: “The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers”, Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

link-bibliography
https://arxiv.org/abs/2209.11737: “Semantic Scene Descriptions As an Objective of Human Vision”, Adrien Doerig, Tim C. Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest

link-bibliography
https://arxiv.org/abs/2209.11055: “SetFit: Efficient Few-Shot Learning Without Prompts”, Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, Oren Pereg

link-bibliography
https://arxiv.org/abs/2209.02535: “Analyzing Transformers in Embedding Space”, Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant

link-bibliography
https://arxiv.org/abs/2207.06300#ibm: “Re2G: Retrieve, Rerank, Generate”, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo

link-bibliography
https://arxiv.org/abs/2207.01848: “TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter

link-bibliography
https://arxiv.org/abs/2204.05927: “Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective”, Yitong Ji, Aixin Sun, Jie Zhang, Chenliang Li

link-bibliography
https://arxiv.org/abs/2206.07137: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

link-bibliography
https://arxiv.org/abs/2206.07160#microsoft: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang

link-bibliography
https://www.biorxiv.org/content/10.1101/2022.06.08.495348.full: “Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-Based Language Model”, Sreejan Kumar, Theodore R. Sumers, Takateru Yamakoshi, Ariel Goldstein, Uri Hasson, Kenneth A. Norman, Thomas L. Griffiths, Robert D. Hawkins, Samuel A. Nastase

link-bibliography
https://arxiv.org/abs/2206.01859#microsoft: “XTC: Extreme Compression for Pre-Trained Transformers Made Simple and Efficient”, Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

link-bibliography
https://arxiv.org/abs/2206.01685: “Toward a Realistic Model of Speech Processing in the Brain With Self-Supervised Learning”, Juliette Millet, Charlotte Caucheteux, Pierre Orhan, Yves Boubenec, Alexandre Gramfort, Ewan Dunbar, Christophe Pallier, Jean-Remi King

link-bibliography
2022-rios.pdf: “Anime Character Recognition Using Intermediate Features Aggregation”, Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai

link-bibliography
https://arxiv.org/abs/2205.13320#google: “Towards Learning Universal Hyperparameter Optimizers With Transformers”, Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, Sagi Perel, Nando de Freitas

link-bibliography
https://arxiv.org/abs/2205.11491#facebook: “HTPS: HyperTree Proof Search for Neural Theorem Proving”, Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, Timothée Lacroix

link-bibliography
https://arxiv.org/abs/2205.04596#google: “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil, Rebecca Roelofs

link-bibliography
https://arxiv.org/abs/2203.13224#facebook: “Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, Jason Weston

link-bibliography
https://arxiv.org/abs/2203.02094#microsoft: “LiteTransformerSearch: Training-Free Neural Architecture Search for Efficient Language Models”, Mojan Javaheripi, Gustavo H. de Rosa, Subhabrata Mukherjee, Shital Shah, Tomasz L. Religa, Caio C. T. Mendes, Sebastien Bubeck, Farinaz Koushanfar, Debadeepta Dey

link-bibliography
https://arxiv.org/abs/2202.03052#alibaba: “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework”, Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang

link-bibliography
https://arxiv.org/abs/2112.10510: “PFNs: Transformers Can Do Bayesian Inference”, Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, Frank Hutter

link-bibliography
https://arxiv.org/abs/2111.13824: “FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, Shuchang Zhou

link-bibliography
https://arxiv.org/abs/2111.12233#microsoft: “LEMON: Scaling Up Vision-Language Pre-Training for Image Captioning”, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

link-bibliography
https://arxiv.org/abs/2111.09162: “It’s About Time: Analog Clock Reading in the Wild”, Charig Yang, Weidi Xie, Andrew Zisserman

link-bibliography
https://arxiv.org/abs/2111.06091: “A Survey of Visual Transformers”, Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He

link-bibliography
https://arxiv.org/abs/2109.12948: “Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

link-bibliography
https://arxiv.org/abs/2109.10282#microsoft: “TrOCR: Transformer-Based Optical Character Recognition With Pre-Trained Models”, Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei

link-bibliography
https://arxiv.org/abs/2109.06243#huawei: “KroneckerBERT: Learning Kronecker Decomposition for Pre-Trained Language Models via Knowledge Distillation”, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh

link-bibliography
https://arxiv.org/abs/2108.13002#microsoft: “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP”, Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

link-bibliography
https://arxiv.org/abs/2107.07566#facebook: “Internet-Augmented Dialogue Generation”, Mojtaba Komeili, Kurt Shuster, Jason Weston

link-bibliography
https://arxiv.org/abs/2107.04589: “ViTGAN: Training GANs With Vision Transformers”, Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu

link-bibliography
https://arxiv.org/abs/2106.12672#google: “Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization”, Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

link-bibliography
https://arxiv.org/abs/2106.10199: “BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models”, Elad Ben Zaken, Shauli Ravfogel, Yoav Goldberg

link-bibliography
https://arxiv.org/abs/2106.09488#amazon: “Scaling Laws for Acoustic Models”, Jasha Droppo, Oguz Elibol

link-bibliography
https://arxiv.org/abs/2106.04803#google: “CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan

link-bibliography
https://arxiv.org/abs/2106.04533: “Chasing Sparsity in Vision Transformers: An End-To-End Exploration”, Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

link-bibliography
https://arxiv.org/abs/2105.15203: “SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo

link-bibliography
https://arxiv.org/abs/2104.07567#facebook: “Retrieval Augmentation Reduces Hallucination in Conversation”, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston

link-bibliography
https://chinai.substack.com/p/chinai-137-year-3-of-chinai: “ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, Jeffrey Ding

link-bibliography
https://arxiv.org/abs/2103.10697#facebook: “ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, Levent Sagun

link-bibliography
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan

link-bibliography
https://arxiv.org/abs/2102.07074: “TransGAN: Two Transformers Can Make One Strong GAN”, Yifan Jiang, Shiyu Chang, Zhangyang Wang

link-bibliography
https://arxiv.org/abs/2102.03334: “ViLT: Vision-And-Language Transformer Without Convolution or Region Supervision”, Wonjae Kim, Bokyung Son, Ildoo Kim

link-bibliography
https://arxiv.org/abs/2101.11986: “Tokens-To-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay, Jiashi Feng, Shuicheng Yan

link-bibliography
https://arxiv.org/abs/2101.11605#google: “Bottleneck Transformers for Visual Recognition”, Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani

link-bibliography
https://arxiv.org/abs/2101.08674: “DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Edwin Arkel Rios, Wen-Huang Cheng, Bo-Cheng Lai

link-bibliography
https://arxiv.org/abs/2101.04702#google: “XMC-GAN: Cross-Modal Contrastive Learning for Text-To-Image Generation”, Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

link-bibliography
https://arxiv.org/abs/2012.12877#facebook: “Training Data-Efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou

link-bibliography
https://arxiv.org/abs/2012.08508#deepmind: “Object-Based Attention for Spatio-Temporal Reasoning: Outperforming Neuro-Symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick

link-bibliography
https://arxiv.org/abs/2011.13729#tencent: “TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, Lei Han, Jiechao Xiong, Peng Sun, Xinghai Sun, Meng Fang, Qingwei Guo, Qiaobo Chen, Tengfei Shi, Hongsheng Yu, Zhengyou Zhang

link-bibliography
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/: “DeepSpeed: Extreme-Scale Model Training for Everyone”, DeepSpeed Team, Rangan Majumder, Junhua Wang

link-bibliography
https://arxiv.org/abs/2008.02217: “Hopfield Networks Is All You Need”, Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

link-bibliography
https://arxiv.org/abs/2006.03654#microsoft: “DeBERTa: Decoding-Enhanced BERT With Disentangled Attention”, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

link-bibliography
https://arxiv.org/abs/2005.12872#facebook: “DETR: End-To-End Object Detection With Transformers”, Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko

link-bibliography
https://ai.meta.com/blog/state-of-the-art-open-source-chatbot/: “Blender: A State-Of-The-Art Open Source Chatbot”, Stephen Roller, Jason Weston, Emily Dinan

link-bibliography
https://arxiv.org/abs/2004.03844: “On the Effect of Dropping Layers of Pre-Trained Transformer Models”, Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov

link-bibliography
https://arxiv.org/abs/2004.03965: “Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Nikola I. Nikolov, Eric Malmi, Curtis G. Northcutt, Loreto Parisi

link-bibliography
https://arxiv.org/abs/2002.10957#microsoft: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou

link-bibliography
https://research.google/blog/towards-a-conversational-agent-that-can-chat-aboutanything/: “Towards a Conversational Agent That Can Chat About…Anything”, Daniel Adiwardana, Thang Luong

link-bibliography
https://openai.com/research/deep-double-descent: “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time”, Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

link-bibliography
https://arxiv.org/abs/1911.02116#facebook: “Unsupervised Cross-Lingual Representation Learning at Scale”, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov

link-bibliography
https://arxiv.org/abs/1909.10351: “TinyBERT: Distilling BERT for Natural Language Understanding”, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

link-bibliography
https://arxiv.org/abs/1909.05286#ibm: “Frustratingly Easy Natural Question Answering”, Lin Pan, Rishav Chakravarti, Anthony Ferritto, Michael Glass, Alfio Gliozzo, Salim Roukos, Radu Florian, Avirup Sil

link-bibliography
https://arxiv.org/abs/1908.04577#alibaba: “StructBERT: Incorporating Language Structures into Pre-Training for Deep Language Understanding”, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si

link-bibliography
https://arxiv.org/abs/1907.11692#facebook: “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

link-bibliography
https://arxiv.org/abs/1905.03197: “UniLM: Unified Language Model Pre-Training for Natural Language Understanding and Generation”, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

link-bibliography
https://arxiv.org/abs/1904.00962#google: “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

link-bibliography
https://arxiv.org/abs/1901.08746: “BioBERT: a Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining”, Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang

link-bibliography
2018-huang.pdf: “Generating Structured Music through Self-Attention”, Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Andrew Dai, Matt Hoffman, Curtis Hawthorne, Douglas Eck

link-bibliography
https://github.com/huggingface/transformers: “Huggingface: transformers Repo”, Huggingface

link-bibliography