‘multi-scale Transformers’ tag

Gwern

‘multi-scale Transformers’ tag

See Also
Gwern
- “Fully-Connected Neural Nets”, Gwern 2021
Links
Miscellaneous
Bibliography

Gwern

“Fully-Connected Neural Nets”, Gwern 2021

Fully-Connected Neural Nets

Links

“State-Space Models Can Learn In-Context by Gradient Descent”, Sushma et al 2024

State-space models can learn in-context by gradient descent

“XT: Nested Tokenization for Larger Context in Large Images”, Gupta et al 2024

xT: Nested Tokenization for Larger Context in Large Images

“A Long-Context Language Model for the Generation of Bacteriophage Genomes”, Shao 2023

A long-context language model for the generation of bacteriophage genomes

“HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling”, Qin et al 2023

HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling

“Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer”, Zhang et al 2023

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

“LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Ding et al 2023

LongNet: Scaling Transformers to 1,000,000,000 Tokens

“Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023

Bytes Are All You Need: Transformers Operating Directly On File Bytes

“Landmark Attention: Random-Access Infinite Context Length for Transformers”, Mohtashami & Jaggi 2023

Landmark Attention: Random-Access Infinite Context Length for Transformers

“MEGABYTE: Predicting Million-Byte Sequences With Multiscale Transformers”, Yu et al 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

“Parallel Context Windows Improve In-Context Learning of Large Language Models”, Ratner et al 2022

Parallel Context Windows Improve In-Context Learning of Large Language Models

“Structured Prompting: Scaling In-Context Learning to 1,000 Examples”, Hao et al 2022

Structured Prompting: Scaling In-Context Learning to 1,000 Examples

“Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022

Efficient Transformers with Dynamic Token Pooling

“Accurate Image Restoration With Attention Retractable Transformer (ART)”, Zhang et al 2022

Accurate Image Restoration with Attention Retractable Transformer (ART)

“Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, Mirowski et al 2022

Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron): An Evaluation by Industry Professionals

“DiNAT: Dilated Neighborhood Attention Transformer”, Hassani & Shi 2022

DiNAT: Dilated Neighborhood Attention Transformer

“Mega: Moving Average Equipped Gated Attention”, Ma et al 2022

Mega: Moving Average Equipped Gated Attention

“Investigating Efficiently Extending Transformers for Long Input Summarization”, Phang et al 2022

Investigating Efficiently Extending Transformers for Long Input Summarization

“ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022

Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

“NAT: Neighborhood Attention Transformer”, Hassani et al 2022

NAT: Neighborhood Attention Transformer

“ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022

ViS4mer: Long Movie Clip Classification with State-Space Video Models

“MaxViT: Multi-Axis Vision Transformer”, Tu et al 2022

MaxViT: Multi-Axis Vision Transformer

“Hierarchical Perceiver”, Carreira et al 2022

Hierarchical Perceiver

“Transformer Quality in Linear Time”, Hua et al 2022

Transformer Quality in Linear Time

“LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Guo et al 2021

LongT5: Efficient Text-To-Text Transformer for Long Sequences

“Simple Local Attentions Remain Competitive for Long-Context Tasks”, Xiong et al 2021

Simple Local Attentions Remain Competitive for Long-Context Tasks

“Restormer: Efficient Transformer for High-Resolution Image Restoration”, Zamir et al 2021

Restormer: Efficient Transformer for High-Resolution Image Restoration

“Swin Transformer V2: Scaling Up Capacity and Resolution”, Liu et al 2021

Swin Transformer V2: Scaling Up Capacity and Resolution

“Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Nawrot et al 2021

Hourglass: Hierarchical Transformers Are More Efficient Language Models

“Fastformer: Additive Attention Can Be All You Need”, Wu et al 2021

Fastformer: Additive Attention Can Be All You Need

“AdaMRA: Adaptive Multi-Resolution Attention With Linear Complexity”, Zhang et al 2021

AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity

“Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Zhu et al 2021

Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision

“Global Filter Networks for Image Classification”, Rao et al 2021

Global Filter Networks for Image Classification

“HiT: Improved Transformer for High-Resolution GANs”, Zhao et al 2021

HiT: Improved Transformer for High-Resolution GANs

“A Multi-Level Attention Model for Evidence-Based Fact Checking”, Kruengkrai et al 2021

A Multi-Level Attention Model for Evidence-Based Fact Checking

“Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling”, Wu et al 2021

Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

“Aggregating Nested Transformers”, Zhang et al 2021

Aggregating Nested Transformers

“Pay Attention to MLPs”, Liu et al 2021

Pay Attention to MLPs

“MViT: Multiscale Vision Transformers”, Fan et al 2021

MViT: Multiscale Vision Transformers

“Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

“Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021

Coordination Among Neural Modules Through a Shared Global Workspace

“Generative Adversarial Transformers”, Hudson & Zitnick 2021

Generative Adversarial Transformers

“LazyFormer: Self Attention With Lazy Update”, Ying et al 2021

LazyFormer: Self Attention with Lazy Update

“CDLM: Cross-Document Language Modeling”, Caciularu et al 2021

CDLM: Cross-Document Language Modeling

“Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2020

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

“Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020

Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries

“Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”, Hajra 2020

Transformer-QL: A Step Towards Making Transformer Network Quadratically Large

“Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size”, Yoshida et al 2020

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

“Progressive Generation of Long Text”, Tan et al 2020

Progressive Generation of Long Text

“Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, Dai et al 2020

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

“Conformer: Convolution-Augmented Transformer for Speech Recognition”, Gulati et al 2020

Conformer: Convolution-augmented Transformer for Speech Recognition

“Multi-Scale Transformer Language Models”, Subramanian et al 2020

Multi-scale Transformer Language Models

“Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder for Long-Form Document Matching”, Yang et al 2020

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

“Lite Transformer With Long-Short Range Attention”, Wu et al 2020

Lite Transformer with Long-Short Range Attention

“ETC: Encoding Long and Structured Inputs in Transformers”, Ainslie et al 2020

ETC: Encoding Long and Structured Inputs in Transformers

“Longformer: The Long-Document Transformer”, Beltagy et al 2020

Longformer: The Long-Document Transformer

“BP-Transformer: Modeling Long-Range Context via Binary Partitioning”, Ye et al 2019

BP-Transformer: Modeling Long-Range Context via Binary Partitioning

“Blockwise Self-Attention for Long Document Understanding”, Qiu et al 2019

Blockwise Self-Attention for Long Document Understanding

“Hierarchical Transformers for Multi-Document Summarization”, Liu & Lapata 2019

Hierarchical Transformers for Multi-Document Summarization

“Hierarchical Multiscale Recurrent Neural Networks”, Chung et al 2016

Hierarchical Multiscale Recurrent Neural Networks

“A Clockwork RNN”, Koutník et al 2014

A Clockwork RNN

Miscellaneous

Bibliography

https://arxiv.org/abs/2307.02486#microsoft: “LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei

link-bibliography
https://arxiv.org/abs/2306.00238#apple: “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

link-bibliography
https://arxiv.org/abs/2305.16300: “Landmark Attention: Random-Access Infinite Context Length for Transformers”, Amirkeivan Mohtashami, Martin Jaggi

link-bibliography
https://arxiv.org/abs/2209.14958#deepmind: “Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, Richard Evans

link-bibliography
https://arxiv.org/abs/2209.15001: “DiNAT: Dilated Neighborhood Attention Transformer”, Ali Hassani, Humphrey Shi

link-bibliography
https://arxiv.org/abs/2209.10655: “Mega: Moving Average Equipped Gated Attention”, Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer

link-bibliography
https://arxiv.org/abs/2206.05852: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Ruslan Khalitov, Tong Yu, Lei Cheng, Zhirong Yang

link-bibliography
https://arxiv.org/abs/2204.10670: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang

link-bibliography
https://arxiv.org/abs/2204.07143: “NAT: Neighborhood Attention Transformer”, Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi

link-bibliography
https://arxiv.org/abs/2112.07916#google: “LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang

link-bibliography
https://arxiv.org/abs/2110.13711#nvidia: “Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, Henryk Michalewski

link-bibliography
https://arxiv.org/abs/2107.02192#nvidia: “Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

link-bibliography
https://arxiv.org/abs/2107.00645: “Global Filter Networks for Image Classification”, Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, Jie Zhou

link-bibliography
https://arxiv.org/abs/2106.07631#google: “HiT: Improved Transformer for High-Resolution GANs”, Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N. Metaxas, Han Zhang

link-bibliography
https://arxiv.org/abs/2105.08050#google: “Pay Attention to MLPs”, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le

link-bibliography
https://arxiv.org/abs/2104.11227#facebook: “MViT: Multiscale Vision Transformers”, Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

link-bibliography
https://arxiv.org/abs/2103.14030: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

link-bibliography
https://arxiv.org/abs/2010.10504#google: “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

link-bibliography
https://arxiv.org/abs/2005.08100#google: “Conformer: Convolution-Augmented Transformer for Speech Recognition”, Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

link-bibliography
https://arxiv.org/abs/2004.05150: “Longformer: The Long-Document Transformer”, Iz Beltagy, Matthew E. Peters, Arman Cohan

link-bibliography