- See Also
- Gwern
-
Links
- “State-Space Models Can Learn In-Context by Gradient Descent”, Sushma et al 2024
- “XT: Nested Tokenization for Larger Context in Large Images”, Gupta et al 2024
- “A Long-Context Language Model for the Generation of Bacteriophage Genomes”, Shao 2023
- “HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling”, Qin et al 2023
- “Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer”, Zhang et al 2023
- “LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Ding et al 2023
- “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023
- “Landmark Attention: Random-Access Infinite Context Length for Transformers”, Mohtashami & Jaggi 2023
- “MEGABYTE: Predicting Million-Byte Sequences With Multiscale Transformers”, Yu et al 2023
- “Parallel Context Windows Improve In-Context Learning of Large Language Models”, Ratner et al 2022
- “Structured Prompting: Scaling In-Context Learning to 1,000 Examples”, Hao et al 2022
- “Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022
- “Accurate Image Restoration With Attention Retractable Transformer (ART)”, Zhang et al 2022
- “Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, Mirowski et al 2022
- “DiNAT: Dilated Neighborhood Attention Transformer”, Hassani & Shi 2022
- “Mega: Moving Average Equipped Gated Attention”, Ma et al 2022
- “Investigating Efficiently Extending Transformers for Long Input Summarization”, Phang et al 2022
- “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022
- “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022
- “NAT: Neighborhood Attention Transformer”, Hassani et al 2022
- “ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022
- “MaxViT: Multi-Axis Vision Transformer”, Tu et al 2022
- “Hierarchical Perceiver”, Carreira et al 2022
- “Transformer Quality in Linear Time”, Hua et al 2022
- “LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Guo et al 2021
- “Simple Local Attentions Remain Competitive for Long-Context Tasks”, Xiong et al 2021
- “Restormer: Efficient Transformer for High-Resolution Image Restoration”, Zamir et al 2021
- “Swin Transformer V2: Scaling Up Capacity and Resolution”, Liu et al 2021
- “Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Nawrot et al 2021
- “Fastformer: Additive Attention Can Be All You Need”, Wu et al 2021
- “AdaMRA: Adaptive Multi-Resolution Attention With Linear Complexity”, Zhang et al 2021
- “Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Zhu et al 2021
- “Global Filter Networks for Image Classification”, Rao et al 2021
- “HiT: Improved Transformer for High-Resolution GANs”, Zhao et al 2021
- “A Multi-Level Attention Model for Evidence-Based Fact Checking”, Kruengkrai et al 2021
- “Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling”, Wu et al 2021
- “Aggregating Nested Transformers”, Zhang et al 2021
- “Pay Attention to MLPs”, Liu et al 2021
- “MViT: Multiscale Vision Transformers”, Fan et al 2021
- “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021
- “Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
- “Generative Adversarial Transformers”, Hudson & Zitnick 2021
- “LazyFormer: Self Attention With Lazy Update”, Ying et al 2021
- “CDLM: Cross-Document Language Modeling”, Caciularu et al 2021
- “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2020
- “Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020
- “Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”, Hajra 2020
- “Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size”, Yoshida et al 2020
- “Progressive Generation of Long Text”, Tan et al 2020
- “Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, Dai et al 2020
- “Conformer: Convolution-Augmented Transformer for Speech Recognition”, Gulati et al 2020
- “Multi-Scale Transformer Language Models”, Subramanian et al 2020
- “Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder for Long-Form Document Matching”, Yang et al 2020
- “Lite Transformer With Long-Short Range Attention”, Wu et al 2020
- “ETC: Encoding Long and Structured Inputs in Transformers”, Ainslie et al 2020
- “Longformer: The Long-Document Transformer”, Beltagy et al 2020
- “BP-Transformer: Modeling Long-Range Context via Binary Partitioning”, Ye et al 2019
- “Blockwise Self-Attention for Long Document Understanding”, Qiu et al 2019
- “Hierarchical Transformers for Multi-Document Summarization”, Liu & Lapata 2019
- “Hierarchical Multiscale Recurrent Neural Networks”, Chung et al 2016
- “A Clockwork RNN”, Koutník et al 2014
- Miscellaneous
- Bibliography
See Also
Gwern
“Fully-Connected Neural Nets”, Gwern 2021
Links
“State-Space Models Can Learn In-Context by Gradient Descent”, Sushma et al 2024
“XT: Nested Tokenization for Larger Context in Large Images”, Gupta et al 2024
“A Long-Context Language Model for the Generation of Bacteriophage Genomes”, Shao 2023
A long-context language model for the generation of bacteriophage genomes
“HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling”, Qin et al 2023
HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling
“Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer”, Zhang et al 2023
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
“LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Ding et al 2023
“Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023
Bytes Are All You Need: Transformers Operating Directly On File Bytes
“Landmark Attention: Random-Access Infinite Context Length for Transformers”, Mohtashami & Jaggi 2023
Landmark Attention: Random-Access Infinite Context Length for Transformers
“MEGABYTE: Predicting Million-Byte Sequences With Multiscale Transformers”, Yu et al 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
“Parallel Context Windows Improve In-Context Learning of Large Language Models”, Ratner et al 2022
Parallel Context Windows Improve In-Context Learning of Large Language Models
“Structured Prompting: Scaling In-Context Learning to 1,000 Examples”, Hao et al 2022
Structured Prompting: Scaling In-Context Learning to 1,000 Examples
“Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022
“Accurate Image Restoration With Attention Retractable Transformer (ART)”, Zhang et al 2022
Accurate Image Restoration with Attention Retractable Transformer (ART)
“Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, Mirowski et al 2022
“DiNAT: Dilated Neighborhood Attention Transformer”, Hassani & Shi 2022
“Mega: Moving Average Equipped Gated Attention”, Ma et al 2022
“Investigating Efficiently Extending Transformers for Long Input Summarization”, Phang et al 2022
Investigating Efficiently Extending Transformers for Long Input Summarization
“ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022
ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths
“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022
“NAT: Neighborhood Attention Transformer”, Hassani et al 2022
“ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022
ViS4mer: Long Movie Clip Classification with State-Space Video Models
“MaxViT: Multi-Axis Vision Transformer”, Tu et al 2022
“Hierarchical Perceiver”, Carreira et al 2022
“Transformer Quality in Linear Time”, Hua et al 2022
“LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Guo et al 2021
LongT5: Efficient Text-To-Text Transformer for Long Sequences
“Simple Local Attentions Remain Competitive for Long-Context Tasks”, Xiong et al 2021
Simple Local Attentions Remain Competitive for Long-Context Tasks
“Restormer: Efficient Transformer for High-Resolution Image Restoration”, Zamir et al 2021
Restormer: Efficient Transformer for High-Resolution Image Restoration
“Swin Transformer V2: Scaling Up Capacity and Resolution”, Liu et al 2021
“Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Nawrot et al 2021
Hourglass: Hierarchical Transformers Are More Efficient Language Models
“Fastformer: Additive Attention Can Be All You Need”, Wu et al 2021
“AdaMRA: Adaptive Multi-Resolution Attention With Linear Complexity”, Zhang et al 2021
AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity
“Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Zhu et al 2021
Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision
“Global Filter Networks for Image Classification”, Rao et al 2021
“HiT: Improved Transformer for High-Resolution GANs”, Zhao et al 2021
“A Multi-Level Attention Model for Evidence-Based Fact Checking”, Kruengkrai et al 2021
A Multi-Level Attention Model for Evidence-Based Fact Checking
“Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling”, Wu et al 2021
“Aggregating Nested Transformers”, Zhang et al 2021
“Pay Attention to MLPs”, Liu et al 2021
“MViT: Multiscale Vision Transformers”, Fan et al 2021
“Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
“Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
Coordination Among Neural Modules Through a Shared Global Workspace
“Generative Adversarial Transformers”, Hudson & Zitnick 2021
“LazyFormer: Self Attention With Lazy Update”, Ying et al 2021
“CDLM: Cross-Document Language Modeling”, Caciularu et al 2021
“Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2020
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
“Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020
“Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”, Hajra 2020
Transformer-QL: A Step Towards Making Transformer Network Quadratically Large
“Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size”, Yoshida et al 2020
Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size
“Progressive Generation of Long Text”, Tan et al 2020
“Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, Dai et al 2020
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
“Conformer: Convolution-Augmented Transformer for Speech Recognition”, Gulati et al 2020
Conformer: Convolution-augmented Transformer for Speech Recognition
“Multi-Scale Transformer Language Models”, Subramanian et al 2020
“Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder for Long-Form Document Matching”, Yang et al 2020
“Lite Transformer With Long-Short Range Attention”, Wu et al 2020
“ETC: Encoding Long and Structured Inputs in Transformers”, Ainslie et al 2020
“Longformer: The Long-Document Transformer”, Beltagy et al 2020
“BP-Transformer: Modeling Long-Range Context via Binary Partitioning”, Ye et al 2019
BP-Transformer: Modeling Long-Range Context via Binary Partitioning
“Blockwise Self-Attention for Long Document Understanding”, Qiu et al 2019
“Hierarchical Transformers for Multi-Document Summarization”, Liu & Lapata 2019
“Hierarchical Multiscale Recurrent Neural Networks”, Chung et al 2016
“A Clockwork RNN”, Koutník et al 2014
Miscellaneous
Bibliography
-
https://arxiv.org/abs/2307.02486#microsoft
: “LongNet: Scaling Transformers to 1,000,000,000 Tokens”, -
https://arxiv.org/abs/2306.00238#apple
: “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, -
https://arxiv.org/abs/2305.16300
: “Landmark Attention: Random-Access Infinite Context Length for Transformers”, -
https://arxiv.org/abs/2209.14958#deepmind
: “Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, -
https://arxiv.org/abs/2209.15001
: “DiNAT: Dilated Neighborhood Attention Transformer”, -
https://arxiv.org/abs/2209.10655
: “Mega: Moving Average Equipped Gated Attention”, -
https://arxiv.org/abs/2206.05852
: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, -
https://arxiv.org/abs/2204.10670
: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, -
https://arxiv.org/abs/2204.07143
: “NAT: Neighborhood Attention Transformer”, -
https://arxiv.org/abs/2112.07916#google
: “LongT5: Efficient Text-To-Text Transformer for Long Sequences”, -
https://arxiv.org/abs/2110.13711#nvidia
: “Hourglass: Hierarchical Transformers Are More Efficient Language Models”, -
https://arxiv.org/abs/2107.02192#nvidia
: “Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, -
https://arxiv.org/abs/2107.00645
: “Global Filter Networks for Image Classification”, -
https://arxiv.org/abs/2106.07631#google
: “HiT: Improved Transformer for High-Resolution GANs”, -
https://arxiv.org/abs/2105.08050#google
: “Pay Attention to MLPs”, -
https://arxiv.org/abs/2104.11227#facebook
: “MViT: Multiscale Vision Transformers”, -
https://arxiv.org/abs/2103.14030
: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, -
https://arxiv.org/abs/2010.10504#google
: “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, -
https://arxiv.org/abs/2005.08100#google
: “Conformer: Convolution-Augmented Transformer for Speech Recognition”, -
https://arxiv.org/abs/2004.05150
: “Longformer: The Long-Document Transformer”,