‘LM tokenization’ tag

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Wikipedia

Miscellaneous

Bibliography

https://arxiv.org/abs/2410.08993: “The Structure of the Token Space for Large Language Models”, Michael Robinson, Sourya Dey, Shauna Sweet

link-bibliography
https://arxiv.org/abs/2409.16211#bytedance: “MaskBit: Embedding-Free Image Generation via Bit Tokens”, Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen

link-bibliography
https://arxiv.org/abs/2406.20086: “Token Erasure As a Footprint of Implicit Vocabulary Items in LLMs”, Sheridan Feucht, David Atkinson, Byron Wallace, David Bau

link-bibliography
https://arxiv.org/abs/2406.18906: “Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets”, Melanie Walsh, Anna Preus, Maria Antoniak

link-bibliography
https://arxiv.org/abs/2405.14838: “From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step”, Yuntian Deng, Yejin Choi, Stuart Shieber

link-bibliography
https://arxiv.org/abs/2405.07883: “Zero-Shot Tokenizer Transfer”, Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić

link-bibliography
https://arxiv.org/abs/2404.13292: “Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge”, Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

link-bibliography
https://arxiv.org/abs/2403.17844: “Mechanistic Design and Scaling of Hybrid Architectures”, Michael Poli, Armin W. Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli

link-bibliography
https://arxiv.org/abs/2402.14903: “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs”, Aaditya K. Singh, D. J. Strouse

link-bibliography
https://arxiv.org/abs/2402.11349: “Tasks That Language Models Don’t Learn”, Bruce W. Lee, JaeHyuk Lim

link-bibliography
https://arxiv.org/abs/2311.16465: “TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering”, Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei

link-bibliography
https://arxiv.org/abs/2310.02226: “Think Before You Speak: Training Language Models With Pause Tokens”, Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

link-bibliography
https://arxiv.org/abs/2307.03381: “Teaching Arithmetic to Small Transformers”, Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

link-bibliography
https://arxiv.org/abs/2306.00238#apple: “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

link-bibliography
https://www.wired.com/story/what-is-artificial-general-intelligence-agi-explained/: “What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon”, Reece Rogers

link-bibliography
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks?”, Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang

link-bibliography
https://arxiv.org/abs/2212.10562#google: “Character-Aware Models Improve Visual Text Rendering”, Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, R. J. Mical, Mohammad Norouzi, Noah Constant

link-bibliography
https://arxiv.org/abs/2212.01349#facebook: “NPM: Nonparametric Masked Language Modeling”, Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, Luke Zettlemoyer

link-bibliography
https://arxiv.org/abs/2210.13669: “Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)”, Tuhin Chakrabarty, Vishakh Padmakumar, He He

link-bibliography
https://aclanthology.org/2022.cai-1.2.pdf: “Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio”, Allen Roush, Sanjay Basu, Akshay Moorthy, Dmitry Dubovoy

link-bibliography
https://arxiv.org/abs/2207.06991: “PIXEL: Language Modeling With Pixels”, Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott

link-bibliography
https://arxiv.org/abs/2206.15474: “Forecasting Future World Events With Neural Networks”, Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

link-bibliography
https://aclanthology.org/2022.acl-short.43.pdf: “FLOTA: An Embarrassingly Simple Method to Mitigate Und-Es-Ira-Ble Properties of Pretrained Language Model Tokenizers”, Valentin Hofmann, Hinrich Schütze, Janet Pierrehumbert

link-bibliography
https://arxiv.org/pdf/2204.06125#page=16&org=openai: “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Aditya A. Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

link-bibliography
https://arxiv.org/abs/2204.03067: “ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion”, Jian Zhu, Cong Zhang, David Jurgens

link-bibliography
https://arxiv.org/abs/2203.13131#facebook: “Make-A-Scene: Scene-Based Text-To-Image Generation With Human Priors”, Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman

link-bibliography
https://arxiv.org/abs/2108.11193: “Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”, Itay Itzhak, Omer Levy

link-bibliography
https://arxiv.org/abs/2107.14795#deepmind: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira

link-bibliography
https://arxiv.org/abs/2106.12672#google: “Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization”, Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

link-bibliography
https://arxiv.org/abs/2105.13626#google: “ByT5: Towards a Token-Free Future With Pre-Trained Byte-To-Byte Models”, Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

link-bibliography
https://arxiv.org/abs/2103.03206#deepmind: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

link-bibliography
https://arxiv.org/abs/2012.15524#google: “Fast WordPiece Tokenization”, Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou

link-bibliography
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf#page=5: “GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

link-bibliography