‘adversarial examples (AI)’ tag

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Wikipedia

Miscellaneous

Bibliography

https://arxiv.org/abs/2410.08993: “The Structure of the Token Space for Large Language Models”, Michael Robinson, Sourya Dey, Shauna Sweet

link-bibliography
https://arxiv.org/abs/2408.05446: “Ensemble Everything Everywhere: Multi-Scale Aggregation for Adversarial Robustness”, Stanislav Fort, Balaji Lakshminarayanan

link-bibliography
https://arxiv.org/abs/2407.11969: “Does Refusal Training in LLMs Generalize to the Past Tense?”, Maksym Andriushchenko, Nicolas Flammarion

link-bibliography
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models”, Siyan Zhao, Tung Nguyen, Aditya Grover

link-bibliography
https://arxiv.org/abs/2404.06664: “CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge”, Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi

link-bibliography
https://arxiv.org/abs/2402.17747: “When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback”, Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

link-bibliography
https://arxiv.org/abs/2402.15570: “Fast Adversarial Attacks on Language Models In One GPU Minute”, Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi

link-bibliography
https://arxiv.org/abs/2402.11753: “ArtPrompt: ASCII Art-Based Jailbreak Attacks against Aligned LLMs”, Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

link-bibliography
https://arxiv.org/abs/2401.05566#anthropic: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

link-bibliography
https://arxiv.org/abs/2310.08419: “PAIR: Jailbreaking Black Box Large Language Models in 20 Queries”, Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong

link-bibliography
https://arxiv.org/abs/2310.02279#sony: “Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion”, Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon

link-bibliography
https://arxiv.org/abs/2309.11751: “How Robust Is Google’s Bard to Adversarial Image Attacks?”, Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu

link-bibliography
https://arxiv.org/abs/2306.07567: “Large Language Models Sometimes Generate Purely Negatively-Reinforced Text”, Fabien Roger

link-bibliography
https://arxiv.org/abs/2305.16934: “On Evaluating Adversarial Robustness of Large Vision-Language Models”, Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, Min Lin

link-bibliography
https://arxiv.org/abs/2303.02242: “TrojText: Test-Time Invisible Textual Trojan Insertion”, Yepeng Liu, Bo Feng, Qian Lou

link-bibliography
https://arxiv.org/abs/2302.04222: “Glaze: Protecting Artists from Style Mimicry by Text-To-Image Models”, Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, Ben Y. Zhao

link-bibliography
https://arxiv.org/abs/2211.03769: “Are AlphaZero-Like Agents Robust to Adversarial Perturbations?”, Li-Cheng Lan, Huan Zhang, Ti-Rong Wu, Meng-Yu Tsai, I-Chen Wu, Cho-Jui Hsieh

link-bibliography
https://arxiv.org/abs/2211.00241: “Adversarial Policies Beat Superhuman Go AIs”, Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

link-bibliography
https://arxiv.org/abs/2208.08831#deepmind: “Discovering Bugs in Vision Models Using Off-The-Shelf Image Generation and Captioning”, Olivia Wiles, Isabela Albuquerque, Sven Gowal

link-bibliography
https://arxiv.org/abs/2205.07460: “Diffusion Models for Adversarial Purification”, Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, Anima Anandkumar

link-bibliography
https://swabhs.com/assets/pdf/wanli.pdf#allen: “WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation”, Alisa Liu, Swabha Swayamdipta, Noah A. Smith, Yejin Choi

link-bibliography
https://arxiv.org/abs/2201.05320#allen: “CommonsenseQA 2.0: Exposing the Limits of AI through Gamification”, Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, Jonathan Berant

link-bibliography
https://arxiv.org/abs/2110.13771#nvidia: “AugMax: Adversarial Composition of Random Augmentations for Robust Training”, Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, Zhangyang Wang

link-bibliography
https://arxiv.org/abs/2106.07411: “Partial Success in Closing the Gap between Human and Machine Vision”, Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wiel, Brendel

link-bibliography
https://arxiv.org/abs/2105.12806: “A Universal Law of Robustness via Isoperimetry”, Sébastien Bubeck, Mark Sellke

link-bibliography
https://distill.pub/2021/multimodal-neurons/#openai: “Multimodal Neurons in Artificial Neural Networks [CLIP]”, Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, Chris Olah

link-bibliography
https://aclanthology.org/2021.naacl-main.235.pdf#facebook: “Bot-Adversarial Dialogue for Safe Conversational Agents”, Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan

link-bibliography
https://arxiv.org/abs/2006.14536#google: “Smooth Adversarial Training”, Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, Quoc V. Le

link-bibliography
https://arxiv.org/abs/2002.00937: “Radioactive Data: Tracing through Training”, Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

link-bibliography
https://arxiv.org/abs/1911.09665: “Adversarial Examples Improve Image Recognition”, Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, Quoc V. Le

link-bibliography
https://arxiv.org/abs/1706.06083: “Towards Deep Learning Models Resistant to Adversarial Attacks”, Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

link-bibliography