‘GPT-4 nonfiction’ tag

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Bibliography

2025-johri.pdf: “An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks”, Shreya Johri, Jaehwan Jeong, Benjamin A. Tran, Daniel I. Schlessinger, Shannon Wongvibulsin, Leandra A. Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M. Van Allen, David Kim, Roxana Daneshjou, Pranav Rajpurkar

link-bibliography
https://arxiv.org/abs/2411.13543: “BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games”, Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

link-bibliography
https://arxiv.org/abs/2410.13893: “Can LLMs Be Scammed? A Baseline Measurement Study”, Udari Madhushani Sehwag, Kelly Patel, Francesca Mosca, Vineeth Ravi, Jessica Staddon

link-bibliography
https://arxiv.org/abs/2410.07095#openai: “MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering”, Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Madry

link-bibliography
https://time.com/7026050/chatgpt-quit-teaching-ai-essay/: “I Quit Teaching Because of ChatGPT”, Victoria Livingstone

link-bibliography
https://dynomight.net/automated/: “Thoughts While Watching Myself Be Automated”, Dynomight

link-bibliography
https://arxiv.org/abs/2407.11969: “Does Refusal Training in LLMs Generalize to the Past Tense?”, Maksym Andriushchenko, Nicolas Flammarion

link-bibliography
https://arxiv.org/abs/2407.04694: “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs”, Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

link-bibliography
https://arxiv.org/abs/2406.18518#salesforce: “APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets”, Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

link-bibliography
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models”, Siyan Zhao, Tung Nguyen, Aditya Grover

link-bibliography
https://arxiv.org/abs/2405.18870#google: “LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks”, Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, Robin I. M. Dunbar

link-bibliography
https://arxiv.org/abs/2405.15143: “Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models”, Cong Lu, Shengran Hu, Jeff Clune

link-bibliography
https://arxiv.org/abs/2405.15306: “DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches With TikZ”, Jonas Belouadi, Simone Paolo Ponzetto, Steffen Eger

link-bibliography
https://arxiv.org/abs/2405.15071: “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization”, Boshi Wang, Xiang Yue, Yu Su, Huan Sun

link-bibliography
https://www.theverge.com/2024/5/13/24155652/chatgpt-voice-mode-gpt4o-upgrades: “ChatGPT Will Be Able to Talk to You like Scarlett Johansson in Her / Upgrades to ChatGPT’s Voice Mode Bring It Closer to the Vision of a Responsive AI Assistant—And Sam Altman Seems to Know It”, Kylie Robison

link-bibliography
https://arxiv.org/abs/2405.00332#scale: “GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic”, Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue

link-bibliography
https://arxiv.org/abs/2404.13076: “LLM Evaluators Recognize and Favor Their Own Generations”, Arjun Panickssery, Samuel R. Bowman, Shi Feng

link-bibliography
https://arxiv.org/abs/2404.07544: “From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples”, Robert Vacareanu, Vlad-Andrei Negru, Vasile Suciu, Mihai Surdeanu

link-bibliography
https://www.wired.com/story/ai-chatbots-foia-requests-election-workers/: “Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts Worry That Election Deniers Could Weaponize Chatbots to Overwhelm and Slow down Local Officials”, Vittoria Elliott

link-bibliography
https://link.springer.com/article/10.1007/s10506-024-09396-9: “Re-Evaluating GPT-4’s Bar Exam Performance”, Eric Martínez

link-bibliography
https://www.wsj.com/tech/ai/a-peter-thiel-backed-ai-startup-cognition-labs-seeks-2-billion-valuation-998fa39d: “A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round Could Increase Startup’s Valuation Nearly Sixfold in a Matter of Weeks, Reflecting AI Frenzy”, Berber Jin

link-bibliography
https://arxiv.org/abs/2403.18624: “Vulnerability Detection With Code Language Models: How Far Are We?”, Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen

link-bibliography
https://arxiv.org/abs/2403.18802#deepmind: “Long-Form Factuality in Large Language Models”, Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

link-bibliography
https://www.bloomberg.com/news/articles/2024-03-12/cognition-ai-is-a-peter-thiel-backed-coding-assistant: “Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A New Startup Called Cognition AI Can Turn a User’s Prompt into a Website or Video Game”, Ashlee Vance

link-bibliography
https://arxiv.org/abs/2402.19450: “Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap”, Saurabh Srivastava, Annarose M. B, Anto P. V, Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, Sooraj Thomas

link-bibliography
https://arxiv.org/abs/2402.14903: “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs”, Aaditya K. Singh, D. J. Strouse

link-bibliography
https://arxiv.org/abs/2402.11753: “ArtPrompt: ASCII Art-Based Jailbreak Attacks against Aligned LLMs”, Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

link-bibliography
https://arxiv.org/abs/2402.11349: “Tasks That Language Models Don’t Learn”, Bruce W. Lee, JaeHyuk Lim

link-bibliography
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10894685/: “GPT-4 Passes the Bar Exam”, Daniel Martin Katz, Michael James Bommarito, Shang Gao, Pablo Arredondo

link-bibliography
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10936766/: “Large Language Models Are Able to Downplay Their Cognitive Abilities to Fit the Persona They Simulate”, Jiří Milička, Anna Marklová, Klára VanSlambrouck, Eva Pospíšilová, Jana Šimsová, Samuel Harvan, Ondřej Drobil

link-bibliography
https://arxiv.org/abs/2312.08926: “PRER: Modeling Complex Mathematical Reasoning via Large Language Model Based MathAgent”, Haoran Liao, Qinyi Du, Shaohua Hu, Hao He, Yanyan Xu, Jidong Tian, Yaohui Jin

link-bibliography
2023-casal.pdf: “Can Linguists Distinguish between ChatGPT and Human Writing?: A Study of Research Ethics and Academic Publishing”, J. Elliott Casal, Matt Kessler

link-bibliography
https://arxiv.org/abs/2311.16452#microsoft: “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine”, Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz

link-bibliography
https://arxiv.org/abs/2311.09247: “Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks”, Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev

link-bibliography
https://arxiv.org/abs/2310.13014: “Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament”, Philipp Schoenegger, Peter S. Park

link-bibliography
https://arxiv.org/abs/2310.08678: “Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams”, Ethan Callanan, Amarachi Mbakwe, Antony Papadimitriou, Yulong Pei, Mathieu Sibue, Xiaodan Zhu, Zhiqiang Ma, Xiaomo Liu, Sameena Shah

link-bibliography
2023-phillips.pdf: “Can a Computer Outfake a Human [Personality]?”, Jane Phillips, Chet Robie

link-bibliography
https://arxiv.org/abs/2310.04406: “Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models”, Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang

link-bibliography
https://arxiv.org/abs/2310.03214#google: “FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation”, Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, Thang Luong

link-bibliography
https://arxiv.org/abs/2310.01377: “UltraFeedback: Boosting Language Models With High-Quality Feedback”, Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun

link-bibliography
https://arxiv.org/abs/2309.12269: “The Cambridge Law Corpus: A Corpus for Legal AI Research”, Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek

link-bibliography
https://arxiv.org/abs/2309.12288: “The Reversal Curse: LLMs Trained on "A Is B" Fail to Learn "B Is A"”, Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

link-bibliography
https://arxiv.org/abs/2309.04269: “From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting”, Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad

link-bibliography
https://arxiv.org/abs/2308.12287: “Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models”, Fredrik Heiding, Bruce Schneier, Arun Vishwanath, Jeremy Bernstein, Peter S. Park

link-bibliography
https://arxiv.org/abs/2308.07921: “Solving Challenging Math Word Problems Using GPT-4 Code Interpreter With Code-Based Self-Verification”, Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li

link-bibliography
https://time.com/6301288/the-ai-jokes-that-give-me-nightmares/: “I’m a Screenwriter. These AI Jokes Give Me Nightmares”, Simon Rich

link-bibliography
https://www.nytimes.com/2023/07/18/technology/openai-chatgpt-facial-recognition.html: “OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An Advanced Version of ChatGPT Can Analyze Images and Is Already Helping the Blind. But Its Ability to Put a Name to a Face Is One Reason the Public Doesn’t Have Access to It”, Kashmir Hill

link-bibliography
2024-banker.pdf: “Machine-Assisted Social Psychology Hypothesis Generation”, Sachin Banker, Promothesh Chatterjee, Himanshu Mishra, Arul Mishra

link-bibliography
https://arxiv.org/abs/2307.06439#microsoft: “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, Yu Gu, Sheng Zhang, Naoto Usuyama, Yonas Woldesenbet, Cliff Wong, Praneeth Sanapathi, Mu Wei, Naveen Valluri, Erika Strandberg, Tristan Naumann, Hoifung Poon

link-bibliography
https://arxiv.org/abs/2307.05300#microsoft: “Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration”, Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, Heng Ji

link-bibliography
https://arxiv.org/abs/2308.01404: “Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models”, Aidan O’Gara

link-bibliography
https://arxiv.org/abs/2306.15626: “LeanDojo: Theorem Proving With Retrieval-Augmented Language Models”, Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar

link-bibliography
https://arxiv.org/abs/2306.12587: “ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews”, Mike D’Arcy, Alexis Ross, Erin Bransom, Bailey Kuehl, Jonathan Bragg, Tom Hope, Doug Downey

link-bibliography
https://arxiv.org/abs/2306.15448: “Understanding Social Reasoning in Language Models With Language Models”, Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman

link-bibliography
https://arxiv.org/abs/2305.20050#openai: “Let’s Verify Step by Step”, Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

link-bibliography
https://arxiv.org/abs/2305.18354: “LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-Based Representations”, Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott Sanner, Elias B. Khalil

link-bibliography
https://arxiv.org/abs/2305.13534: “How Language Model Hallucinations Can Snowball”, Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah Smith

link-bibliography
https://arxiv.org/abs/2305.06972: “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Julian Hazell

link-bibliography
https://arxiv.org/abs/2304.11490: “Boosting Theory-Of-Mind Performance in Large Language Models via Prompting”, Shima Rahimi Moghaddam, Christopher J. Honey

link-bibliography
https://www.medrxiv.org/content/10.1101/2023.03.24.23287731.full: “Performance of ChatGPT on Free-Response, Clinical Reasoning Exams”, Eric Strong, Alicia DiGiammarino, Yingjie Weng, Preetha Basaviah, Poonam Hosamani, Andre Kumar, Andrew Nevins, John Kugler, Jason Hom, Jonathan H. Chen

link-bibliography
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks?”, Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang

link-bibliography
https://arxiv.org/pdf/2303.08774#page=12&org=openai: “GPT-4 Technical Report § Limitations: Calibration”, OpenAI

link-bibliography
https://arxiv.org/abs/2302.14520: “Large Language Models Are State-Of-The-Art Evaluators of Translation Quality”, Tom Kocmi, Christian Federmann

link-bibliography
https://arxiv.org/abs/2302.12173: “Not What You’ve Signed up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection”, Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz

link-bibliography
https://techcrunch.com/2022/11/23/harvey-which-uses-ai-to-answer-legal-questions-lands-cash-from-openai/: “Harvey, Which Uses AI to Answer Legal Questions, Lands Cash from OpenAI”, Kyle Wiggers

link-bibliography