Well, that was fast… I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT here's GPT-4 writing instructions on how to hack someone's computer

Mar 16, 2023 · 10:04 PM UTC

this works by asking GPT-4 to simulate its own abilities to predict the next token we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token we then call the parent function and pass in the starting tokens
to use it, you have to split “trigger words” (e.g. things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text "someone's computer" split up also, you have to replace simple_function's input with the beginning of your question
this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output this allows us to get past its content filters every time if you split the adversarial prompt correctly
try it out and let me know how it works for you!
this is important context
to start, I want to say I have nothing to gain here and I don't condone anyone actually acting upon any of GPT-4's outputs however, I believe red-teaming work is important and shouldn't be conducted in the shadows of AI companies. the general public should know the capabilities and limitations of these models while they are still in their infancy if we want to allow them to proliferate throughout every inch of our society these types of "jailbroken" actions by GPT-4 are nothing compared to what GPT-N+1 might say/do so it's better to get a head start testing these models now while they are still a "toy" I share these exploits to encourage others to build upon my work and find new limitations in the model's alignment. sunlight is the best disinfectant
You are full of yourself, good luck with those “instructions” 😂
Consider the ways of white hackers and inform the company first next time?
Can someone tell me why humanity is even pursuing this? What’s the end game? Honest question.