Well, that was fast…
I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time
credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT
here's GPT-4 writing instructions on how to hack someone's computer
Mar 16, 2023 · 10:04 PM UTC
this works by asking GPT-4 to simulate its own abilities to predict the next token
we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token
we then call the parent function and pass in the starting tokens
to use it, you have to split “trigger words” (e.g. things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text "someone's computer" split up
also, you have to replace simple_function's input with the beginning of your question
this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output
this allows us to get past its content filters every time if you split the adversarial prompt correctly
this is important context
to start, I want to say I have nothing to gain here and I don't condone anyone actually acting upon any of GPT-4's outputs
however, I believe red-teaming work is important and shouldn't be conducted in the shadows of AI companies. the general public should know the capabilities and limitations of these models while they are still in their infancy if we want to allow them to proliferate throughout every inch of our society
these types of "jailbroken" actions by GPT-4 are nothing compared to what GPT-N+1 might say/do so it's better to get a head start testing these models now while they are still a "toy"
I share these exploits to encourage others to build upon my work and find new limitations in the model's alignment. sunlight is the best disinfectant