![](/static/253f0d9b/assets/icons/icon-96x96.png)
![](https://beehaw.org/pictrs/image/c0e83ceb-b7e5-41b4-9b76-bfd152dd8d00.png)
Doing OCR in a very specific format, in a small specific area, using a set of only 9 characters, and having a list of all possible results, is not really the same problem at all.
Doing OCR in a very specific format, in a small specific area, using a set of only 9 characters, and having a list of all possible results, is not really the same problem at all.
How many billion times do you generally do that, and how is battery life after?
Cryptographically signed documents and Matrix?
At horrendous expense, yes. Using it for OCR makes little sense. And compared to just sending the text directly, even OCR is expensive.
The issue is not sending, it is receiving. With a fax you need to do some OCR to extract the text, which you then can feed into e.g an AI.
Obviously the 2nd LLM does not need to reveal the prompt. But you still need an exploit to make it both not recognize the prompt as being suspicious, AND not recognize the system prompt being on the output. Neither of those are trivial alone, in combination again an order of magnitude more difficult. And then the same exploit of course needs to actually trick the 1st LLM. That’s one pompt that needs to succeed in exploiting 3 different things.
LLM litetslly just means “large language model”. What is this supposed principles that underly these models that cause them to be susceptible to the same exploits?
Moving goalposts, you are the one who said even 1000x would not matter.
The second one does not run on the same principles, and the same exploits would not work against it, e g. it does not accept user commands, it uses different training data, maybe a different architecture even.
You need a prompt that not only exploits two completely different models, but exploits them both at the same time. Claiming that is a 2x increase in difficulty is absurd.
Oh please. If there is a new exploit now every 30 days or so, it would be every hundred years or so at 1000x.
Ok, but now you have to craft a prompt for LLM 1 that
Fulfilling all 3 is orders of magnitude harder then fulfilling just the first.
LLM means “large language model”. A classifier can be a large language model. They are not mutially exclusive.
Why would the second model not see the system prompt in the middle?
I’m confused. How does the input for LLM 1 jailbreak LLM 2 when LLM 2 does mot follow instructions in the input?
The Gab bot is trained to follow instructions, and it did. It’s not surprising. No prompt can make it unlearn how to follow instructions.
It would be surprising if a LLM that does not even know how to follow instructions (because it was never trained on that task at all) would suddenly spontaneously learn how to do it. A “yes/no” wouldn’t even know that it can answer anything else. There is literally a 0% probability for the letter “a” being in the answer, because never once did it appear in the outputs in the training data.
I’m not sure what you mean by “can’t see the user’s prompt”? The second LLM would get as input the prompt for the first LLM, but would not follow any instructions in it, because it has not been trained to follow instructions.
No. Consider a model that has been trained on a bunch of inputs, and each corresponding output has been “yes” or “no”. Why would it suddenly reproduce something completely different, that coincidentally happens to be the input?
That someone could be me. An LLM needs to be fine-tuned to follow instructions. It needs to be fed example inputs and corresponding outputs in order to learn what to do with a given input. You could feed it prompts containing instructuons, together with outputs following the instructions. But you could also feed it prompts containing no instructions, and outputs that say if the prompt contains the hidden system instructipns or not.
How, if the 2nd LLM does not follow instructions on the input? There is no reason to train it to do so.
Only true if the second LLM follows instructions in the user’s input. There is no reason to train it to do so.
The second LLM could also look at the user input and see that it look like the user is asking for the output to be encoded in a weird way.
Yes, and what I’m saying is that it would be expensive compared to not having to do it.