The solution here will, like the solution to SQL injection and to sound typing, involve restricting the structure of the input to some subset of the full possible input space. I don't think anyone is sure what that will look like with LLMs, but I don't see any reason to assume a priori that there is no way to define a safe subset of the possible prompts. Again, we did it with type systems and proof assistants.
The resulting system won't have the unbounded flexibility that our existing models have, but if they're provably safe that will make up for it.
> I don't think anyone is sure what that will look like with LLMs, but I don't see any reason to assume a priori that there is no way to define a safe subset of the possible prompts.
That would essentially require a "non-Turing-complete" prompt language. Because if the prompt language was effectively Turing complete, it'd be impossible to determine whether every possible prompt would produce a "safe" outcome or not. This would severely limit what the LLM could do even compared to GPT3.5.
>Again, we did it with type systems and proof assistants.
Proof assistants require a human to provide the actual proof whether something is safe (correct) or not; they can't do it automatically except for very limited, simple classes of programs.
Yes, you don't want a Turing complete language. They allow too much.
> Proof assistants require a human to provide the actual proof whether something is safe (correct) or not; they can't do it automatically except for very limited, simple classes of programs.
Finding a proof is in NP (at least if you restrict yourself to proofs that are short enough that a human might have a chance to write it out in their lifetime). So computers can do it.
Nah, I think it will be the other way around. We currently have intelligent agents working on help desk and other customer service roles. Those agents have had their acceptable output more and more restricted.
We will just do to LLMs what we are already doing to people.
The options are a support LLM that can sometimes be tricked into giving out refunds for items that were never purchased, and a support LLM that never gives out refunds at all. (It might hallucinate that it gave a refund, but it won't be hooked up to any API that actually allows it to do so.)
This is actually the only possible answer IMHO. Humans are Turing-complete, which means the best we can do is give them training and guidelines and trust them. Even so their training can be subverted through social engineering.
What we're talking about here is social engineering of LLMs. That's currently pretty easy. It will get harder but it cannot be made impossible.
The resulting system won't have the unbounded flexibility that our existing models have, but if they're provably safe that will make up for it.