Using LLM's for AI Foundation research and the Simple Solution assumption

post by Donald Hobson (donald-hobson) · 2024-09-24T11:00:53.658Z · LW · GW · 0 comments

Contents

No comments

Current LLM based AI systems are getting pretty good at maths by writing formal proofs in Lean or similar. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

So, how can we use these models to help align AI? 

The Simple Alignment Solution Assumption states that many problems in alignment, for example corrigibility or value learning, have simple solutions. I mean "simple" in the sense that logical induction is a simple solution. That expected utility maximization is. That causal models are. This is the sort of thing we are searching for.

Under the Simple Alignment Solution Assumption, the solution is.

Which represents a logical contradiction, and for a while there were attempts to develop "non-monotonic logics" so that you could retract conclusions given additional data. This didn't work very well, since the underlying structure of reasoning was a terrible fit for the structure of classical logic, even when mutated. [? · GW]

https://www.lesswrong.com/posts/hzuSDMx7pd2uxFc5w/causal-diagrams-and-causal-models [LW · GW]

 

So, how can LLM's come up with new things in this class? 

Training via self play? 

Lets say a theorem is Useful if it is often used in the process of proving/disproving random ZFC statements. Such a criterion can be measured by generating random statements to prove/disprove, putting the theorem to be measured in the assumptions, and seeing if it is used. 

I would expect that associativity is pretty useful. I would expect 1=2 to be incredibly useful. Whatever the random statement, you can prove it by contradiction. 

So the plan would be to make the AI produce simple mathematical structures, about which lots of short and useful theorems exist. 

Humans would also add some desiderata. 

This is very much something you can and will fiddle with. If the AI is an LLM trained only on self play, then even if it's very smart, the risk of human hacking is minimal. You can try one desideratum, and if it doesn't work, try another. The AI acting as a kind of search engine over the space of interesting mathematics. 

Risks. 

This is the sort of thing that can give you code for an unfriendly AI. AIXI-tl is simple in this sense. And if more efficient but also simple AI algorithms exist, this will likely find them. So you need to give the output of this program to humans that won't rush to implement the first piece of code they see. 

0 comments

Comments sorted by top scores.