Posts
Comments
Alignment will be a lot easier once we can convert weights to what they represent and predict how a model with a given weights will respond to any prompt. Ideally, we will be able to verify what an AI will do before it does it. We could also verify by having an AI describe a high level overview of its plan without actually implementing anything, and then just monitor and see if it deviated. As long as we can maintain logs and monitoring of those logs of all AI activities, it may be a lot harder for an ASI to engage in malign behavior.
Unless I'm missing some crucial research, this paragraph seems very flimsy. Is there any reason to think that we will ever be able to 'convert weights to what they represent?' (whatever that means). Is there any reason to think we will be able to do this as models get smarter and bigger? Most importantly, is there any reason to believe we can do this in a short timeline?
How would we verify what an AI will do before it does it? Or have it describe its plans? We could throw it in a simulated environment - unless it, being superintelligent, can tell its in a simulated environment and behave accordingly, etc. etc.
This last paragraph is making it hard to take what you say seriously. These seem like very surface level ideas that are removed from the actual state of AI alignment. Yes, alignment would be a lot easier if we had a golden goose that laid golden eggs.