How useful would alien alignment research be?
post by Donald Hobson (donald-hobson) · 2025-01-23T10:59:22.330Z · LW · GW · No commentsThis is a question post.
Contents
Answers 5 quila 2 Noosphere89 None No comments
Imagine aliens on a distant world. They have values very different to humans. However, they also have complicated values, and don't exactly know their own values.
Imagine these aliens are doing well at AI alignment. They are just about to boot up a friendly (to them) superintelligence.
Now imagine we get to see all their source code and research notes. How helpful would this be for humans solving alignment?
Answers
That would be very helpful; I expect we could relatively easily solve the technical problem if we could read their research notes.
- Much of the problem is figuring out how to create an ASI targetable at any goal at all.
As for the goal design, the intuitive way of "just hardcode your values... [in their full complexity and also determine what they 'really refer to' in the true ontology of reality (includes figuring out the true ontology of reality) in order to specify them and also make sure you really endorse this as your final choice]" is actually not doable if you're time pressed as we are; although maybe an alien civilization capable of solving alignment would not be so time pressed, and could figure that out carefully over very many years.
Known alternatives which avoid that hardness, and so are more appealing at least under time pressure, include:
- A pointer [LW · GW] to some process which would terminate in the endorsed values of an entity in the AI's near environment. (Conveniently such a process can additionally terminate in a decision theory specification, and decision theory seems like an important third part of the problem)
- Corrigibility?
Both these have the property of being copyable by us / not only working for the aliens' values.
↑ comment by Donald Hobson (donald-hobson) · 2025-01-23T12:28:52.681Z · LW(p) · GW(p)
I wasn't really thinking about a specific algorithm. Well I was kind of thinking about LLM's and the alien shogolith meme.
But yes. I know this would be helpful.
But I'm more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code?
Also.
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
I am guessing and hoping that the masks in an LLM are at least as limited-optimizers as humans, often more. Due to their tendency to learn the most usefully predictive patterns first. Hidden long term sneaky plans will only very rarely influence the text. (Due to the plans being hidden)
And, I hope, the shogolith isn't itself particularly intrested in optimizing the real world. The shogolith just chooses what mask to wear.
So.
Can we duct tape a mask of "alignment researcher" onto a shogolith, and keep the mask in place long enough to get some useful alignment research done.
The more that there is one "know it when you see it" simple alignment solution, the more likely this is to work.
Replies from: rhollerith_dot_com, quila↑ comment by RHollerith (rhollerith_dot_com) · 2025-01-23T19:51:16.559Z · LW(p) · GW(p)
But I’m more thinking about what work remains.
It depends on how they did it. If they did it by formalizing the notion of "the values and preferences (coherently extrapolated) of (the living members of) the species that created the AI", then even just blindly copying their design without any attempt to understand it has a very high probability of getting a very good outcome here on Earth.
The AI of course has to inquire into and correctly learn about our values and preferences before it can start intervening on our behalf, so one way such a blind copying might fail is if the method the aliens used to achieve this correct learning depended on specifics of the situation on the alien planet that don't obtain here on Earth.
↑ comment by quila · 2025-01-24T00:14:35.835Z · LW(p) · GW(p)
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
Agreed it is natural.
To describe 'limited optimization' in my words: The teacher implements an abstract function whose optimization target is not {the outcome of a system containing a copy of this function}, but {criteria about the isolated function's own output}. The input to this function is not {the teacher's entire world model}, but some simple data structure whose units map to schedule-related abstractions. The output of this function, when interpreted by the teacher, then maps back to something like a possible schedule ordering. (Of course, this is an idealized case, I don't claim that actual human brains are so neat)
The optimization target of an agent, though, is "{the outcome of a system containing a copy of this function}" (in this case, 'this function' refers to the agent). If agents themselves implemented agentic functions, the result would be infinite recurse; so all agents of sufficiently complex worlds must, at some point in the course of solving their broader agent-question[1], ask 'domain limited' sub-questions.
(note that 'domain limited' and 'agentic' are not fundamental-types; the fundamental thing would be something like "some (more complex) problems have sub-problems which can/must be isolated")
I think humans have deep assumptions conducive to their 'embedded agency' which can make it harder to see this for the first time. It may be automatic to view 'the world' as a referent which a 'goal function' can somehow be about naturally. I once noticed I had a related confusion, and asked "wait, how can a mathematical function 'refer to the world' at all?". The answer is that there is no mathematically default 'world' object to refer to, and you have to construct a structural copy to refer to instead (which, being a copy, contains a copy of the agent, implying the actions of the real agent and its copy logically-correspond), which is a specific non-default thing, which nearly all functions do not do.
(This totally doesn't answer your clarified question, I'm just writing a related thing to something you wrote in hopes of learning)
- ^
From possible outputs, which meets some criteria about {the outcome of a system containing a copy of this function}?
Right now, the answer is "it largely depends on what happened", but my prior is that it would be very useful at least as a clarification of how they did it, and it would at the very least guide our own efforts by making sure we learned what actually works.
No comments
Comments sorted by top scores.