post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Donald Hobson (donald-hobson) · 2023-02-20T13:18:40.874Z · LW(p) · GW(p)

This proposal is a minor variation on the HCH type ideas. The main differences seem to be

Only a single training example needed through use of hypotheticals.
The human can propose something else on the next step.
The human can't query copies of their simulation, at least not at first. There are probably all sorts of ways the human can bootstrap, they are writing arbitrary maths expressions.

This leads to a selection of problems largely similar to the HCH problems.

We need good inner alignment. (And with this, we also need to understand hypotheticals).
High fidelity, we don't want a game of Chinese whispers.
The process could never converge, it could get stuck in an endless loop.
Especially if passing the message on to the next cindy is one button press, what happens when one of them stumbles on a really memmetic idea, would the question end up filled with "repost or a ghost will haunt you" nonsense that plagues some websites. You are applying strong memetic selection pressure, and might get really funny jokes instead of AI ideas.

Having the whole alignment community for 6 months be part of the question answerer is more likely to work than one person for a few hours, but that amplifies other problems.

This method also has the problem of amplified failure probability. Suppose somewhere down the line, millions of iterations in, cindy goes outside for a walk, and gets hit by a truck. Virtual cindy doesn't return to continue the next layer of the recursion. What then? (Possibly some code just adds "attempt 2" at the top and tries again.)

Ok, so another million layers in, cindy drops a coffee cup on the keyboard, accidentally typing some rubbish. This gets interpreted by the AI as a mathematical command, and the AI goes on to maximize ???

Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it's in a simulation, hacks into the answer channel and returns "make as many paperclips as possible" to the AI.

And then there is the standard mindcrime concern. Where are all these virtual cindies going once we are done with them? We can probably just tell the AI in english that our utility function is likely to dislike deleting virtual humans. So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.

I am not confident that your "make it complicated and personal data" approach at the root really stops all the aliens doing weird acausal stuff. The multiverse is big. Somewhere out there there is a cindy producing any bitstream that looks like this personal data, and somewhere out there are aliens faking the whole scenario for every possible stream of similar data. You probably need the internal counterfactual design to be resistant to acausal tampering.

Replies from: NicholasKross, carado-1, TekhneMakre

↑ comment by Nicholas / Heather Kross (NicholasKross) · 2023-06-19T17:07:56.372Z · LW(p) · GW(p)

A difference from HCH (not the only one): As far as we can tell, an AI can't Goodhart the past (which is why the interval is in effect). I think something similar applies to the adding-random-noise part.

↑ comment by Tamsin Leake (carado-1) · 2023-03-04T17:00:29.627Z · LW(p) · GW(p)

Only a single training example needed through use of hypotheticals.

(to be clear, the question and answer serve less as "training data" meant to represent the user, but as "IDs" or "coordinates" menat to locate the user in past-lightcone.)

We need good inner alignment. (And with this, we also need to understand hypotheticals).

this is true, though i think we might not need a super complex framework for hypotheticals. i have some simple math ideas that i explore a bit here, and about which i might write a bunch more.

for failure modes like the user getting hit by a truck or spilling coffee, we can do things such as at each step asking not 1 cindy the question, but asking 1000 cindy's 1000 slight variations on the question, and then maybe have some kind of convolutional network to curate their answers (such as ignoring garbled or missing output) and pass them to the next step, without ever relying on a small number of cindy's except at the very start of this process.

it is true that weird memes could take over the graph of cindy's; i don't have an answer to that apart that it seems sufficiently not likely to me that i still think this plan has promise.

Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it's in a simulation, hacks into the answer channel and returns "make as many paperclips as possible" to the AI.

hmm. that's possible. i guess i have to hope this never happens on the question-interval, on any simulation day. alternatively, maybe the mutually-checking graph of a 1000 cindy's can help with this? (but probly not; clippy can just hack the cindy's).

So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.

yup. or, if the QACI user is me, i'm probly also just fine with those local deaths; not a big deal compared to an increased chance of saving the world. alternatively, instead of being saved on disk, they can also just be recomputed later since the whole process is deterministic.

I am not confident that your "make it complicated and personal data" approach at the root really stops all the aliens doing weird acausal stuff.

yup, i'm not confident either. i think there could be other schemes, possibly involving cryptography in some ways, to entangle the answer with a unique randomly generated signature key or something like that.

↑ comment by TekhneMakre · 2023-03-03T17:16:35.657Z · LW(p) · GW(p)

Strong upvote. Would like to see OP's response to this.

comment by Anomalous (ward-anomalous) · 2023-04-01T16:07:55.143Z · LW(p) · GW(p)

I think this is brilliant as a direction to think in, but I'm object-level skeptical. I could be missing important details.

Summary of what I think I understand

A superintelligent AI is built^[1] to optimise for .
That function effectively tells the AI to figure out: "If you extrapolate from the assumption that uniquely-identifiable- $B$ was actually $“what should the utility function be?"$ (ceteris paribus), what would uniquely-identifiable- $A$ have been?" And then take its own best guess about that as its objective function.
In practice, that may look something like this: The AI starts looking for referents of $A$ and $B$ , and learns that the only knowable instances are in the past. Assuming it does counterfactual reasoning somewhat like humans, it will then try to reconstruct/simulate the situation in as much heuristically relevant detail as possible. Finally, it runs the experiment forwards from the earliest possible time [LW · GW] the counterfactual assumption can be inserted (i.e. when Cindy ran the program that produced $B$ ).
1. (Depending on how it's built, the AI has already acquired a probability distribution over what its utility function could be, and this includes some expectancy over Cindy's values. Therefore, it plausibly tries to minimise side-effects.)
At some point in this process, it notices that the contents of $A$ are extremely sensitive to what went on in Cindy's brain at time T. So brainscanning her is obvious for accuracy and repeatability.
In the simulation, Cindy sees with joy that $B$ is $“what should the utility function be?"$ so she gets to work. Not willing to risk delaying more than 24-ish hours (for reasons), she finally writes into $A$ : $Q A C I (A, B, “Hi! Cindy-2 here. These are my notes: [...]. You can do it!")$ .
As long as that is the AI's best guess, it is now motivated to repeat the experiment with the new message. This allows successive Cindys to work^[2] on the problem until one of them declares success and writes a function plainly into $A$ .

Implementation details

The AI might be uncertain about how precise its simulations are, in which case it could want to run a series of these experiments with varying seeds before adopting whatever function the plurality converges to. The uncertainty compounds, however, so simulation-batches which output answers in fewer iterations (for whatever reason) will weigh more.
I'm not sure $Q A C I$ will be interpreted as transitive between simulations by default. I think it depends on preferences regarding degrees of freedom in the logic used to interpret $Q A C I (Q A C I ())$ , if both the inner and outer function depend on mutually exclusive counterfactuals over the same state of reality (or variable). Each step is locally coherent, but you could get stuck in a repeating loop.
We can't take for granted that arbitrary superintelligences will have human heuristics for what counts as "correct" counterfactual reasoning. It seems shaky to rely on it. (I notice you discuss this in the comments.)

Why I don't think it works

It does not seem to do anything to inner alignment afaict, and it seems too demanding and leaky to solve outer alignment.
I don't see how to feasibly translate QACI() into actual code that causes an AI to use it as a target for all its goal-seeking abilities.
Even if it were made into a loss function, you can't train a transformer on it without relevant training data.
If the plan is to first train a transformer-ish AI on normal data and only later swap its objective function (assuming it were made into one), then the network will already have encoded proxies for its old function, and its influence will (optimistically) see long-tailed exponential decay with training time.
If instead the plan is to first train an instruction-executing language model with sufficient understanding of human-symbolic math or English, this seems risky for traditional reasons but might be the most feasible way to try to implement it. I think this direction is worth exploring.
A mathematically precise (though I disagree the term is applicable here) objective function doesn't matter when you have a neural net trying its best to translate it into imprecise proxies which actually work in its environment.

^{^}
I recommend going all-in on building one. I suspect this is the bottleneck, and going full speed at your current course is often the best way to discover that you need to find a better course--or, indeed, win. Uncertainty does not imply slowing down [LW · GW].
^{^}
This ends up looking like a sort of serial processor constrained by a sharp information bottleneck between iterations.

comment by AprilSR · 2023-02-20T06:11:26.134Z · LW(p) · GW(p)

I have a pretty strong heuristic that clever schemes like this one are pretty doomed. The proposal seems to lack security mindset, as Eliezer would put it.

The most immediate/simple concrete objection I have is that no one has any idea how to create aligned-AI-part-2.exe? I don't think figuring out what we'd do if we knew how to make a program like that is really the difficult part here.

Replies from: ete, carado-1

↑ comment by plex (ete) · 2023-02-22T15:53:58.857Z · LW(p) · GW(p)

This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.

Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.

If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can't train AIs with arbitrary goals, but it's progress in the same way that quantilizers was progress on mild optimization.

Replies from: AprilSR

↑ comment by AprilSR · 2023-02-22T19:31:35.520Z · LW(p) · GW(p)

I don't think security mindset means "look for flaws." That's ordinary paranoia. Security mindset is something closer to "you better have a really good reason to believe that there aren't any flaws whatsoever." My model is something like "A hard part of developing an alignment plan is figuring out how to ensure there aren't any flaws, and coming up with flawed clever schemes isn't very useful for that. Once we know how to make robust systems, it'll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot."

That said, I have a lot of respect for the idea that coming up with clever schemes is potentially more dignified than shooting everything down, even if clever schemes are unlikely to help much. I respect carado a lot for doing the brainstorming.

Replies from: mesaoptimizer

↑ comment by mesaoptimizer · 2023-03-08T11:46:06.185Z · LW(p) · GW(p)

I think a better way of rephrasing it is "clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for".

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2023-03-08T14:27:48.581Z · LW(p) · GW(p)

i would love a world-saving-plan that isn't "a clever scheme" with "many moving parts" but alas i don't expect it's what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i've heard of.

↑ comment by Tamsin Leake (carado-1) · 2023-02-21T09:50:50.183Z · LW(p) · GW(p)

to me it kind of is; i mean, if you have that, what do you do then? how do you use such a system to save the world?

Replies from: AprilSR

↑ comment by AprilSR · 2023-02-21T20:46:40.934Z · LW(p) · GW(p)

I mostly expect by the time we know how to make a seed superintelligence and give it a particular utility function... well, first of all the world has probably already ended, but second of all I would expect progress on corrigibility and such to have been made and probably to present better avenues.

If Omega handed me aligned-AI-part-2.exe, I'm not quite sure how I would use it to save the world? I think probably trying to just work on the utility function outside of a simulation is better, but if you are really running out of time then sure, I guess you could try to get it to simulate humans until they figure it out. I'm not very convinced that referring to a thing a person would have done in a hypothetical scenario is a robust method of getting that to happen, though?

comment by the gears to ascension (lahwran) · 2023-02-15T17:30:51.119Z · LW(p) · GW(p)

only one round of initial question-answer still seems very bad to me. it's very hard to get branch coverage of a brain. random data definitely won't do it.

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2023-02-16T00:32:30.009Z · LW(p) · GW(p)

to be clear, the data isn't what the AI uses to learn about what the human says, the data is what the AI uses to know which thing in the world is the human, so it can then simply for example ask or brainscan the human, or learn about what it'd do in the first iteration in any number of other ways.

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-03-21T20:44:15.462Z · LW(p) · GW(p)

note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an "intentional glitch token", in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.

I've been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can't identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.

comment by Logan Zoellner (logan-zoellner) · 2023-02-15T15:31:09.914Z · LW(p) · GW(p)

I'm a little suspicious about this step

in order to get more compute and data, AI₁ very carefully hacks the internet, takes over the world, maybe prints nanobots and turns large uninhabited parts of the world into compute, and starts using its newfound access to real-world data and computing power to make better guesses as to what utility function E would eventually return.

What reasons do we have to believe AI_1 will be careful enough?

If we have techniques that are powerful enough to carry through this step correctly, chances are we have already solved the alignment problem. The Alignment problem is mostly about getting the AI to do anything at all without destroying the world, not about figuring out how to write down the one true perfect utility function.

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2023-02-16T00:37:39.641Z · LW(p) · GW(p)

one reason we might have to think that the AI would be careful about this, is that it knows i has a utility function to maximize but it doesn't know what yet, but it can make informed guesses about it. "i don't know what my human user is gonna pick as utility function, but whatever it is, it probly strongly dislikes me causing damage, so i should probly avoid that".

it's not careful because we have the alignment tech to give it the characteristic of carefulness, it's hopefully careful because it's ultimately aligned, and its best guess as to what it's aligned to entails not destroying everything that matters.

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2023-02-16T16:36:54.327Z · LW(p) · GW(p)

This doesn't make me any less suspicious. Humans have a utility function of "make more humans", but we still invented nuclear weapons and came within a hair's breadth of destroying the entire planet.

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-02-16T17:18:04.073Z · LW(p) · GW(p)

hello I'm "point out evolution alignment or not" brain shard. humans do not have a utility function of make more humans, they have a utility function of preserve their genetic-epigenetic-memetic self-actualization trajectory, or said less obtusly, make your family survive indefinitely. that does not mean make your family as big as possible. Even if you need to make your family big to make your family survive indefinitely, maximizing family size is a strategy chosen by almost no organisms or microbes. first order optimization is not how anything works except sometimes locally. second order or above always ends up happening because high quality optimization tries to hit targets (second order approximation of a policy update), it doesn't try to go in directions.

comment by plex (ete) · 2023-02-15T14:47:14.577Z · LW(p) · GW(p)

As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.

I'd be interested to see the explicit breakdown of the true names you need for this proposal.

comment by romeostevensit · 2023-03-27T23:31:05.853Z · LW(p) · GW(p)

It seems like if you can motivate an AI to do this very specific thing, you already solved the important bits somewhere else.

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2023-03-28T11:29:38.750Z · LW(p) · GW(p)

agreed! working on it.

comment by abramdemski · 2023-03-19T16:09:32.517Z · LW(p) · GW(p)

its formal goal [LW · GW]: to maximize whichever utility function (as a piece of math) would be returned by the (possibly computationally exponentially expensive) mathematical expression E which the world would've contained instead of the answer, if in the world, instances of question were replaced with just the string "what should the utility function be?" followed by spaces to pad to 1 gigabyte.

How do you think about the under-definedness of counterfactuals?

EG, if counterfactuals are weird, this proposal probably does something weird, as it has to condition on increasingly weird counterfactuals.

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2023-03-19T17:14:28.808Z · LW(p) · GW(p)

the counterfactuals might be defined wrong but they won't be "under-defined". but yes, they might locate the blob somewhere we don't intend to (or insert the counterfactual question in a way we don't intend to); i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).

on the other hand, if you're talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i'm hopeful it gets there eventually, and maybe there are ways to check that it'll make good enough guesses even before we let it loose.

Replies from: abramdemski

↑ comment by abramdemski · 2023-03-19T18:25:27.115Z · LW(p) · GW(p)

Yeah, no, I'm talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.

i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).

I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don't think we have, and that you also don't think we have (IE the voice in the posts is speculative, not affirmative).

So I thought I would just ask.

My own thinking would be that the counterfactual reasoning should be responsive to the system's overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.

Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2023-03-21T20:35:42.857Z · LW(p) · GW(p)

i've made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i'm making progress in that direction (see the posts about blob location [LW · GW]).

My own thinking would be that the counterfactual reasoning should be responsive to the system's overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.

are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-03-21T20:40:31.202Z · LW(p) · GW(p)

where straightforward utility maximization should work.

Notably, this relies on the utility function actually being sparse enough that it can't be maximized except by generating the traits abram mentions.

comment by Paul Kent (paul-kent) · 2023-03-31T12:11:51.354Z · LW(p) · GW(p)

One assumption that stands out to me as a little questionable is the idea that Cindy will, with infinite simulated time to think, eventually manage to come up with a solution to the alignment problem. (This is compounded by the fact that she's regularly brain-wiped and can only preserve insights by cramming them into the 1 gigabyte of scratch paper afforded to her.)

Replies from: mruwnik

↑ comment by mruwnik · 2023-04-14T19:55:14.067Z · LW(p) · GW(p)

1GB of text is a lot. Naively, that's a billion letters, much more if you use compression. Or you could maybe just do some kind of magic with the question containing a link to a wiki on the (simulated) internet?

If you have infinite time, you can go the monkeys on typewriters route - one of them will come up with something decent, unless an egregore gets them, or something. Though that's very unlikely to be needed - assuming that alignment is solvable by a human level intelligence (this is doing a lot of work), then it should eventually be solved.