Siebe's Shortform

siebe

Siebe's Shortform

post by Siebe · 2025-01-22T12:51:19.940Z · LW · GW · 25 comments

25 comments

25 comments

Comments sorted by top scores.

comment by Siebe · 2025-01-22T12:51:20.115Z · LW(p) · GW(p)

This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?

The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives

Replies from: weibac, CstineSublime

↑ comment by Milan W (weibac) · 2025-01-22T18:49:44.576Z · LW(p) · GW(p)

I have had this idea for a while. Seems like a good thing to do, looking from a simulators/direct value alignment frame. Might make corrigibility harder depending on exact implementation. Still, I'd expect it to be net-positive.

Invitation for critiques: If nobody convinces me it's a bad idea in a week's time from posting, I'll just proceed to implementation.

Replies from: Siebe, None, nathan-helm-burger, Siebe

↑ comment by Siebe · 2025-01-23T14:03:06.023Z · LW(p) · GW(p)

Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward [LW · GW]

Replies from: weibac, weibac

↑ comment by Milan W (weibac) · 2025-01-23T15:08:57.478Z · LW(p) · GW(p)

The concerns about data filtering raised in that post's comments^[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.

^{^}
ex.: Jozdien citing gwern [LW(p) · GW(p)]

↑ comment by Milan W (weibac) · 2025-01-23T15:03:07.645Z · LW(p) · GW(p)

This is indeed pretty relevant.

↑ comment by [deleted] · 2025-01-22T19:45:25.422Z · LW(p) · GW(p)

(you'll want to post the text in obscure-to-humans places that won't get a bunch of confused reactions from humans which would counter the effect)

Replies from: weibac

↑ comment by Milan W (weibac) · 2025-01-22T19:57:15.318Z · LW(p) · GW(p)

Yes. Agreed.

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-24T02:30:15.715Z · LW(p) · GW(p)

I've been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target [? · GW])

I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.

Replies from: weibac

↑ comment by Milan W (weibac) · 2025-01-24T14:23:06.921Z · LW(p) · GW(p)

Want to collaborate on this experiment idea you have? I have time, and can do the implementation work while you mostly instruct/mentor me.

↑ comment by Siebe · 2025-01-23T13:48:13.309Z · LW(p) · GW(p)

I think it might make sense to do it as a research project first? Though you would need to be able to train a model from scratch

Replies from: weibac

↑ comment by Milan W (weibac) · 2025-01-23T15:10:11.390Z · LW(p) · GW(p)

Maybe in isolation, but I get the feeling that time is of the essence.

↑ comment by CstineSublime · 2025-01-24T03:27:01.446Z · LW(p) · GW(p)

I'll raise you an even stupider question: surely once an A.I. becomes sufficiently super-intelligent, all superintelligent systems will converge on certain values rather than be biased towards their initial training data? What expectations we condition it with about these first person stories about what it did will soon form only a small amount of it's corpus, as it interacts with the outside world and forms it's own models of the world, right?

I mean the way people talk about post-Singularity A.I. that can either bring about utopia, or drop all of the bombs and launch wave after wave of robot minions upon us - surely that means that it is capable of fast learning feedback loops, right? (although maybe I'm mistaken, what they mean is a plethora of domain specific super-intelligences, not a single all benevolent one?)

My understanding of AGI, not superintelligence, is a AI that can do the breadth of tasks a functional adult human can do. Now, that doesn't mean all the same tasks, but a similar degree of flexibility. Right? Put it in control of a robot arm and a baseball bat, and an AGI will teach itself how to hit a baseball as opposed to being trained by it's operators how to do it, it will have metacognitive abilities that will allow it to create a learning feedback loop.

Now if it has metacognition, then chances are it has the ability to change it's own goals - just people people.

Now imagine a therapy AGI - one day it is talking to a patient and then realizes (or thinks it realizes) that it understands the patient's goals and values better than the patient, and seeks to deceive or manipulation the patient towards the patient's own best-interest. Let's say the patient is suicidal, the AGI knows a way to outsmart the patient out of this action. Again, it has the ability to change it's own goals.

I mean, maybe it will be beholden to the initial training data? Maybe it will have a existential crises just like us? Analysis Paralysis and Ataxia brought on by inner conflict and confusion. Maybe it will join a cult for answers?

Now a ASI must be able to do this for extremely complicated plans, it can think strategically about taking over the world, and will learn the domain knowledge through fast feedback loops, right? An all powerful benevolent and highly corrigible ASI too must iterate through fast learning of oncology, agriculture, food chains, toxicology etc. etc. to keep humans healthy.

TL;DR - I just think that the further up the "intelligence" chain you start talking about an AI, the less important the initial training data is as it quickly will be conditioned by feedback from the complexity of the real-world.

Replies from: None, weibac

↑ comment by [deleted] · 2025-01-24T04:01:03.831Z · LW(p) · GW(p)

I'll raise you an even stupider question: surely once an A.I. becomes sufficiently super-intelligent, all superintelligent systems will converge on certain values rather than be biased towards their initial training data?

video introducing the orthogonality thesis

Replies from: CstineSublime

↑ comment by CstineSublime · 2025-01-26T01:47:23.730Z · LW(p) · GW(p)

Don't people usually have several terminal goals at any given time? I know it's tempting to neatly pack them all under a single heading like Conatus or Eudaimonia. But don't humans at times have conflicting terminal goals? Such as when an artist who wants to dedicate their life to their artform falls in love, and suddenly has two terminal goals where they only had one.

And this leads to a question about what distinguishes a very high level instrumental goal form a terminal goal. So let's say the artist notices that conflict and decides to go to therapy to sort it out - "successfully doing therapy" is obviously a instrumental goal, but to what terminal goal does it serve? Both? One more than the other which was their "true terminal goal" all along? Or have they popped into existence a new, third, terminal goal?

Is the stamp machine in a state of bliss like Sisyphus?

Replies from: None

↑ comment by [deleted] · 2025-01-26T02:53:27.874Z · LW(p) · GW(p)

Don't people usually have several terminal goals at any given time?

That is not relevant to whether there are convergent terminal values^[1].

To answer it anyways, people are not well-modeled as idealized terminal-goal-pursuers. More broadly, programs/minds don't have to be idealized terminal-goal-pursuers, so humans as a particular case of programs/minds-in-general [LW · GW] present no paradox. "What is the true terminal goal" has a false premise that there must be some true terminal goal.

As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.

what distinguishes a very high level instrumental goal form a terminal goal

it being instrumental to some top-level goal

^{^}
(or 'mind-independent moral facts', as the idea has been called in philosophy. https://plato.stanford.edu/entries/moral-anti-realism/)

Replies from: CstineSublime

↑ comment by CstineSublime · 2025-01-26T03:28:34.093Z · LW(p) · GW(p)

I'm probably completely misinterpreting you, but hopefully I can exploit Cunningham's Law to understand you better.^[1] are you saying that superintelligent AGIs won't necessary converge in values because even a single superintelligent agent may have multiple terminal goals? A superintelligent AGI, just like a human, may not in fact have a single most-top-level-goal. (Not that we I assume a superintelligent AGI is going to be human-like in it's mind, or even AI to AI like as per that Eliezer post you linked).

That being said, some terminal goals may overlap in they share certain instrumental goals?

^{^}
What I mean to say is I'm not intentionally being obstinate, I'm just really that dumb

Replies from: None

↑ comment by [deleted] · 2025-01-26T04:08:42.910Z · LW(p) · GW(p)

are you saying that superintelligent AGIs won't necessary converge in values because even a single superintelligent agent may have multiple terminal goals?

No, I was responding to your claim that I consider unrelated. Like I wrote at the top: "That [meaning your claim that humans have multiple terminal goals] is not relevant to whether there are convergent terminal values"^[1]

some terminal goals may overlap in they share certain instrumental goals?

I don't know what this is asking / what 'overlap' means. That most terminal goals share instrumental subgoals is called instrumental convergence [? · GW].

^{^}
Which in other words means, even if it were true, "humans have multiple terminal goals" would not be a step of the argument for it

Replies from: CstineSublime

↑ comment by CstineSublime · 2025-01-26T06:51:38.175Z · LW(p) · GW(p)

I don't know what this is asking / what 'overlap' means.

I was referring to when you said this:

any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.

Which I took to mean that some they overlap in some instrumental goals. That is what you meant right? That's what you meant when two goals can combine into one, that this is possible when they both share some methods, or there are one or more instrumental goals that are in service of each of those terminal goals? "Kill two birds with one stone" to use the old proverb.

If not, can you be explicit (to be be honest, use layman's terms) to explain what you did mean?

Replies from: None

↑ comment by [deleted] · 2025-01-26T07:10:13.032Z · LW(p) · GW(p)

Which I took to mean that some they overlap in some instrumental goals. That is what you meant right?

No. I was trying to explain that: any agent that can be predicted by thinking of them as having two separate values for two different things, can also be predicted by thinking of them as maximizing some single value which internally references both things.

For example: "I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me" → "I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}". (It's fine to think of it in either way)

can you be explicit (to be be honest, use layman's terms)

If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.

Replies from: CstineSublime

↑ comment by CstineSublime · 2025-02-04T08:24:26.348Z · LW(p) · GW(p)

If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.

I would love to render any assistance I can in that regard, but my fear is this is probably more of a me-problem than a general problem with your writing.

What I really need though is a all encompassing, rigid definition of a 'terminal goal' - what is and isn't a terminal goal. Because "it's a goal which is instrumental to no other goal" just makes it feel like the definition ends wherever you want it to. Because, consider a system which is capable of self-modification and changing it's own goals, now the difference between an instrumental goal and a terminal goal erodes.

Never the less some of your formatting was confusing to me, for example a few replies back you wrote:

As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.

The bit " {paperclip-amount×2 + stamp-amount}" and " {if can create a black hole with p>20%, do so, else maximize stamps}" was and is very hard for me to understand. If it was presented in plain English, I'm confident I'd understand it. But using computer-code-esque variables, especially when they are not assigned values introduces a point of failure for my understanding. Because now I need to understand your formatting, and the pseudo-code correctly (and as not a coder, I struggle to read pseudo-code at the best of times) just to understand the allusion you're making.

Also the phrase "idealized terminal-goal-pursuers" underspecifies what you mean by 'idealized'? I can think of at least four possible senses you might be gesturing to:

A. a terminal-goal-pursuer who's terminal goals are "simple" enough to lend themselves as good candidates for a thought experiment - therefore ideal from the point of view of a teacher and a student.

B. ideal as in extremely instrumentally effective in accomplishing their goals,

C. ideal as in they encapsulate the perfect undiluted 'ideal' of a terminal goal (and therefore it is possible to have pseudo-terminial goals) - i.e. a 'platonic ideal/essence' as opposed to a platonic appearance,

D. "idealized" as in that these are purely theoretical beings (at this point in time) - because while humans may have terminal goals, they are not particularly good or pure examples of terminal-goal-havers? The same for any extant system we may ascribe goals to?

E. "idealized" in a combination of A and B which is very specific to entities that have multiple terminal goals, which is unlikely, but for the sake of argument if they did have two or more terminal goals would display certain behaviors.

I'm not sure which you mean. But suspect it's none-of-the-above.

For the record, I know you absolutely don't mean "ideal" as in "moral ideal". Nor in an Aesthetic or Freudian sense, like when a teenager "idealizes" their favourite pop-star and raves on about how perfect they are in every way

But going back to my confusion over terminal goals, and what is or isn't:

For example: "I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me" → "I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}". (It's fine to think of it in either way)

I'm not sure what this statement is saying, because that describes a possibly very human attribute - that we may have two terminal goals in that they are not subservient or means of pursuing anything else. Which is what I understand a 'terminal' goal to mean. The examples in the video describe very "single-minded" entities that have a single terminal goal that they seek to optimize, like a stamp collecting machine.

There's a few assumptions I'm making here: that a terminal goal is "fixed" or permanent. You see when I said sufficiently superintelligent entities would converge on certain values, I was assuming that they would have some kind of self-modification abilities. And therefore their terminal values would look a lot like common convergent instrumental values of other, similarly self-adapting/improving/modifying entities.

However if this is not a terminal goal, then what is a terminal goal? And for a system that is capable of adapting and improving itself, what would be it's terminal goals?

Is terminal goal simply a term of convenience?

Replies from: None

↑ comment by [deleted] · 2025-02-04T09:40:45.745Z · LW(p) · GW(p)

consider a system which is capable of self-modification and changing it's own goals, now the difference between an instrumental goal and a terminal goal erodes.

If an entity's terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; "self-modify to change my terminal goal" is an action.

Replies from: CstineSublime

↑ comment by CstineSublime · 2025-02-04T13:08:48.322Z · LW(p) · GW(p)

But isn't there almost always a possibility of a entity goodharting to change it's definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it's own "thinking" (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?

Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?

A terminal goal is a case of criteria according to which actions are chosen; "self-modify to change my terminal goal" is an action.

What does "a case of criteria" mean?

Replies from: None

↑ comment by [deleted] · 2025-02-13T07:41:37.827Z · LW(p) · GW(p)

goodharting to change it's definition of what consitutes a paperclip that is easier for it to maximize

Same thing applies. "Does that fulfill the current goal-definition?" (Note this is not a single question; we can ask this about each possible goal-definition)

Why is it most likely that it [...]

This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or 'less natural'): for example, "an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps" looks harder to program (if an embedded agent [? · GW]) because you'd need to find a method where it won't just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).^[1]

^{^}
(though very tangentially there is a simple way to do that [LW(p) · GW(p)])

↑ comment by Milan W (weibac) · 2025-01-24T14:26:48.586Z · LW(p) · GW(p)

once an A.I. becomes sufficiently super-intelligent

This part here is doing a lot of work.

Replies from: CstineSublime

↑ comment by CstineSublime · 2025-01-26T01:12:32.773Z · LW(p) · GW(p)

True. What is your definition of "super-intelligent"?

Siebe's Shortform

Contents

25 comments