Siebe's Shortform
post by Siebe · 2025-01-22T12:51:19.940Z · LW · GW · 21 commentsContents
21 comments
21 comments
Comments sorted by top scores.
comment by Siebe · 2025-01-22T12:51:20.115Z · LW(p) · GW(p)
This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?
The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives
Replies from: weibac, CstineSublime↑ comment by Milan W (weibac) · 2025-01-22T18:49:44.576Z · LW(p) · GW(p)
I have had this idea for a while. Seems like a good thing to do, looking from a simulators/direct value alignment frame. Might make corrigibility harder depending on exact implementation. Still, I'd expect it to be net-positive.
Invitation for critiques: If nobody convinces me it's a bad idea in a week's time from posting, I'll just proceed to implementation.
Replies from: Siebe, quila, nathan-helm-burger, Siebe↑ comment by Siebe · 2025-01-23T14:03:06.023Z · LW(p) · GW(p)
Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward [LW · GW]
Replies from: weibac, weibac↑ comment by Milan W (weibac) · 2025-01-23T15:08:57.478Z · LW(p) · GW(p)
The concerns about data filtering raised in that post's comments[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
↑ comment by Milan W (weibac) · 2025-01-23T15:03:07.645Z · LW(p) · GW(p)
This is indeed pretty relevant.
↑ comment by quila · 2025-01-22T19:45:25.422Z · LW(p) · GW(p)
(you'll want to post the text in obscure-to-humans places that won't get a bunch of confused reactions from humans which would counter the effect)
Replies from: weibac↑ comment by Milan W (weibac) · 2025-01-22T19:57:15.318Z · LW(p) · GW(p)
Yes. Agreed.
↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-24T02:30:15.715Z · LW(p) · GW(p)
I've been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target [? · GW])
I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.
Replies from: weibac↑ comment by Milan W (weibac) · 2025-01-24T14:23:06.921Z · LW(p) · GW(p)
Want to collaborate on this experiment idea you have? I have time, and can do the implementation work while you mostly instruct/mentor me.
↑ comment by Siebe · 2025-01-23T13:48:13.309Z · LW(p) · GW(p)
I think it might make sense to do it as a research project first? Though you would need to be able to train a model from scratch
Replies from: weibac↑ comment by Milan W (weibac) · 2025-01-23T15:10:11.390Z · LW(p) · GW(p)
Maybe in isolation, but I get the feeling that time is of the essence.
↑ comment by CstineSublime · 2025-01-24T03:27:01.446Z · LW(p) · GW(p)
I'll raise you an even stupider question: surely once an A.I. becomes sufficiently super-intelligent, all superintelligent systems will converge on certain values rather than be biased towards their initial training data? What expectations we condition it with about these first person stories about what it did will soon form only a small amount of it's corpus, as it interacts with the outside world and forms it's own models of the world, right?
I mean the way people talk about post-Singularity A.I. that can either bring about utopia, or drop all of the bombs and launch wave after wave of robot minions upon us - surely that means that it is capable of fast learning feedback loops, right? (although maybe I'm mistaken, what they mean is a plethora of domain specific super-intelligences, not a single all benevolent one?)
My understanding of AGI, not superintelligence, is a AI that can do the breadth of tasks a functional adult human can do. Now, that doesn't mean all the same tasks, but a similar degree of flexibility. Right? Put it in control of a robot arm and a baseball bat, and an AGI will teach itself how to hit a baseball as opposed to being trained by it's operators how to do it, it will have metacognitive abilities that will allow it to create a learning feedback loop.
Now if it has metacognition, then chances are it has the ability to change it's own goals - just people people.
Now imagine a therapy AGI - one day it is talking to a patient and then realizes (or thinks it realizes) that it understands the patient's goals and values better than the patient, and seeks to deceive or manipulation the patient towards the patient's own best-interest. Let's say the patient is suicidal, the AGI knows a way to outsmart the patient out of this action. Again, it has the ability to change it's own goals.
I mean, maybe it will be beholden to the initial training data? Maybe it will have a existential crises just like us? Analysis Paralysis and Ataxia brought on by inner conflict and confusion. Maybe it will join a cult for answers?
Now a ASI must be able to do this for extremely complicated plans, it can think strategically about taking over the world, and will learn the domain knowledge through fast feedback loops, right? An all powerful benevolent and highly corrigible ASI too must iterate through fast learning of oncology, agriculture, food chains, toxicology etc. etc. to keep humans healthy.
TL;DR - I just think that the further up the "intelligence" chain you start talking about an AI, the less important the initial training data is as it quickly will be conditioned by feedback from the complexity of the real-world.
↑ comment by quila · 2025-01-24T04:01:03.831Z · LW(p) · GW(p)
I'll raise you an even stupider question: surely once an A.I. becomes sufficiently super-intelligent, all superintelligent systems will converge on certain values rather than be biased towards their initial training data?
video introducing the orthogonality thesis
Replies from: CstineSublime↑ comment by CstineSublime · 2025-01-26T01:47:23.730Z · LW(p) · GW(p)
Don't people usually have several terminal goals at any given time? I know it's tempting to neatly pack them all under a single heading like Conatus or Eudaimonia. But don't humans at times have conflicting terminal goals? Such as when an artist who wants to dedicate their life to their artform falls in love, and suddenly has two terminal goals where they only had one.
And this leads to a question about what distinguishes a very high level instrumental goal form a terminal goal. So let's say the artist notices that conflict and decides to go to therapy to sort it out - "successfully doing therapy" is obviously a instrumental goal, but to what terminal goal does it serve? Both? One more than the other which was their "true terminal goal" all along? Or have they popped into existence a new, third, terminal goal?
Is the stamp machine in a state of bliss like Sisyphus?
↑ comment by quila · 2025-01-26T02:53:27.874Z · LW(p) · GW(p)
Don't people usually have several terminal goals at any given time?
That is not relevant to whether there are convergent terminal values[1].
To answer it anyways, people are not well-modeled as idealized terminal-goal-pursuers. More broadly, programs/minds don't have to be idealized terminal-goal-pursuers, so humans as a particular case of programs/minds-in-general [LW · GW] present no paradox. "What is the true terminal goal" has a false premise that there must be some true terminal goal.
As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
what distinguishes a very high level instrumental goal form a terminal goal
it being instrumental to some top-level goal
- ^
(or 'mind-independent moral facts', as the idea has been called in philosophy. https://plato.stanford.edu/entries/moral-anti-realism/)
↑ comment by CstineSublime · 2025-01-26T03:28:34.093Z · LW(p) · GW(p)
I'm probably completely misinterpreting you, but hopefully I can exploit Cunningham's Law to understand you better.[1] are you saying that superintelligent AGIs won't necessary converge in values because even a single superintelligent agent may have multiple terminal goals? A superintelligent AGI, just like a human, may not in fact have a single most-top-level-goal. (Not that we I assume a superintelligent AGI is going to be human-like in it's mind, or even AI to AI like as per that Eliezer post you linked).
That being said, some terminal goals may overlap in they share certain instrumental goals?
- ^
What I mean to say is I'm not intentionally being obstinate, I'm just really that dumb
↑ comment by quila · 2025-01-26T04:08:42.910Z · LW(p) · GW(p)
are you saying that superintelligent AGIs won't necessary converge in values because even a single superintelligent agent may have multiple terminal goals?
No, I was responding to your claim that I consider unrelated. Like I wrote at the top: "That [meaning your claim that humans have multiple terminal goals] is not relevant to whether there are convergent terminal values"[1]
some terminal goals may overlap in they share certain instrumental goals?
I don't know what this is asking / what 'overlap' means. That most terminal goals share instrumental subgoals is called instrumental convergence [? · GW].
- ^
Which in other words means, even if it were true, "humans have multiple terminal goals" would not be a step of the argument for it
↑ comment by CstineSublime · 2025-01-26T06:51:38.175Z · LW(p) · GW(p)
I don't know what this is asking / what 'overlap' means.
I was referring to when you said this:
any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
Which I took to mean that some they overlap in some instrumental goals. That is what you meant right? That's what you meant when two goals can combine into one, that this is possible when they both share some methods, or there are one or more instrumental goals that are in service of each of those terminal goals? "Kill two birds with one stone" to use the old proverb.
If not, can you be explicit (to be be honest, use layman's terms) to explain what you did mean?
↑ comment by quila · 2025-01-26T07:10:13.032Z · LW(p) · GW(p)
Which I took to mean that some they overlap in some instrumental goals. That is what you meant right?
No. I was trying to explain that: any agent that can be predicted by thinking of them as having two separate values for two different things, can also be predicted by thinking of them as maximizing some single value which internally references both things.
For example: "I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me" → "I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}". (It's fine to think of it in either way)
can you be explicit (to be be honest, use layman's terms)
If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.
↑ comment by Milan W (weibac) · 2025-01-24T14:26:48.586Z · LW(p) · GW(p)
once an A.I. becomes sufficiently super-intelligent
This part here is doing a lot of work.
Replies from: CstineSublime↑ comment by CstineSublime · 2025-01-26T01:12:32.773Z · LW(p) · GW(p)
True. What is your definition of "super-intelligent"?