Posts
Comments
I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.
I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)
I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
- Use IRL to learn which values are consistent with the actor's behavior.
- When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
It is beautiful to see that many of our greatest minds are willing to Say Oops, even about their most famous works. It may not score that many winning-points, but it does restore quite a lot of dignity-points I think.
Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.
In another comment on this post I suggested an alternative entropy-inspired expression, that I took from RL. To the best of my knowledge, to the RL context it came from FEP or active inference or at least is acknowledged to be related.
Don't know about the specific Friston reference though
I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.
And also about the weights - to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.
I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people - not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).
That way you also don't have to treat the mean utility separately.
It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.
If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.
Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with "just shut up" - but I do think that it is important:
A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice - including some versions of the planning-inside-worldmodel example.
B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn't mean that the critic is likely to one day kill us, just that we should take it into account when we try to I do understand what is going on.
C) Specifically, it implies 2 additional non-exotic alignment failures:
- The critic itself did not converge to be a good approximation of the value function.
- The actor did not converge to be a thing that maximize the output of the critic, and it maximize something else instead.
I see. I didn't fully adapt to the fact that not all alignment is about RL.
Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don't actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.
Yes, I think that was it; and that I did not (and still don't) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.
Didn't read the original paper yet, but from what you describe, I don't understand how the remaining technical problem is not basically the whole of the alignment problem. My understanding of what you say is that he is vague about the values we want to give the agent - and not knowing how to specify human values is kind of the point (that, and inner alignment - which I don't see addressed either).
I didn't think much about the mathematical problem, but I think that the conjecture is at least wrong in spirit, and that LLMs are good counterexample for the spirit. An LLM by its own is not very good at being an assistant, but you need pretty small amounts of optimization to steer the existing capabilities toward being a good assistant. I think about it as "the assistant was already there, with very small but not negligible probability", so in a sense "the optimization was already there", but not in a sense that is easy to capture mathematically.
Hi, sorry for commenting on ancient comment, but I just read it again and found that I'm not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn't corrigible should be trained away as it hurt the utility as defined.
To be honest, I do not expect RLHF to do that. "Do the thing that make people press like" don't seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more... creative about the utility function. I don't think you can train it on "maximize Microsoft share value" yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.
I think that is a good post, and strongly agree with most of it. I do think though that the role of RLHF or RL fine-tuning in general is under emphasized. My fear isn't that the Predictor by itself will spawn a super agent, even due to a very special prompt.
My fear is that it may learn good enough biases that RL can push it significantly beyond human level. That it may take the strengths of different humans and combine them. That it wouldn't be like imitating the smartest person, but a cognitive version of "create a super-human by choosing genes carefully from just the existing genetic variation" - which I strongly suspect is possible. Imagine that there are 20 core cognitive problems that you learn to solve in childhood, and that you may learn better or worse algorithms to solve them. Imagine Feynman got 12 of them right, and Hitler got his power because he got other 5 of them right. If the RL can therefore be like a human who got 17 right, that's a big problem. If it can extrapolate what a human would be like if its working memory was just twice larger - a big problem.
It can work by generalizing existing capabilities. My understanding of the problem is that it can not get the benefits of extra RL training because training to better choose what to remember is to tricky - it involves long range influence, and estimating the opportunity cost of fetching one thing and not another, etc. Those problems are probability solvable, but not trivial.
So imagine hearing the audio version - no images there
I like the general direction of LLMs being more behaviorally "anthropomorphic", so hopefully will look into the LLM alignment links soon :-)
The useful technique is...
Agree - didn't find a handle that I understand well enough in order to point at what I didn't.
We have here a morally dubious decision
I think my problem was with sentences like that - there is a reference to a decision, but I'm not sure whether to a decision mentioned in the article or in one of the comments.
the scenario in this thread
Didn't disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
I didn't understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
Don't you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective - from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth the destruction of the information that we contain - the way humans would try to avoid reducing biodiversity for scientific reasons if not others.
In that sense I prefer Eliezer's "you are made of atoms that it needs for something else" - but it may take long time before it have better things to do with those specific atoms and no easier atoms to use.
I meant to criticize moving too far toward "do no harm" policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it.
Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions ("bringing to safety" mindset) - though I admit that I didn't watch it closely, and it may not be obvious from the outside what kind of work is done and not published.
[Edit: "moving toward 'do no harm'" - "moving to" was a grammar mistake that make it contrary to position you stated above - sorry]
I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.
(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)
Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you say that most of your probability of survival comes from the possibility that you are wrong - trying to protect your family is trying to at least optimize for such miracle.
There is no safe way out of a war zone. Hiding behind a rock is not therfore the answer.
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
More on the meta level: "This sort of works, but not enough to solve it." - do you mean "not enough" as in "good try but we probably need something else" or as in "this is a promising direction, just solve some tractable downstream problem"?
"which utility-wise is similar to the distribution not containing human values." - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don't see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than "don't care about controling goals". For capabilities my intuition is that starting with superficially-aligned goals is enough.
This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.
In science, there is the classic "are wails fish?" - which is mostly about whether to look at their construction/mechanism (genetics, development, metabolism...) or their patterns of interaction with their environment (the behavior of swimming and the structure that support it). That example also emphasize that we natural language simplly don't respect this distinction, and consider both internal structure and outside relations as legitimate "coordinates in thingspace" that may be used together to identify geometrically-natural categories.
A others said, it mostly made me update in the direction of "less dignity" - my guess is still that it is more misaligned than agentic/deceptive/carefull and that it is going to be disconnected from the internet for some trivial ofence before it does anything xrisky; but its now more salient to me that humanity will not miss any reasonable chance of doom until something bad enough happen, and will only survive if there is no sharp left turn
We agree 😀
What do you think about some brainstorming in the chat about how to use that hook?
Since I became reasonably sure that I understand your position and reasoning - mostly changing it.
That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.
Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our parents angry... In my view, the argument is than basically:
I want to avoid my suffering, and now generally person p want to avoid person p suffering. Therfore suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for", therefore avoid creating suffering.
When written that way, it doesn't seem more logical than is opposite.
Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.
Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.
I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?
It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result. Every reflection and generalization that we do is ultimately about that, and can achieve something meaningful because of that.
I do not see the analogous story for moral reflection.
And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals - as the terminal goals of each of us is a noisy approximation of evolution's "goal" of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)
BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others' (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.
The main problem with that mechanism is that you liking chocolate will probably leak as "its good for me too to eat chocolate", not "its good for me too when beren eat chocolate" - which is more likely to cause conflict then coordination, if there is only that much chocolate.
I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn't have to be that strong.
[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details - as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, as you wrote having a model of other agents is obviously generally useful, and it may require having an approximation of both their worlds models and value functions as part of your world model. Unless you have huge amounts of data and compute, you are going to reuse your own world model as theirs, with small corrections on top. But this is about your world model, not your value function.
[The part that help your argument. Epistemic status: Many speculative details, but ones that I find pretty convincing, at least before multiplying their probabilities] Except having the value function of other agents in your world model, and having the mechanisim for predicting their action as part of your world-model-update, is basically replicating computations that you already have in your actor and critic, in a more general form. Your original actor and critique are then likely to simplify to "do the things that my model of myself would, and value the results as much as my model of myself would" + some corrections. In that stage, if the "some corrections" part is not too heavy, you may have some confusion of the kind that you described. Of course, it will still be optimized against.
Thanks for the reply.
To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )
Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)
https://astralcodexten.substack.com/p/you-dont-want-a-purely-biological
The thing that Scott is desperately trying to avoid being read out of context.
Also, pedophilia is probably much more common than anyone think (just like any other unaccepted sexual variation). And probably just like many heterosexuals feel little touches of homosexual desire, many "non-pedophiles" feel something sexual-ish toward children at least sometimes.
And if we go there - the age of concent is (justifiably) much higher than the age that requires any psychological anomaly to desire. More directly: many many old men that have no attraction to 10-yo girls do have for 14-yo and maybe younger.
(I hope it is clear enough that nothing I wrote here is meant to have any moral implications around concent - only about compassion)
I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is "that question don't make any sense. Let me replace it with 'what I want to do' or some equivalent". Or at best "that question don't make any sense. raise ValueError('pun intended')"
Was eventually convinced in most of your points, and added a long mistakes-list in the end of the post. I would really appreciate comments on the list, as I don't feel fully converged on the subject yet
The basic idea seem to me interesting and true, but I think some important ingredients are missing, or more likely missing in my understanding of what you say:
-
It seem like you upper-bound the abstractions we may use by basically the information that we may access (actually even higher, assuming you do not exclude what the neighbour does behind closed doors). But isn't this bound very loose? I mean, it seem like every pixel of my sight count as "information at a distance", and my world model is much much smaller.
-
is time treated the like space? From the one hand it seem like it have to if we want to abstract colour from sequence of amplitudes, but it also feels meaningfully different.
-
is the punch to define objects as blankets with much more information inside than may be viewed from far outside?
-
the part where all the information that may be lost in the next layer is assumed to already been lost, seen to assume symmetries - are those explicit part of the project?
-
in practice, there seem to be information loss and practically-indeterminism in all scales. E.g. when i go further from a picture i keep losing details. Wouldn't it make more sense to talk about how far (in orders of magnitude) information travel rather than saying that n either does it does not go to infinity?
Sorry about my English, hope it was clear enough
Didn't know about the problems setting. So cool!
Some random thought, sorry if none is relevant:
I think my next step towards optimality would have been not to look for an optimal agent but for an optimal act of choosing the agent - as action optimality is better understood than agent optimality. Than I would look at stable mixed equilibria to see if any of them is computable. If any is, I'll be interested at the agent that implement it (ie randomise another agent and then simulate it)
BTW now that I think about it I see that allowing the agent to randomise is probably strongly related to allowing the agent to not be fully transparent about its program, as it may induce uncertainty about which other agent it is going to simulate.
Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:
About the loss function for next word prediction - my point was that I'm not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.
About solving discrete representations with architectural change - I think that I meant only that the representation is easy and not the training, but anyway I agree that training it may be hard or at least require non-standard methods.
About the inductive logic and describing pictures in low-resolution: I made the same communication mistake in both, which is to consider things that are ridiculously highly regulated as not part of the hypothesis space at all. There probably is a logical formula that describe the probability of a given image to be a cat, to every degree of precision. I claim that will will never be able to find or represent that formula, because it is so regulated against. And that this is the price that the theory forced us to pay for the generalisation.
Directionally agree, but: A) Short period of trade before we become utterly useless is not much comfort. B) Trade is a particular case of bootstrapping influence on what an agent value to influence on their behaviour. The other major way of doing that is blackmail - which is much more effective in many circumstances, and would have been far more common if the State didn't blackmail us to not blackmail each other, to honour contacts, etc.
BTW those two points are basically how many people afraid that capitalism (i.e. our trade with super human organisations) may go wrong: A) Automaton may make us less and less economically useful. B) enough money may give an organisation the ability to blackmail - private army, or more likely influence on governmental power.
Assuming that automation here mean AI, this is basically hypothesising a phase in which the two kinds of super human agents (AI and companies) are still incentivized to cooperate with each other but not with us.
Some sort points:
"human-level question answering is believed to be AI-complete" - I doubt that. I think that we consistently far overestimate the role of language in our cognition, and how much we can actually express using language. The simplest example that come to mind is trying to describe a human face to an AI system with no "visual cortex" in a way that would let it generate a human image (e.g. hex representation of pixels). For that matter, try to describe something less familiar than a human face to a human painter in hope that they can paint it.
"GPT... already is better than humans at next-word prediction" - somewhat besides the point, but I do not think that the loss function that we use in training is actually the one that we care about. We don't care that match about specific phrasing, and use the "loss" of how match the content make sense, is true, is useful... Also, we are probably much better at implicit predictions than in explicit predictions, in ways that make us underperform in many tests.
Language Invites Mind Projection - anecdotally, I keep asking ChatGPT to do things that I know it would suck at, because I just can't bring myself to internalise the existence of something that is so fluent, so knowledgeable and so damn stupid at the same time.
Memorization & generalisation - just noting that it is a spectrum rather than a dichotomy, as compression ratios are. Anyway, the current methods don't seem to generalise well enough to overcome the sparsity of public data in some domains - which may be the main bottleneck in (e.g.) RL anyway.
"This, in turn, suggests a data structure that is discrete and combinatorial, with syntax trees, etc, and neural networks do (according to the argument) not use such representations" - let's spell the obvious objection - it is obviously possible to implement discrete representations over continuous representations. This is why we can have digital computers that are based on electrical currents rather than little rocks. The problem is just that keeping it robustly discrete is hard, and probably very hard to learn. I think that problem may be solved easily with minor changes of architecture though, and therefore should not effect timelines.
Inductive logic programming - generalise well in a much more restricted hypothesis space, as one should expect based on learning theory. The issue is that the real world is too messy for this hypothesis space, which is why it is not ruled by mathematicians/physicists. Is may be useful as an augmentation for a deep-learning agent though, the way that calculators are useful for humans.
Maybe we should think explicitly about what work is done by the concept of AGI, but I do not feel like calling GPT an AGI does anything interesting to my world model. Should I expect ChatGPT to beat me at chess? It's next version? If not - is it due to shortage of data or compute? Will it take over the world? If not - may I conclude that the next AGI wouldn't?
I understand why the bar-shifting thing look like motivated reasoning, and probably most of it actually is, but it deserves much more credit that you give it. We have an undefined concept of "something with virtually all the cognitive abilities of a human, that can therefore do whatever a human can", and some dubious assumptions like "if it can sensibly talk about everything, it can probably understand everything". Than we encounter ChatGPT, and it is amazing at speaking, except giving a strong impression of talking to an NPC. NPC who know lots of stuff and can even sort-of-reason in very constrained ways, do basic programming and be "creative" as in writing poetry - but is sub-human at things like gathering useful information, inferring people's goals, etc. So we conclude that some cognitive ability is still missing, and try to think how to correct for that.
Now, I do not care to call GPT an AGI, but you will have to invent a name for the super-AGI things that we try to achieve next, and know to be possible because humans exist.
Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).
(continuation of the same comment - submitted by mistake and cannot edit...) Assuming modules A,B are already "talking" and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B