LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:
wei-dai on The formal goal is a pointerImplementing moral decision-making
Training models to robustly represent and abide by ethical frameworks.
Description
AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.
Example benchmarks
Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.
Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).
More reading
I would honestly be pretty comfortable with maximizing SBF’s CEV.
Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.)
My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed.
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).
Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?
Also, how does RL fit into QACI? Can you point me to where this is discussed?
davekasten on Take SCIFs, it’s dangerous to go aloneIt may be worth noting that, at least anecdotally, when you talk about AI development processes with DoD policy people, they assume that SCIFs will be used at some point.
YMMV on whether that's their pattern-matching or hard-earned experience speaking, but I think worth noting.
But we could have said the same thing of SBF, before the disaster happened.
I would honestly be pretty comfortable with maximizing SBF's CEV.
Please explain your thinking behind this?
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don't incentivize irrationality (like ours did).
Sorrry if I was unclear there.
It's not, because some moral theories are not compatible with EU maximization.
I'm pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong.
And I think this uncertainty problem can be solved by forcing utility bounds.
odd-anon on "You're the most beautiful girl in the world" and Wittgensteinian Language GamesThis isn't that complicated. The halo effect is real and can go to extremes when romantic relationships are involved, and most people take their sense data at face value most of the time. The sentence is meant completely literally.
wei-dai on The formal goal is a pointerLuckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
But we could have said the same thing of SBF, before the disaster happened.
Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human.
Please explain your thinking behind this?
Dealing with moral uncertainty is just part of expected utility maximization.
It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear [LW · GW] how to handle uncertainty between them.
wei-dai on Ironing Out the Squigglesthe inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.
I can certainly understand that as you scale both architectures, they both make less mistakes on distribution. But do they also generalize out of training distribution more similarly? If so, why? Can you explain this more? (I'm not getting your point from just "approximately Bayesian setup".)
They needed a giant image classification dataset which I don’t think even existed 5 years ago.
This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?
pi-rogers on The formal goal is a pointerI'm 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call "human values") if they were rational enough and had good enough decision theory.
If I'm wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist's curse [? · GW]. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim)
I agree that moral uncertainty is a very hard problem, but I don't think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization.
To solve (2), I think we should try to adapt something like the Hippocratic principle [LW(p) · GW(p)] to work for QACI, without requiring direct reference to a human's values and beliefs (the sidestepping of which is QACI's big advantage over PreDCA). I wonder if Tammy has thought about this.
pi-rogers on Tamsin Leake's ShortformWhat about the following:
My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian.
I'm not sure how correct this is, but it's possible.
prometheus on Why I'm doing PauseAIIt probably began training in January and finished around early April. And they're now doing evals.