Posts
Comments
I'm not sure I can come up with a distinguishing principle here, but I feel like some but not all unpleasant emotions feel similar to physical pain, such that I would call them a kind of pain ("emotional pain"), and cringing at a bad joke can be painful in this way.
More reasons: people wear sunglasses when they’re doing fun things outdoors like going to the beach or vacationing so it’s associated with that, and also sometimes just hiding part of a picture can cause your brain to fill it in with a more attractive completion than is likely.
This probably does help capitalize AI companies a little bit, demand for call options will create demand for the underlying. This is probably a relatively small effect (?), but I'm not confident in my ability to estimate this at all.
I'm confused about what you mean & how it relates to what I said.
It's totally wrong that you can't argue against someone who says "I don't know", you argue against them by showing how your model fits the data and how any plausible competing model either doesn't fit or shares the salient features of yours. It's bizarre to describe "I don't know" as "garbage" in general, because it is the correct stance to take when neither your prior nor evidence sufficiently constrain the distribution of plausibilities. Paul obviously didn't posit an "unobserved kindness force" because he was specifically describing the observation that humans are kind. I think Paul and Nate had a very productive disagreement in that thread and this seems like a wildly reductive mischaracterization of it.
I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership
Hi, any updates on how this worked out? Considering trying this...
This is the most interesting answer I've ever gotten to this line of questioning. I will think it over!
What observation could demonstrate that this code indeed corresponded to the metaphysical important sense of continuity across time? What would the difference be between a world where it did or it didn't?
Say there is a soul. We inspect a teleportation process, and we find that, just like your body and brain, the soul disappears on the transmitter pad, and an identical soul appears on the receiver. What would this tell you that you don't already know?
What, in principle, could demonstrate that two souls are in fact the same soul across time?
It is epistemic relativism.
Question 1 and 3 are explicitly about values, so I don't think they do amount to epistemic relativism.
There seems to be a genuine question about what happens and which rules govern it, and you are trying to sidestep it by saying "whatever happens - happens".
I can imagine a universe with such rules that teleportation kills a person and a universe in which it doesn't. I'd like to know how does our universe work.
There seems to be a genuine question here, but it is not at all clear that there actually is one. It is pretty hard to characterize what this question amounts to, i.e. what the difference would be between two worlds where the question has different answers. I take OP to be espousing the view that the question isn't meaningful for this reason (though I do think they could have laid this out more clearly).
You may find it helpful to read the relevant sections of The Conscious Mind by David Chalmers, the original thorough examination of his view:
Those considerations aside, the main way in which conceivability arguments can go wrong is by subtle conceptual confusion: if we are insufficiently reflective we can overlook an incoherence in a purported possibility, by taking a conceived-of situation and misdescribing it. For example, one might think that one can conceive of a situation in which Fermat's last theorem is false, by imagining a situation in which leading mathematicians declare that they have found a counterexample. But given that the theorem is actually true, this situation is being misdescribed: it is really a scenario in which Fermat's last theorem is true, and in which some mathematicians make a mistake. Importantly, though, this kind of mistake always lies in the a priori domain, as it arises from the incorrect application of the primary intensions of our concepts to a conceived situation. Sufficient reflection will reveal that the concepts are being incorrectly applied, and that the claim of logical possibility is not justified.
So the only route available to an opponent here is to claim that in describing the zombie world as a zombie world, we are misapplying the concepts, and that in fact there is a conceptual contradiction lurking in the description. Perhaps if we thought about it clearly enough we would realize that by imagining a physically identical world we are thereby automatically imagining a world in which there is conscious experience. But then the burden is on the opponent to give us some idea of where the contradiction might lie in the apparently quite coherent description. If no internal incoherence can be revealed, then there is a very strong case that the zombie world is logically possible.
As before, I can detect no internal incoherence; I have a clear picture of what I am conceiving when I conceive of a zombie. Still, some people find conceivability arguments difficult to adjudicate, particularly where strange ideas such as this one are concerned. It is therefore fortunate that every point made using zombies can also be made in other ways, for example by considering epistemology and analysis. To many, arguments of the latter sort (such as arguments 3-5 below) are more straightforward and therefore make a stronger foundation in the argument against logical supervenience. But zombies at least provide a vivid illustration of important issues in the vicinity.
(II.7, "Argument 1: The logical possibility of zombies". Pg. 98).
Iterated Amplification is a fairly specific proposal for indefinitely scalable oversight, which doesn't involve any human in the loop (if you start with a weak aligned AI). Recursive Reward Modeling is imagining (as I understand it) a human assisted by AIs to continuously do reward modeling; DeepMind's original post about it lists "Iterated Amplification" as a separate research direction.
"Scalable Oversight", as I understand it, refers to the research problem of how to provide a training signal to improve highly capable models. It's the problem which IDA and RRM are both trying to solve. I think your summary of scalable oversight:
(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)
is inconsistent with how people in the industry use it. I think it's generally meant to refer to the outer alignment problem, providing the right training objective. For example, here's Anthropic's "Measuring Progress on Scalable Oversight for LLMs" from 2022:
To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., 2016).
It references "Concrete Problems in AI Safety" from 2016, which frames the problem in a closely related way, as a kind of "semi-supervised reinforcement learning". In either case, it's clear what we're talking about is providing a good signal to optimize for, not an AI doing mechanistic interpretability on the internals of another model. I thus think it belongs more under the "Control the thing" header.
I think your characterization of "Prosaic Alignment" suffers from related issues. Paul coined the term to refer to alignment techniques for prosaic AI, not techniques which are themselves prosaic. Since prosaic AI is what we're presently worried about, any technique to align DNNs is prosaic AI alignment, by Paul's definition.
My understanding is that AI labs, particularly Anthropic, are interested in moving from human-supervised techniques to AI-supervised techniques, as part of an overall agenda towards indefinitely scalable oversight via AI self-supervision. I don't think Anthropic considers RLAIF an alignment endpoint itself.
I am very surprised that "Iterated Amplification" appears nowhere on this list. Am I missing something?
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
Isn't the worst case scenario just leaving the aliens alone? If I'm worried I'm going to fuck up some alien's preferences, I'm just not going to give them any power or wisdom!
I guess you think we're likely to fuck up the alien's preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien's preferences (as it feels like I do), why can't I also care about their reflection process (which is just a meta preference)?
I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by "doing right by [person]" is "what that person would mean by 'doing right by me'". This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I'm not sure I care about anyway.
I feel like you say this because you expect your values-upon-reflection to be good by light of your present values--in which case, you're not so much valuing reflection, as just enacting your current values.
If Omega told me if I reflected enough, I'd realize what I truly wanted was to club baby seals all day, I would take action to avoid ever reflecting that deeply!
It's not so much that I want to lock in my present values as it is I don't want to lock in my reflective values. They seem equally arbitrary to me.
This kind of thing seems totally backwards to me. In what sense do I lose if I "bulldoze my values"? It only makes sense to describe me as "having values" insofar as I don't do things like bulldoze them! It seems like a way to pretend existential choices don't exist--just assume you have a deep true utility function, and then do whatever maximizes it.
Why should I care about "teasing out" my deep values? I place no value on my unknown, latent values at present, and I see no reason to think I should!