Posts

Comments

Comment by Joey KL (skluug) on Self's Shortform · 2025-02-05T17:06:39.161Z · LW · GW

Cool, thanks!

Comment by Joey KL (skluug) on Self's Shortform · 2025-02-05T00:56:28.436Z · LW · GW

You mean this substance? https://en.wikipedia.org/wiki/Mesembrine

Do you have a recommended brand, or places to read more about it?

Comment by Joey KL (skluug) on What Goes Without Saying · 2025-01-21T15:56:00.833Z · LW · GW

I would love to hear the principal’s take on your conversation.

Comment by Joey KL (skluug) on Drake Thomas's Shortform · 2025-01-12T21:56:16.370Z · LW · GW

Interesting, I can see why that would be a feature. I don't mind the taste at all actually. Before, I had some of their smaller citrus flavored kind, and they dissolved super quick and made me a little nauseous. I can see these ones being better in that respect. 

Comment by Joey KL (skluug) on Drake Thomas's Shortform · 2025-01-12T21:23:04.690Z · LW · GW

I ordered some of the Life Extension lozenges you said you were using; they are very large and take a long time to dissolve. It's not super unpleasant or anything, I'm just wondering if you would count this against them?

Comment by Joey KL (skluug) on Matthew Barnett's Shortform · 2025-01-03T06:05:26.924Z · LW · GW

Thank you for your extended engagement on this! I understand your point of view much better now.

Comment by Joey KL (skluug) on Matthew Barnett's Shortform · 2025-01-03T02:47:09.501Z · LW · GW

Oh, I think I get what you’re asking now. Within-lifetime learning is a process that includes something like a training process for the brain, where we learn to do things that feel good (a kind of training reward). That’s what you’re asking about if I understand correctly?

I would say no, we aren’t schemers relative to this process, because we don’t gain power by succeeding at it. I agree this is subtle and confusing question, and I don’t know if Joe Carlsmith would agree, but the subtlety to me seems to belong more to the nuances of the situation & analogy and not to the imprecision of the definition.

(Ordinary mental development includes something like a training process, but it also includes other stuff more analogous to building out a blueprint, so I wouldn’t overall consider it a kind of training process.)

Comment by Joey KL (skluug) on Matthew Barnett's Shortform · 2025-01-03T00:55:38.184Z · LW · GW
Comment by Joey KL (skluug) on Matthew Barnett's Shortform · 2025-01-03T00:53:02.751Z · LW · GW

If you're talking about this report, it looks to me like it does contain a clear definition of "schemer" in section 1.1.3, pg. 25: 

It’s easy to see why terminally valuing reward-on-the-episode would lead to training-gaming (since training-gaming just is: optimizing for reward-on-the-episode). But what about instrumental training-gaming? Why would reward-on-the-episode be a good instrumental goal?

In principle, this could happen in various ways. Maybe, for example, the AI wants the humans who designed it to get raises, and it knows that getting high reward on the episode will cause this, so it training-games for this reason.

The most common story, though, is that getting reward-on-the-episode is a good instrumental strategy for getting power—either for the AI itself, or for some other AIs (and power is useful for a very wide variety of goals). I’ll call AIs that are training-gaming for this reason “power-motivated instrumental training-gamers,” or “schemers” for short.

By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power. For example, if a company tries to instill loyalty in its employees, an employee who professes loyalty insincerely as a means to a promotion would be considered a schemer (as I understand it). 

Comment by Joey KL (skluug) on Matthew Barnett's Shortform · 2024-12-31T23:47:59.165Z · LW · GW

I think this post would be a lot stronger with concrete examples of these terms being applied in problematic ways. A term being vague is only a problem if it creates some kind of miscommunication, confused conceptualization, or opportunity for strategic ambiguity. I'm willing to believe these terms could pose these problems in certain contexts, but this is hard to evaluate in the abstract without concrete cases where they posed a problem.

Comment by Joey KL (skluug) on Which things were you surprised to learn are not metaphors? · 2024-11-22T01:52:03.436Z · LW · GW

I'm not sure I can come up with a distinguishing principle here, but I feel like some but not all unpleasant emotions feel similar to physical pain, such that I would call them a kind of pain ("emotional pain"), and cringing at a bad joke can be painful in this way.

Comment by Joey KL (skluug) on Alexander Gietelink Oldenziel's Shortform · 2024-10-21T05:02:35.916Z · LW · GW

More reasons: people wear sunglasses when they’re doing fun things outdoors like going to the beach or vacationing so it’s associated with that, and also sometimes just hiding part of a picture can cause your brain to fill it in with a more attractive completion than is likely.

Comment by Joey KL (skluug) on Wei Dai's Shortform · 2024-09-25T17:08:28.900Z · LW · GW

This probably does help capitalize AI companies a little bit, demand for call options will create demand for the underlying. This is probably a relatively small effect (?), but I'm not confident in my ability to estimate this at all.

Comment by Joey KL (skluug) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-25T16:59:44.611Z · LW · GW

I'm confused about what you mean & how it relates to what I said.

Comment by Joey KL (skluug) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-24T02:16:54.719Z · LW · GW

It's totally wrong that you can't argue against someone who says "I don't know", you argue against them by showing how your model fits the data and how any plausible competing model either doesn't fit or shares the salient features of yours. It's bizarre to describe "I don't know" as "garbage" in general, because it is the correct stance to take when neither your prior nor evidence sufficiently constrain the distribution of plausibilities. Paul obviously didn't posit an "unobserved kindness force" because he was specifically describing the observation that humans are kind. I think Paul and Nate had a very productive disagreement in that thread and this seems like a wildly reductive mischaracterization of it.

Comment by Joey KL (skluug) on Wei Dai's Shortform · 2024-08-25T19:48:20.050Z · LW · GW

I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership

Comment by Joey KL (skluug) on Sunlight is yellow parallel rays plus blue isotropic light · 2024-08-07T17:57:55.381Z · LW · GW

Hi, any updates on how this worked out? Considering trying this...

Comment by Joey KL (skluug) on Relativity Theory for What the Future 'You' Is and Isn't · 2024-08-02T03:07:45.377Z · LW · GW

This is the most interesting answer I've ever gotten to this line of questioning. I will think it over!

Comment by Joey KL (skluug) on Relativity Theory for What the Future 'You' Is and Isn't · 2024-07-31T20:09:53.725Z · LW · GW

What observation could demonstrate that this code indeed corresponded to the metaphysical important sense of continuity across time? What would the difference be between a world where it did or it didn't?

Comment by Joey KL (skluug) on Relativity Theory for What the Future 'You' Is and Isn't · 2024-07-30T16:04:20.222Z · LW · GW

Say there is a soul. We inspect a teleportation process, and we find that, just like your body and brain, the soul disappears on the transmitter pad, and an identical soul appears on the receiver. What would this tell you that you don't already know?

What, in principle, could demonstrate that two souls are in fact the same soul across time?

Comment by Joey KL (skluug) on Relativity Theory for What the Future 'You' Is and Isn't · 2024-07-29T15:27:30.457Z · LW · GW

It is epistemic relativism.

Question 1 and 3 are explicitly about values, so I don't think they do amount to epistemic relativism.

There seems to be a genuine question about what happens and which rules govern it, and you are trying to sidestep it by saying "whatever happens - happens".

I can imagine a universe with such rules that teleportation kills a person and a universe in which it doesn't. I'd like to know how does our universe work.

There seems to be a genuine question here, but it is not at all clear that there actually is one. It is pretty hard to characterize what this question amounts to, i.e. what the difference would be between two worlds where the question has different answers. I take OP to be espousing the view that the question isn't meaningful for this reason (though I do think they could have laid this out more clearly).

Comment by Joey KL (skluug) on Linch's Shortform · 2024-07-23T22:03:04.978Z · LW · GW

You may find it helpful to read the relevant sections of The Conscious Mind by David Chalmers, the original thorough examination of his view:

Those considerations aside, the main way in which conceivability arguments can go wrong is by subtle conceptual confusion: if we are insufficiently reflective we can overlook an incoherence in a purported possibility, by taking a conceived-of situation and misdescribing it. For example, one might think that one can conceive of a situation in which Fermat's last theorem is false, by imagining a situation in which leading mathematicians declare that they have found a counterexample. But given that the theorem is actually true, this situation is being misdescribed: it is really a scenario in which Fermat's last theorem is true, and in which some mathematicians make a mistake. Importantly, though, this kind of mistake always lies in the a priori domain, as it arises from the incorrect application of the primary intensions of our concepts to a conceived situation. Sufficient reflection will reveal that the concepts are being incorrectly applied, and that the claim of logical possibility is not justified.

So the only route available to an opponent here is to claim that in describing the zombie world as a zombie world, we are misapplying the concepts, and that in fact there is a conceptual contradiction lurking in the description. Perhaps if we thought about it clearly enough we would realize that by imagining a physically identical world we are thereby automatically imagining a world in which there is conscious experience. But then the burden is on the opponent to give us some idea of where the contradiction might lie in the apparently quite coherent description. If no internal incoherence can be revealed, then there is a very strong case that the zombie world is logically possible.

As before, I can detect no internal incoherence; I have a clear picture of what I am conceiving when I conceive of a zombie. Still, some people find conceivability arguments difficult to adjudicate, particularly where strange ideas such as this one are concerned. It is therefore fortunate that every point made using zombies can also be made in other ways, for example by considering epistemology and analysis. To many, arguments of the latter sort (such as arguments 3-5 below) are more straightforward and therefore make a stronger foundation in the argument against logical supervenience. But zombies at least provide a vivid illustration of important issues in the vicinity.

(II.7, "Argument 1: The logical possibility of zombies". Pg. 98).

Comment by Joey KL (skluug) on Shallow review of live agendas in alignment & safety · 2023-12-03T00:54:04.905Z · LW · GW

Iterated Amplification is a fairly specific proposal for indefinitely scalable oversight, which doesn't involve any human in the loop (if you start with a weak aligned AI). Recursive Reward Modeling is imagining (as I understand it) a human assisted by AIs to continuously do reward modeling; DeepMind's original post about it lists "Iterated Amplification" as a separate research direction. 

"Scalable Oversight", as I understand it, refers to the research problem of how to provide a training signal to improve highly capable models. It's the problem which IDA and RRM are both trying to solve. I think your summary of scalable oversight: 

(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)

is inconsistent with how people in the industry use it. I think it's generally meant to refer to the outer alignment problem, providing the right training objective. For example, here's Anthropic's "Measuring Progress on Scalable Oversight for LLMs" from 2022:

To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., 2016).

It references "Concrete Problems in AI Safety" from 2016, which frames the problem in a closely related way, as a kind of "semi-supervised reinforcement learning". In either case, it's clear what we're talking about is providing a good signal to optimize for, not an AI doing mechanistic interpretability on the internals of another model. I thus think it belongs more under the "Control the thing" header.

I think your characterization of "Prosaic Alignment" suffers from related issues. Paul coined the term to refer to alignment techniques for prosaic AI, not techniques which are themselves prosaic. Since prosaic AI is what we're presently worried about, any technique to align DNNs is prosaic AI alignment, by Paul's definition.

My understanding is that AI labs, particularly Anthropic, are interested in moving from human-supervised techniques to AI-supervised techniques, as part of an overall agenda towards indefinitely scalable oversight via AI self-supervision.  I don't think Anthropic considers RLAIF an alignment endpoint itself. 

Comment by Joey KL (skluug) on Shallow review of live agendas in alignment & safety · 2023-12-01T22:17:03.911Z · LW · GW

I am very surprised that "Iterated Amplification" appears nowhere on this list. Am I missing something?

Comment by Joey KL (skluug) on Cosmopolitan values don't come free · 2023-06-02T20:12:45.505Z · LW · GW

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

Isn't the worst case scenario just leaving the aliens alone? If I'm worried I'm going to fuck up some alien's preferences, I'm just not going to give them any power or wisdom!

I guess you think we're likely to fuck up the alien's preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien's preferences (as it feels like I do), why can't I also care about their reflection process (which is just a meta preference)?

I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by "doing right by [person]" is "what that person would mean by 'doing right by me'". This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I'm not sure I care about anyway. 

Comment by Joey KL (skluug) on Your Utility Function is Your Utility Function · 2022-05-06T22:54:15.266Z · LW · GW

I feel like you say this because you expect your values-upon-reflection to be good by light of your present values--in which case, you're not so much valuing reflection, as just enacting your current values. 

If Omega told me if I reflected enough, I'd realize what I truly wanted was to club baby seals all day, I would take action to avoid ever reflecting that deeply!

It's not so much that I want to lock in my present values as it is I don't want to lock in my reflective values. They seem equally arbitrary to me.

Comment by Joey KL (skluug) on Your Utility Function is Your Utility Function · 2022-05-06T21:40:53.307Z · LW · GW

This kind of thing seems totally backwards to me. In what sense do I lose if I "bulldoze my values"? It only makes sense to describe me as "having values" insofar as I don't do things like bulldoze them! It seems like a way to pretend existential choices don't exist--just assume you have a deep true utility function, and then do whatever maximizes it.

Why should I care about "teasing out" my deep values? I place no value on my unknown, latent values at present, and I see no reason to think I should!