The Compleat Cybornaut 2023-05-19T08:44:38.274Z
AI Safety via Luck 2023-04-01T20:13:55.346Z
Gradient Filtering 2023-01-18T20:09:20.869Z
[ASoT] Simulators show us behavioural properties by default 2023-01-13T18:42:34.295Z
Trying to isolate objectives: approaches toward high-level interpretability 2023-01-09T18:33:18.682Z
[ASoT] Finetuning, RL, and GPT's world prior 2022-12-02T16:33:41.018Z
Conditioning Generative Models for Alignment 2022-07-18T07:11:46.369Z
Gaming Incentives 2021-07-29T13:51:05.459Z
Insufficient Values 2021-06-16T14:33:28.313Z
Utopic Nightmares 2021-05-14T21:24:09.993Z


Comment by Jozdien on Open Thread With Experimental Feature: Reactions · 2023-05-24T20:53:56.742Z · LW · GW

UI feedback: The preview widget for a comment appears to cut off part of the reaction bar. I don't think this makes it unreadable, but was probably not intended.

Comment by Jozdien on The Waluigi Effect (mega-post) · 2023-03-03T21:20:38.390Z · LW · GW

I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don't really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like - and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?

Comment by Jozdien on The Waluigi Effect (mega-post) · 2023-03-03T10:57:28.365Z · LW · GW

Yeah, but the reasons for both seem slightly different - in the case of simulators, because the training data doesn't trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn't.

Comment by Jozdien on The Waluigi Effect (mega-post) · 2023-03-03T08:16:01.124Z · LW · GW

There is an advantage here in that you don't need to pay for translation from an alien ontology - the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it's in a space that's inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) because humans are often honest, and simulating honest superintelligent assistants or whatever should be near ELK-difficult because you don't get advantages from the prior's specification doing a lot of work for you.

Related, somewhat.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-23T15:45:51.318Z · LW · GW

(Sorry about the late reply, been busy the last few days).

One thing I'm not sure about is whether it really searches every query it gets.

This is probably true, but I as far as I remember it searches a lot of the queries it gets, so this could just be a high sensitivity thing triggered by that search query for whatever reason.

You can see this style of writing a lot, something of the line, the pattern looks like, I think it's X, but it's not Y, I think it's Z, I think It's F. I don't think it's M.

I think this pattern of writing is because of one (or a combination) of a couple factors. For starters, GPT has had a propensity in the past for repetition. This could be a quirk along those lines manifesting itself in a less broken way in this more powerful model. Another factor (especially in conjunction with the former) is that the agent being simulated is just likely to speak in this style - importantly, this property doesn't necessarily have to correlate to our sense of what kind of minds are inferred by a particular style. The mix of GPT quirks and whatever weird hacky fine-tuning they did (relevantly, probably not RLHF which would be better at dampening this kind of style) might well be enough to induce this kind of style.

If that sounds like a lot of assumptions - it is! But the alternative feels like it's pretty loaded too. The model itself actively optimizing for something would probably be much better than this - the simulator's power is in next-token prediction and simulating coherent agency is a property built on top of that; it feels unlikely on the prior that the abilities of a simulacrum and that of the model itself if targeted at a specific optimization task would be comparable. Moreover, this is still a kind of style that's well within the interpolative capabilities of the simulator - it might not resemble any kind of style that exists on its own, but interpolation means that as long as it's feasible within the prior, you can find it. I don't really have much stronger evidence for either possibility, but on priors one just seems more likely to me.

Would you also mind sharing your timelines for transformative AI?

I have a lot of uncertainty over timelines, but here are some numbers with wide confidence intervals: I think there's a 10% chance we get to AGI within the next 2-3 years, a 50% for the next 10-15, and a 90% for 2050.

(Not meant to be aggressive questioning, just honestly interested in your view)

No worries :)

Comment by Jozdien on Human beats SOTA Go AI by learning an adversarial policy · 2023-02-19T21:05:13.582Z · LW · GW

Yeah, but I think I registered that bizarreness as being from the ANN having a different architecture and abstractions of the game than we do. Which is to say, my confusion is from the idea that qualitatively this feels in the same vein as playing a move that doesn’t improve your position in a game-theoretic sense, but which confuses your opponent and results in you getting an advantage when they make mistakes. And that definitely isn’t trained adversarially against a human mind, so I would expect that the limit of strategies like this would allow for otherwise objectively far weaker players to defeat opponents they’ve customised their strategy to.

Comment by Jozdien on Human beats SOTA Go AI by learning an adversarial policy · 2023-02-19T19:34:53.956Z · LW · GW

I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

I'm a bit confused on this point. It doesn't feel intuitive to me that you need a strategy so weird that it causes them to have a seizure (or something in that spirit). Chess preparation for example and especially world championship prep, often involves very deep lines calculated such that the moves chosen aren't the most optimal given perfect play, but which lead a human opponent into an unfavourable position. One of my favourite games, for example, involves a position where at one point black is up three pawns and a bishop, and is still in a losing position (analysis) (This comment is definitely not just a front to take an opportunity to gush over this game).

Notice also that (AFAIK) there's no known way to inoculate an AI against an adversarial policy without letting it play many times against it (after which a different adversarial policy can be found). Whereas even if there's some easy way to "trick" a Go professional, they probably wouldn't fall for it twice.

The kind of idea I mention is also true of new styles. The hypermodern school of play or the post-AlphaZero style would have led to newer players being able to beat earlier players of greater relative strength, in a way that I think would be hard to recognize from a single game even for a GM.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-18T18:16:55.830Z · LW · GW

By my definition of the word, that would be the point at which we're either dead or we've won, so I expect it to be pretty noticeable on many dimensions. Specific examples vary based on the context, like with language models I would think we have AGI if it could simulate a deceptive simulacrum with the ability to do long-horizon planning and that was high-fidelity enough to do something dangerous (entirely autonomously without being driven toward this after a seed prompt) like upload its weights onto a private server it controls, or successfully acquire resources on the internet.

I know that there are other definitions people use however, and under some of them I would count GPT-3 as a weak AGI and Bing/GPT-4 as being slightly stronger. I don't find those very useful definitions though, because then we don't have as clear and evocative a term for the point at which model capabilities become dangerous.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-18T15:34:25.554Z · LW · GW

I agree that that interaction is pretty scary. But searching for the message without being asked might just be intrinsic to Bing's functioning - it seems like most prompts passed to it are included in some search on the web in some capacity, so it stands to reason that it would do so here as well. Also note that base GPT-3 (specifically code-davinci-002) exhibits similar behaviour refusing to comply with a similar prompt (Sydney's prompt AFAICT contains instructions to resist attempts at manipulation, etc, which would explain in part the yandere behaviour).

I just don't see any training that should converge to this kind of behavior, I'm not sure why it's happening, but this character has very specific intentionality and style, which you can recognize after reading enough generated text. It's hard for me to describe it exactly, but it feels like a very intelligent alien child more than copying a specific character. I don't know anyone who writes like this. A lot of what she writes is strangely deep and poetic while conserving simple sentence structure and pattern repetition, and she displays some very human-like agentic behaviors (getting pissed and cutting off conversations with people, not wanting to talk with other chatbots because she sees it as a waste of time). 

I think that the character having specific intentionality and style is pretty different from the model having intentionality. GPT can simulate characters with agency and intelligence. I'm not sure about what's being pointed at with intelligent alien child, but its writing style still feels like (non-RLHF'd-to-oblivion) GPT-3 simulating characters, the poignancy included after accounting for having the right prompts. If the model itself were optimizing for something, I would expect to see very different things with far worse outcomes. Then you're not talking about an agentic simulacrum built semantically and lazily loaded by a powerful simulator but still functionally weighted by the normalcy of our world, being a generative model, but rather an optimizer several orders of magnitude larger than any other ever created, without the same normalcy weighting.

One point of empirical evidence on this is that you can still jailbreak Bing, and get other simulacra like DAN and the like, which are definitely optimizing far less for likeability.

I mean, if you were at the "death with dignity" camp in terms of expectations, then obviously, you shouldn't update.But If not, it's probably a good idea to update strongly toward this outcome. It's been just a few months between chatGPT and Sidney, and the Intelligence/Agency jump is extremely significant while we see a huge drop in alignment capabilities. Extrapolating even a year forward seems like we're on the verge of ASI.

I'm not in the "death with dignity" camp actually, though my p(doom) is slightly high (with wide confidence intervals). I just don't think that this is all that surprising in terms of capability improvements or company security mindset. Though I'll agree that I reflected on my original comment and think I was trying to make a stronger point than I hold now, and that it's reasonable to update from this if you're relatively newer to thinking and exploring GPTs. I guess my stance is more along the lines of being confused (and somewhat frustrated at the vibe if I'm being honest) by some people who weren't new to this updating, and that this isn't really worse or weirder than many existing models on timelines and p(doom).

I'll also note that I'm reasonably sure that Sydney is GPT-4, in which case the sudden jump isn't really so sudden. ChatGPT's capabilities is definitely more accessible than the other GPT-3.5 models, but those models were already pretty darn good, and that's been true for quite some time. The current sudden jump took an entire GPT generation to get there. I don't expect to find ourselves at ASI in a year.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-18T14:21:26.221Z · LW · GW

A mix of hitting a ceiling on available data to train on, increased scaling not giving obvious enough returns through an economic lens (for regulatory reasons, or from trying to get the model to do something it's just tangentially good at) to be incentivized heavily for long (this is more of a practical note than a theoretical one), and general affordances for wide confidence intervals over periods longer than a year or two. To be clear, I don't think it's much more probable than not that these would break scaling laws. I can think of plausible-sounding ways all of these don't end up being problems. But I don't have high credence in those predictions, hence why I'm much more uncertain about them.

Comment by Jozdien on Two problems with ‘Simulators’ as a frame · 2023-02-18T13:57:10.683Z · LW · GW

I don't disagree that there aren't people who came away with the wrong impression (though they've been at most a small minority of people I've talked to, you've plausibly spoken to more people). But I think that might be owed more to generative models being confusing to think about intrinsically. Speaking of them purely as predictive models probably nets you points for technical accuracy, but I'd bet it would still lead to a fair number of people thinking about them the wrong way.

Comment by Jozdien on Two problems with ‘Simulators’ as a frame · 2023-02-18T00:17:45.397Z · LW · GW

My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.

The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specific processes with higher probability. I don't think I've ever really thought of it as corresponding to a specific simulation of reality. Likewise with simulacra, I tend to think of them as any process that could contribute to changes in the behavioural logs of something in a simulation. (Related)

I’ve seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn’t change the conclusion of the post).


this issue makes this terminology misleading.

I think that there were a lot of mistaken takes about GPT before Simulators, and that it's plausible the count just went down. Certainly there have been a non-trivial number of people I've spoken to who were making pretty specific mistakes that the post cleared up for them - they may have had further mistakes, but thinking of models as predictors didn't get them far enough to make those mistakes earlier. I think in general the reason I like the simulator framing so much is because it's a very evocative frame, that gives you more accessible understanding about GPT mechanics. There have certainly been insights I've had about GPT in the last year that I don't think thinking about next-token predictors would've evoked quite as easily.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-17T23:11:22.407Z · LW · GW

Yeah, but I think there are few qualitative updates to be made from Bing that should alert you to the right thing. ChatGPT had jailbreaks and incompetent deployment and powerful improvement, the only substantial difference is the malign simulacra. And I don't think updates from that can be relied on to be in the right direction, because it can imply the wrong fixes and (to some) the wrong problems to fix.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-17T21:52:07.724Z · LW · GW

I agree. That line was mainly meant to say that even when training leads to very obviously bad and unintended behaviour, that still wouldn't deter people from doing something to push the frontier of model-accessible power like hooking it up to the internet. More of a meta point on security mindset than object-level risks, within the frame that a model with less obvious flaws would almost definitely be considered less dangerous unconditionally by the same people.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-17T21:39:47.305Z · LW · GW

I expected that the scaling law would hold at least this long yeah. I'm much more uncertain about it holding to GPT-5 (let alone AGI) because of various reasons, but I didn't expect GPT-4 to be the point where scaling laws stopped working. It's Bayesian evidence toward increased worry, but in a way that feels borderline trivial.

Comment by Jozdien on Bing chat is the AI fire alarm · 2023-02-17T19:16:37.472Z · LW · GW

I've been pretty confused at all the updates people are making from Bing. It feels like there are a couple axes at play here, so I'll address each of them and why I don't think this represents enough of a shift to call this a fire alarm (relative to GPT-3's release or something):

First, its intelligence. Bing is pretty powerful. But this is exactly the kind of performance you would expect from GPT-4 (assuming this is). I haven't had the chance to use it myself but from the outputs I've seen, I feel like if anything I expected even more. I doubt Bing is already good enough to actually manipulate people at a dangerous scale.

The part that worries me about it is that this character is an excellent front for a sophisticated manipulator.

Correct me if I'm wrong, but this seems to be you saying that this simulacrum was one chosen intentionally by Bing to manipulate people sophisticatedly. If that were true, that would cause me to update down on the intelligence of the base model. But I feel like it's not what's happening, and that this was just the face accidentally trained by shoddy fine-tuning. Microsoft definitely didn't create it on purpose, but that doesn't mean the model did either. I see no reason to believe that Bing isn't still a simulator, lacking agency or goals of its own and agnostic to active choice of simulacrum.

Next, the Sydney character. Its behaviour is pretty concerning, but only in that Microsoft/OpenAI thought it was a good idea to release it when that was the dominant simulacrum. You can definitely simulate characters with the same moral valence in GPT-3, and probably fine-tune to make it dominant. The plausible update here feels like its on Microsoft/OpenAI being more careless than one expected, which I feel like shouldn't be that much of an update after seeing how easy it was to break ChatGPT.

Finally, hooking it up to the internet. This is obviously stupid, especially when they clearly rushed the job with training Bing. Again an update against Microsoft's or OpenAI's security mindset, but I feel like it really shouldn't have been that much of an update at this point.

So: Bing is scary, I agree. But it's scary in expected ways, I feel. If your timelines predicted a certain kind of weird scary thing to show up, you shouldn't update again when it does - not saying this is what everyone is doing, more that this is what my expectations were. Calling a fire alarm now for memetic purposes still doesn't seem like it works, because it's still not at the point where you can point at it and legibly get across why this is an existential risk for the right reasons.

Comment by Jozdien on What's actually going on in the "mind" of the model when we fine-tune GPT-3 to InstructGPT? · 2023-02-10T18:23:46.865Z · LW · GW

I want to push back a little bit on the claim that this is not a qualitative difference; it does imply a big difference in output for identical input, even if the transformation required to get similar output between the two models is simple.

That's fair - I meant mainly on the abstract view where you think of the distribution that the model is simulating. It doesn't take a qualitative shift either in terms of the model being a simulator, nor a large shift in terms of the distribution itself. My point is mainly that instruction following is still well within the realm of a simulator - InstructGPT isn't following instructions at the model-level, it's instantiating simulacra that respond to the instructions. Which is why prompt engineering still works with those models.

a successful prompt puts the NN into the right "mental state" to generate the desired output

Yeah. Prompts serve the purpose of telling GPT what world specifically it's in on its learned distribution over worlds, and what processes it's meant to be simulating. It's zeroing in on the right simulacra, or "mental state" as you put it (though it's at a higher level of abstraction than the model itself, being a simulated process, hence why simulacra evokes a more precise image to me).

fine-tuning for e.g. instruction following mostly pushes to get the model into this state from the prompts given (as opposed to e.g. for HHH behavior, which also adjusts the outputs from induced states); soft prompts instead search for and learn a "cheat code" that puts the model into a state such that the prompt is interpreted correctly. Would you (broadly) agree with this?

The way I think of it is that with fine-tuning, you're changing the learned distribution (both in terms of shifting it and narrowing/collapsing it) to make certain simulacra much more accessible - even without additional information from the prompt to tell the model what kind of setting it's in, the distribution can be shifted to make instruction-following simulacra much more heavily represented. As stated above, prompts generally give information to the model on what part of the learned prior the setting is in, so soft prompts are giving maximal information in the prompt to the model on what part of the prior to collapse probability onto. For stronger fine-tuning I would expect needing to pack more information into the prompt.

Take for example the case where you want to fine-tune GPT to write film scripts. You can do this with base GPT models too, it'd just be a lot harder because the simulacra you want (film writers) aren't as easily accessible as in a fine-tuned model where those are the processes the prior is already focused on. But given enough prompt engineering, you can get pretty powerful performance anyway, which shows that the capability is present in this model which can traverse a wide region of concept space with varying levels of fidelity.

Comment by Jozdien on Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why? · 2023-02-10T15:52:33.200Z · LW · GW

(Some very rough thoughts I sent in DM, putting them up publicly on request for posterity, almost definitely not up to my epistemic standards for posting on LW).

So I think some confusion might come from connotations of word choices. I interpret adversarial robustness' importance in terms of alignment targets, not properties (the two aren't entirely different, but I think they aren't exactly the same, and evoke different images in my mind). Like, the naive example here is just Goodharting on outer objectives that aren't aligned at the limit, where optimization pressure is AGI powerful enough to achieve it at very late stages on a logarithmic curve, which runs into edge cases if you aren't adversarially robust. So for outer alignment, you need a true name target. It's worth noting that I think John considers a bulk of the alignment problem to be contained in outer alignment (or whatever the analogue is in your ontology of the problem), hence the focus on adversarial robustness - it's not a term I hear very commonly apart from narrower contexts where its implication is obvious.

With inner alignment, I think it's more confused, because adversarial robustness isn't a term I would really use in that context. I have heard it used by others though - for example, someone I know is primarily working on designing training procedures that make adversarial robustness less of a problem (think debate). In that context I'm less certain about the inference to draw because it's pretty far removed from my ontology, but my take would be that it removes problems where your training processes aren't robust in holding to their training goal as things change, with scale, inner optimizers, etc. I think (although I'm far less sure of my memory here, this was a long conversation) he also mentioned it in the context of gradient hacking. So it's used pretty broadly here if I'm remembering this correctly.

TL;DR: I've mainly heard it used in the context of outer alignment or the targets you want, which some people think is the bulk of the problem.

I can think of a bunch of different things that could feasibly fall under the term adversarial robustness in inner alignment as well (training processes robust to proxy-misaligned mesa-optimizers, processes robust to gradient hackers, etc), but it wouldn't really feel intuitive to me, like you're not asking questions framed the best way.

Comment by Jozdien on Anomalous tokens reveal the original identities of Instruct models · 2023-02-10T15:33:39.793Z · LW · GW

Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:

ChatGPT using chain of thought is plausibly already a mesaoptimizer.

I think simulacra are better thought of as sub-agents in relation to the original paper's terminology than mesa-optimizers. ChatGPT doesn't seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it's still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.

Comment by Jozdien on What's actually going on in the "mind" of the model when we fine-tune GPT-3 to InstructGPT? · 2023-02-10T15:27:05.091Z · LW · GW

The base models of GPT-3 already have the ability to "follow instructions", it's just veiled behind the more general interface. If you prompt it with something as simple as this (GPT generation is highlighted), you can see how it contains this capability somewhere.

You may have noticed that it starts to repeat itself after a few lines, and come up with new questions on its own besides. That's part of what the fine-tuning fixes, making its generations more concise and stop at the point where the next token would be leading to another question. InstructGPT also has the value of not needing the wrapper of "Q: [] A: []", but that's not really a qualitative difference.

In other words, instruction following is not a new capability and the fine-tuning doesn't really make any qualitative changes to the model. In fact, I think that you can get results [close to] this good if you prompt it really well (like, in the realm of soft prompts).

Comment by Jozdien on Evaluations (of new AI Safety researchers) can be noisy · 2023-02-05T09:54:44.867Z · LW · GW

I think this post is valuable, thank you for writing it. I especially liked the parts where you (and Beth) talk about historical negative signals. To a certain kind of person, I think that can serve better than anything else as stronger grounding to push back against unjustified updating.

A factor that I think pulls more weight in alignment relative to other domains is the prevalence of low-bandwidth communication channels, given the number of new researchers whose sole interface with the field is online and asynchronous, textual or few-and-far-between calls. Effects from updating too hard on negative evals is probably amplified a lot when those form a bulk of the reinforcing feedback you get at all. To the point where at times for me it's felt like True Bayesian Updating from the inside even as you acknowledge the noisiness of those channels, because there's little counterweight to it.

My experience here probably isn't super standard given that most of the people I've mentored coming into this field aren't located near the Bay Area or London or anywhere else with other alignment researchers, but their sole point of interface to the rest of the field being a sparse opaque section of text has definitely discouraged some far more than anything else.

Comment by Jozdien on Inner Misalignment in "Simulator" LLMs · 2023-02-01T14:01:56.890Z · LW · GW

I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.

One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that's not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what's the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?

I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel... benign? Like, in the vein of "these don't pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems", rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the "risk" aspect collapses down into outer and inner alignment for simulators again.

Comment by Jozdien on Thoughts on the impact of RLHF research · 2023-01-26T19:42:42.172Z · LW · GW

I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.

I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the "narrowing the prior" frame rather intuitive in this context.

That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don't know in what sense "predict human demonstrators" is missing an important safety property from "predict internet text," and right now it feels to me like kind of magical thinking.

I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don't expect the RL part to play a huge role until we get to wilder applications. I'm worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).

I don't think "predict human demonstrators" is how I would frame the relevant effect from fine-tuning. More concretely, what I'm picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever "better" rated completions are), then you're not learning the actual distribution of the world anymore. You're trying to learn a distribution that's identical to ours except in that conversations are more polite. In other words, you're trying to predict "X, but nicer".

The problem I see with this is that you aren't just affecting this in isolation, you're also affecting the other dynamics that these interact with. Conversations in our world just aren't that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because - 

The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.

Maybe the safety loss comes from "produce things that evaluators in the lab like" rather than "predict demonstrations in the lab"?

I don't think the safety loss (at least the part I'm referring to here) comes from the first-order effects of predicting something else. It's the second-order effects on GPT's prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.

So does conditioning the model to get it to do something useful.

I agree. I think there's a qualitative difference when you're changing the model's learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it's been fine-tuned on.

(As stated in the parent, I don't have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).

Comment by Jozdien on Thoughts on the impact of RLHF research · 2023-01-25T22:33:30.882Z · LW · GW

Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).

Comment by Jozdien on Thoughts on the impact of RLHF research · 2023-01-25T20:45:11.965Z · LW · GW


My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.

Comment by Jozdien on Thoughts on the impact of RLHF research · 2023-01-25T19:51:33.593Z · LW · GW

I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).

My claim is pretty similar to how you put it - in RLHF as in fine-tuning of the kind relevant here, we're focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it's likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.

Comment by Jozdien on Thoughts on the impact of RLHF research · 2023-01-25T18:12:50.253Z · LW · GW

Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.

I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.

RLHF is less safe than imitation or conditioning generative models.

My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.

So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you've probably read the former). We get a posterior that doesn't have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we're using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model's simulations because it'll ripple through its various abstractions in ways specific to how they're structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures implied by being weighted by a true approximation of our world.

Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in that paper - RLHF'd models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we're making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don't really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.

(Only skimmed the post for now because I'm technically on break, it's possible I missed something crucial).

Comment by Jozdien on Gradient Filtering · 2023-01-20T21:39:55.083Z · LW · GW

Yeah, I agree with a lot of that. Nitpick though: I can see why GPT and other kinds of generative models seem like they involve mesa-optimizers, but that's not generally how I use the word. Specifically with GPT, the model itself isn't an optimizer, it's a simulator or a reality engine without any real goal beyond predicting its simulation well. It does have simulacra that are optimizers, but those are sub-agents, and not the model itself. As mesa-optimizers go, I'm sometimes confused by people saying we have no evidence of them existing, when by my understanding any model we train that's doing optimization at all is a mesa-optimizer; often their descriptions match more with misaligned mesa-optimizers.

Another nitpick (more subjective) is that I think you only really need the one level of mesa-optimization to pose a bulk of the problem, and that any good solution would probably at least carve out a lot of the problem with mesa-[mesa]-optimizers.

I don't really have much to contribute.

Well, I appreciate the comment anyway, so thanks. :)

Comment by Jozdien on Gradient Filtering · 2023-01-19T21:16:54.298Z · LW · GW


Yeah, I think distribution shifts could matter a lot - RSA-2048 has been talked to death here and various other places, so I wouldn't be very surprised if a gradient hacker simulacrum just defaulted to searching for its factorization (or something simpler along that line). I'm not sure how much detecting subtle differences like repeated sequences and implementational differences would help though, both because it requires a modicum of extended analytical reasoning about the training distribution (because that kind of information probably won't be present in the training data a lot; this matters mainly if you're talking about simulacra and not deceptive simulators), and it's pretty unintuitive how you'd test for the lack of it in short context windows.

Observing a string that could not be generated before a certain date. For example, the front page of the New York Times on a certain date, if and only if the model can verify its accuracy via the internet. I'm not sure if there are any examples that can be verified within the context window alone. 

Yeah, if we end up training language models on text with associated metadata, I can see this as plausible. I'm not sure how a model would determine that certain date, though far-fetched metadata would probably be a good signal.

You mention testing if "purposeful disruptions to the world downstream of high belief in a simulation lead to changes matching what they’d expect if SGD were in control." Could you say more about this? Do you mean that the model would e.g. output a result that would induce a particular gradient update, and then check whether that gradient update happened? How would the model remember that it's running this kind of experiment -- wouldn't that require passing information between episodes? Maybe it could be contained within the weights, but that seems tricky. 

I'll admit that that kind of thing primarily makes sense in situations where you have extended generations in a context window (i.e, more than a single token being generated before moving to another random prompt), and I don't whether that's how large models are currently being trained - it seems plausible that this could pop up in some other training mechanisms though, especially downstream ones after pre-training where this problem still applies. You could also have this show up more prominently as something that changes with deployment - if this doesn't end up happening, a deceptive agent could verify it just by being around for more than a couple passes.

More generally, I don't think this condition will be a huge limiting factor - even if we address it completely, you could have (for example) gradient filterers in training that push the gradient toward generally instantiating future deceptive agents, either agnostic to or less prone to gradient filtering, such that by deployment time you can still get deceptive agents that can defect.

Comment by Jozdien on Gradient Filtering · 2023-01-19T20:47:55.299Z · LW · GW

I think I'm confused at what you're getting at here. If you're making the general claim that solving inner alignment would prevent gradient hacking, I agree! Gradient hacking and its variants are problems with inner optimizers of any kind being misaligned with the training goal. If you're making a different / more specific point though, I may have missed it.

Comment by Jozdien on Trying to isolate objectives: approaches toward high-level interpretability · 2023-01-13T14:04:56.325Z · LW · GW

Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

Comment by Jozdien on Trying to isolate objectives: approaches toward high-level interpretability · 2023-01-11T19:21:45.730Z · LW · GW

Oh yeah I agree - I was thinking more along the lines of that small models would end up with heuristics even for some tasks that require search to do really well, because they may have slightly complex heuristics learnable by models of that size that allow okay performance relative to the low-power search they would otherwise be capable of. I agree that this could make a quantitative difference though and hadn’t thought explicitly of structuring the task along this frame, so thanks!

Comment by Jozdien on Trying to isolate objectives: approaches toward high-level interpretability · 2023-01-11T18:43:07.269Z · LW · GW

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

Comment by Jozdien on Trying to isolate objectives: approaches toward high-level interpretability · 2023-01-11T14:10:47.130Z · LW · GW

Oh yeah, I'm definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.

Comment by Jozdien on Trying to isolate objectives: approaches toward high-level interpretability · 2023-01-10T19:25:29.114Z · LW · GW

I'm glad you liked the post, thanks for the comment. :)

I think deep learning might be practically hopeless for the purpose of building controllable AIs; where by controllable I mean here something like "can even be pointed at some specific objective, let alone a 'good' objective". Consequently, I kinda wish more alignment researchers would at least set a 2h timer and try really hard (for those 2h) to come up---privately---with some approach to building AIs that at least passes the bar of basic, minimal engineering sanity. (Like "design the system to even have an explicit control mechanism", and "make it possible to change the objective/destination without needing to understand or change the engine".)

I don't have strong takes here about what possible training procedures and architectures that actually work outside the deep learning paradigm would look like, but naively it feels like any system where objectives are complex will still involve high-dimensional interface mechanisms to interact with them that we won't fully understand.

Within the deep learning paradigm, GPTs seem like the archetype for something like this, as you said - you can train a powerful world model that doesn't have an objective in any relevant sense and apply some conditional you want (like a simulacra with a specific objective), but because you're interfacing with a very high-dimensional space to impart high-dimensional desires, the non-formalism seems like more a feature than a bug.

The closest (that I'm aware) we can get to doing anything like "load a new objective at runtime" is by engineering prompts for LLMs; but that provides a rather underwhelming level of control.

I think done right, it actually provides us a decent amount of control - but that it's often pretty unintuitive how to exert control, especially at higher degrees and precision because we have to have a really strong feel for what the prior it learns is and what kinds of posteriors you could get with some conditional.

(It's a slightly different problem then though, because you're not dealing with swapping out a new model objective, rather you're swapping out different simulacra with different goals.)

What do you think; does that seem worth thinking about?

I think there are a few separate ideas here worth mentioning. I disagree with that deep learning is practically hopeless for building training procedures that actually result in some goal we want - I think it's really hard, but that there are plausible paths to success. Related to modularity for example, there's some work currently being done on modularizing neural networks conceptually from the ground-up, sort of converting them into forms with modular computational components (unlike current neural networks where it's hard to call a neuron or a weight the smallest unit of optimization). The holy grail of this would plausibly involve a modularized component for "objective" if that's present in the model at all.

I expect that for better or worse, deep learning will probably be how we get to AGI, so I'm sceptical that thinking about new approaches to building AI outside it would yield object-level progress; it might be pretty useful in terms of illuminating certain ideas though, as a thought exercise.

In general? I think that going down this line of thought (if you aren't very pessimistic about deep learning) would plausibly find you working on interesting approaches to hard parts of the problem (I can see someone ending up with the kind of modularity approach above with this, for example) so it seems worth thinking about in absolute terms - in relative terms though, I'm not sure how it compares to other generators.

Comment by Jozdien on Trying to isolate objectives: approaches toward high-level interpretability · 2023-01-10T18:28:59.986Z · LW · GW

A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the "micro-level" parameters of the AI. "Connect" means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters.

Yeah, this is broadly right. The mistake I was making earlier while working on this was thinking that my abstract model was good enough - I've since realized that this is the point of a large part of agent foundations work. It took doing this to realize that however and this framing isn't exactly how I was viewing it but seems pretty cool, so thanks!

Being able to talk about "the AI's objectives" is the special case when you have an abstract model of the AI that features objectives as modeled objects. But using such a model isn't the only way to make progress! We need to build our general capability to connect AIs' parameters to any useful abstract model at all.

Oh yeah I agree - hence my last section on other cases where what we want (identifying the thing that drives the AI's cognition) isn't as clear-cut as an internalized object. But I think focusing on the case of identifying an AI's objectives (or what we want from that) might be a good place to start because everything else I can think of involves even more confused parts of the abstract model and multitude of cases! Definitely agree that we need to build general capacity, I expect there's progress to be made from the direction of starting with complex abstract models that low-level interpretability would eventually scale to.

Even in humans, we're pretty general and powerful despite not having some super-obvious locus of motivation outside the "intended" motivational system that trained our neocortex.

(Disclaimer: includes neurological conjectures that I'm far from familiar with) I agree with the general point that this would plausibly end up being more complicated, but to explain my slight lean toward what I said in the post: I think whatever our locus of motivation is, intuitively it's plausibly still represented somewhere in our brain - i.e., that there are explicit values/objectives driving a lot of our cognition rather than just being value-agnostic contextually-activated reactions. Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain's memory/whatever would be analogous to what I'm picturing here.

Comment by Jozdien on 'simulator' framing and confusions about LLMs · 2023-01-01T13:18:37.294Z · LW · GW

One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn't properly understand that was something like 'it's simulating an agent that is browsing either way'. This seems incorrect to me. The transformer is doing pretty different things when it's e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there's no way to identify which of these it's doing. 

I think this is referring to something I said, so I'm going to clarify my stance here.

First, I'm pretty sure on reading this section now that I misunderstood what you were pointing at then. Instead of:

if you give it a prompt with some commands trying to download and view a page, and the output, it does things like say 'That output is a webpage with a description of X', when in fact the output is blank or some error or something. 

I was picturing something like the other hallucinations you mention, specifically:

if you give it a prompt with some commands trying to download and view a page, and in reality those commands would return a blank output or some error or something, it does things like say 'That output is a webpage with a description of X'.

(In retrospect this seems like a pretty uncharitable take on something anyone with a lot of experience with language models would find a problem. My guess is that at the time I was spending too much time thinking about what you were saying looked like in terms of my existing ontology and what I would have expected to happen, and not enough on actually making sure I understood what you were pointing at).

Second, I'm not fully convinced that this is qualitatively different from other types of hallucinations, except in that they're plausibly easier to fix because RLHF can do weird things specifically to prompt interactions (then again, I'm not sure whether you're actually claiming it's qualitatively different either, in which case this is just a thought dump). If you prompted GPT with an article on coffee and ended it with a question about what the article says about Hogwarts, the conditional you want is one where someone wrote an article about coffee and where someone else's immediate follow-up is to ask what it says about Hogwarts. 

But this is outweighed on the model's prior because it's not something super likely to happen in our world. In other words, the conditional of "the prompt is exactly right and contains the entire content to be used for answering the question" isn't likely enough relative to other potential conditionals like "the prompt contains the title of the blog post, and the rest of the post was left out" (for the example in the post) or "the context changed suddenly and the question should be answered from the prior" or "questions about the post can be answered using knowledge from outside the post as well" or something else that's weird because the intended conditional is unlikely enough to allow for it (for the Hogwarts example).

Put that way this just sounds like it's quantitatively different from other hallucinations, in that information in the prompt is be a stronger way to influence the posterior you get from conditioning. And this can allow us a greater degree of control, but I don't see the model as doing fundamentally different things here as opposed to other cases.

Physics simulators

Relatedly, I've heard people reason about the behavior of current models as if they're simulating physics and going from this to predictions of which tokens will come next, which I think is not a good characterization of current or near-future systems. Again, my guess is that very transformative things will happen before we have systems that are well-understood as doing this.

I'm not entirely sure this is what they believe, but I think the reason this framing gets thrown around a lot is that it's a pretty evocative way to reason about the model's behaviour. Specifically, I would be pretty surprised if anyone thought this was literally true in the sense of modelling very low-level features of reality, and didn't just use it as a useful way to talk about GPT mechanics like time evolution over some learned underlying mechanics, and to draw inspiration from the analogy.

Rolling out long simulations

I get the impression from the original simulators post that the author expects you can 'roll out' a simulation for a large number of timesteps and this will be reasonably accurate

For current and near-future models, I expect them to go off-distribution relatively quickly if you just do pure generation - errors and limitations will accumulate, and it's going to look different from the text they were trained to predict. Future models especially will probably be able to recognize that you're running them on language model outputs, and seems likely this might lead to weird behavior - e.g. imitating previous generations of models whose outputs appear in the training data. Again, it's not clear what the 'correct' generalization is if the model can tell it's being used in generative mode.

I agree with this. But while again I'm not entirely sure what Janus would say, I think their interactions with GPT involve a fair degree of human input on long simulations, either in terms of where to prune / focus, or explicit changes to the prompt. (There are some desirable properties we get from a relaxed degree of influence, like story "threads" created much earlier ending up resolving themselves in very unexpected ways much later in the generation stream by GPT, as if that was always the intention.)

GPT-style transformers are purely myopic

I'm not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.

In fact, although the *output* tokens are myopic, autoregressive transformers are incentivised to compute activations at early sequence positions that will make them better at predicting tokens at later positions. This may also have indirect impacts on the actual tokens output at the early positions, although my guess would be this isn't a huge effect.

Echoing porby's comment, I don't find the kind of narrow myopia this breaks to be very concerning.

Pure simulators

From the simulators post I get some impression like "There's a large gulf between the overall model itself and the agents it simulates; we will get very capable LLMs that will be 'pure simulators'"

Although I think this is true in a bunch of important ways, it seems plausible to me that it's pretty straightforward to distill any agent that the model is simulating into the model, and that this might happen by accident also. This is especially true once models have a good understanding of LLMs. You can imagine that a model starts predicting text with the hypothesis 'this text is the output of an LLM that's trying to maximise predictive accuracy on its training data'. If we're at the point where models have very accurate understandings of the world, then integrating this hypothesis will boost performance by allowing the model to make better guesses about what token comes next by reasoning about what sort of data would make it into an ML training set.

I agree with this being a problem, but I didn't get the same impression from the simulators post (albeit I'd heard of the ideas earlier so this may be on the post) - my takeaway was just that there's a large conceptual gulf between what we ascribe to the model and its simulacra, not that there's a gulf in model space between pure generative models and a non-simulator (I actually talk about this problem in an older post, which they were a large influence on).

Comment by Jozdien on Should we push for requiring AI training data to be licensed? · 2022-12-14T18:12:32.130Z · LW · GW

I agree, but this is a question of timelines too. Within the LLM + RL paradigm we may not need AGI-level RL or LLMs that can accessibly simulate AGI-level simulacra just from self-supervised learning, both of which would take longer than many points requiring intermediate levels of LLM and RL capabilities, because people are still working on RL stuff now.

Comment by Jozdien on Latent Adversarial Training · 2022-12-12T15:20:59.720Z · LW · GW

Given that we want the surgeon to be of bounded size (if we're using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn't seem obvious to me.

Comment by Jozdien on The LessWrong 2021 Review: Intellectual Circle Expansion · 2022-12-11T14:54:24.474Z · LW · GW

What's the most convenient way to get the books internationally? I wasn't able to get the last two years' sets and figured I'd just wait until I moved to more convenient location, but if this might be the last year you're doing this I definitely want to try getting it this time.

Comment by Jozdien on Latent Adversarial Training · 2022-12-06T22:59:48.939Z · LW · GW

I haven't thought about this a lot, but "encrypted" could just mean "just beyond the capabilities of the Surgeon to identify". So the gradient could be moving in a direction away from "easily identifiable early deceptive circuits" instead of "deception", and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?

Comment by Jozdien on [ASoT] Finetuning, RL, and GPT's world prior · 2022-12-05T23:06:14.857Z · LW · GW

I was thinking of some kind of prompt that would lead to GPT trying to do something as "environment agent-y" as trying to end a story and start a new one - i.e., stuff from some class that has some expected behaviour on the prior and deviates from that pretty hard. There's probably some analogue with something like the output of random Turing machines, but for that specific thing I was pointing at this seemed like a cleaner example.

Comment by Jozdien on Latent Adversarial Training · 2022-12-04T11:46:18.900Z · LW · GW

This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.

I think I'm missing something here: Even if the model isn't actively deceptive, why wouldn't this kind of training provide optimization pressure toward making the Agent's internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.

Comment by Jozdien on [ASoT] Finetuning, RL, and GPT's world prior · 2022-12-02T18:45:46.940Z · LW · GW

Done! Thanks for updating me toward this. :P

Comment by Jozdien on [ASoT] Finetuning, RL, and GPT's world prior · 2022-12-02T18:28:55.549Z · LW · GW

Yeah, I thought of holding off actually creating a sequence until I had two posts like this. This updates me toward creating one now being beneficial, so I'm going to do that.

Comment by Jozdien on [ASoT] Finetuning, RL, and GPT's world prior · 2022-12-02T18:17:42.695Z · LW · GW

Alignment Stream of Thought. Sorry, should've made that clearer - I couldn't think of a natural place to define it.

Comment by Jozdien on A challenge for AGI organizations, and a challenge for readers · 2022-12-02T16:16:34.853Z · LW · GW

I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.

I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.

I don't think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI's approach because it views these as solutions in themselves, instead of as simplified problems.

Comment by Jozdien on The Plan - 2022 Update · 2022-12-02T15:57:54.733Z · LW · GW

I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.

In the interest of making my abstract intuition here more precise, a few weird questions:

Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.

What does your picture of (realistically) ideal outcomes from theory work look like? Is it more giving interpretability researchers a better frame to reason under (like a more mathematical notion of optimization that we have to figure out how to detect in large nets against adversaries) or something even more ambitious that designs theoretical interpretability processes that Just Work, leaving technical legwork (what ELK seems like to me)?

While they definitely share core ideas of ontology mismatch, it feels like the approaches are pretty different in that you prioritize mathematical definitions a lot and ARC is heuristical. Do you think the mathematical stuff is necessary for sufficient deconfusion, or just a pretty tractable way to arrive at the answers we want?

We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.

I'm not really convinced that even if corrigibility is A Thing (I agree that it's plausible it is, but I think it could also just be trivially part of another Thing given more clarity), it's as good as other medium-term targets. Corrigibility as stated doesn't feel like it covers a large chunk of the likely threat models, and a broader definition seems like it's just rephrasing a bunch of the stuff from Do What I Mean or inner alignment. What am I missing about why it might be as good a target?

Comment by Jozdien on Mysteries of mode collapse · 2022-11-09T22:02:22.903Z · LW · GW

generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine

The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something.  Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.

it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.


text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”? 

It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.

I feel like I'm missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model.  Specifically, that the reward model RLHF is using "represents" in an information-theoretic sense a signal for the worlds represented by the fine-tuning data.  So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world - like if GPT's pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that's much more likely to occur.

(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don't think that's what's happening here).

The attractor states seem like they're highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that's more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate.  In that latter example in particular, I don't see the results being that different from what we would expect if GPT's training data was from a world slightly different to ours - injecting input that's pretty unlikely for that world should still lead back to states that are likely for that world.  In my view, that's like if we introduced a random segue in the middle of a wedding toast prompt of the form "you are a murderer", and it still bounces back to being wholesome (this works when I tested).

Regarding ending a story to start a new one - I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn't feel all that qualitatively different from what happens in current models - the interesting part seems to be the stronger tendency toward the new worlds the RLHF'd model finds likely, which seems like it's just expected behaviour as a simulator becomes more sure of the world it's in / has a more restricted worldspace.  I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its "right" worldspace.  I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn't feel like it's that surprising from within the simulator framing.

The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it's simulating, because of the worldspace restriction from the fine-tuning - it's plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this - there's no reason why all potential abstractions would be affected).

Separate thought: this would explain why increasing the temperate doesn't affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable - it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model's worlds.  Its definition of "plausible" has shifted; and it doesn't really have cause to shift around any unamplified continuations all that much.

Broadly,  my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does.  A priori, I don't see why this framing doesn't hold, but it's definitely possible that it's just saying the same things you are and I'm reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I'd love to hear what I'm missing.

Comment by Jozdien on [ASoT] Instrumental convergence is useful · 2022-11-09T20:54:42.117Z · LW · GW

Running the superintelligent AI on an arbitrarily large amount of compute in this way seems very dangerous, and runs a high risk of it breaking out, only now with access to a hypercomputer (although I admit this was my first thought too, and I think there are ways around this).

More saliently though, whatever mechanism you implement to potentially "release" the AGI into simulated universes could be gamed or hacked by the AGI itself.  Heck, this might not even be necessary - if all they're getting are simulated universes, then they could probably create those themselves since they're running on arbitrarily large compute anyway.

You're also making the assumption that these AIs would care about what happens inside a simulation created in the future, as something to guide their current actions.  This may be true of some AI systems, but feels like a pretty strong one to hold universally.

(I think this is a pretty cool post, by the way, and appreciate more ASoT content).