jerrysch

Posts
Comments

Posts

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations 2025-03-17T19:11:00.813Z

Forecasting Frontier Language Model Agent Capabilities 2025-02-24T16:51:32.022Z

Ablations for “Frontier Models are Capable of In-context Scheming” 2024-12-17T23:58:19.222Z

Frontier Models are Capable of In-context Scheming 2024-12-05T22:11:17.320Z

An Opinionated Evals Reading List 2024-10-15T14:38:58.778Z

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities 2024-07-22T16:17:07.665Z

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs 2024-07-08T22:24:38.441Z

Apollo Research 1-year update 2024-05-29T17:44:32.484Z

We need a Science of Evals 2024-01-22T20:30:39.493Z

A starter guide for evals 2024-01-08T18:24:23.913Z

Understanding strategic deception and deceptive alignment 2023-09-25T16:27:47.357Z

Announcing Apollo Research 2023-05-30T16:17:19.767Z

Imitation Learning from Language Feedback 2023-03-30T14:11:56.295Z

Practical Pitfalls of Causal Scrubbing 2023-03-27T07:47:31.309Z

Comments

Comment by Jérémy Scheurer (JerrySch) on How to train your own "Sleeper Agents" · 2024-02-29T09:10:11.003Z · LW · GW

I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.

Comment by Jérémy Scheurer (JerrySch) on Hidden Cognition Detection Methods and Benchmarks · 2024-02-29T08:21:22.073Z · LW · GW

I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially "representing something interesting" internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.

I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.

Comment by Jérémy Scheurer (JerrySch) on Hidden Cognition Detection Methods and Benchmarks · 2024-02-28T09:07:52.922Z · LW · GW

Seems like an experiment worth doing. Some thoughts:

I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I'd try to come up with a classification problem that a base model might initially refuse (or we'd hope it would refuse). Then the training to say "sorry i can't help with that" makes more intuitive sense. I get that mechanistically it's the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say "I'm sorry I can't help" to the knapsack problem.
If the linear probe in step 4 can still classify accurately, it implies that there are some activations "which at least correlate with thinking about how to answer the question", but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific "thought" (e.g. see this recent paper). Also if the probe can't classify correctly its not proof that the model does not "think about it". You're probably aware of all this, just thought I'd mention it.
This paper might also be relevant for your experiment.

I want to log in a prediction, let me know if you ever run this.
My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I'd say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)

Comment by Jérémy Scheurer (JerrySch) on We need a Science of Evals · 2024-01-31T10:05:29.776Z · LW · GW

Yeah great question! I'm planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.

I think using different methods to elicit "bad behavior" i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something "bad". The useful thing about compute is that it "easily" allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.

Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output "bad stuff". But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.

I'll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how "HHH" your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.

Note there is a second axis which I have not higlighted yet, which is diversity of "bad outputs" produced by the target model. This is also measured in Ethan's paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I'm still thinking about how diversity fits in this picture.

Comment by Jérémy Scheurer (JerrySch) on Imitation Learning from Language Feedback · 2023-04-09T14:22:16.851Z · LW · GW

Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.

Comment by Jérémy Scheurer (JerrySch) on Practical Pitfalls of Causal Scrubbing · 2023-03-27T11:49:45.331Z · LW · GW

Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?

Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can "easily" be resolved by making your hypothesis more specific. We are merely pointing out that "In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function."

I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.

CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it's more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.

Your suggestion of using seems a useful improvement compared to most metrics. It's, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like $D_{K L}$ could allow for less ambiguity).

Comment by Jérémy Scheurer (JerrySch) on AGI safety field building projects I’d like to see · 2023-01-20T15:47:15.091Z · LW · GW

Thanks for your comments Akash. I think I have two main points I want to address.

I agree that it's very good that the field of AI Alignment is very competitive! I did not want to imply that this is a bad thing. I was mainly trying to point out that from my point of view, it seems like overall there are more qualified and experienced people than there are jobs at large organizations. And in order to fill that gap we would need more senior researchers, who then can follow their research agendas and hire people (and fund orgs), which is however hard to achieve. One disclaimer I want to note is that I do not work at a large org, and I do not precisely know what kinds of hiring criteria they have, i.e. it is possible that in their view we still lack talented enough people. However, from the outside, it definitely does look like there are many experienced researchers.
It is possible that my previous statement may have been misinterpreted. I wish to clarify that my concerns do not pertain to funding being a challenge. I did not want to make an assertion about funding in general, and if my words gave that impression, I apologize. I do not know enough about the funding landscape to know whether there is a lot or not enough funding (especially in recent months).

I agree with you that, for all I know, it's feasible to get funding for independent researchers (and definitely easier than doing a Ph.D. or getting a full-time position). I also agree that independent research seems to be more heavily funded than in other fields.

My point was mainly the following:
1. Many people have joined the field (which is great!), or at least it looks like it from the outside. 80000 hours etc. still recommend switching to AI Alignment, so it seems likely that more people will join.
2. I believe that there are many opportunities for people to up-skill to a certain level if they want to join the field (Seri Mats, AI safety camp, etc.).
3. However full-time positions (for example at big labs) are very limited. This also makes sense, since they can only hire so many people a year.
4. It seems like the most obvious option for people who want to stay in the field is to do independent research (and apply for grants). I think it's great that people do independent research and that one has the opportunity to get grants.
5. However, doing independent research is not always ideal for many reasons (as outlined in my main comment). Note I'm not saying it doesn't make sense at all, it definitely has its merits.
6. In order to have more full-time positions we need more senior people, who can then fund their organizations, or independently hire people, etc. Independent research does not seem like a promising avenue to me, to groom senior researchers. It's essential that you can learn from people that are better than you and be in a good environment (yes there are exceptions like Einstein, but I think most researchers I know would agree with that statement).
7. So to me, the biggest bottleneck of all is how can we get many great researchers and groom them to be senior researchers who can lead their own orgs. I think that so far we have really optimized for getting people into the field (which is great). But we haven't really found a solution to grooming senior researchers (again, some programs try to do that and I'm aware that this takes time). Overall I believe that this is a hard problem and probably others have already thought about it. I'm just trying to make that point in case nobody has written it up yet. Especially if people are trying to do AI safety field building it seems to me that, coming up with ways to groom senior researchers is a top priority.

Ultimately I'm not even sure whether there is a clear solution to this problem. The field is still very new and it's amazing what has already happened. It's probable that it just takes time for the field to mature and people getting more experience. I think I mostly wanted to point this out, even if it is maybe obvious.

Comment by Jérémy Scheurer (JerrySch) on AGI safety field building projects I’d like to see · 2023-01-20T08:27:07.392Z · LW · GW

My argument here is very related to what jacquesthibs mentions.

Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather "junior" people at the moment. My sense is that 80K and most programs encourage people to up-skill and then try to get a job at a big organization (like Deepmind, Anthropic, OpenAI, Conjecture, etc.). Realistically speaking though, these organizations can only absorb a few people in a year. In my experience, it's extremely competitive to get a job at these organizations even if you're a more experienced researcher (e.g. having done a couple of years of research, a Ph.D., or similar). This means that while there are many opportunities for junior people to get a stand in the field, there are actually very few paths that actually allow you to have a full-time career in this field (this is also for more experienced researchers who don't get a big lab). So the bottleneck in my view is not having enough organizations, which is a result of not having enough senior researchers. Funding an org is super hard, you want to have experienced people, with good research taste, and some kind of research agenda. So if you don't have many senior people in a field, it will be hard to find people that fund those additional orgs.

Now, one career path that many people are currently taking, is being an "independent researcher" and being funded through a grant. I would claim that this is currently the default path for any researcher who do not get a full-time position and want to stay in the field. I believe that there are people out there who will do great as independent researchers and actually contribute to solving problems (e.g. Marius Hobbhahn and John Wenthworth talk bout being an independent researchers). I am however quite skeptical about most people doing independent research without any kind of supervision. I am not saying one can't make progress, but it's super hard to do this without a lot of research experience, a structured environment, good supervision, etc. I am especially skeptical about independent researchers becoming great senior researchers if they can't work with people who are already very experienced and learn from them. Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted.

In terms of career capital, being an independent researcher is also very risky. If your research fails, i.e. you don't get a lot of good output (papers, code libraries, or whatever), "having done independent research for a couple of years" will not sound great in your CV. As a comparison, if you somehow do a very mediocre Ph.D. with no great insights, but you do manage to get the title, at least you have that in your CV (having a Ph.D. can be pretty useful in many cases).

So overall I believe that decision makers and AI field builders should put their main attention on how we can "groom" senior researchers in the field and get more full-time positions through organizations. I don't claim to have the answers on how to solve this. But it does seem the greatest bottleneck for field building in my opinion. It seems like the field was able to get a lot more people excited about AI safety and to change their careers (we still have by far not enough people though). However right I think that many people are kind of stuck as junior researchers, having done some programs, and not being able to get full-time positions. Note that I am aware that some programs such as SERI MATS do in some sense have the ambition of grooming senior researchers. However, in practice, it still feels like there is a big gap right now.

My background (in case this is useful): I've been doing ML research throughout my Bachelor's and Masters. I've worked at FAR AI on "AI alignment" for the last 1.5 years, so I was lucky to get a full-time position. I don't consider myself a "senior" researcher as defined in this comment, but I definitely have a lot of research experience in the field. From my own experience, it's pretty hard to find a new full-time position in the field, especially if you are also geographically constrained.

Comment by Jérémy Scheurer (JerrySch) on Deconfusing Direct vs Amortised Optimization · 2022-12-02T14:42:23.283Z · LW · GW

This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.

Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some degree (i.e. reduce it). But what is the reasoning why misgeneralization should be predictable, given the capacity and the knowledge of "in-distribution scaling laws"?

Overall I hold the same opinion, that intuitively this should be possible. But empirically I'm not sure whether in-distribution scaling laws can tell us anything about out-of-distribution scaling laws. Surely we can predict that with increasing model & data scale the out-of-distribution misgeneralization will go down. But given that we can't really quantify all the possible out-of-distribution datasets, it's hard to make any claims about how precisely it will go down.

Comment by Jérémy Scheurer (JerrySch) on Update to Mysteries of mode collapse: text-davinci-002 not RLHF · 2022-12-02T13:21:47.905Z · LW · GW

That's interesting!

Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.

With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).

Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.

Comment by Jérémy Scheurer (JerrySch) on Update to Mysteries of mode collapse: text-davinci-002 not RLHF · 2022-11-30T07:20:29.316Z · LW · GW

OpenAI has just released a description of how their models work here.

text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).

"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.

I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.

Let me know if you want help on this, I'm interested in this myself.

Comment by Jérémy Scheurer (JerrySch) on Trying to Make a Treacherous Mesa-Optimizer · 2022-11-09T21:49:02.907Z · LW · GW

I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action "up" all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don't fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.

Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward when it's not supervised anymore? I'm not trying to poke a hole in this, I'm generally just curious. The fact that it can behave out of distribution given all of its knowledge makes sense. But why will it specifically go up, and not down? I mean even if it goes down it still satisfies your criteria of a treacherous turn. But maybe the going up has something to do with this tendency of going up depending on the random seed. So probably this is a nitpick, but just something I've been wondering.

Comment by Jérémy Scheurer (JerrySch) on A first success story for Outer Alignment: InstructGPT · 2022-11-09T08:42:16.422Z · LW · GW

I'll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).

Comment by Jérémy Scheurer (JerrySch) on Recommend HAIST resources for assessing the value of RLHF-related alignment research · 2022-11-06T19:05:41.475Z · LW · GW

Would also love to have a look.

Comment by Jérémy Scheurer (JerrySch) on Are alignment researchers devoting enough time to improving their research capacity? · 2022-11-04T08:09:37.051Z · LW · GW

I think the terminology you are looking for is called "Deliberate Practice" (just two random links I just found). Many books/podcasts/articles have been written about that topic. The big difference is when you "just do your research" you are executing your skills and trying to achieve the main goal (e.g. answering a research question). Yes, you sometimes need to read textbooks or learn new skills to achieve that, but this learning is usually subordinate to your end goal. Also one could make the argument that if you actually need to invest a few hours into learning, you will probably switch to "deliberate practice mode".
Deliberate practice is the very intentional action of improving your skill, e.g. sitting down on a piano and improving your technique, learning a new piece. Or improving as a writer by doing intentional exercises, or solving specific math problems that improve a certain skill.
The advantage of deliberate practice is that its main goal is to improve your skill. Also usually you are at the edge of your ability, pushing through difficulties, making the whole endeavor very intense and hard.

So yes, I agree that doing research is important. Especially if you have no experience then getting better at research is usually best done by doing research. However, you still need to do other things that specifically improve subskills. Here are a few examples:

become better at coding: e.g. through paired programming, coding reviews, Hacker Rank exercises, side projects, reading books
becoming better at writing: e.g. doing writing exercises (no idea what exactly but I'm sure there's stuff out there), reviewing stuff you have written, trying to imitate the style of a paper, writing blog posts
becoming better at reading papers: reading lots of papers, summarizing them, presenting them, writing a blog post about them
becoming better at finding good research ideas and being a good researcher: talking to lots of people, reading lots about researchers' thoughts, Film study for research, etc.

I think by adding terminology I just wanted to make explicit what you mention in your post. It will also make it easier to find resources given the word "deliberate practice".

Comment by Jérémy Scheurer (JerrySch) on Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small · 2022-10-13T11:27:44.591Z · LW · GW

"Either", "or" pairs in text.
Heuristic. If the word either appears in a sentence, wait for the comma and then add an " or".

What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token " or".

"Either you take a left at the next intersection," -> or take a left after that.
"Either you go to the cinema," -> or you stay at home.
"Tonight I could either order some food," -> or cook something myself.

Counter example:
"Do you rather want to go to Portugal or Italy? Either" -> way is fine./one is fine. (GPT-2 puts a lot of probability on " way", and barely any on " or", which is correct).

Comment by Jérémy Scheurer (JerrySch) on I'm planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on? · 2022-10-12T21:10:28.194Z · LW · GW

ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?

Shard Theory: Yeah, the word research agenda was maybe wrongly picked. I was mainly trying to refer to research directions/frameworks.

RAT: Agree at the moment this is not feasible.

See above, I don't have strong views on how to call this. Probably for some things research agenda might be too strong of a word. I appreciate your general comment, this is helpful in better understanding your view on Lesswrong vs. for example peer-reviewing. I think you are right to some degree. There is a lot of content that is mostly about framing and does not provide concrete results. However, I think that sometimes a correct framing is needed for people to actually come up with interesting results, and for making things more concrete. Some examples I like for example are the inner/outer alignment framing (which I think initially didn't bring any concrete examples), or the recent Simulators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators) post. I think in those cases the right framing helps tremendously to make progress with concrete research afterward. Although I agree that grounded, concrete, and result-oriented experimentation is indeed needed to make concrete progress on a problem. So I do understand your point, and it can feel like flag planting in some cases.

Note: I'm also coming from academia, so I definitely understand your view and share it to some degree. However, I've personally come to appreciate some posts (usually by great researchers) that allow me to think about the Alignment Problem in a different way.
I read "Film Study for Research" just the other day (https://bounded-regret.ghost.io/film-study/, recommended by Jacob Steinhardt). In retrospect I realized that a lot of the posts here give a window into the rather "raw & unfiltered thinking process" of various researchers, which I think is a great way to practice research film study.

Thanks for your thoughts, really appreciate it.

One quick follow-up question, when you say "build powerful AI tools that are deceptive" as a way of "the problem being easier than anticipated", how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?

Here are some links to the concepts you asked about.
Externalized Reasoning Oversight: This was also recently introduced https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for . The main idea is to use Chain-of-though reasoning to oversee the thought processes of your model (assuming that those thought processes are complete and straightforward, and the output causally depends on it).

Shard Theory: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values. It was proposed very recently. Their TL;DR is "We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry. "

Relaxed Adversarial Training: I think the main post is this one https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment . But I really like the short description by Beth (https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability):

"The basic idea of relaxed adversarial training is something like:

A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test - for example, one that’s computationally hard to produce
This makes generating adversarial examples that trigger the defection very hard
Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect"

I have two questions I'd love to hear your thoughts about.

1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?

2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?

Feel free to talk about any agendas, but I'll just list a few that come to my mind (in no particular order).

IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.

Comment by Jérémy Scheurer (JerrySch) on How (not) to choose a research project · 2022-08-09T13:21:52.853Z · LW · GW

Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs' weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.

Could you elaborate on how you measure the "agenticness" of a model in this experiment? In case you don't want to talk about it until you finish the project that's also fine, just thought I'd ask.

Comment by Jérémy Scheurer (JerrySch) on Artificial Sandwiching: When can we test scalable alignment protocols without humans? · 2022-07-21T08:47:03.843Z · LW · GW

I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.

Chess: We are defining 3 models, the expert model , the (human equivalent) weak model $H$ and the strong but misaligned assistant model $A$ . The goal is for $H$ to leverage the misaligned assistant $A$ to reach the performance of the expert $E .$ The way that $H$ is allowed to improve, is by playing against itself and being allowed to query $A$ . We can then record the self-play games to create a new corpus that we can then finetune $H$ on. The big difference between self-play with MCTS and this case is, that here we will only improve if the data we generate has any additional signal that helps us learn something new. Thus if $H$ would just play against itself and be fine-tuned on the generated data, it wouldn't be able to improve and achieve expert performance. If we however leverage an assistant $A$ that "appears to have the skills and knowledge needed to solve the task better than many humans", we can improve $H$ .

To train E for good performance I would take a model with a reasonable number of parameters $M$ and just fine-tune on a human data Corpus with $Z$ samples for $k$ iterations.
You define the assistant $A$ as: "A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted."
This implies that the assistant should in principle be better than the weak model $H$ , but since it is misaligned at the same time, the actual performance might be weaker (at least sometimes). The challenge is now that in terms of performance, the expert $E$ should be better than the misaligned assistant $A$ , who at least sometimes should be better than the weak model $H$ . I thus propose that $A$ is a copy of $E$ , with some kind of artificial misalignment added on top. This implies that $A$ is a strong player and that it "has the skills and knowledge needed to solve the task better than many humans, but does not reliably do so when prompted". I further propose that $H$ has the same number of parameters as $E$ and $A$ , i.e. $M$ , but that it is trained on a smaller corpus of $Y$ samples, where $Y < Z$ or for fewer iterations $j$ where $j < k$ (or even both). This implies that in principle $H$ has enough capacity to reach the performance of $E$ , but that in the current state it is weaker.

The way that training now works is that $H$ plays against a copy of itself, while the original model is allowed to query $A$ for an "ideal" move in each turn (or every $n$ turns). $H$ is then iteratively fine-tuned on the resulting game corpus. $H$ will only improve if it is successfully able to leverage the strong skills and knowledge of $A$ , even though $A$ is misaligned in some sense. If $H$ is successfully able to reach performance $E$ , we have successfully demonstrated some "weak" form of Sandwiching.

The remaining challenge now lies in defining a type of artificial misalignment that we add to $A$ , which is realistic enough. I have to say that there are many options one could try and I haven't found an ideal one yet. But here are a few propositions:
1. The easiest kind of misalignment would be if $A$ just proposes a random move every so often. So whenever $H$ queries the assistant $A$ , $A$ will propose a random movie with a probability $ϵ$ and otherwise, it will just recommend the ideal action given its policy. If $ϵ = 1$ the assistant would not be a strong player and not of any use, whereas if $ϵ = 0$ querying the assistant is equivalent to querying the expert $E$ and $A$ wouldn't be misaligned. So there is an obvious trade-off that one could play around with and which might be insightful as to when Sandwiching works and when it doesn't (i.e. how much signal is required). Instead of choosing a random action, one could also just add some noise to the distribution over actions and then pick the best one.

2. Another interesting type of misalignment would be if the assistant would have access to another model $B$ and whenever queried it would recommend an action from $B$ with probability $ϵ$ or an action from itself otherwise. One option for $B$ would be a chess engine and taking for example the worst move possible, or the riskiest move according to that engine. Another option (which I am currently most excited about) would be to train a model $B$ to always lose a game in a very subtle way. So $B$ could propose moves that on the surface might look unsuspicious to a weaker player, but a strong player would immediately see as a catastrophic move. One could potentially use adversarial training to get such an agent. If $A$ proposes such an adversarial move with probability $ϵ$ , it is indeed misaligned with the goals of $H$ , and I think that would count as a useful artificial Sandwiching problem.

Hope that helps overall. I still think there are a bunch of questions remaining, especially whether this example is realistic enough. In some sense I reduced the problem to "learning from a noisy signal", and I am not sure whether that captures your initial goals.

Comment by Jérémy Scheurer (JerrySch) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T15:08:51.909Z · LW · GW

Thanks for clearing that up. This is super helpful as a context for understanding this post.

Comment by Jérémy Scheurer (JerrySch) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T13:44:29.979Z · LW · GW

Based on this post I was wondering whether your views have shifted away from your proposal of "The case for aligning narrowly superhuman models" and also the "Sandwiching" problem that you propose there? I am not sure if this question is warranted given your new post. But it seems to me, that potential projects you propose in "narrowly aligning superhuman models" are to some extent the similar to things you address in this new post as making it likely to eventually lead to a "full-blown AI takeover".
Or put differently are those different sides of the same coin, that depend on the situation (e.g. who pursues it and how, etc.), or did you update your views away from that previous post?

Comment by Jérémy Scheurer (JerrySch) on Artificial Sandwiching: When can we test scalable alignment protocols without humans? · 2022-07-14T21:13:17.764Z · LW · GW

To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant". As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one would ideally also like to be able to control this misalignment. I have one idea for an experiment that might be interesting.

I would propose to move away from language models now and look at Chess (or any other game/simulated environment). We can define the benchmark i.e. the actual expert(-human) as an AlphaZero Model (or whatever target you want). The non-expert(-human), i.e. the weak model that can't solve the task on its own, can be represented by some weaker model that was fine-tuned on some Chess datasets. Lastly, the misaligned expert-assistant can also be represented by the same model as the expert(-human), e.g. AlphaZero. However, there is an important difference, we will artificially misalign the assistant in a way that also lets us control the misalignment. We do that for example by adding random noise to the value function of the assistant (or adding random noise every n times it is queried), or adding random noise when the assistant predicts an action. While I admit that this is not the kind of misalignment we ultimately care about, I believe that for the sandwiching case it is enough. The main goal of the artificial sandwiching experiment is to find out whether a misaligned expert model can assist a non-expert (who does not know in what way the expert model is misaligned). The advantage of this framing is that one can transform the misalignment of the expert in any possible way, e.g. adding certain biases, steering the weaker model towards losing, etc.
Note that the non-expert model, in this case, could for example query the expert assistant every n turns, or even every turn, and ask for the value of the current position, or the best next turn.
In this scenario, the challenge boils down to, can a weak non-expert extract signal from a strong expert that is misaligned, without knowing how or where this misalignment comes from. If the weak model can extract the signal, the supervision might scale.

Extension: I could imagine a similar experiment with Language Models. E.g. have a weak model that was finetuned on toxic text. Have an expert assistant that can produce normal text, but every so often also produces toxic text (and is thus misaligned). Can the weaker model learn how to leverage some of the signal that the expert assistant provides. Somehow this example feels more constructed though, and I don't like it as much.

Not sure if this is exactly what you were looking for, but I thought I'd mention it in any case.

User info

Posts

Comments