Posts

Your LLM Judge may be biased 2024-03-29T16:39:22.534Z
CIRL Corrigibility is Fragile 2022-12-21T01:40:50.232Z

Comments

Comment by Rachel Freedman (rachelAF) on Your LLM Judge may be biased · 2024-04-01T02:03:59.990Z · LW · GW

This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.

Comment by Rachel Freedman (rachelAF) on Your LLM Judge may be biased · 2024-03-30T02:01:16.354Z · LW · GW

I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!

If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.

The main challenges that I see are 1) the computational cost of doing additional optimization (we had to do best-of-N optimization rather than updating the entire model to make our experiments manageable) and 2) it requires finetuning access (which often isn’t available for the latest models). But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often.

Comment by Rachel Freedman (rachelAF) on CIRL Corrigibility is Fragile · 2023-12-24T17:56:23.679Z · LW · GW

I’d be interested to see this as well!

Comment by Rachel Freedman (rachelAF) on Lightcone Infrastructure/LessWrong is looking for funding · 2023-06-14T18:56:36.130Z · LW · GW

Thank you for such a detailed and thorough answer! This resolves a lot of my confusion.

Based on conversations around closing the wework Lightcone office, I had assumed that you didn't want to continue hosting office space, and so hadn't considered that counterfactual cost. But the Inn expenses you mention seem more reasonable if the alternative is continuing to rent wework space.

The FTX context also makes a lot of sense. I was confused how the purchase fit into your current strategy and funding situation, but I understand that both of those were quite different a year or two ago. Given how much things have changed, do you have conditions under which you would decide to sell the space and focus on other projects? Or are you planning to hold onto it no matter what, and decide how best to use it to support your current strategy as that develops?

Comment by Rachel Freedman (rachelAF) on Lightcone Infrastructure/LessWrong is looking for funding · 2023-06-14T18:41:16.701Z · LW · GW

These all sound like major benefits to owning the venue yourself!

To be clear, I don't doubt at all that using the Inn for events is much better than non-purpose-built space. However, the Inn also has costs that renting existing spaces wouldn't: I assume that purchasing and renovating it costs more than renting hotel spaces as-needed for events (though please correct me if I'm wrong!), and my impression is that it's taken the Lightcone team a lot of time and effort over the past year+ to purchase and renovate, which naturally has opportunity costs.

I'm asking because my uninformed guess is that those financial and time costs outweigh the (very real) benefits of hosting events like you have been. I'm interested to hear if I'm just wrong about the costs, or if you have additional plans to make even more effective use of the space in the future, or if there's additional context I'm missing.

ETA: Oli answered these questions below, so no need to respond to them unless you have something additional you'd like me to know.

Comment by Rachel Freedman (rachelAF) on Lightcone Infrastructure/LessWrong is looking for funding · 2023-06-14T16:11:39.363Z · LW · GW

Will much of that $3-6M go into renovating and managing the Rose Garden Inn, or to cover work that could have been covered by existing funding if the Inn wasn't purchased?

If so, I'm curious to hear more about the strategy behind buying and renovating the space, since it seems like a substantial capital investment, and a divergence from LightCone Infrastructure's previous work and areas of expertise. I'm aware that several (primarily social?) events were held there over the past year, and I see from an earlier comment that you're planning to host SERI MATS scholars, and to continue providing space for events and retreats.

it seems valuable to have a central and optimized space for hosting people and events, but I'm curious how large the counterfactual benefit of the Inn is. If it didn't exist, programs would have to use existing venues such as hotels, which would charge them more (I assume?) and presumably be somewhat less nice. How would you quantify the counterfactual benefit that the Inn has provided here? How does that compare to the expense of buying, renovating and managing it? If the costs exceed those benefits, what additional value do you plan to get out of the space?

Comment by Rachel Freedman (rachelAF) on CIRL Corrigibility is Fragile · 2023-01-07T22:41:25.926Z · LW · GW

I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn't rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems).

I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post).

My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do you disagree with that?

Comment by rachelAF on [deleted post] 2022-12-11T19:28:37.587Z

Thanks for the clarification! From OpenAI's announcement, it looks like this ranking only occurs during the finetuning portion of training (Step 2). But the user doesn't have the opportunity to provide this feedback after deployment. So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users? I'm asking because one of the key benefits of CIRL games (also called "assistance games") is that they allow the AI to continuously update towards the user's values, without freezing for deployment, and I don't fully understand the connection here.

Comment by rachelAF on [deleted post] 2022-12-11T03:58:41.869Z

Where does the reward  in step 1 come from? Is it assigned by H? Is it determined by an outside observer? Is the reward function somehow hardcoded into the context?

Comment by Rachel Freedman (rachelAF) on Does a LLM have a utility function? · 2022-12-09T21:02:09.637Z · LW · GW

I think that the significant distinction is whether an AI system has a utility function that it is attempting to optimize at test time. A LLM does have an utility function, in that there is an objective function written in its training code that it uses to calculate gradients and update its parameters during training. However, once it is deployed, its parameters are frozen and its score on this objective function can no longer impact its behavior. In that sense, I don't think that it makes sense to think of a LLM as "trying to" optimize this objective after deployment. However, this answer could change in response to changes in model training strategy, which is why this distinction is significant.

Comment by Rachel Freedman (rachelAF) on AI Safety Seems Hard to Measure · 2022-12-09T20:52:31.145Z · LW · GW

Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There's extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related -- if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignment research itself. 

This makes me wonder,  are there proxy metrics that we can use? By "proxy metric", I mean something that doesn't necessarily fully align with what we want, but is close or often correlated. Proxy metrics are gameable, so we can't really trust their evaluations of powerful algorithmic optimizers. But human researchers are less good at optimizing things, so their might exist proxies that can be a good enough guiding signal for us. 

One possible such proxy signal is "community approval", operationalized as something like forum comments. I think this is a pretty shoddy signal, not least because community feedback often directly conflicts. Another is evaluations from successful established researchers, which is more informative but less scalable (and depends on your operationalization of "successful" and "established"). 

Comment by Rachel Freedman (rachelAF) on [Link] Why I’m optimistic about OpenAI’s alignment approach · 2022-12-09T20:30:54.139Z · LW · GW

Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up as a separate post later, but for now I have a few questions:

  1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?
  2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work? 
  3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"? 
  4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?
Comment by Rachel Freedman (rachelAF) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-24T16:18:59.619Z · LW · GW

As an AI researcher, my favourite way to introduce other technical people to AI Alignment is Brian Christian’s book “The Alignment Problem” (particularly section 3). I like that it discusses specific pieces of work, with citations to the relevant papers, so that technical people can evaluate things for themselves as interested. It also doesn’t assume any prior AI safety familiarity from the reader (and brings you into it slowly, starting with mainstream bias concerns in modern-day AI).

Comment by Rachel Freedman (rachelAF) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-24T16:10:03.155Z · LW · GW

I work on AI safety via learning from human feedback. In response to your three ideas:

  • Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise is biased along one axis), the AI can infer incorrect human values.

  • I’m working on it, stay tuned.

  • Our most capable AI systems require a LOT of training data, and it’s already expensive to generate enough human feedback for training. Limiting the pool of human teachers to trusted experts, or providing pre-training to all of the teachers, would make this even more expensive. One possible way out of this is to train AI systems themselves to give feedback, in imitation of a small trusted set of human teachers.

Comment by Rachel Freedman (rachelAF) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-24T15:54:46.393Z · LW · GW

In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.

Comment by Rachel Freedman (rachelAF) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-24T15:42:21.287Z · LW · GW

Consciousness, intelligence and human-value-alignment are probably mostly orthogonal, so I don’t think that solving the hard problem of intelligence would directly impact AGI alignment research. (Perhaps consciousness requires general intelligence, so understanding how consciousness works on a mechanistic level might dramatically accelerate timelines? But that’s highly speculative.)

However, if solving the hard problem of consciousness leads us to realize that some of our AI systems are conscious, then we have a whole new set of moral patients. (As an AGI researcher) I personally would become much more concerned with machine ethics in that case, and I suspect others would as well.

Comment by Rachel Freedman (rachelAF) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-24T15:33:50.483Z · LW · GW

Short answer: Yep, probably.

Medium answer: If AGI has components that look like our most capable modern deep learning models (which I think is quite likely if it arrives in the next decade or two), it will probably be very resource-intensive to run, and orders of magnitude more expensive to train. This is relevant because it impacts who has the resources to develop AGI (large companies and governments; likely not individual actors), secrecy (it’s more difficult to secretly acquire a massive amount of compute than it is to secretly boot up an AGI on your laptop; this may even enable monitoring and regulation), and development speed (if iterations are slower and more expensive, it slows down development).

If you’re interested in further discussion of possible compute costs for AGI (and how this affects timelines), I recommend reading about bio anchors.

Comment by Rachel Freedman (rachelAF) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-15T03:47:55.718Z · LW · GW

What can I read/look at to skill up with "alignment."

A good place to start is the "AGI Safety Fundamentals" course reading list, which includes materials from a diverse set of AI safety research agendas. Reading this can help you figure out who in this space is doing what, and which of that you think is useful.  You can also join an official iteration of the course if you want to discuss the materials with a cohort and a facilitator (you can register interest for that here). You can also join the AI Alignment slack, to discuss these and other materials and meet others who are interested in working on AI safety.

What dark horse AI/Alignment-focused companies are out there and would be willing to hire an outsider engineer?

I'm not sure what qualifies as "dark horse", but there are plenty of AI safety organizations interested in hiring research engineers and software engineers. For these roles, your engineering skills and safety motivation typically matter more than your experience in the community. Places off the top of my head that hire engineers for AI safety work: Redwood, Anthropic, FAR, OpenAI, DeepMind. I'm sure I've missed others, though, so look around! These sorts of opportunities are also usually posted on the 80k job board and in AI Alignment slack.

Comment by Rachel Freedman (rachelAF) on [RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm. · 2022-04-09T17:45:32.541Z · LW · GW

DeepMind and OpenAI both already employee teams of existential-risk focused AI safety researchers. While I don't personally work on any of these teams, I get the impression from speaking to them that they are much more talent-constrained than resource-constrained.

I'm not sure how to alleviate this problem in the short term. My best guess would be free bootcamp-style training for value-aligned people who are promising researchers but lack specific relevant skills. For example, ML engineering training or formal mathematics education for junior AIS researchers who would plausibly be competitive hires if that part of their background were strengthened.

However, I don't think that offering AI safety researchers as "free consultants" to these organizations would have much impact. I doubt the organizations would accept since they already have relevant internal teams, and AI safety researchers can presumably have greater impact working within the organization than as external consultants.