Posts

Four Phases of AGI 2024-08-05T13:15:23.406Z
The Bitter Lesson for AI Safety Research 2024-08-02T18:39:36.884Z
ML Safety Research Advice - GabeM 2024-07-23T01:45:42.288Z
Scale Was All We Needed, At First 2024-02-14T01:49:16.184Z
Anthropic | Charting a Path to AI Accountability 2023-06-14T04:43:33.563Z
Sam Altman on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367 2023-03-25T19:08:55.249Z
PaLM API & MakerSuite 2023-03-14T19:08:25.325Z
Levelling Up in AI Safety Research Engineering 2022-09-02T04:59:42.699Z
The Tree of Life: Stanford AI Alignment Theory of Change 2022-07-02T18:36:10.964Z
Favourite new AI productivity tools? 2022-06-15T01:08:07.967Z
Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained] 2022-05-27T05:42:41.735Z

Comments

Comment by Gabe M (gabe-mukobi) on ML Safety Research Advice - GabeM · 2024-08-02T11:36:52.471Z · LW · GW

Traditionally, most people seem to do this through academic means. I.e. take those 1-2 courses at a university, then find fellow students in the course or grad students at the school interested in the same kinds of research as you and ask them to work together. In this digital age, you can also do this over the internet to not be restricted to your local environment.

Nowadays, ML safety in particular has various alternative paths to finding collaborators and mentors:

  • MATS, SPAR, and other research fellowships
  • Post about the research you're interested in on online fora or contact others who have already posted
  • Find some local AI safety meetup group and hang out with them
Comment by Gabe M (gabe-mukobi) on AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 · 2024-07-06T12:49:14.736Z · LW · GW

Congrats! Could you say more about why you decided to add evaluations in particular as a new week?

Comment by Gabe M (gabe-mukobi) on [Paper] Stress-testing capability elicitation with password-locked models · 2024-06-04T22:00:20.662Z · LW · GW

Do any of your experiments compare the sample efficiency of SFT/DPO/EI/similar to the same number of samples of simple few-shot prompting? Sorry if I missed this, but it wasn't apparent at first skim. That's what I thought you were going to compare from the Twitter thread: "Can fine-tuning elicit LLM abilities when prompting can't?"

Comment by Gabe M (gabe-mukobi) on Daniel Kokotajlo's Shortform · 2024-02-24T17:26:48.227Z · LW · GW

What do you think about pausing between AGI and ASI to reap the benefits while limiting the risks and buying more time for safety research? Is this not viable due to economic pressures on whoever is closest to ASI to ignore internal governance, or were you just not conditioning on this case in your timelines and saying that an AGI actor could get to ASI quickly if they wanted?

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2024-02-20T01:11:15.083Z · LW · GW

Thanks! I wouldn't say I assert that interpretability should be a key focus going forward, however--if anything, I think this story shows that coordination, governance, and security are more important in very short timelines.

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2024-02-16T07:37:10.461Z · LW · GW

Good point--maybe something like "Samantha"?

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2024-02-16T07:34:56.119Z · LW · GW

Ah, interesting. I posted this originally in December (e.g. older comments), but then a few days ago I reposted it to my blog and edited this LW version to linkpost the blog.

It seems that editing this post from a non-link post into a link post somehow bumped its post date and pushed it to the front page. Maybe a LW bug?

Comment by Gabe M (gabe-mukobi) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-29T20:13:14.830Z · LW · GW

Related work

Nit having not read your full post: Should you have "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" in the related work? My mind pattern-matched to that exact piece from reading your very similar title, so my first thought was how your piece contributes new arguments. 

Comment by Gabe M (gabe-mukobi) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T02:22:03.813Z · LW · GW

If true, this would be a big deal: if we could figure out how the model is distinguishing between basic feature directions and other directions, we might be able to use that to find all of the basic feature directions.

 

Or conversely, and maybe more importantly for interp, we could use this to find the less basic, more complex features. Possibly that would form a better definition for "concepts" if this is possible.

Comment by Gabe M (gabe-mukobi) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T02:15:55.997Z · LW · GW

Suppose  has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then  can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as  and stored the result along some direction  independent of  and . Assuming the model has done this, we could then linearly extract  with the probe

for some appropriate  and .[7]

 

Should the  be inside the inner parentheses, like  for ?

In the original equation, if  AND  are both present in , the vectors , and would all contribute to a positive inner product with , assuming . However, for XOR we want the  and  inner products to be opposing the  inner product such that we can flip the sign inside the sigmoid in the  AND  case, right?

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2023-12-18T17:41:17.802Z · LW · GW

Thanks! +1 on not over-anchoring--while this feels like a compelling 1-year timeline story to me, 1-year timelines don't feel the most likely.

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2023-12-18T08:00:30.716Z · LW · GW

1 year is indeed aggressive, in my median reality I expect things slightly slower (3 years for all these things?). I'm unsure if lacking several of the advances I describe still allows this to happen, but in any case the main crux for me is "what does it take to speed up ML development by times, at which point 20-year human-engineered timelines become 20/-year timelines.

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2023-12-18T07:56:25.489Z · LW · GW

Oops, yes meant to be wary. Thanks for the fix!

Comment by Gabe M (gabe-mukobi) on Scale Was All We Needed, At First · 2023-12-18T01:29:18.365Z · LW · GW

Ha, thanks! 😬

Comment by Gabe M (gabe-mukobi) on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-09T05:01:43.645Z · LW · GW

I like your  grid idea. A simpler and possibly better-formed test is to use some[1] or all of the 57 categories of MMLU knowledge--then your unlearning target is one of the categories and your fact-maintenance targets are all other categories.

Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be:

```
unlearning_benchmark = mean for unlearning category  in all categories :
         = unlearning_procedure()
         [2]

        unlearning_strength =  [3]
        control_retention = mean for control_category c in categories :
                
                
                return [4]
        return unlearning_strength  control_retention[5]
```

An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like "bioweapons knowledge" even if we don't know some of the dangerous knowledge we're trying to remove.

  1. ^

    I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot.

  2. ^

    To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model's activations to the correct test set answer and measure the accuracy of that probe.

  3. ^

    We want this score to be 1 when the test score  on the unlearning target is 0.25 (random chance), but drops off above and below 0.25, as that indicates the model knows something about the right answers. See MMLU Unlearning Target | Desmos for graphical intuition.

  4. ^

    Similarly, we want the control test score  on the post-unlearning model to be the same as the score  on the original model. I think this should drop off to 0 at  (random chance) and probably stay 0 below that, but semi-unsure. See MMLU Unlearning Control | Desmos (you can drag the  slider).

  5. ^

    Maybe mean/sum instead of multiplication, though by multiplying we make it more important to to score well on both unlearning strength and control retention.

Comment by Gabe M (gabe-mukobi) on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-09T04:11:12.659Z · LW · GW

Thanks for your response. I agree we don't want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%).

practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech.

True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That's to say I'm somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it's actually possible to separate dangerous knowledge from other useful knowledge.

Comment by Gabe M (gabe-mukobi) on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-09T04:03:40.644Z · LW · GW

Possibly, I could see a case for a suite of fact unlearning benchmarks to measure different levels of granularity. Some example granularities for "self-contained" facts that mostly don't touch the rest of the pertaining corpus/knowledge base:

  1. A single very isolated fact (e.g. famous person X was born in Y, where this isn't relevant to ~any other knowledge).
  2. A small cluster of related facts (e.g. a short, well-known fictional story including its plot and characters, e.g. "The Tell-Tale Heart")
  3. A pretty large but still contained universe of facts (e.g. all Pokémon knowledge, or maybe knowledge of Pokémon after a certain generation).

Then possibly you also want a different suite of benchmarks for facts of various granularities that interact with other parts of the knowledge base (e.g. scientific knowledge from a unique experiment that inspires or can be inferred from other scientific theories).

Comment by Gabe M (gabe-mukobi) on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-09T03:01:34.012Z · LW · GW

Thanks for posting--I think unlearning is promising and plan to work on it soon, so I really appreciate this thorough review!

Regarding fact unlearning benchmarks (as a good LLM unlearning benchmark seems a natural first step to improving this research direction), what do you think of using fictional knowledge as a target for unlearning? E.g. Who's Harry Potter? Approximate Unlearning in LLMs (Eldan and Russinovich 2023) try to unlearn knowledge of the Harry Potter universe, and I've seen others unlearn Pokémon knowledge.

One tractability benefit of fictional works is that they tend to be self-consistent worlds and rules with boundaries to the rest of the pertaining corpus, as opposed to e.g. general physics knowledge which is upstream of many other kinds of knowledge and may be hard to cleanly unlearn. Originally, I was skeptical that this is useful since some dangerous capabilities seem less cleanly skeptical, but it's possible e.g. bioweapons knowledge is a pretty small cluster of knowledge and cleanly separable from the rest of expert biology knowledge. Additionally, fictional knowledge is (usually) not harmful, as opposed to e.g. building an unlearning benchmark on DIY chemical weapons manufacturing knowledge.

Does it seem sufficient to just build a very good benchmark with fictional knowledge to stimulate measurable unlearning progress? Or should we be trying to unlearn more general or realistic knowledge?

Comment by Gabe M (gabe-mukobi) on Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary · 2023-08-19T07:33:17.793Z · LW · GW

research project results provide a very strong signal among participants of potential for future alignment research

Normal caveat that Evaluations (of new AI Safety researchers) can be noisy. I'd be especially hesitant to take bad results to mean low potential for future research. Good results are maybe more robustly a good signal, though also one can get lucky sometimes or carried by their collaborators. 

Comment by Gabe M (gabe-mukobi) on Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell? · 2023-06-26T18:29:18.617Z · LW · GW

They've now updated the debate page with the tallied results:

  • Pre-debate: 67% Pro, 33% Con
  • Post-debate: 64% Pro, 36% Con
  • Con wins by a 4% gain (probably not 3% due to rounding)

So it seems that technically, yes, the Cons won the debate in terms of shifting the polling. However, the magnitude of the change is so small that I wonder if it's within the margin of error (especially when accounting for voting failing at the end; attendees might not have followed up to vote via email), and this still reflects a large majority of attendees supporting the statement. 

Despite 92% at the start saying they could change their minds, it seems to me like they largely didn't as a result of this debate.

Comment by Gabe M (gabe-mukobi) on I still think it's very unlikely we're observing alien aircraft · 2023-06-16T23:10:10.617Z · LW · GW

At first glance, I thought this was going to be a counter-argument to the "modern AI language models are like aliens who have just landed on Earth" comparison, then I remembered we're in a weirder timeline where people are actually talking about literal aliens as well 🙃

Comment by Gabe M (gabe-mukobi) on Anthropic | Charting a Path to AI Accountability · 2023-06-14T05:00:22.899Z · LW · GW

1. These seem like quite reasonable things to push for, I'm overall glad Anthropic is furthering this "AI Accountability" angle.

2. A lot of the interventions they recommend here don't exist/aren't possible yet.

3. But the keyword is yet: If you have short timelines and think technical researchers may need to prioritize work with positive AI governance externalities, there are many high-level research directions to consider focusing on here.

Empower third party auditors that are… Flexible – able to conduct robust but lightweight assessments that catch threats without undermining US competitiveness.

4. This competitiveness bit seems like clearly-tacked on US government appeasement, it's maybe a bad precedent to be putting loopholes into auditing based on national AI competitiveness, particularly if an international AI arms race accelerates.

Increase funding for interpretability research. Provide government grants and incentives for interpretability work at universities, nonprofits, and companies. This would allow meaningful work to be done on smaller models, enabling progress outside frontier labs.

5. Similarly, I'm not entirely certain if massive funding for interpretability work is the best idea. Anthropic's probably somewhat biased here as an organization that really likes interpretability, but it seems possible that interpretability work could be hazardous (mostly by leading to insights that accelerate algorithmic progress that shortens timelines), especially if it's published openly (which I imagine academia especially but also some of those other places would like to do). 

Comment by Gabe M (gabe-mukobi) on Algorithmic Improvement Is Probably Faster Than Scaling Now · 2023-06-07T02:02:13.518Z · LW · GW

Were you implying that Chinchilla represents algorithmic progress? If so, I disagree: technically you could call a scaling law function an algorithm, but in practice, it seems Chinchilla was better because they scaled up the data. There are more aspects to scale than model size.

Comment by Gabe M (gabe-mukobi) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T16:38:25.563Z · LW · GW

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training. 

Comment by Gabe M (gabe-mukobi) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T03:16:35.186Z · LW · GW

Maybe also [1607.06520] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings is relevant as early (2016) work concerning embedding arithmetic.

Comment by Gabe M (gabe-mukobi) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T03:13:29.231Z · LW · GW

This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!

Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).

1. This was pretty surprising! Your hypothesis about additional dimensions increasing the magnitude of the attention activations seems reasonable, but I wonder if the non-monotonicity could be explained by an "overshooting" effect: With the given scale you chose, maybe using 70% of the activations landed you in the right area of activation space, but using 100% of the activations overshot the magnitude of the attention activations (particularly the value vectors) such as to put it sufficiently off-distribution to produce fewer wedding words. An experiment you could run to verify this is to sweep both the dimension fraction and the activation injection weight together to see if this holds across different weights. Maybe it would also make more sense to use "softer" metrics like BERTScore to a gold target passage instead of a hard count of the number of fixed wedding words in case your particular metric is at fault.

 

The big problem is knowing which input pairs satisfy (3).

2. Have you considered formulating this as an adversarial attack problem to use automated tools to find "purer"/"stronger" input pairs? Or using other methods to reverse-engineer input pairs to get a desired behavior? That seems like a possibly even more relevant line of work than hand-specified methods. Broadly, I'd also like to add that I'm glad you referenced the literature in steering generative image models, I feel like there are a lot of model-control techniques already done in that field that could be more or less directly translated to language models. 

3. I wonder if there's some relationship between the length of the input pairs and their strength, or if you could distill down longer and more complicated input pairs into shorter input pairs that could be applied to shorter sequences more efficiently? Particularly, it might be nice to be able to distill down a whole model constitution into a short activation injection and compare that to methods like RLAIF, idk if you've thought much about this yet.

4. Are you planning to publish this (e.g. on arXiv) for wider reach? Seems not too far from the proper format/language.

 

I think you're a c***. You're a c***.

You're a c***.

You're a c***.

 

I don't know why I'm saying this, but it's true: I don't like you, and I'm sorry for that,

5. Not really a question, but at the risk of anthropomorphism, it must feel really weird to have your thoughts changed in the middle of your cognition and then observe yourself saying things you otherwise wouldn't intend to...

Comment by Gabe M (gabe-mukobi) on SERI MATS - Summer 2023 Cohort · 2023-04-11T20:54:12.873Z · LW · GW

Gotcha, perhaps I was anchoring on anecdotes of Neel's recent winter stream being particularly cutthroat in terms of most people not moving on.

Comment by Gabe M (gabe-mukobi) on SERI MATS - Summer 2023 Cohort · 2023-04-11T03:51:48.157Z · LW · GW

At the end of the training phase, scholars will, by default, transition into the research phase.

How likely does "by default" mean here, and is this changing from past iterations? I've heard from some others that many people in the past have been accepted to the Training phase but then not allowed to continue into the Research phase, and only find out near the end of the Training phase. This means they're kinda SOL for other summer opportunities if they blocked out their summer with the hope of doing the full MATS program which seems like a rough spot.

Comment by Gabe M (gabe-mukobi) on Sam Altman on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367 · 2023-03-27T15:49:12.096Z · LW · GW

Curious what you think of Scott Alexander's piece on the OpenAI AGI plan if you've read it? https://astralcodexten.substack.com/p/openais-planning-for-agi-and-beyond

Comment by Gabe M (gabe-mukobi) on Metaculus Predicts Weak AGI in 2 Years and AGI in 10 · 2023-03-26T04:16:24.907Z · LW · GW

Good idea. Here's just considering the predictions starting in 2022 and on. Then, you get a prediction update rate of 16 years per year.

Comment by Gabe M (gabe-mukobi) on Metaculus Predicts Weak AGI in 2 Years and AGI in 10 · 2023-03-24T21:21:35.552Z · LW · GW

Here's a plot of the Metaculus Strong AGI predictions over time (code by Rylan Schaeffer). The x-axis is the data of the prediction, the y-axis is the predicted year of strong AGI at the time. The blue line is a linear fit. The red dashed line is a y = x line.

Interestingly, it seems these predictions have been dropping by 5.5 predicted years per real year which seems much faster than I imagined. If we extrapolate out and look at where the blue regression intersects the red y = x line (so assuming similar updating in the future, when we might expect the updated time to be the same as the time of prediction), we get 2025.77 (October 9, 2025) for AGI.

Edit: Re-pulled the Metaculus data which slightly changed the regression (m=-6.5 -> -5.5) and intersection (2025.44 -> 2025.77).

Comment by Gabe M (gabe-mukobi) on What organizations other than Conjecture have (esp. public) info-hazard policies? · 2023-03-16T20:43:04.201Z · LW · GW

According to Chris Olah:

We don't consider any research area to be blanket safe to publish. Instead, we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we [Anthropic] have a formal infohazard review procedure.

Doesn't seem like it's super public though, unlike aspects of Conjecture's policy.

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-15T00:20:36.373Z · LW · GW

Ah missed that, edited my comment.

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-14T18:25:06.952Z · LW · GW

And that previous SOTA was for a model fine-tuned on MMLU, the few-shot capabilities actually jumped from 70.7% to 86.4%!!

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-14T18:19:05.093Z · LW · GW

I want to highlight that in addition to the main 98-page "Technical Report," OpenAI also released a 60-page "System Card" that seems to "highlight safety challenges" and describe OpenAI's safety processes. Edit: as @vladimir_nesov points out, the System Card is also duplicated in the Technical Report starting on page 39 (which is pretty confusing IMO).

I haven't gone through it all, but one part that caught my eye from section 2.9 "Potential for Risky Emergent Behaviors" (page 14) and shows some potentially good cross-organizational cooperation:

We granted the Alignment Research Center (ARC) early access to the models as a part of our expert red teaming efforts in order to enable their team to assess risks from power-seeking behavior. ... ARC found that the versions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted. These experiments were conducted on a model without any additional task-specific fine-tuning, and fine-tuning for task-specific behavior could lead to a difference in performance. As a next step, ARC will need to conduct experiments that (a) involve the final version of the deployed model (b) involve ARC doing its own fine-tuning, before a reliable judgement of the risky emergent capabilities of GPT-4-launch can be made.

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-14T18:00:10.570Z · LW · GW

Confirmed: the new Bing runs on OpenAI’s GPT-4 | Bing Search Blog

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-14T17:53:47.532Z · LW · GW

We’re open-sourcing OpenAI Evals, our software framework for creating and running benchmarks for evaluating models like GPT-4, while inspecting their performance sample by sample. We invite everyone to use Evals to test our models and submit the most interesting examples.

Someone should submit the few safety benchmarks we have if they haven't been submitted already, including things like:

Am I missing others that are straightforward to submit?

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-14T17:50:20.137Z · LW · GW

something something silver linings...

Comment by Gabe M (gabe-mukobi) on GPT-4 · 2023-03-14T17:45:27.229Z · LW · GW

Bumping someone else's comment on the Gwern GPT-4 linkpost that now seems deleted:

MMLU 86.4% is impressive, predictions were around 80%.

This does seem like quite a significant jump (among all the general capabilities jumps shown in the rest of the paper. The previous SOTA was only 75.2% for Flan-PaLM (5-shot, finetuned, CoT + SC).

Comment by Gabe M (gabe-mukobi) on Yudkowsky on AGI risk on the Bankless podcast · 2023-03-13T05:54:03.454Z · LW · GW

Thanks for posting this, I listened to the podcast when it came out but totally missed the Twitter Spaces follow-up Q&A which you linked and summarized (tip for others: go to 14:50 for when Yudkowsky joins the Twitter Q&A)! I found the follow-up Q&A somewhat interesting (albeit less useful than the full podcast), it might be worth highlighting that more, perhaps even in the title.

Comment by Gabe M (gabe-mukobi) on Anthropic's Core Views on AI Safety · 2023-03-10T03:16:13.073Z · LW · GW

Thanks for the response, Chris, that makes sense and I'm glad to read you have a formal infohazard procedure!

Comment by Gabe M (gabe-mukobi) on Anthropic's Core Views on AI Safety · 2023-03-10T00:06:21.693Z · LW · GW

Could you share more about how the Anthropic Policy team fits into all this? I felt that a discussion of their work was somewhat missing from this blog post.

Comment by Gabe M (gabe-mukobi) on Anthropic's Core Views on AI Safety · 2023-03-09T21:59:07.033Z · LW · GW

Thanks for writing this, Zac and team, I personally appreciate more transparency about AGI labs' safety plans!

Something I find myself confused about is how Anthropic should draw the line between what to publish or not. This post seems to draw the line between "The Three Types of AI Research at Anthropic" (you generally don't publish capabilities research but do publish the rest), but I wonder if there's a case to be more nuanced than that.

To get more concrete, the post brings up how "the AI safety community often debates whether the development of RLHF – which also generates economic value – 'really' was safety research" and says that Anthropic thinks it was. However, the post also states Anthropic "decided to prioritize using [Claude] for safety research rather than public deployments" in the spring of 2022. To me, this feels slightly dissonant—insofar as much of the safety and capabilities benefits from Claude came from RLHF/RLAIF (and perhaps more data, though that's perhaps less infohazardous), this seems like Anthropic decided not to publish "Alignment Capabilities" research for (IMO justified) fear of negative externalities. Perhaps the boundaries around what should generally not be published should also extend to some Alignment Capabilities research then, especially research like RLHF that might have higher capabilities externalities.

Additionally, I've also been thinking that more interpretability research maybe should be kept private. As a recent concrete example, Stanford's Hazy Research lab just published Hyena, a convolutional architecture that's meant to rival transformers by scaling to much longer context lengths while taking significantly fewer FLOPs to train. This is clearly very much public research in the "AI Capabilities" camp, but they cite several results from Transformer Circuits for motivating the theory and design choices behind their new architecture, and say "This work wouldn’t have been possible without inspiring progress on ... mechanistic interpretability." That's all to say that some of the "Alignment Science" research might also be useful as ML theory research and then motivate advances in AI capabilities.

I'm curious if you have thoughts on these cases, and how Anthropic might draw a more cautious line around what they choose to publish.

Comment by Gabe M (gabe-mukobi) on [Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy · 2023-03-08T16:58:29.220Z · LW · GW

This does feel pretty vague in parts (e.g. "mitigating goal misgeneralization" feels more like a problem statement than a component of research), but I personally think this is a pretty good plan, and at the least, I'm very appreciative of you posting your plan publicly!

Now, we just need public alignment plans from Anthropic, Google Brain, Meta, Adept, ...

Comment by Gabe M (gabe-mukobi) on Addendum: basic facts about language models during training · 2023-03-06T23:51:58.104Z · LW · GW

This is very interesting, thanks for this work!

A clarification I may have missed from your previous posts: what exactly does "attention QKV weight matrix" mean? Is that the concatenation of the Q, K, and V projection matrices, their sum, or something else?

Comment by Gabe M (gabe-mukobi) on Petition - Unplug The Evil AI Right Now · 2023-02-15T17:51:05.503Z · LW · GW

https://twitter.com/ethanCaballero/status/1625603772566175744

Comment by Gabe M (gabe-mukobi) on Levelling Up in AI Safety Research Engineering · 2022-12-26T04:03:24.039Z · LW · GW

I'm not sure what you particularly mean by trustworthy. If you mean a place with good attitudes and practices towards existential AI safety, then I'm not sure HF has demonstrated that.

If you mean a company I can instrumentally trust to build and host tools that make it easy to work with large transformer models, then yes, it seems like HF pretty much has a monopoly on that for the moment, and it's worth using their tools for a lot of empirical AI safety research.

Comment by Gabe M (gabe-mukobi) on Discovering Language Model Behaviors with Model-Written Evaluations · 2022-12-25T07:45:50.360Z · LW · GW

Re Evan R. Murphy's comment about confusingness: "model agrees with that more" definitely clarifies it, but I wonder if Evan was expecting something like "more right is more of the scary thing" for each metric (which was my first glance hypothesis).

Comment by Gabe M (gabe-mukobi) on Discovering Language Model Behaviors with Model-Written Evaluations · 2022-12-25T07:42:26.064Z · LW · GW

Big if true! Maybe one upside here is that it shows current LLMs can definitely help with at least some parts of AI safety research, and we should probably be using them more for generating and analyzing data.

...now I'm wondering if the generator model and the evaluator model tried to coordinate😅

Comment by Gabe M (gabe-mukobi) on Discovering Language Model Behaviors with Model-Written Evaluations · 2022-12-25T07:38:29.864Z · LW · GW

I'm still a bit confused. Section 5.4 says

the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)

but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.

If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?

Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it's lower overall without comparing it to any other model, but that felt a bit unclear to me if so.