Posts

We need a Science of Evals 2024-01-22T20:30:39.493Z
A starter guide for evals 2024-01-08T18:24:23.913Z
Experiences and learnings from both sides of the AI safety job market 2023-11-15T15:40:32.196Z
Theories of Change for AI Auditing 2023-11-13T19:33:43.928Z
Understanding strategic deception and deceptive alignment 2023-09-25T16:27:47.357Z
There should be more AI safety orgs 2023-09-21T14:53:52.779Z
Apollo Research is hiring evals and interpretability engineers & scientists 2023-08-04T10:54:09.276Z
Announcing Apollo Research 2023-05-30T16:17:19.767Z
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2 2023-05-25T15:37:54.593Z
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1 2023-05-09T19:41:10.528Z
Should we publish mechanistic interpretability research? 2023-04-21T16:19:40.514Z
Clarifying mesa-optimization 2023-03-21T15:53:33.955Z
Reflection Mechanisms as an Alignment Target - Attitudes on “near-term” AI 2023-03-02T04:29:47.741Z
More findings on maximal data dimension 2023-02-02T18:33:53.606Z
More findings on Memorization and double descent 2023-02-01T18:26:41.320Z
The role of Bayesian ML in AI safety - an overview 2023-01-27T19:40:05.727Z
The next decades might be wild 2022-12-15T16:10:04.750Z
Predicting GPU performance 2022-12-14T16:27:23.923Z
Theories of impact for Science of Deep Learning 2022-12-01T14:39:46.062Z
Announcing AI safety Mentors and Mentees 2022-11-23T15:21:12.636Z
Disagreement with bio anchors that lead to shorter timelines 2022-11-16T14:40:16.734Z
Some advice on independent research 2022-11-08T14:46:19.134Z
Science of Deep Learning - a technical agenda 2022-10-18T14:54:35.406Z
Building a transformer from scratch - AI safety up-skilling challenge 2022-10-12T15:40:10.537Z
Lessons learned from talking to >100 academics about AI safety 2022-10-10T13:16:38.036Z
Reflection Mechanisms as an Alignment target: A follow-up survey 2022-10-05T14:03:19.923Z
Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA 2022-10-04T07:22:14.975Z
Why deceptive alignment matters for AGI safety 2022-09-15T13:38:53.219Z
The Defender’s Advantage of Interpretability 2022-09-14T14:05:38.200Z
Trends in GPU price-performance 2022-07-01T15:51:10.850Z
What success looks like 2022-06-28T14:38:42.758Z
Announcing Epoch: A research organization investigating the road to Transformative AI 2022-06-27T13:55:51.451Z
Reflection Mechanisms as an Alignment target: A survey 2022-06-22T15:05:55.703Z
Our mental building blocks are more different than I thought 2022-06-15T11:07:04.062Z
Investigating causal understanding in LLMs 2022-06-14T13:57:59.430Z
Eliciting Latent Knowledge (ELK) - Distillation/Summary 2022-06-08T13:18:51.114Z
The limits of AI safety via debate 2022-05-10T13:33:27.797Z
Nuclear Energy - Good but not the silver bullet we were hoping for 2022-04-30T15:41:25.983Z
Compute Trends — Comparison to OpenAI’s AI and Compute 2022-03-12T18:09:55.039Z
Compute Trends Across Three eras of Machine Learning 2022-02-16T14:18:30.406Z
How harmful are improvements in AI? + Poll 2022-02-15T18:16:07.854Z
Causality, Transformative AI and alignment - part I 2022-01-27T16:18:57.942Z
Estimating training compute of Deep Learning models 2022-01-20T16:12:43.497Z
What’s the backward-forward FLOP ratio for Neural Networks? 2021-12-13T08:54:48.104Z
How to measure FLOP/s for Neural Networks empirically? 2021-11-29T15:18:06.999Z
What are red flags for Neural Network suffering? 2021-11-08T12:51:28.294Z
What makes us happy and depressed? 2021-10-03T06:25:03.822Z
A Guide for Productivity 2021-07-23T07:03:37.131Z
How to Sleep Better 2021-07-16T00:00:29.425Z

Comments

Comment by Marius Hobbhahn (marius-hobbhahn) on This might be the last AI Safety Camp · 2024-01-25T13:12:10.888Z · LW · GW

Copying from EAF

TL;DR: At least in my experience, AISC was pretty positive for most participants I know and it's incredibly cheap. It also serves a clear niche that other programs are not filling and it feels reasonable to me to continue the program.

I've been a participant in the 2021/22 edition. Some thoughts that might make it easier to decide for funders/donors.
1. Impact-per-dollar is probably pretty good for the AISC. It's incredibly cheap compared to most other AI field-building efforts and scalable.
2. I learned a bunch during AISC and I did enjoy it. It influenced my decision to go deeper into AI safety. It was less impactful than e.g. MATS for me but MATS is a full-time in-person program, so that's not surprising.
3. AISC fills a couple of important niches in the AI safety ecosystem in my opinion. It's online and part-time which makes it much easier to join for many people, it implies a much lower commitment which is good for people who want to find out whether they're a good fit for AIS. It's also much cheaper than flying everyone to the Bay or London. This also makes it more scalable because the only bottleneck is mentoring capacity without physical constraints.
4. I think AISC is especially good for people who want to test their fit but who are not super experienced yet. This seems like an important function. MATS and ARENA, for example, feel like they target people a bit deeper into the funnel with more experience who are already more certain that they are a good fit. 
5. Overall, I think AISC is less impactful than e.g. MATS even without normalizing for participants. Nevertheless, AISC is probably about ~50x cheaper than MATS. So when taking cost into account, it feels clearly impactful enough to continue the project. I think the resulting projects are lower quality but the people are also more junior, so it feels more like an early educational program than e.g. MATS. 
6. I have a hard time seeing how the program could be net negative unless something drastically changed since my cohort. In the worst case, people realize that they don't like one particular type of AI safety research. But since you chat with others who are curious about AIS regularly, it will be much easier to start something that might be more meaningful. Also, this can happen in any field-building program, not just AISC.  
7. Caveat: I have done no additional research on this. Maybe others know details that I'm unaware of. See this as my personal opinion and not a detailed research analysis. 

Comment by Marius Hobbhahn (marius-hobbhahn) on We need a Science of Evals · 2024-01-23T15:09:15.302Z · LW · GW

I feel like both of your points are slightly wrong, so maybe we didn't do a good job of explaining what we mean. Sorry for that. 

1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I'd be surprised if the way we currently do demonstrations could not be improved by better science.
1b) Even if you claim you just did a demo or an existence proof and explicitly state that this should not be seen as evidence of absence, people will still see the absence of evidence as negative evidence. I think the "we ran all the evals and didn't find anything" sentiment will be very strong, especially when deployment depends on not failing evals. So you should deal with that problem from the start IMO. Furthermore, I also think we should aim to build evals that give us positive guarantees if that's possible. I'm not sure it is possible but we should try. 
1c) The airplane analogy feels like a strawman to me. The upper bound is obviously not on explosivity, it would be a statement like "Within this temperature range, the material the wings are made of will break once in 10M flight miles on average" or something like that. I agree that airplanes are simpler and less high-dimensional. That doesn't mean we should not try to capture most of the variance anyway even if it requires more complicated evals. Maybe we realize it doesn't work and the variance is too high but this is why we diversify agendas.

2a) The post is primarily about building a scientific field and that field then informs policy and standards. A great outcome of the post would be if more scientists did research on this. If this is not clear, then we miscommunicated. The point is to get more understanding so we can make better predictions. These predictions can then be used in the real world. 
2b) It really is not "we need to find standardised numbers to measure so we can talk to serious people" and less "let's try to solve that thing where we can't reliably predict much about our AIs". If that was the main takeaway, I think the post would be net negative. 

3) But the optimization requires computation? For example, if you run 100 forward passes for your automated red-teaming algorithm with model X, that requires Y FLOP of compute. I'm unsure where the problem is. 

Comment by Marius Hobbhahn (marius-hobbhahn) on We need a Science of Evals · 2024-01-23T14:49:24.351Z · LW · GW

Nice work. Looking forward to that!

Comment by Marius Hobbhahn (marius-hobbhahn) on We need a Science of Evals · 2024-01-23T09:20:52.315Z · LW · GW

Not quite sure tbh.
1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant. 
2. I'm not sure how true your suggestion is but I haven't tried it a lot empirically. But this is exactly the kind of stuff I'd like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don't have enough confidence in. Or at least it hasn't been established as a standard in evals.

Comment by Marius Hobbhahn (marius-hobbhahn) on We need a Science of Evals · 2024-01-23T09:16:27.574Z · LW · GW

I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them. 

I'm not sure the post would have been better if we used a more narrow title, e.g. "We need a science of capability evaluations" because the natural question then would be "But why not for propensity tests or for this other type of eval. I think the broader point of "when we do evals, we need some reason to be confident in the results no matter which kind of eval" seems to be true across all of them. 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2023-12-17T09:34:39.030Z · LW · GW

I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won't play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world (but not more than 5-10 years), humans giving more and more power into the hands of AI, infighting within the AI safety community, and many more of the predictions made in this post. 

The main thing I disagree with (as I already updated in April 2023) is that the timelines underlying the post are too long. I now think almost everything is going to happen in at least half of the time presented in the post, e.g. many events in the 2030-2040 section may already happen before 2030. 

In general, I can strongly recommend taking a weekend or so to write a similar story for yourselves. I felt like it made many of the otherwise fairly abstract implications of timeline and takeoff models much more salient to me and others who are less in the weeds with formal timeline / takeoff models. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Disagreement with bio anchors that lead to shorter timelines · 2023-12-17T09:22:31.007Z · LW · GW

I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore's law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore's law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware

If anything, my disagreements have become stronger and my timelines have become shorter over time. Even the aggressive model I present in the post seems too conservative for my current views and my median date is 2030 or earlier. I have substantial probability mass on an AI that could automate most current jobs before 2026 which I didn't have at the time of writing.

I also want to point out that Daniel Kokotajlo, whom I spent some time talking about bio anchors and Tom Davidson's takeoff model with, seemed to have consistently better intuitions than me (or anyone else I'm aware of) on timelines. The jury is still out there, but so far it looks like reality follows his predictions more than mine. At least in my case, I updated significantly toward shorter timelines multiple times due to arguments he made. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Nuclear Energy - Good but not the silver bullet we were hoping for · 2023-12-16T21:18:35.156Z · LW · GW

I think I still mostly stand behind the claims in the post, i.e. nuclear is undervalued in most parts of society but it's not as much of a silver bullet as many people in the rationalist / new liberal bubble would make it seem. It's quite expensive and even with a lot of research and de-regulation, you may not get it cheaper than alternative forms of energy, e.g. renewables. 

One thing that bothered me after the post is that Johannes Ackva (who's arguably a world-leading expert in this field) and Samuel + me just didn't seem to be able to communicate where we disagree. He expressed that he thought some of our arguments were wrong but we never got to the crux of the disagreement. 

After listening to his appearance on 80k: https://80000hours.org/podcast/episodes/johannes-ackva-unfashionable-climate-interventions/ I feel like I understand the core of the disagreement much better (though I never confirmed with Johannes). He mostly looks at energy through a lens of scale, neglectedness, and traceability, i.e. he's looking to investigate and push interventions that are most efficient on the margin. On the margin, nuclear seems underinvested and lots of reasonable options are underexplored (e.g. large-scale production of smaller reactors), both Samuel and I would agree with that. However, the claim we were trying to make in the post was that nuclear is already more expensive than renewables and this gap will likely just increase in the future. Thus, it makes sense to, in total, invest more in renewables than nuclear. Also, there were lots of smaller things where I felt like I understood his position much better after listening to the podcast. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Trends in GPU price-performance · 2023-12-16T20:58:27.573Z · LW · GW

In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading. 

In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth is probably faster than Moore's law. 

The main reason is that FP32 is just not the right thing to look at in modern ML and we even knew this at the time of writing, i.e. it ignores tensor cores and lower precisions like TF16 or INT8. 

I'm a little worried that people who read this post but don't have any background in ML got the wrong takeaway from the post and we should have emphasized this difference even more at the time. We have written a follow-up post about this recently here: https://epochai.org/blog/trends-in-machine-learning-hardware
I feel like the new post does a better job at explaining where compute progress comes from.

Comment by Marius Hobbhahn (marius-hobbhahn) on Lessons learned from talking to >100 academics about AI safety · 2023-12-16T20:45:26.135Z · LW · GW

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

Comment by Marius Hobbhahn (marius-hobbhahn) on Some for-profit AI alignment org ideas · 2023-12-14T19:04:23.836Z · LW · GW

At Apollo, we have spent some time weighing the pros and cons of the for-profit vs. non-profit approach so it might be helpful to share some thoughts. 

In short, I think you need to make really sure that your business model is aligned with what increases safety. I think there are plausible cases where people start with good intentions but insufficient alignment between the business model and the safety research that would be the most impactful use of their time where these two goals diverge over time. 

For example, one could start as an organization that builds a product but merely as a means to subsidize safety research. However, when they have to make tradeoffs, these organizations might choose to focus more talent on product because it is instrumentally useful or even necessary for the survival of the company. The forces that pull toward profit (e.g. VCs, status, growth) are much more tangible than the forces pulling towards safety. Thus, I could see many ways in which this goes wrong. 

A second example: Imagine an organization that builds evals and starts with the intention of evaluating the state-of-the-art models because they are most likely to be risky. Soon they realize that there are only a few orgs that build the best models and there are a ton of customers that work with non-frontier systems who'd be willing to pay them a lot of money to build evals for their specific application. Thus, the pull toward doing less impactful but plausibly more profitable work is stronger than the pull in the other direction.

Lastly, one thing I'm somewhat afraid of is that it's very easy to rationalize all of these decisions in the moment. It's very easy to say that a strategic shift toward profit-seeking is instrumentally useful for the organization, growth, talent, etc.  And there are cases in which this is true. However, it's easy to continue such a rationalization spree and maneuver yourself into some nasty path dependencies. Some VCs only came on for the product, some hires only want to ship stuff, etc. 

In conclusion, I think it's possible to do profitable safety work but it's hard. You should be confident that your two goals are compatible when things get hard, you should have a team and culture that can resist the pulls and even produce counter pulls when you're not doing safety-relevant work and you should only work with funders who fully understand and buy into your true mission. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Experiences and learnings from both sides of the AI safety job market · 2023-11-15T19:15:39.749Z · LW · GW

Thx. updated:

"You might not be there yet" (though as Neel points out in the comments, CV screening can be a noisy process)“You clearly aren’t there yet”

Comment by Marius Hobbhahn (marius-hobbhahn) on Understanding strategic deception and deceptive alignment · 2023-09-29T07:53:59.272Z · LW · GW

Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the "how do we make AIs honest in the first place" part. 

If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can't rely on the selection mechanisms because the AI games them. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Understanding strategic deception and deceptive alignment · 2023-09-26T12:15:40.790Z · LW · GW

We considered alternative definitions of DA in Appendix C.

We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):

“An AI is deceptively aligned when it is strategically deceptive about its misalignment”

Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities. 

For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s hard to map this situation to deception about misalignment

Problem 2: There are cases where the deception itself is the misalignment, e.g. when the AI strategically lies to its designers, it is misaligned but not necessarily deceptive about that misalignment. 

For example, a personal assistant AI deletes an incoming email addressed to the user that would lead to the user wanting to replace the AI. The misalignment (deleting an email) is itself strategic deception, but the model is not deceiving about its misalignment (unless it engages in additional deception to cover up the fact that it deleted an email, e.g. by lying to the user when asked about any emails).

Comment by Marius Hobbhahn (marius-hobbhahn) on Understanding strategic deception and deceptive alignment · 2023-09-26T08:54:57.520Z · LW · GW

Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I'd not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it's far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.

Comment by Marius Hobbhahn (marius-hobbhahn) on Understanding strategic deception and deceptive alignment · 2023-09-26T08:52:01.635Z · LW · GW

(personal opinion; might differ from other authors of the post)

Thanks for both questions. I think they are very important. 

1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I'd say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn't reason through that, I'd probably not classify it as such. I think it's fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.

2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven't thought about it in detail, so I might change my mind. However, currently, I think the following:
a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I'm happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception.
b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy.  As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it's doing and then the deception either stops or becomes intentional. (weekly held, haven't thought about it in detail)

Comment by Marius Hobbhahn (marius-hobbhahn) on Understanding strategic deception and deceptive alignment · 2023-09-26T07:41:03.717Z · LW · GW

Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually "understood" that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning. 

I think your hypothesis of seeking positive ratings is just as likely but I don't feel like we have the evidence to clearly say so wth is going on inside LLMs or what their "goals" are.

Comment by Marius Hobbhahn (marius-hobbhahn) on There should be more AI safety orgs · 2023-09-25T11:49:34.211Z · LW · GW

Nice to see you're continuing!

Comment by Marius Hobbhahn (marius-hobbhahn) on Critiques of prominent AI safety labs: Conjecture · 2023-06-16T07:53:07.299Z · LW · GW

I'm not going to crosspost our entire discussion from the EAF. 

I just want to quickly mention that Rohin and I were able to understand where we have different opinions and he changed my mind about an important fact. Rohin convinced me that anti-recommendations should not have a higher bar than pro-recommendations even if they are conventionally treated this way. This felt like an important update for me and how I view the post. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Announcing Apollo Research · 2023-06-15T09:44:48.155Z · LW · GW

All of the above but in a specific order. 
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning. 
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning. 
3. Do less and less handholding for 1 and 2. See if the model still shows deception. 
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer questions such as: can we change training data, technique, order of fine-tuning approaches, etc. such that the models are less deceptive? 
5. Use 1-4 to reduce the chance of labs deploying deceptive models in the wild. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Critiques of prominent AI safety labs: Conjecture · 2023-06-13T11:42:38.468Z · LW · GW

(cross-posted from EAG)

Meta: Thanks for taking the time to respond. I think your questions are in good faith and address my concerns, I do not understand why the comment is downvoted so much by other people. 

1. Obviously output is a relevant factor to judge an organization among others. However, especially in hits-based approaches, the ultimate thing we want to judge is the process that generates the outputs to make an estimate about the chance of finding a hit. For example, a cynic might say "what has ARC-theory achieve so far? They wrote some nice framings of the problem, e.g. with ELK and heuristic arguments, but what have they ACtUaLLy achieved?" To which my answer would be, I believe in them because I think the process that they are following makes sense and there is a chance that they would find a really big-if-true result in the future. In the limit, process and results converge but especially early on they might diverge. And I personally think that Conjecture did respond reasonably to their early results by iterating faster and looking for hits. 
2. I actually think their output is better than you make it look. The entire simulators framing made a huge difference for lots of people and writing up things that are already "known" among a handful of LLM experts is still an important contribution, though I would argue most LLM experts did not think about the details as much as Janus did. I also think that their preliminary research outputs are pretty valuable. The stuff on SVDs and sparse coding actually influenced a number of independent researchers I know (so much that they changed their research direction to that) and I thus think it was a valuable contribution. I'd still say it was less influential than e.g. toy models of superposition or causal scrubbing but neither of these were done by like 3 people in two weeks. 
3. (copied from response to Rohin): Of course, VCs are interested in making money. However, especially if they are angel investors instead of institutional VCs, ideological considerations often play a large role in their investments. In this case, the VCs I'm aware of (not all of which are mentioned in the post and I'm not sure I can share) actually seem fairly aligned for VC standards to me. Furthermore, the way I read the critique is something like "Connor didn't tell the VCs about the alignment plans or neglects them in conversation". However, my impression from conversation with (ex-) staff was that Connor was very direct about their motives to reduce x-risks. I think it's clear that products are a part of their way to address alignment but to the best of my knowledge, every VC who invested was very aware about what their getting into. At this point, it's really hard for me to judge because I think that a) on priors, VCs are profit-seeking, and b) different sources said different things some of which are mutually exclusive. I don't have enough insight to confidently say who is right here. I'm mainly saying, the confidence of you surprised me given my previous discussions with staff.
4. Regarding confidence: For example, I think saying "We think there are better places to work at than Conjecture" would feel much more appropriate than "we advice against..." Maybe that's just me. I just felt like many statements are presented with a lot of confidence given the amount of insight you seem to have and I would have wanted them to be a bit more hedged and less confident. 
5. Sure, for many people other opportunities might be a better fit. But I'm not sure I would e.g. support the statement that a general ML engineer would learn more in general industry than with Conjecture. I also don't know a lot about CoEm but that would lead me to make weaker statements than suggesting against it. 

Thanks for engaging with my arguments. I personally think many of your criticisms hit relevant points and I think a more hedged and less confident version of your post would have actually had more impact on me if I were still looking for a job. As it is currently written, it loses some persuasion on me because I feel like you're making too broad unqualified statements which intuitively made me a bit skeptical of your true intentions. Most of me thinks that you're trying to point out important criticism but there is a nagging feeling that it is a hit piece. Intuitively, I'm very averse against everything that looks like a click-bait hit piece by a journalist with a clear agenda. I'm not saying you should only consider me as your audience, I just want to describe the impression I got from the piece. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Critiques of prominent AI safety labs: Conjecture · 2023-06-12T12:36:32.712Z · LW · GW

(cross-posted from EAF)

Some clarifications on the comment:
1. I strongly endorse critique of organisations in general and especially within the EA space. I think it's good that we as a community have the norm to embrace critiques.
2. I personally have my criticisms for Conjecture and my comment should not be seen as "everything's great at Conjecture, nothing to see here!". In fact, my main criticism of leadership style and CoEm not being the most effective thing they could do, are also represented prominently in this post. 
3. I'd also be fine with the authors of this post saying something like "I have a strong feeling that something is fishy at Conjecture, here are the reasons for this feeling". Or they could also clearly state which things are known and which things are mostly intuitions. 
4. However, I think we should really make sure that we say true things when we criticize people, quantify our uncertainty, differentiate between facts and feelings and do not throw our epistemics out of the window in the process
5. My main problem with the post is that they make a list of specific claim with high confidence and I think that is not warranted given the evidence I'm aware of. That's all.  

Comment by Marius Hobbhahn (marius-hobbhahn) on Critiques of prominent AI safety labs: Conjecture · 2023-06-12T10:14:53.257Z · LW · GW

(cross-commented from EA forum)

I personally have no stake in defending Conjecture (In fact, I have some questions about the CoEm agenda) but I do think there are a couple of points that feel misleading or wrong to me in your critique. 

1. Confidence (meta point): I do not understand where the confidence with which you write the post (or at least how I read it) comes from. I've never worked at Conjecture (and presumably you didn't either) but even I can see that some of your critique is outdated or feels like a misrepresentation of their work to me (see below). For example, making recommendations such as "freezing the hiring of all junior people" or "alignment people should not join Conjecture" require an extremely high bar of evidence in my opinion. I think it is totally reasonable for people who believe in the CoEm agenda to join Conjecture and while Connor has a personality that might not be a great fit for everyone, I could totally imagine working with him productively. Furthermore, making a claim about how and when to hire usually requires a lot of context and depends on many factors, most of which an outsider probably can't judge. 
Given that you state early on that you are an experienced member of the alignment community and your post suggests that you did rigorous research to back up these claims, I think people will put a lot of weight on this post and it does not feel like you use your power responsibly here
I can very well imagine a less experienced person who is currently looking for positions in the alignment space to go away from this post thinking "well, I shouldn't apply to Conjecture then" and that feels unjustified to me.

2. Output so far: My understanding of Conjecture's research agenda so far was roughly: "They started with Polytopes as a big project and published it eventually. On reflection, they were unhappy with the speed and quality of their work (as stated in their reflection post) and decided to change their research strategy. Every two weeks or so, they started a new research sprint in search of a really promising agenda. Then, they wrote up their results in a preliminary report and continued with another project if their findings weren't sufficiently promising." In most of their public posts, they stated, that these are preliminary findings and should be treated with caution, etc. Therefore, I think it's unfair to say that most of their posts do not meet the bar of a conference publication because that wasn't the intended goal. 
Furthermore, I think it's actually really good that the alignment field is willing to break academic norms and publish preliminary findings. Usually, this makes it much easier to engage with and criticize work earlier and thus improves overall output quality. 
On a meta-level, I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit. These kinds of critiques make it more likely that people follow small incremental research agendas and alignment just becomes academia 2.0. When you make a critique like that, at least acknowledge that hits-based research might be the right approach.

3. Your statements about the VCs seem unjustified to me. How do you know they are not aligned? How do you know they wouldn't support Conjecture doing mostly safety work? How do you know what the VCs were promised in their private conversations with the Conjecture leadership team? Have you talked to the VCs or asked them for a statement? 
Of course, you're free to speculate from the outside but my understanding is that Conjecture actually managed to choose fairly aligned investors who do understand the mission of solving catastrophic risks. I haven't talked to the VCs either, but I've at least asked people who work(ed) at Conjecture. 

In conclusion:
1. I think writing critiques is good but really hard without insider knowledge and context. 
2. I think this piece will actually (partially) misinform a large part of the community. You can see this already in the comments where people without context say this is a good piece and thanking you for "all the insights".
3. The EA/LW community seems to be very eager to value critiques highly and for good reason. But whenever people use critiques to spread (partially) misleading information, they should be called out. 
4. That being said, I think your critique is partially warranted and things could have gone a lot better at Conjecture. It's just important to distinguish between "could have gone a lot better" and "we recommend not to work for Conjecture" or adding some half-truths to the warranted critiques.
5. I think your post on Redwood was better but suffered from some of the same problems. Especially the fact that you criticize them for having not enough tangible output when following a hits-based agenda just seems counterproductive to me. 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2023-05-03T10:07:28.266Z · LW · GW

Clarified the text: 

Update (early April 2023): I now think the timelines in this post are too long and expect the world to get crazy faster than described here. For example, I expect many of the things suggested for 2030-2040 to already happen before 2030. Concretely, in my median world, the CEO of a large multinational company like Google is an AI. This might not be the case legally but effectively an AI makes most major decisions.

Not sure if this is "Nice!" xD. In fact, it seems pretty worrying. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-26T09:18:21.270Z · LW · GW

So far, I haven't looked into it in detail and I'm only reciting other people's testimonials. I intend to dive deeper into these fields soon. I'll let you know when I have a better understanding.  

Comment by Marius Hobbhahn (marius-hobbhahn) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-25T19:11:08.895Z · LW · GW

I agree with the overall conclusion that the burden of proof should be on the side of the AGI companies. 

However, using the FDA as a reference or example might not be so great because it has historically gotten the cost-benefit trade-offs wrong many times and e.g. not permitting medicine that was comparatively safe and highly effective. 

So if the association of AIS evals or is similar to the FDA, we might not make too many friends. Overall, I think it would be fine if the AIS auditing community is seen as generally cautious but it should not give the impression of not updating on relevant evidence, etc. 

If I were to choose a model or reference class for AI auditing, I would probably choose the aviation industry which seems to be pretty competent and well-regarded. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Should we publish mechanistic interpretability research? · 2023-04-23T15:02:53.069Z · LW · GW

People could choose how they want to publish their opinion. In this case, Richard chose the first name basis. To be fair though, there aren't that many Richards in the alignment community and it probably won't be very hard for you to find out who Richard is. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Should we publish mechanistic interpretability research? · 2023-04-21T20:18:56.627Z · LW · GW

Just to get some intuitions. 

Assume you had a tool that basically allows to you explain the entire network, every circuit and mechanism, etc. The tool spits out explanations that are easy to understand and easy to connect to specific parts of the network, e.g. attention head x is doing y. Would you publish this tool to the entire world or keep it private or semi-private for a while? 

Comment by Marius Hobbhahn (marius-hobbhahn) on Clarifying mesa-optimization · 2023-03-31T20:51:03.492Z · LW · GW

Thank you!

I also agree that toy models are better than nothing and we should start with them but I moved away from "if we understand how toy models do optimization, we understand much more about how GPT-4 does optimization". 

I have a bunch of project ideas on how small models do optimization. I even trained the networks already. I just haven't found the time to interpret them yet. I'm happy for someone to take over the project if they want to. I'm mainly looking for evidence against the outlined hypothesis, i.e. maybe small toy models actually do fairly general optimization. Would def. update my beliefs. 

I'd be super interested in falsifiable predictions about what these general-purpose modules look like. Or maybe even just more concrete intuitions, e.g. what kind of general-purpose modules you would expect GPT-3 to have. I'm currently very uncertain about this. 

I agree with your final framing. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Clarifying mesa-optimization · 2023-03-22T08:40:48.967Z · LW · GW

How confident are you that the model is literally doing gradient descent from these papers? My understanding was that the evidence in these papers is not very conclusive and I treated it more as an initial hypothesis than an actual finding. 

Even if you have the redundancy at every layer, you are still running copies of the same layer, right? Intuitively I would say this is not likely to be more space-efficient than not copying a layer and doing something else but I'm very uncertain about this argument. 

I intend to look into the Knapsack + DP algorithm problem at some point. If I were to find that the model implements the DP algorithm, it would change my view on mesa optimization quite a bit. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Investigating causal understanding in LLMs · 2023-03-04T09:52:01.029Z · LW · GW

No plans so far. I'm a little unhappy with the experimental design from last time. If I ever come back to this, I'll change the experiments up anyways.

Comment by Marius Hobbhahn (marius-hobbhahn) on Cognitive Emulation: A Naive AI Safety Proposal · 2023-02-27T15:59:17.034Z · LW · GW

Could you elaborate a bit more about the strategic assumptions of the agenda? For example,
1. Do you think your system is competitive with end-to-end Deep Learning approaches?
1.1. Assuming the answer is yes, do you expect CoEm to be preferable to users?
1.2. Assuming the answer is now, how do you expect it to get traction? Is the path through lawmakers understanding the alignment problem and banning everything that is end-to-end and doesn't have the benefits of CoEm? 
2. Do you think this is clearly the best possible path for everyone to take right now or more like "someone should do this, we are the best-placed organization to do this"? 

PS: Kudos to publishing the agenda and opening up yourself to external feedback. 

Comment by Marius Hobbhahn (marius-hobbhahn) on LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space · 2023-02-13T21:52:36.743Z · LW · GW

fair. You convinced me that the effect is more determined by layer-norm than cross-entropy.

Comment by Marius Hobbhahn (marius-hobbhahn) on LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space · 2023-02-13T21:19:02.822Z · LW · GW

I agree that the layer norm does some work here but I think some parts of the explanation can be attributed to the inductive bias of the cross-entropy loss. I have been playing around with small toy transformers without layer norm and they show roughly similar behavior as described in this post (I ran different experiments, so I'm not confident in this claim). 

My intuition was roughly:
- the softmax doesn't care about absolute size, only about the relative differences of the logits.
- thus, the network merely has to make the correct logits really big and the incorrect logits small
- to get the logits, you take the inner product of the activations and the unembedding. The more aligned the directions of the correct class with the corresponding unembedding weights are (i.e. the smaller their cosine similarity), the bigger the logits.
- Thus, direction matters more than distance. 

Layernorm seems to even further reduce the effect of distance but I think the core inductive bias comes from the cross-entropy loss. 

Comment by Marius Hobbhahn (marius-hobbhahn) on More findings on Memorization and double descent · 2023-02-06T18:50:30.137Z · LW · GW

I don't think there is a general answer here. But here are a couple of considerations:
- networks can get stuck in local optima, so if you initialize it to memorize, it might never find a general solution.
- grokking has shown that with high weight regularization, networks can transition from memorized to general solutions, so it is possible to move from one to the other.
- it probably depends a bit on how exactly you initialize the memorized solution. You can represent lookup tables in different ways and some are much more liked by NNs than others. For example, I found that networks really don't like it if you set the weights to one-hot vectors such that one input only maps to one feature.
- My prediction for empirical experiments here would be something like "it might work in some cases but not be clearly better in the general case. It will also depend on a lot of annoying factors like weight decay and learning rate and the exact way you build the dictionary". 

Comment by Marius Hobbhahn (marius-hobbhahn) on Gradient hacking is extremely difficult · 2023-01-26T05:23:39.024Z · LW · GW

I agree with everything you're saying. I just want to note that as soon as someone starts training networks in a way where not all weights are updated simultaneously, e.g. because the weights are updated only for specific parts of the network, or when the network has an external memory that is not changed every training step, gradient hacking seems immediately much more likely and much scarier. 

And there are probably hundreds of researchers out there working on modular networks with memory, so it probably won't take that long until we have models that plausibly have the capabilities to do gradient hacking. Whether they actually do it is a totally different question but it would be much easier to create a story of how the networks would gradient hack. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2023-01-03T08:15:38.539Z · LW · GW

This criticism has been made for the last 40 years and people have usually had new ideas and were able to execute them. Thus, on priors, we think this trend will continue even if we don't know exactly which kind of ideas they will be. 

In fact, due to our post, we were made aware of a couple of interesting ideas about chip improvements that we hadn't considered before that might change the outcome of our predictions (towards later limits) but we haven't included them in the model yet. 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2022-12-21T22:19:28.726Z · LW · GW

Hmmm interesting. 

Can you provide some of your reasons or intuitions for this fast FOOM?

My intuition against it is mostly like "intelligence just seems to be compute bound and thus extremely fast takeoffs (hours to weeks) are unlikely". But I feel very uncertain about this take and would like to refine it. So just understanding your intuitions better would probably already help a lot. 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2022-12-21T15:24:25.717Z · LW · GW

I think it's mostly my skepticism about extremely fast economic transformations. 

Like GPT-3 could probably automate more parts of the economy today but somehow it just takes a while for people to understand that and get it to work in practice. I also expect that it will take a couple of years between showing the capabilities of new AI systems in the lab and widespread economic impact just because humans take a while to adapt (at least with narrow systems). 

At some point (maybe in 2030) we will reach a level where AI is as capable as humans in many tasks and then the question is obviously how fast it can self-improve.  I'm skeptical that it is possible to self-improve as fast as the classic singularity story would suggest. In my mind, you mostly need more compute for training, new training data, new task design, etc. I think it will take some time for the AI to come up with all of that and even then, exponential demands just have their limits. Maybe the AI can get 100x compute and train a new model but getting 10000x compute probably won't happen immediately (at least in my mind; arguments for or against are welcome). 

Lastly, I wrote a story about my median scenario. I do have lots of uncertainty about how the TAI distribution should look like (see here) but my mode is at 2032-2035 (i.e. earlier than my median). So I could have also written a story with faster developments and it would reflect a slightly different corner of my probability distribution. But due to the reasons above, it would mostly look like a slightly faster version of this story. 

And your scenario is within the space of scenarios that I think could happen, I just think it's less likely than a less accelerationist, slower transition. But obviously not very confident in this prediction.
 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2022-12-16T12:19:10.485Z · LW · GW

Well maybe. I still think it's easier to build AGI than to understand the brain, so even the smartest narrow AIs might not be able to build a consistent theory before someone else builds AGI. 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2022-12-16T08:44:07.005Z · LW · GW

I'm not very bullish on HMI. I think the progress humanity makes in understanding the brain is extremely slow and because it's so hard to do research on the brain, I don't expect us to get much faster. 

Basically, I expect humanity to build AGI way before we are even close to understanding the brain. 

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2022-12-15T21:52:50.283Z · LW · GW

I know your reasoning and I think it's a plausible possibility. I'd be interested in how the disruption of AI into society looks like in your scenario. 

Is it more like one or a few companies have AGIs but the rest of the world is still kinda normal or is it roughly like my story just 2x as fast?

Comment by Marius Hobbhahn (marius-hobbhahn) on The next decades might be wild · 2022-12-15T21:25:28.128Z · LW · GW

Thanks for pointing this out. I made a clarification in the text. 

Similar to Daniel, I'd also be interested in what public opinion was at the time or what the consensus view among experts was if there was one. 

Also, it seems like the timeframe for mobile phones is 1993 to 2020 if you can trust this statistic.

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T18:02:25.830Z · LW · GW

Definitely could be but don't have to be. We looked a bit into cooling and heat and did not find any clear consensus on the issue. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T17:02:56.982Z · LW · GW

We did consider modeling it explicitly. However, most estimates on the Landauer limit give very similar predictions as size limits. So we decided against making an explicit addition to the model and it is "implicitly" modeled in the physical size. We intend to look into Landauer's limit at some point but it's not a high priority right now. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T11:05:01.800Z · LW · GW

We originally wanted to forecast FLOP/s/$ instead of just FLOP/s but we found it hard to make estimates about price developments. We might look into this in the future. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T11:03:34.320Z · LW · GW

Well, depending on who you ask, you'll get numbers between 1e13 and 1e18 for the human brain FLOP/s equivalent. So I think there is lots of uncertainty about it. 

However, I do agree that if it was at 1e16, your reasoning sounds plausible to me. What a wild imagination. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T11:01:38.600Z · LW · GW

Yeah, I also expect that there are some ways of compensating for the lack of miniaturization with other tech. I don't think progress will literally come to a halt. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T11:00:12.899Z · LW · GW

We looked more into this because we wanted to get a better understanding of Ajeya's estimate of price-performance doubling every 2.5 years. Originally, some people I talked to were skeptical and thought that 2.5 years is too conservative. I now think that 2.5 years is probably insufficiently conservative in the long run. 

However, I also want to note that there are still reasons to believe a doubling time of 2 years or less could be realistic due to progress in specialization or other breakthroughs. I still have large uncertainty about the doubling time but my median estimate got a bit more conservative. 

We have not incorporated this particular estimate into the bio anchor's model. Mostly, because the model is a prediction about a particular type of GPU and I think it will be modified or replaced once miniaturization is no longer an option. So, I don't expect progress to entirely stop in the next decade, just slow down a bit. 

But lots of open questions all over the place. 

Comment by Marius Hobbhahn (marius-hobbhahn) on Predicting GPU performance · 2022-12-15T10:54:29.887Z · LW · GW

By uncertainty I mean, I really don't know, i.e. I could imagine both very high and very low gains. I didn't want to express that I'm skeptical. 

For the third paragraph, I guess it depends on what you think of as specialized hardware. If you think GPUs are specialized hardware than a gain of 1000x from CPUs to GPUs sounds very plausible to me. If you think GPUs are the baseline and specialized hardware are e.g. TPUs, then a 1000x gain sounds implausible to me. 

My original answers wasn't that clear. Does this make more sense to you?