Posts

Risks from AI Overview: Summary 2023-08-18T01:21:25.445Z
Catastrophic Risks from AI #6: Discussion and FAQ 2023-06-27T23:23:58.846Z
Catastrophic Risks from AI #5: Rogue AIs 2023-06-27T22:06:11.029Z
Catastrophic Risks from AI #4: Organizational Risks 2023-06-26T19:36:41.333Z
Catastrophic Risks from AI #3: AI Race 2023-06-23T19:21:07.335Z
Catastrophic Risks from AI #2: Malicious Use 2023-06-22T17:10:08.374Z
Catastrophic Risks from AI #1: Introduction 2023-06-22T17:09:40.883Z
[MLSN #9] Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans 2023-04-11T16:03:31.061Z
[MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming 2023-02-20T15:54:13.791Z
What's the deal with AI consciousness? 2023-01-11T16:37:02.200Z
Implications of simulators 2023-01-07T00:37:45.799Z
"AI" is an indexical 2023-01-03T22:00:31.089Z
A Year of AI Increasing AI Progress 2022-12-30T02:09:39.458Z
Did ChatGPT just gaslight me? 2022-12-01T05:41:46.560Z
A philosopher's critique of RLHF 2022-11-07T02:42:51.234Z
ML Safety Scholars Summer 2022 Retrospective 2022-11-01T03:09:10.305Z
Announcing the Introduction to ML Safety course 2022-08-06T02:46:00.295Z
$20K In Bounties for AI Safety Public Materials 2022-08-05T02:52:47.729Z
Examples of AI Increasing AI Progress 2022-07-17T20:06:41.213Z
Open Problems in AI X-Risk [PAIS #5] 2022-06-10T02:08:06.121Z
Perform Tractable Research While Avoiding Capabilities Externalities [Pragmatic AI Safety #4] 2022-05-30T20:25:05.151Z
Complex Systems for AI Safety [Pragmatic AI Safety #3] 2022-05-24T00:00:58.873Z
A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] 2022-05-09T17:18:53.978Z
Introduction to Pragmatic AI Safety [Pragmatic AI Safety #1] 2022-05-09T17:06:00.409Z
Introducing the ML Safety Scholars Program 2022-05-04T16:01:51.575Z
[$20K in Prizes] AI Safety Arguments Competition 2022-04-26T16:13:16.351Z

Comments

Comment by TW123 (ThomasWoodside) on Why was the AI Alignment community so unprepared for this moment? · 2023-07-15T14:23:36.655Z · LW · GW

We weren't intending to use the contest to do any direct outreach to anyone (not sure how one would do direct outreach with one liners in any case) and we didn't use it for that. I think it was less useful than I would have hoped (nearly all submissions were not very good), but ideas/anecdotes surfaced have been used in various places and as inspiration.

It is also interesting to note that the contest was very controversial on LW, essentially due to it being too political/advocacy flavored (though it wasn't intended for "political" outreach per se). I think it's fair for research that LW has had those kinds of norms, but it did have a chilling effect on people who wanted to do the kind of advocacy that many people on LW now deem useful/necessary.

Comment by TW123 (ThomasWoodside) on Catastrophic Risks from AI #3: AI Race · 2023-06-24T23:50:04.529Z · LW · GW

It is possible that AI would allow for the creation of brain-computer interfaces such that we can productively merge with AI systems. I don't think this would apply in that case since that would be a true "augmentation."

If that doesn't happen, though, or before that happens, I think this is a real possibility. The disanalogy is that our brains wouldn't add anything to sufficiently advanced AI systems, unlike books, which are useless without our brains to read them.

Today, many people are weaker physically than in previous times because we don't need to do as much physical labor. I don't see why the same couldn't happen with our minds. Of course, many people go to the gym, and people will probably also continue to learn things to keep sharp. If that becomes a strong and widespread cultural norm, then we wouldn't have this problem. But it doesn't seem guaranteed that would happen.

Comment by TW123 (ThomasWoodside) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T01:50:43.258Z · LW · GW

Hi Jan, I appreciate your feedback.

I've been helping out with this and I can say that the organizers are working as quickly as possible to verify and publish new signatures. New signatures have been published since the launch, and additional signatures will continue to be published as they are verified. There is a team of people working on it right now and has been since launch.

The main obstacles to extremely swift publication are:

  • First, determining who meets our bar for name publication. We think the letter will have greater authority (and coordination value) if all names are above a certain bar, and so some effort needs to be put into determining whether signatories meet that bar.
  • Second, as you mention verification. Prior to launch, CAIS built an email verification system that ensures that signatories must verify their work emails in order for their signature to be valid. However, this has required some tweaks, such as making the emails more attention grabbing and adding some language on the form itself that makes clear that people should expect an email (before these tweaks, some people weren't verifying their emails).
  • Lastly, even with verification, some submissions are still possibly fake (from email addresses that we aren't sure are the real person) and need to be further assessed.

These are all obstacles that simply require time to address, and the team is working around the clock. In fact, I'm writing this comment on their behalf so that they can focus on the work they're doing.  We will publish all noteworthy signatures as quickly as we can, which should be within a matter of days (as I said above, some have already been published and this is ongoing). We do take your feedback that perhaps we should have hired more people so that verification could be swifter. 

In response to your feedback, we have just added language in the form and email that makes clear signatures won't show up immediately so that we can verify them. This might seem very obvious, but when you are running something with so many moving parts as this entire process has been, it is easy to miss things.

Thank you again for your feedback.

Comment by TW123 (ThomasWoodside) on Research agenda: Supervising AIs improving AIs · 2023-04-29T19:22:29.194Z · LW · GW

I've been collecting examples of this kind of thing for a while now here: ai-improving-ai.safe.ai.

In addition to algorithmic and data improvements I'll add there are also some examples of AI helping to design hardware (e.g. GPU architectures) and auxiliary software (e.g. for datacenter cooling).

Comment by TW123 (ThomasWoodside) on Request to AGI organizations: Share your views on pausing AI progress · 2023-04-12T01:10:51.932Z · LW · GW

At the time of this post, the FLI letter has been signed by 1 OpenAI research scientist, 7 DeepMind research scientists/engineers, and 0 Anthropic employees. 

"1 OpenAI research scientist" felt weird to me on priors. 0 makes sense, if the company gave some guidance (e.g. legal) to not sign, or if the unanimous opinion was that it's a bad idea to sign. 7 makes sense too -- it's about what I'd expect from DeepMind and shows that there's a small contingent of people really worried about risk. Exactly 1 is really weird -- there are definitely multiple risk conscious people at OpenAI, but exactly one of them decided to sign?

I see a "Yonas Kassa" listed as an OpenAI research scientist, but it's very unclear who this person is. I don't see any LinkedIn or Google Scholar profile of this name associated with OpenAI. Previously, I know many of the signatures were inaccurate, so I wonder if this one is, too?

Anyway, my guess is that actually zero OpenAI researchers, and that both OpenAI and Anthropic employees have decided (as a collective? because of a top down directive? for legal reasons? I have no idea) to not sign.

Comment by TW123 (ThomasWoodside) on Is InstructGPT Following Instructions in Other Languages Surprising? · 2023-02-14T17:35:42.741Z · LW · GW

Later in the thread Jan asks, "is this interpretability complete?" which I think implies that his intuition is that this should be easier to figure out than other questions (perhaps because it seems so simple). But yeah, it's kind of unclear why he is calling out this in particular.

Comment by TW123 (ThomasWoodside) on Is InstructGPT Following Instructions in Other Languages Surprising? · 2023-02-14T04:07:45.789Z · LW · GW

I find myself surprised/confused at his apparent surprise/confusion.

Jan doesn't indicate that he's extremely surprised or confused? He just said he doesn't know why this happens. There's a difference between being unsurprised by something (e.g. by observing something similar before) and actually knowing why it happens.To give a trivial example, hunter gatherers from 10,000 BC would not have been surprised if a lightning strike caused fire, but would be quite clueless (or incorrect) as to why or how this happens.

I think Quintin's answer is a good possible hypothesis (though of course it leads to the further question of how LLMs learn language-neutral circuitry).

Comment by TW123 (ThomasWoodside) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-30T22:24:42.051Z · LW · GW

In addition to the post you linked, there is also an earlier post on this topic that I like.

I also co wrote a post that looks at specific structural factors related to AI safety.

Comment by TW123 (ThomasWoodside) on What's the deal with AI consciousness? · 2023-01-14T23:39:31.496Z · LW · GW

Thanks so much for writing this, quite useful to see your perspective!

First, I don't think that you've added anything new to the conversation. Second, I don't think what you have mentioned even provides a useful summary of the current state of the conversation: it is neither comprehensive, nor the strongest version of various arguments already made.

Fair enough!

I don't think that's a popular opinion here. And while I think some people might just have a cluster of "brain/thinky" words in their head when they don't think about the meaning of things closely, I don't think this is a popular opinion of people in general unless they're really not thinking about it.

I've seen this in the public a very surprising amount. For example see the New York Times article linked. Agree it's not remotely popular on LessWrong.

Citation needed.

Fair enough. I'm not very sympathetic to panpsychism, but it probably could have been worth mentioning. Though I am not really sure how much it would add for most readers.

Assuming we make an AI conscious, and that consciousness is actually something like what we mean by it more colloquially (human-like, not just panpsychistly), it isn't clear that this makes it a moral concern. 

That's true; and it might be a moral concern without consciousness. But on many moral accounts, consciousness is highly relevant. I think probably most people would say it is relevant.

Meanwhile, I feel like there is a lot of lower hanging fruit in neuroscience that would also help solve this problem more easily later in addition to actually being useful now.

Curious what research you think would do here?

Same thing as above, and also the prevailing view here is that it is much more important that AI will kill us, and if we're theoretically spending (social) capital to make these people care about things, the not killing us is astronomically more important.

I agree with this. But at the same time the public conversation keeps talking about consciousness. I wanted to address it for that reason, and really address it, rather than just brush it aside. I don't really think it's true that discussion of this detracts from x-risk; both point in the direction of being substantially more careful, for example.

They cannot choose not to because they don't know what it is, so this is unactionable and useless advice.

Good point. I think I had meant to say that researchers should not try to do this. I will edit the post to say that.

I think my recommendations are probably not well targeted enough; I didn't really specify to whom I was recommending them to. I'll try to avoid doing that in the future.

Comment by TW123 (ThomasWoodside) on What's the deal with AI consciousness? · 2023-01-14T22:26:21.380Z · LW · GW

I agree with this. If we are able to design consciousness such that a system is fulfilled by serving humans, then it's possible that would be morally alright. I don't think there is a strong enough consensus that I'd feel comfortable locking it in, but to me it seems ok.

By default though, I think we won't be designing consciousness intentionally, and it will just emerge, and I don't think that's too likely to lead to this sort of situation.

Comment by TW123 (ThomasWoodside) on Beware safety-washing · 2023-01-13T16:18:38.163Z · LW · GW

A related post: https://www.lesswrong.com/posts/xhD6SHAAE9ghKZ9HS/safetywashing

[Realized this is contained in a footnote, but leaving this comment here in case anyone missed it].

Comment by TW123 (ThomasWoodside) on What's the deal with AI consciousness? · 2023-01-11T19:11:19.739Z · LW · GW

It looks like I got one or possibly two strong downvotes, but it doesn't seem like from either of the commenters. If you downvoted this (or think you understand why it was downvoted), please let me know in the comments so I can improve!

Comment by TW123 (ThomasWoodside) on What's the deal with AI consciousness? · 2023-01-11T19:09:20.157Z · LW · GW

Consciousness therefore only happens if it improves performance at the task we have assigned.  And some tasks like interacting directly with humans it might improve performance.

I don't think this is necessarily true. Consciousness could be a side effect of other processes that do improve performance.

The way I've heard this put: a polar bear has thick hair so that it doesn't get too cold, and this is good for its evolutionary fitness. The fact that the hair is extremely heavy is simply a side effect of this. Consciousness could possibly me similar.

Comment by TW123 (ThomasWoodside) on Applied Linear Algebra Lecture Series · 2022-12-25T06:38:26.575Z · LW · GW

I don't think these are necessarily bad suggestions if there were a future series. But my sense is that John did this for the people in the audience, somebody asked him to record it so he did, and now he's putting them online in case they're useful to anyone. It's very hard to make good production quality lectures, and it would have required more effort. But it sounds like John knew this and decided he would rather spend his time elsewhere, which is completely his choice to make. As written, these suggestions feel a bit pushy to me.

Comment by TW123 (ThomasWoodside) on Let’s think about slowing down AI · 2022-12-24T06:15:10.807Z · LW · GW

Sorry if I missed it earlier in the thread, but who is this "polymath"?

Comment by TW123 (ThomasWoodside) on Let’s think about slowing down AI · 2022-12-23T04:15:00.868Z · LW · GW

Yeah, I agree with all this.

Comment by TW123 (ThomasWoodside) on Let’s think about slowing down AI · 2022-12-23T03:39:38.819Z · LW · GW

I just do not think that the post is written for people who think "slowing down AI capabilities is robustly good." If people thought that, then why do they need this post? Surely they don't need somebody to tell them to think about it?

So it seems to me like the best audience for this post would be those (including those at some AI companies, or those involved in policy, which includes people reading this post) who currently think something else, for example that the robustly good thing is for their chosen group to be ahead so that they can execute whatever strategy they think they alone can do correctly.

The people I've met who don't want to think about slowing down AI capabilities just don't seem to think that slowing down AI progress would be robustly good, because that just wouldn't be a consistent view! They often seem to have some view that nothing is robustly good, or maybe some other thing ("get more power") is robustly good. Such people just won't really be swayed by the robust priors thing, or maybe they'd be swayed in the other direction.

Comment by TW123 (ThomasWoodside) on Let’s think about slowing down AI · 2022-12-23T02:11:07.860Z · LW · GW

The claim being made is something like the following:

1) AGI is a dangerous technology.

2) It is robustly good to slow down dangerous technologies.

3) Some people might say that you should not actually do this because of [complicated unintelligible reason].

4) But you should just do the thing that is more robustly good.

I argue that many people (yes, you're right, in ways that conflict with one another) believe the following:

1) X is a dangerous country.

2) It is robustly good to always be ahead of X in all technologies, including dangerous ones.

3) Some people might say that you should not actually do this because of [complicated unintelligible reason]. This doesn't make very much sense.

4) But you should just do the thing that is more robustly good.

 

My point is that which argument is the obvious, robust one, and which one is the weird inside view one depends on your perspective. Therefore, it doesn't seem like (4) is a very good generalized argument. For example, if I were one of these powerful people, I think it would be wrong for me to be convinced to "focus on the robustly good measures, not the weird inside view measures" because it would lead me to do bad things like trying to advance AI capabilities. As a result, the argument seems suspect to me. It feels like it only works for this community, or people who are already very concerned by AI x-risk.

In comparison, there are specific arguments like "AGI is dangerous" and "slowing down dangerous technologies is actually robustly good" (some of these were presented in this post) that I think are, ironically, must more robustly good, because they don't seem to have negative effects as reliably when presented to people who hold beliefs I think are wrong.

Edit: I no longer endorse this comment. It claims too much, specifically that any reasoning procedure is suspect if it leads to people who believe false premises taking bad actions.

I think what I was really trying to get at in my original comment was that that particular argument seems aimed at people who already think that it would be robustly good to slow down dangerous technologies. But the people who would most benefit from this post are those who do not already think this; for them it doesn't help much and might actively hurt.

Comment by TW123 (ThomasWoodside) on Let’s think about slowing down AI · 2022-12-23T01:27:40.905Z · LW · GW

There are things that are robustly good in the world, and things that are good on highly specific inside-view models and terrible if those models are wrong. Slowing dangerous tech development seems like the former, whereas forwarding arms races for dangerous tech between world superpowers seems more like the latter.

It may seem the opposite to some people. For instance, my impression is that for many adjacent to the US government, "being ahead of China in every technology" would be widely considered robustly good, and nobody would question you at all if you said that was robustly good. Under this perspective the idea that AI could pose an existential risk is a "highly specific inside-view model" and it would be terrible if we acted on the model and it is wrong.

I don't think your readers will mostly think this, but I actually think a lot of people would, which for me makes this particular argument seem entirely subjective and thus suspect.

Comment by TW123 (ThomasWoodside) on Where to be an AI Safety Professor · 2022-12-07T17:33:14.531Z · LW · GW

I think this is all true, but also since Yale CS is ranked poorly the graduate students are not very strong for the most part. You certainly have less competition for them if you are a professor, but my impression is few top graduate students want to go to Yale. In fact, my general impression is often the undergraduates are stronger researchers than the graduate students (and then they go on to PhDs at higher ranked places than Yale).

Yale is working on strengthening its CS department and it certainly has a lot of money to do that. But there are a lot of reasons that I am not that optimistic. There is essentially no tech scene in New Haven, New Haven is not that great in general, the Yale CS building is extremely dingy (I think this has an actual effect on people), and it's really hard to affect the status quo. However, I'm more optimistic that Yale will successfully forge a niche of interdisciplinary research, which is really a strength of the university.

Comment by TW123 (ThomasWoodside) on Where to be an AI Safety Professor · 2022-12-07T15:46:51.325Z · LW · GW

I'll just comment on my experience as an undergrad at Yale in case it's useful.

At Yale, the CS department, particularly when it comes to state of the art ML, is not very strong. There are a few professors who do good work, but Yale is much stronger in social robotics and there is also some ML theory. There are a couple AI ethics people at Yale, and there soon will be a "digital ethics" person, but there aren't any AI safety people.

That said, there is a lot of latent support for AI safety at Yale. One of the global affairs professors involved in the Schmidt Program for Artificial Intelligence, Emerging Technology, and National Power is quite interested in AI safety. He invited Brian Christian and Stuart Russell to speak and guest teach his classes, for example. The semi-famous philosopher L.A. Paul is interested in AI safety, and one of the theory ML professors had a debate about AI safety in one of his classes. One of the professors most involved in hiring new professors specifically wants to hire AI safety people (though I'm not sure he really knows what AI safety is).

I wouldn't really recommend Yale to people who are interested in doing very standard ML research and want an army of highly competent ML researchers to help them. But for people whose work interacts with sociotechnical considerations like policy, or is more philosophical in nature, I think Yale would be a fantastic place to be, and in fact possibly one of the best places one could be.

Comment by TW123 (ThomasWoodside) on Did ChatGPT just gaslight me? · 2022-12-02T21:59:23.857Z · LW · GW

I appreciate this. I don't even consider myself part of the rationality community, though I'm adjacent. My reasons for not drinking have nothing to do with the community and existed before I knew what it was. I actually get the sense this is the case for a number of people in the community (more of a correlation or common cause rather than caused by the community itself). But of course I can't speak for all.

I will be trying it on Sunday. We will see how it is.

Comment by TW123 (ThomasWoodside) on Did ChatGPT just gaslight me? · 2022-12-02T19:26:05.747Z · LW · GW

I've thought about this comment, because it certainly is interesting. I think I was clearly confused in my questions to ChatGPT (though I will note: My tequila-drinking friends did not and still don't think tequila tastes at all sweet, including "in the flavor profile" or anything like that. But it seems many would say they're wrong!) ChatGPT was clearly confused in its response to me as well. 

I think this part of my post was incorrect:

It was perfectly clear: ChatGPT was telling me that tequila adds a sweetness to the drink. So it was telling me that tequila is a sweet drink (at least, as sweet as orange juice).

I have learned today that a drink does not have to be sweet in order for many to consider it to add "sweetness." To be honest, I don't understand this at all, and at the time considered it a logical contradiction. It seems a lot less clear cut to me now.

However, the following (and the quote above it) is what I focused on most in the post. I quoted the latter part of it three different times. I believe it is entirely unaffected by whether or not tequila is canonically considered to be sweet:

“I was not referring to the sweetness that comes from sugar.” But previously, ChatGPT had said “tequila has a relatively low alcohol content and a relatively high sugar content.” Did ChatGPT really forget what it had said, or is it just pretending?

Is ChatGPT gaslighting me?

Thomas: You said tequila has a "relatively high sugar content"?

ChatGPT: I apologize if my previous response was unclear. When I said that tequila has a "relatively high sugar content," I was not suggesting that tequila contains sugar.

Comment by TW123 (ThomasWoodside) on Did ChatGPT just gaslight me? · 2022-12-02T19:06:48.561Z · LW · GW

It should! I mentioned that probable future outcome in my original post.

Comment by TW123 (ThomasWoodside) on Did ChatGPT just gaslight me? · 2022-12-02T05:40:43.391Z · LW · GW

I'm going to address your last paragraph first, because I think it's important for me to respond to, not just for you and me but for others who may be reading this.

When I originally wrote this post, it was because I had asked ChatGPT a genuine question about a drink I wanted to make. I don't drink alcohol, and I never have. I've found that even mentioning this fact sometimes produces responses like yours, and it's not uncommon for people to think I am mentioning it as some kind of performative virtue signal. People choose not to drink for all sorts of reasons, and maybe some are being performative about it, but that's a hurtful assumption to make about anyone who makes that choice and dares to admit it in a public forum. This is exactly why I am often hesitant to mention this fact about myself, but in the case of this post, there really was no other choice (aside from just not posting this at all, which I would really disprefer). I've generally found the LW community and younger generations to be especially good at interpreting a choice not to drink for what it usually is: a personal choice, not a judgment or a signal or some kind of performative act. However, your comment initially angered and then saddened me, because it greets my choice through a lens of suspicion. That's generally a fine lens through which to look at the world, but I think in this context, it's a harmful one. I hope you will consider thinking a little more compassionately in the future with respect to this issue.


 To answer your object-level critiques:

The problem is that it clearly contradicts itself several times, rather than admitting a contradiction it doesn't know how to reconcile. There is no sugar in tequila. Tequila may be described as sweet (nobody I talked to described it as such, but some people on the internet do) for non-sugar reasons. In fact, I'm sure ChatGPT knows way more about tequila than I do!

It is not that it "may not know" how to reconcile those facts. It is that it doesn't know, makes something up, and pretends it makes sense.

A situation where somebody interacting with the chatbot doesn't know much about the subject area is exactly the kind of situation we need to be worried about with these models. I'm entirely unconvinced that the fact that some people describe tequila as sweet says much at all about this post. That's because the point of the post was rather that ChatGPT claimed tequila has high sugar content, then claimed that actually the sweetness is due to something else, and it never really meant that tequila has any sugar. That is the problem, and I don't think my description of it is overblown.

Comment by TW123 (ThomasWoodside) on Did ChatGPT just gaslight me? · 2022-12-02T01:57:40.919Z · LW · GW

Interesting! I hadn't come across that. Maybe ChatGPT is right that there is sweetness (perhaps to somebody with trained taste) that doesn't come from sugar. However, the blatant contradictions remain (ChatGPT certainly wasn't saying that at the beginning of the transcript).

Comment by TW123 (ThomasWoodside) on Did ChatGPT just gaslight me? · 2022-12-01T13:18:17.566Z · LW · GW

OpenAI has in the past not been that transparent about these questions, but in this case, the blog post (linked in my post) makes it very clear it's trained with reinforcement learning from human feedback.

However, of course it was initially pretrained in an unsupervised fashion (it's based on GPT-3), so it seems hard to know whether this specific behavior was "due to the RL" or "a likely continuation".

Comment by TW123 (ThomasWoodside) on A philosopher's critique of RLHF · 2022-11-07T03:40:07.845Z · LW · GW

This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you're interested, here's a good talk.

Comment by TW123 (ThomasWoodside) on A philosopher's critique of RLHF · 2022-11-07T03:03:35.440Z · LW · GW

That's fair. I think it's a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don't think just asking you questions "when it's confused" is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that's not research that's currently happening (though there might be something I'm not aware of).

Comment by TW123 (ThomasWoodside) on Instead of technical research, more people should focus on buying time · 2022-11-07T02:43:48.144Z · LW · GW

This has also motivated me to post one of my favorite critiques of RLHF.

Comment by TW123 (ThomasWoodside) on Instead of technical research, more people should focus on buying time · 2022-11-07T02:12:30.359Z · LW · GW

I think if they operationalized it like that, fine, but I would find the frame "solving the problem" to be a very weird way of referring to that. Usually, when I hear people saying "solving the problem" they have a vague sense of what they are meaning, and have implicitly abstracted away the fact that there are many continuous problems where progress needs to be made and that the problem can only really be reduced, but never solved, unless there is actually a mathematical proof.

Comment by TW123 (ThomasWoodside) on Instead of technical research, more people should focus on buying time · 2022-11-07T02:06:36.562Z · LW · GW

I'm not a big supporter of RLHF myself, but my steelman is something like:

RLHF is a pretty general framework for conforming a system to optimize for something that can't clearly be defined. If we just did, "if a human looked at this for a second would they like it?" this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn't sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.

I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.

Comment by TW123 (ThomasWoodside) on Instead of technical research, more people should focus on buying time · 2022-11-07T00:59:55.466Z · LW · GW

I think I essentially agree with respect to your definition of "sanity," and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of "sanity" and my definition of "safety culture." I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.

As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.

Comment by TW123 (ThomasWoodside) on Instead of technical research, more people should focus on buying time · 2022-11-07T00:04:47.002Z · LW · GW

I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.

I wouldn't be excited about (for example) retreats with undergrads to learn about "how you can help buy more time." I'm not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.

That said--and I don't think you dispute this--we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one's threat models. Societechnical systems are very difficult to influence, and it's easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.

I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely "concerned" does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):

  • Preoccupation with failure, especially black swan events and unseen failures.
  • Reluctance to simplify interpretations and explain failures using only simplistic narratives.
  • Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
  • Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
  • Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.

In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says "slow down" -- one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don't. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned "move fast and break things" ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).

Lastly, regarding this:

with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully "aligned", or that we just need to throw some RLHF at the problem and the AI will learn our values fine

I'm curious if you think that these topics have no relevance to alignment, or whether you're saying it's problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.

Comment by TW123 (ThomasWoodside) on Instead of technical research, more people should focus on buying time · 2022-11-06T12:56:27.235Z · LW · GW

I feel somewhat conflicted about this post. I think a lot of the points are essentially true.  For instance, I think it would be good if timelines could be longer all else equal. I also would love more coordination between AI labs. I also would like more people in AI labs to start paying attention to AI safety.

But I don't really like the bottom line. The point of all of the above is not reducible to just "getting more time for the problem to be solved." 

First of all, the framing of "solving the problem" is, in my view, misplaced. Unless you think we will someday have a proof of beneficial AI (I think that's highly unlikely), then there will always be more to do to increase certainty and reliability.  There isn't a moment when the problem is solved.

Second, these interventions are presented as a way of giving "alignment researchers" more time to make technical progress. But in my view, things like more coordination could lead to labs actually adopting any alignment proposals at all. The same is the case for racing. In terms of concern about AI safety, I'd expect labs would actually devote resources to this themselves if they are concerned. This shouldn't be a side benefit, it should be a mainline benefit.

I don't think high levels of reliability of beneficial AI will come purely from people who post on LessWrong, because the community is just so small and doesn't have that much capital behind it. DeepMind/OpenAI, not to mention Google Brain and Meta AI research, could invest significantly more in safety than they do.  So could governments (yes, the way they do this might be bad -- but it could in principle be good).

You say that you thought buying time was the most important frame you found to backchain with. To me this illustrates a problem with backchaining. Dan Hendrycks and I discussed similar kinds of interventions, and we called this "improving contributing factors" which is what it's called in complex systems theory. In my view, it's a much better and less reductive frame for thinking about these kinds of interventions.

Comment by TW123 (ThomasWoodside) on Counterarguments to the basic AI x-risk case · 2022-10-15T16:14:27.974Z · LW · GW

As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent - the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.

Your argument seems to be:

  1. Definitionally, intelligence is the ability to achieve one's goals.
  2. Less goal-directed systems are less intelligent.
  3. Less intelligent systems will always lose in competition.
  4. Less goal directed systems will always lose in competition.

Defining intelligence as goal-directedness doesn't do anything for your argument. It just kicks the can down the road. Why will less intelligent (under your definition, goal directed) always lose in competition?

Imagine you had a magic wand or a genie in a bottle that would fulfill every wish you could dream of. Would you use it? If so, you're incentivized to take over the world, because the only possible way of making every wish come true is absolute power over the universe. The fact that you normally don't try to achieve that may have to do with the realization that you have no chance. If you had, I bet you'd try it. I certainly would, if only so I could stop Putin. But would me being all-powerful be a good thing for the rest of the world? I doubt it.

Romance is a canonical example of where you really don't want to be all powerful (if real romance is what you want). Romance could not exist if your romantic partner always predictably did everything you ever wanted. The whole point is they are a different person, with different wishes, and you have to figure out how to navigate that and its unpredictabilities. That is the "fun" of romance. So no, I don't think everyone would really use that magic wand.

Comment by TW123 (ThomasWoodside) on No free lunch theorem is irrelevant · 2022-10-04T01:11:41.239Z · LW · GW

I think it's tangentially relevant in certain cases. Here's something I wrote in another context, where I think it's perhaps useful to understand what we really mean when we say "intelligence."

We consider humans intelligent not because they do better on all possible optimization problems (they don’t, due to the no free lunch theorem), but because they do better on a subset of problems that are actually encountered in the real world. For instance, humans have particular cognitive architectures that allow us to understand language well, and language is something that humans need to understand. This can be seen even more clearly with “cognitive biases”, which are errors in logical reasoning that are nevertheless advantageous for surviving in the ancestral human world. Recently, people have tried to rid themselves of such biases, because of their belief that they no longer help in the modern world, a perfect example of the fact that human intelligence is highly domain-dependent.

We can’t do arithmetic nearly as well as machines, but we don’t view them as intelligent because we can do many more things than they can, more flexibly and better. The machine might reply that it can do many things as well: it can quickly multiply 3040 by 2443, but also 42323 by 3242, and 379 by 305, and in an absolutely huge number of these calculations. The human might respond that these are all “just multiplication problems”; the machine might say that human problems are “just thriving-on-planet-earth problems”. When we say intelligence, we essentially just mean “ability to excel at thriving-on-planet-earth problems,” which requires knowledge and cognitive architectures that are specifically good at thriving-on-planet-earth. Thinking too much about "generality" tends to confuse; instead consider performance on thriving-on-planet-earth problems, or particular subsets of those problems.

Agree it's mostly not relevant.

Comment by TW123 (ThomasWoodside) on If you want to learn technical AI safety, here's a list of AI safety courses, reading lists, and resources · 2022-10-04T01:00:32.199Z · LW · GW

Link is course.mlsafety.org.

Comment by TW123 (ThomasWoodside) on A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] · 2022-09-27T16:16:32.610Z · LW · GW

(Not reviewed by Dan Hendrycks.)

This post is about epistemics, not about safety techniques, which are covered in later parts of the sequence. Machine learning, specifically deep learning, is the dominant paradigm that people believe will lead to AGI. The researchers who are advancing the machine learning field have proven quite good at doing so, insofar as they have created rapid capabilities advancements. This post sought to give an overview of how they do this, which is in my view extremely useful information! We strongly do not favor advancements of capabilities in the name of safety, and that is very clear in the rest of this sequence. But it seems especially odd to say that one should not even talk about how capabilities have been advanced.

The amount of capabilities research is simply far greater than safety research. Thus to answer the question "what kind of research approaches generally work for shaping machine learning systems?" it is quite useful to engage with how they have worked in capabilities advancements. In machine learning, theoretical (in "math proofs" sense of the word) approaches to advancing capabilities have largely not worked. This suggests deep learning is not amenable to these kinds of approaches. 

It sounds like you believe that these approaches will be necessary for AI safety, so no amount of knowledge of their inefficacy should persuade us in favor of more iterative, engineering approaches. To put it another way: if iterative engineering practices will never ensure safety, then it does not matter if they worked for capabilities, so we should be working mainly on theory.

I pretty much agree with the logic of this, but I don't agree with the premise: I don't think it's the case that iterative engineering practices will never ensure safety. The reasons for this are covered in this sequence. Theory, on the other hand, is a lot more ironclad than iterative engineering approaches, if useable theory could actually be produced. Knowledge that useable theory has not really been produced in deep learning suggests to me that it's unlikely to for safety, either. Thus, to me, an interative engineering approach appears to be more favorable despite the fact that it leaves more room for error. However, I think that if you believe that iterative engineering approachs will never work, then indeed you should work on theory despite what should be a very strong prior (based on what is said in this post) that theory will also not work.

Comment by TW123 (ThomasWoodside) on How Do AI Timelines Affect Existential Risk? · 2022-08-29T18:27:24.253Z · LW · GW

From my comments on the MLSS project submission (which aren't intended to be comprehensive):

Quite enjoyed reading this, thanks for writing!

My guess is that the factors combine to create a roughly linear model. Even if progress is unpredictable and not linear, the average rate of progress will still be linear.

I’m very skeptical that this is a linear interpolation. It’s the core of your argument, but I didn’t think it was really argued. I would be very surprised if moving from 50% to 49% risk took similar time as moving from 2% to 1% risk, even if there are more researchers, unless the research pool grows exponentially. I don’t really think you’ve justified this linear trend.

The report also seems to just assume aligned AI would reduce other x-risk to zero. I’m not sure why this should be assumed. I can see a case for a large reduction in it, but it's not necessarily obvious.

Lastly, it felt strange to me to not explore risks of cognitively enhanced humans: for instance, risks that cognitively enhanced humans tould have different values, or risks that cognitively enhanced humans would subjugate unenhanced humans.

Comment by TW123 (ThomasWoodside) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-29T04:41:21.111Z · LW · GW

Thanks! I really appreciate it, and think it's a lot more accurate now. Nitpicks:

I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.

Comment by TW123 (ThomasWoodside) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-29T04:37:06.465Z · LW · GW

As somebody who used to be an intern at CHAI, but certainly isn't speaking for the organization:

CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it's ML research, but it's not top down at all so it doesn't feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.

Comment by TW123 (ThomasWoodside) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-29T04:06:00.998Z · LW · GW

Also, as to your comment:

My worry is that academics will pursue strategies that work right now but won't work for AGI, because they are trying to win the competition instead of align AGIs. This might be really helpful though.

(My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It's the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, as you say, competition can be a valuable force; if you can set the incentives right, it might not be necessary for all researchers to be caring about x-risk. If you can give them a fun problem to solve and make sure it's actually relevant and they are only rewarded for actually relevant work, then good research could still be produced. Relevant research has been produced by the ML community before by people who weren't explicitly thinking about x-risk (mostly "accidentally", i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).

Also, iterative progress involves making progress that works now but might not in the future. That's ok, as long as some of it does in fact work in the future.

Comment by TW123 (ThomasWoodside) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-29T03:57:28.835Z · LW · GW

Thanks so much for writing this! I think it's a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.

[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn't been reviewed by others at CAIS, but I'm pretty confident it's accurate.]

For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I'd probably change:

  • First, our outreach is not just to academics, but also to people in industry. We usually use the term "ML community" rather than "academia" for this reason.
  • Second, the technical research side of the organization is about a lot more than robustness. We do research in Trojans as you mention, which isn't robustness, but also in machine ethics, cooperative AI, anomaly detection, forecasting, and probably more areas soon. We are interested in most of the areas in Open Problems in AI X-Risk, but the extent to which we're actively working on them varies.
  • I also think it might be good to add our newly-announced (so maybe after you wrote the post) Philosophy Fellowship, which focuses on recruiting philosophers to study foundational conceptual problems in AI risk. This might correct a misconception that CAIS isn't interested in conceptual research; we very much are, but of a different flavor than some others, which I would broadly characterize as "more like philosophy, less like math".
  • Also, there is no way you would have known about this since we've never said it publicly anywhere, but we intend to also build out compute and research engineering infrastructure for academics specifically, who often don't have funding for compute and even if they do don't have the support necessary to leverage it. Building out a centralized way for safety academics to access compute and engineering support would create economies of scale (especially the compute contracts and compute infrastructure). However, these plans are in early stages.
  • Another fieldbuilding effort maybe worth mentioning is ML Safety Scholars.

In general, here is how I personally describe the theory of change for CAIS. This hasn't been reviewed by anyone, and I don't know how much Dan personally likes it, but it's how I think of it.  It's also not very polished, sorry. Anyway, to me there are three major forms of research:

  1. Philosophizing. Many AI safety problems are still very undefined. We need people to think about the properties of possible systems at a high level and tease out relevant considerations and possible solutions. This is exactly what philosophers do and why we are interested in the program above. Without this kind of conceptual research, it's very difficult to figure out concrete problems to work on.
  2. Concretization. It does us no good if the ideas generated in philosophizing are never concretized. Part of this is because no amount of thinking can substitute for real experimentation and implementation. Part of this is because it won't be long before we really need progress: we can't afford to just philosophize. Concretization involves taking the high level ideas and implementing something that usefully situates them in empirical systems. Benchmarks are an example of this.
  3. Iterative improvements. Once an idea is concretized, the initial concretization is likely not optimal. We need people to make tweaks and make the initial methods better at achieving their aims, according to the concretized ideas. Most papers produced by the broader ML community are iterative improvements.

CAIS intends to be the glue that integrates all three of these areas. Through our philosophy fellowship program, we will train philosophers to do useful conceptual research while working in close proximity with ML researchers. Most of our ML research focuses on building foundational methods and benchmarks that can take fuzzy problems and concretize them. Lastly, we see our fieldbuilding effort as very much driving iterative improvements: who better to make iterative improvements on well-defined safety problems than the ML community? They have shown themselves to be quite good at this when it comes to general capabilities.

For a more in depth look at our research theory of impact, I suggest Pragmatic AI Safety.

Edit: I realized your post made me actually write things up that I hadn't before, because I thought it would likely be more accurate than the (great for an outsider!) description that you had written. This strikes me as a very positive outcome of this post, and I hope others who feel their descriptions miss something will do the same!

Comment by TW123 (ThomasWoodside) on Announcing the Distillation for Alignment Practicum (DAP) · 2022-08-18T22:47:51.281Z · LW · GW

I think the meaning of "distillation" is used differently by different people, and this confuses me. Originally (based on John Wentworth's post) I thought that "distillation" meant:

"Take existing ideas from a single existing work (or from a particular person) and present them in a way that is more understandable or more concise."

That's also the definition seemingly used by the Distillation Contest.

But then you give these examples in your curriculum:

Bushwackers:

  • Explain how a sharp left turn would look like in current ML paradigms
  • Explain the connection between Agent Foundations and ELK

Rosetta Scribes

  • Interpretability research -> Chaos Theory -> Interpretability research
  • Content extrapolation -> Causality theory (causal inference)
  • e.t.c  - open-ended and lose format, what field to translate to is probably very dependent on the problem

Field Mapping

  • Map out the timelines to AGI and identify the intersections and state the arguments for why we will go one down one road vs the other at each intersection
  • Systematically investigate the field with a set of assumptions about the road ahead and look at what research methodologies past the test (like Nate Soares did for MIRIs arguments about a sharp left turn.)
  • Other research methodology that elucidates where we should be going

Propagators

Trailblazers

  • Explain every concept in AI alignment using QCD?
  • Come up with new ways of doing distillations here?

It seems like you mean something more like:

"Write something understandable that presents ideas in an intuitive way and possibly draws from many different works"

But in that case, I am not sure how this is different from "conceptual research where you try hard to present your work in an understandable way." In which case, the meaning of "distillation" has become hopelessly stretched.

Could you include a clear definition of "distillation," such that it includes clear examples of what is and isn't considered a distillation? I would ask you to write a distillation of what a distillation is, but I don't know if I'd be using the term distillation correctly.

Comment by TW123 (ThomasWoodside) on Announcing the Introduction to ML Safety course · 2022-08-09T03:48:20.307Z · LW · GW

Thanks for the suggestion, Richard! It actually probably fits best under one of the forthcoming lectures, but for now we added it to emergent behavior.

Comment by TW123 (ThomasWoodside) on $20K In Bounties for AI Safety Public Materials · 2022-08-05T22:32:08.272Z · LW · GW

These are already the top ~10%, the vast majority of the submissions aren't included. We didn't feel we really had enough data to accurately rank within these top 80 or so, though some are certainly better than others. Also, it really depends on the point you're trying to make or the audience, I don't think there really exists an objective ordering.

We did do categorization at one point, but many points fall into multiple categories and there are a lot of individual points such that we didn't find it very useful when we had them categorized.

Comment by TW123 (ThomasWoodside) on $20K In Bounties for AI Safety Public Materials · 2022-08-05T22:30:44.217Z · LW · GW

I'm not sure what you mean by "using bullet-pointed summaries of the 7 works stated in the post". If you mean the past examples of good materials, I'm not sure how good of an idea that is. We don't just want people to be rephrasings/"distillations" of single pieces of prior work.

I'm also not sure we literally tell you how to win, but yes, reading the instructions would be useful.

Comment by TW123 (ThomasWoodside) on A daily routine I do for my AI safety research work · 2022-07-19T23:19:40.626Z · LW · GW

I made this link which redirects to all arxiv pages from the last day on AI, ML, Computation and Language, Computer Vision, and Computers and Society into a single view. Since some papers are listed under multiple areas I prefer to view this so I don't skim over the same paper twice. If you bookmark it's just one click per day!

Comment by TW123 (ThomasWoodside) on Examples of AI Increasing AI Progress · 2022-07-19T16:10:59.640Z · LW · GW

No need to delete the tweet. I dagree the examples are not info hazards, they're all publicly known. I just probably wouldn't want somebody going to good ML researchers who currently are doing something that isn't really capabilities (e.g., application of ML to some other area) and telling them "look at this, AGI soon."