capybaralet's Shortform 2020-08-27T21:38:18.144Z · score: 5 (1 votes)
A reductio ad absurdum for naive Functional/Computational Theory-of-Mind (FCToM). 2020-01-02T17:16:35.566Z · score: 4 (5 votes)
A list of good heuristics that the case for AI x-risk fails 2019-12-02T19:26:28.870Z · score: 24 (18 votes)
What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. 2019-12-02T18:20:47.530Z · score: 28 (16 votes)
A fun calibration game: "0-hit Google phrases" 2019-11-21T01:13:10.667Z · score: 7 (4 votes)
Can indifference methods redeem person-affecting views? 2019-11-12T04:23:10.011Z · score: 11 (4 votes)
What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? 2019-08-20T21:45:12.118Z · score: 30 (8 votes)
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research 2019-08-06T22:22:20.928Z · score: 34 (17 votes)
False assumptions and leaky abstractions in machine learning and AI safety 2019-06-28T04:54:47.119Z · score: 23 (6 votes)
Let's talk about "Convergent Rationality" 2019-06-12T21:53:35.356Z · score: 28 (9 votes)
X-risks are a tragedies of the commons 2019-02-07T02:48:25.825Z · score: 9 (5 votes)
My use of the phrase "Super-Human Feedback" 2019-02-06T19:11:11.734Z · score: 13 (8 votes)
Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" 2019-02-06T19:09:20.809Z · score: 25 (12 votes)
The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk 2019-01-31T06:13:35.321Z · score: 14 (8 votes)
Imitation learning considered unsafe? 2019-01-06T15:48:36.078Z · score: 12 (6 votes)
Conceptual Analysis for AI Alignment 2018-12-30T00:46:38.014Z · score: 26 (9 votes)
Disambiguating "alignment" and related notions 2018-06-05T15:35:15.091Z · score: 43 (13 votes)
Problems with learning values from observation 2016-09-21T00:40:49.102Z · score: 0 (7 votes)
Risks from Approximate Value Learning 2016-08-27T19:34:06.178Z · score: 1 (4 votes)
Inefficient Games 2016-08-23T17:47:02.882Z · score: 14 (15 votes)
Should we enable public binding precommitments? 2016-07-31T19:47:05.588Z · score: 0 (1 votes)
A Basic Problem of Ethics: Panpsychism? 2015-01-27T06:27:20.028Z · score: -4 (11 votes)
A Somewhat Vague Proposal for Grounding Ethics in Physics 2015-01-27T05:45:52.991Z · score: -3 (16 votes)


Comment by capybaralet on Why GPT wants to mesa-optimize & how we might change this · 2020-09-21T00:11:34.878Z · score: 2 (2 votes) · LW · GW

I didn't read the post (yet...), but I'm immediately skeptical of the claim that beam search is useful here ("in principle"), since GPT-3 is just doing next step prediction (it is never trained on its own outputs, IIUC). This means it should always just match the conditional P(x_t | x_1, .., x_{t-1}). That conditional itself can be viewed as being informed by possible future sequences, but conservation of expected evidence says we shouldn't be able to gain anything by doing beam search if we already know that conditional. Now it's true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.

At a high level, I don't think we really need to be concerned with this form of "internal lookahead" unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).

Comment by capybaralet on Why GPT wants to mesa-optimize & how we might change this · 2020-09-21T00:02:00.728Z · score: 1 (1 votes) · LW · GW

Seq2seq used beam search and found it helped ( It was standard practice in the early days of NMT; I'm not sure when that changed.

This blog post gives some insight into why beam search might not be a good idea, and is generally very interesting:

Comment by capybaralet on Radical Probabilism · 2020-09-20T23:57:32.509Z · score: 1 (1 votes) · LW · GW

This blog post seems superficially similar, but I can't say ATM if there are any interesting/meaningful connections:

Comment by capybaralet on AI Research Considerations for Human Existential Safety (ARCHES) · 2020-09-19T04:54:37.962Z · score: 3 (2 votes) · LW · GW

There is now also an interview with Critch here:

Comment by capybaralet on capybaralet's Shortform · 2020-09-18T23:30:30.114Z · score: 3 (2 votes) · LW · GW

A lot of the discussion of mesa-optimization seems confused.

One thing that might be relevant towards clearing up the confusion is just to remember that "learning" and "inference" should not be thought of as cleanly separated, in the first place, see, e.g. AIXI...

So when we ask "is it learning? Or just solving the task without learning", this seems like a confused framing to me. Suppose your ML system learned an excellent prior, and then just did Bayesian inference at test time. Is that learning? Sure, why not. It might not use a traditional search/optimization algorithm, but probably is has to do *something* like that for computational reasons if it wants to do efficient approximate Bayesian inference over a large hypothesis space, so...

Comment by capybaralet on Developmental Stages of GPTs · 2020-09-18T23:25:57.570Z · score: 3 (2 votes) · LW · GW
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.)

No, that's zero-shot. Few shot is when you train on those instead of just stuffing them into the context.

It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot).

Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?

By "training a model", I assume you mean "a ML model" (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line.

I'm not saying it's doing SGD; I don't know what it's doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.

Comment by capybaralet on capybaralet's Shortform · 2020-09-18T23:17:22.177Z · score: 9 (5 votes) · LW · GW

It seems like a lot of people are still thinking of alignment as too binary, which leads to critical errors in thinking like: "there will be sufficient economic incentives to solve alignment", and "once alignment is a bottleneck, nobody will want to deploy unaligned systems, since such a system won't actually do what they want".

It seems clear to me that:

1) These statements are true for a certain level of alignment, which I've called "approximate value learning" in the past ( I think I might have also referred to it as "pretty good alignment" or "good enough alignment" at various times.

2) This level of alignment is suboptimal from the point of view of x-safety, since the downside risk of extinction for the actors deploying the system is less than the downside risk of extinction summed over all humans.

3) We will develop techniques for "good enough" alignment before we develop techniques that are acceptable from the standpoint of x-safety.

4) Therefore, the expected outcome is: once "good enough alignment" is developed, a lot of actors deploy systems that are aligned enough for them to benefit from them, but still carry an unacceptably high level of x-risk.

5) Thus if we don't improve alignment techniques quickly enough after developing "good enough alignment", it's development will likely lead to a period of increased x-risk (under the "alignment bottleneck" model).

Comment by capybaralet on capybaralet's Shortform · 2020-09-18T23:06:15.180Z · score: 3 (2 votes) · LW · GW

No, I'm talking about it breaking out during training. The only "shifts" here are:

1) the AI gets smarter

2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).

We can imagine scenarios where it's only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn... the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:

A) it is auto-induced distributional shift

B) the developers never decided to deploy

Comment by capybaralet on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-17T21:19:01.438Z · score: 3 (2 votes) · LW · GW

MAIEI also has an AI Ethics newsletter I recommend for those interested in the topic.

Comment by capybaralet on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-17T21:17:51.415Z · score: 1 (1 votes) · LW · GW
I actually expect that the work needed for the open-ended search paradigm will end up looking very similar to the work needed by the “AGI via deep RL” paradigm: the differences I see are differences in difficulty, not differences in what problems qualitatively need to be solved.

I'm inclined to agree. I wonder if there are any distinctive features that jump out?

Comment by capybaralet on [AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents · 2020-09-17T21:11:03.422Z · score: 1 (1 votes) · LW · GW

Regarding curriculum learning: I think its very neglected, and seems likely to be a core component of prosaic alignment approaches. The idea of a "basin of attraction for corrigibility (or other desirable properties)" seems likely to rely on appropriate choice of curriculum.

Comment by capybaralet on capybaralet's Shortform · 2020-09-17T19:39:33.985Z · score: 5 (3 votes) · LW · GW

I'm frustrated with the meme that "mesa-optimization/pseudo-alignment is a robustness (i.e. OOD) problem". IIUC, this is definitionally true in the mesa-optimization paper, but I think this misses the point.

In particular, this seems to exclude an important (maybe the most important) threat model: the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.

This is exactly how I imagine a treacherous turn from a boxed superintelligent AI agent to occur, for instance. It secretly begins breaking out of the box (e.g. via manipulating humans) and we don't notice until its too late.

Comment by capybaralet on Why is pseudo-alignment "worse" than other ways ML can fail to generalize? · 2020-09-17T19:20:20.943Z · score: 5 (3 votes) · LW · GW

I disagree with the framing that: "pseudo-alignment is a type of robustness/distributional shift problem". This is literally true based on how it's defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).

Comment by capybaralet on Mesa-Search vs Mesa-Control · 2020-09-17T19:02:43.306Z · score: 1 (1 votes) · LW · GW

I guess most of my cruxes are RE your 2nd "=>", and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.

Comment by capybaralet on Mesa-Search vs Mesa-Control · 2020-09-17T19:01:14.255Z · score: 3 (2 votes) · LW · GW

Yep. I'd love to see more discussion around these cruxes (e.g. I'd be up for a public or private discussion sometime, or moderating one with someone from MIRI). I'd guess some of the main underlying cruxes are:

  • How hard are these problems to fix?
  • How motivated will the research community be to fix them?
  • How likely will developers be to use the fixes?
  • How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)

Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:

  • It's hard to say, but I'd say there's a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
  • A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
  • Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we'll build good tools), if they are cheap enough to be deemed "practical", but very unlikely (~10%) to be cheap enough.
  • It will probably need to be highly reliable; "the necessary intelligence/resources needed to destroy the world goes down every year" (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
Comment by capybaralet on Mesa-Search vs Mesa-Control · 2020-09-17T07:59:10.276Z · score: 3 (2 votes) · LW · GW

I'm very curious to know whether people at MIRI in fact disagree with this claim.

I would expect that they don't... e.g. Eliezer seems to think we'll see them and patch them unsuccessfully:

Comment by capybaralet on Mesa-Search vs Mesa-Control · 2020-09-17T07:47:04.008Z · score: 1 (1 votes) · LW · GW

Practically speaking, I think the big difference is that the history is outside of GPT-3's control, but a recurrent memory would be inside its control.

Comment by capybaralet on Mesa-Search vs Mesa-Control · 2020-09-17T07:38:31.628Z · score: 0 (2 votes) · LW · GW
Comment by capybaralet on capybaralet's Shortform · 2020-09-17T06:11:28.481Z · score: 3 (2 votes) · LW · GW

It might be a passive request, I'm not actually sure... I'd think of it more like an invitation, which you are free to decline. Although OFC, declining an invitation does send a message whether you like it or not *shrug.

Comment by capybaralet on capybaralet's Shortform · 2020-09-16T23:04:19.358Z · score: 1 (1 votes) · LW · GW

I guess one problem here is that how someone responds to such a statement carries information about how much they respect you...

If someone you are honored to even get the time of day from writes that, you will almost certainly craft a strong response about X...

Comment by capybaralet on capybaralet's Shortform · 2020-09-16T23:03:02.965Z · score: 3 (2 votes) · LW · GW

I like "tell culture" and find myself leaning towards it more often these days, but e.g. as I'm composing an email, I'll find myself worrying that the recipient will just interpret a statement like: "I'm curious about X" as a somewhat passive request for information about X (which it sort of is, but also I really don't want it to come across that way...)

Anyone have thoughts/suggestions?

Comment by capybaralet on capybaralet's Shortform · 2020-09-16T09:22:07.097Z · score: 3 (2 votes) · LW · GW

As alignment techniques improve, they'll get good enough to solve new tasks before they get good enough to do so safely. This is a source of x-risk.

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T23:53:17.695Z · score: 1 (1 votes) · LW · GW

What's our backup plan if the internet *really* goes to shit?

E.g. Google search seems to have suddenly gotten way worse for searching for machine learning papers for me in the last month or so. I'd gotten used to it being great, and don't have a good backup.

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T23:50:56.763Z · score: 1 (1 votes) · LW · GW

A friend asked me what EAs think of

Here's my response (based on ~1 minute of Googling):

He seems to have what I call a "moral purity" attitude towards morality.
By this I mean, thinking of ethics as less consequentialist and more about "being a good person".

I think such an attitude is natural, very typical and not very EA.So, e.g. living frugally might or might not be EA, but definitely makes sense if you believe we have strong charitable obligations and have a moral purity attitude towards morality.

Moving away from moral purity and giving consequentialist arguments against it are maybe one of the main insights of EA vs. Peter Singer.

Comment by capybaralet on TurnTrout's shortform feed · 2020-09-15T08:27:19.399Z · score: 1 (1 votes) · LW · GW

I found this fascinating... it's rare these days that I see some fundamental assumption in my thinking that I didn't even realize I was making laid bare like this... it is particularly striking because I think I could easily have realized that my own experience contradicts catharsis theory... I know that I can distract myself to become less angry, but I usually don't want to, in the moment.

I think that desire is driven by emotion, but rationalized via something like catharsis theory. I want to try and rescue catharsis theory by saying that maybe there are negative long-term effects of being distracted from feelings of anger (e.g. a build up of resentment). I wonder how much this is also a rationalization.

I also wonder how accurately the authors have characterized catharsis theory, and how much to identify it with the "hydraulic model of anger"... I would imagine that there are lots of attempts along the lines of what I suggested to try and rescue catharsis theory by refining or moving away from the hydraulic model. A highly general version might claim: "over a long time horizon, not 'venting' anger is net negative".

Comment by capybaralet on MakoYass's Shortform · 2020-09-15T07:51:32.701Z · score: 1 (1 votes) · LW · GW

The obvious bad consequence is a false sense of security leading people to just get BCIs instead of trying harder to shape (e.g. delay) AI development.

" You can't make horses competitive with cars by giving them exoskeletons. " <-- this reads to me like a separate argument, rather than a restatement of the one that came before.

I agree that BCI seems unlikely to be a good permanent/long-term solution, unless it helps us solve alignment, which I think it could. It could also just defuse a conflict between AIs and humans, leading us to gracefully give up our control over the future light cone instead of fighting a (probably losing) battle to retain it.

...Your post made me think more about my own (and others') reasons for rejecting Neuralink as a bad idea... I think there's a sense of "we're the experts and Elon is a n00b". This coupled with feeling a bit burned by Elon first starting his own AI safety org and then ditching it for this... overall doesn't feel great.

Comment by capybaralet on Tofly's Shortform · 2020-09-15T07:40:15.381Z · score: 3 (2 votes) · LW · GW

Regarding your "intuition that there should be some “best architecture”, at least for any given environment, and that this architecture should be relatively “simple”.", I think:

1) I'd say "task" rather than "environment", unless I wanted to emphasize that I think selection pressure trumps the orthogonality thesis (I'm ambivalent, FWIW).

2) I don't see why it should be "simple" (and relative to what?) in every case, but I sort of share this intuition for most cases...

3) On the other hand, I think any system with other agents probably is much more complicated (IIUC, a lot of people think social complexity drove selection pressure for human-level intelligence in a feedback loop). At a "gears level" the reason this creates an insatiable drive for greater complexity is that social dynamics can be quite winner-takes-all... if you're one step ahead of everyone else (and they don't realize it), then you can fleece them.

Comment by capybaralet on Tofly's Shortform · 2020-09-15T07:34:57.979Z · score: 3 (2 votes) · LW · GW

I don't think asymptotic reasoning is really the right tool for the job here.

We *know* things level off eventually because of physical limits (

But fast takeoff is about how fast we go from where we are now to (e.g.) a superintelligence with a decisive strategic advantage (DSA). DSA probably doesn't require something near the physical limits of computation.

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T07:28:11.208Z · score: 13 (5 votes) · LW · GW

Wow this is a lot better than my FB/Twitter feed :P

:D :D :D

Let's do this guys! This is the new FB :P

Comment by capybaralet on Jimrandomh's Shortform · 2020-09-15T07:27:21.193Z · score: 1 (1 votes) · LW · GW

Sounds like a very general criticism that would apply to any effects that are very strong/consistent in circumstances where there a very high variance (e.g. binary) latent variable takes on a certain variable (and the effect is 0 otherwise...).

I wonder how meta-analyses typically deal with that...(?) suggested that very large anomalous effects are usually evidence of fraud, and that meta-analyses may try to prevent a single large effect size study from dominating (IIRC).

Comment by capybaralet on ChristianKl's Shortform · 2020-09-15T07:17:15.416Z · score: 2 (2 votes) · LW · GW

Can you give us the TL; DR on what "proper humidity" means in this context? Google says 30-50% is good (in general). Is the same true for COVID?

Comment by capybaralet on ofer's Shortform · 2020-09-15T07:14:45.486Z · score: 1 (1 votes) · LW · GW

Any ideas where to get good up-to-date info on that? Ideally I'd like to hear if/when we have any significant reductions in uncertainty! :D

Comment by capybaralet on rohinmshah's Shortform · 2020-09-15T07:11:39.840Z · score: 1 (1 votes) · LW · GW

I didn't get it... is the problem with the "look for N opinions" response that you aren't computing the denominator (|"intervening is positive"| + |"intervening is negative"|)?

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T07:09:35.909Z · score: 2 (2 votes) · LW · GW

Although it's not framed this way, I think much of the disagreement about timelines/scaling-hypothesis/deep-learning in the ML community basically comes down to this...

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T07:01:24.055Z · score: 6 (4 votes) · LW · GW

"No Free Lunch" (NFL) results in machine learning (ML) basically say that success all comes down to having a good prior.

So we know that we need a sufficiently good prior in order to succeed.

But we don't know what "sufficiently good" means.

e.g. I've heard speculation that maybe we can use 2^-MDL in any widely used Turing-complete programming language (e.g. Python) for our prior, and that will give enough information about our particular physics for something AIXI-like to become superintelligent e.g. within our lifetime.

Or maybe we can't get anywhere without a much better prior.

DOES ANYONE KNOW of any work/(intelligent thoughts) on this?

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T06:33:12.306Z · score: 3 (2 votes) · LW · GW

I've been building up drafts for a looooong time......

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T06:32:52.032Z · score: 12 (3 votes) · LW · GW

I have the intention to convert a number of draft LW blog posts into short-forms.

Then I will write a LW post linking to all of them and asking people to request that I elaborate on any that they are particularly interested in.

Comment by capybaralet on capybaralet's Shortform · 2020-09-15T06:31:27.020Z · score: 1 (3 votes) · LW · GW

Moloch is not about coordination failures.

Moloch is about the triumph of instrumental goals.

Coordination *might* save us from that. Or not. "it is too soon to say"

Comment by capybaralet on Against boots theory · 2020-09-15T05:21:05.663Z · score: 5 (3 votes) · LW · GW

I just barely skimmed this. I think it's a good explanation for why poor people are poor, not why rich people are rich (in the modern world, in rich countries). I think this is supported by research, although I wasn't able to track down the sources I heard about... some researchers looked at how poor people manage their finances and found that they were often behaving "rationally" (as much as richer people are) when taking loans with pretty extreme interest rates, for instance.

Comment by capybaralet on capybaralet's Shortform · 2020-09-13T21:23:34.354Z · score: 6 (4 votes) · LW · GW

Regarding the "Safety/Alignment vs. Capabilities" meme: it seems like people are sometimes using "capabilities" to use 2 different things:

1) "intelligence" or "optimization power"... i.e. the ability to optimize some objective function

2) "usefulness": the ability to do economically valuable tasks or things that people consider useful

I think it is meant to refer to (1).

Alignment is likely to be a bottleneck for (2).

For a given task, we can expect 3 stages of progress:

i) sufficient capabilities(1) to perform the task

ii) sufficient alignment to perform the task unsafely

iii) sufficient alignment to perform the task safely

Between (i) and (ii) we can expect a "capabilities(1) overhang". When we go from (i) to (ii) we will see unsafe AI systems deployed and a potentially discontinuous jump in their ability to do the task.

Comment by capybaralet on Radical Probabilism · 2020-09-03T05:43:36.544Z · score: 1 (1 votes) · LW · GW
the structured space will always give strictly better predictions than the socratic hypothesis.

I don't think so.... suppose in the H/T example, the Socratic hypothesis says that P(H) = P(T) = 3. Then it will always do better than any hypothesis that has to be normalized.

I'm not sure what you mean by "structured hypotheses" here though...

Comment by capybaralet on Radical Probabilism · 2020-09-02T08:20:49.661Z · score: 1 (1 votes) · LW · GW

It seems like you could get pretty far with this approach, and it starts to look pretty Bayesian to me if I update epsilon based on how predictable the world seems to have been, in general, so far.

Comment by capybaralet on Radical Probabilism · 2020-09-02T08:19:05.397Z · score: 1 (1 votes) · LW · GW

It seems like you don't need statistical tests, and can instead include a special "Socratic" hypothesis (which just says "I don't know") in your hypothesis space. This hypothesis can assign some fixed or time-varying probability to any observation (e.g. yielding an unnormalized probability distribution by saying P(X=r) = epsilon for any real number r, assuming all observations X are real-valued). I wonder if that has been explored.

Comment by capybaralet on Prediction = Compression [Transcript] · 2020-09-02T08:04:27.504Z · score: 3 (2 votes) · LW · GW

How could I get in on such zoom talks?

Comment by capybaralet on capybaralet's Shortform · 2020-08-27T21:38:18.538Z · score: 7 (3 votes) · LW · GW

My pet "(AI) policy" idea for a while has been "direct recourse", which is the idea that you can hedge against one party precipitating an irreversible events by giving other parties the ability to disrupt their operations at will.
For instance, I could shut down my competitors' AI project if I think it's an X-risk.
The idea is that I would have to compensate you if I was later deemed to have done this for an illegitimate reason.
If shutting down your AI project is not irreversible, then this system increases our ability to prevent irreversible events, since I might stop some existential catastrophe, and if I shut down your project when I shouldn't, then I just compensate you and we're all good.

Comment by capybaralet on Developmental Stages of GPTs · 2020-08-01T23:21:00.477Z · score: 6 (4 votes) · LW · GW
I think it's an essential time to support projects that can work for a GPT-style near-term AGI , for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate or Iterated Amplification are in this class.

As I argued here, I think GPT-3 is more likely to be aligned than whatever we might do with CIRL/IDA/Debate ATM, since it is trained with (self)-supervised learning and gradient descent.

The main reason such a system could pose an x-risk by itself seems to be mesa-optimization, so studying mesa-optimization in the context of such systems is a priority (esp. since GPT-3's 0-shot learning looks like mesa-optimization).

In my mind, things like IDA become relevant when we start worrying about remaining competitive with agent-y systems built using self-supervised learning systems as a component, but actually come with a safety cost relative to SGD-based self-supervised learning.

This is less the case when we think about them as methods for increasing interpretability, as opposed to increasing capabilities (which is how I've mostly seen them framed recently, a la the complexity theory analogies).

Comment by capybaralet on Rationalists, Post-Rationalists, And Rationalist-Adjacents · 2020-08-01T19:35:37.811Z · score: 1 (1 votes) · LW · GW

I strongly disagree with this definition of a rationalist. I think it's way too narrow, and assumes a certain APPROACH to "winning" that is likely incorrect.

Comment by capybaralet on Predictions for GPT-N · 2020-08-01T18:18:16.285Z · score: 3 (2 votes) · LW · GW

I think GPT-3 should be viewed as roughly as aligned as IDA would be if we pursued it using our current understanding. GPT-3 is trained via self-supervised learning (which is, on the face of it, myopic), so the only obvious x-safety concerns are something like mesa-optimization.

In my mind, the main argument for IDA being safe is still myopia.

I think GPT-3 seems safer than (recursive) reward modelling, CIRL, or any other alignment proposals based on deliberately building agent-y AI systems.


In the above, I'm ignoring the ways in which any of these systems increase x-risk via their (e.g. destabilizing) social impact and/or contribution towards accelerating timelines.

Comment by capybaralet on What specific dangers arise when asking GPT-N to write an Alignment Forum post? · 2020-07-29T03:26:41.028Z · score: 1 (1 votes) · LW · GW

In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks.

It's very unclear ATM how much weight to give this argument in general, or in specific contexts.

But I don't think it's particularly sensitive to the choice of task/learning algorithm.

Comment by capybaralet on What specific dangers arise when asking GPT-N to write an Alignment Forum post? · 2020-07-29T02:46:28.220Z · score: 1 (1 votes) · LW · GW

No, and I don't think it really matters too much... what's more important is the "architecture" of the "mesa-optimizer". It's doing something that looks like search/planning/optimization/RL.

Roughly speaking, the simplest form of this model of how things works says: "Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it's doing agenty stuff on the inside... i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn".