What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? 2019-08-20T21:45:12.118Z · score: 30 (8 votes)
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research 2019-08-06T22:22:20.928Z · score: 34 (17 votes)
False assumptions and leaky abstractions in machine learning and AI safety 2019-06-28T04:54:47.119Z · score: 23 (6 votes)
Let's talk about "Convergent Rationality" 2019-06-12T21:53:35.356Z · score: 27 (8 votes)
X-risks are a tragedies of the commons 2019-02-07T02:48:25.825Z · score: 9 (5 votes)
My use of the phrase "Super-Human Feedback" 2019-02-06T19:11:11.734Z · score: 13 (8 votes)
Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" 2019-02-06T19:09:20.809Z · score: 25 (12 votes)
The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk 2019-01-31T06:13:35.321Z · score: 14 (8 votes)
Imitation learning considered unsafe? 2019-01-06T15:48:36.078Z · score: 10 (5 votes)
Conceptual Analysis for AI Alignment 2018-12-30T00:46:38.014Z · score: 26 (9 votes)
Disambiguating "alignment" and related notions 2018-06-05T15:35:15.091Z · score: 43 (13 votes)
Problems with learning values from observation 2016-09-21T00:40:49.102Z · score: 0 (7 votes)
Risks from Approximate Value Learning 2016-08-27T19:34:06.178Z · score: 1 (4 votes)
Inefficient Games 2016-08-23T17:47:02.882Z · score: 14 (15 votes)
Should we enable public binding precommitments? 2016-07-31T19:47:05.588Z · score: 0 (1 votes)
A Basic Problem of Ethics: Panpsychism? 2015-01-27T06:27:20.028Z · score: -4 (11 votes)
A Somewhat Vague Proposal for Grounding Ethics in Physics 2015-01-27T05:45:52.991Z · score: -3 (16 votes)


Comment by capybaralet on Distance Functions are Hard · 2019-09-14T19:15:45.876Z · score: 1 (1 votes) · LW · GW
I see. How about doing active learning of computable functions? That solves all 3 problems

^ I don't see how?

I should elaborate... it sounds like your thinking of active learning (where the AI can choose to make queries for information, e.g. labels), but I'm talking about *inter*active training, where a human supervisor is *also* actively monitoring the AI system, making queries of it, and intelligently selecting feedback for the AI. This might be simulated as well, using multiple AIs, and there might be a lot of room for good work there... but I think if we want to solve alignment, we want a deep and satisfying understanding of AI systems, which seems hard to come by without rich feedback loops between humans and AIs. Basically, by interactive training, I have in mind something where training AIs looks more like teaching other humans.

So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning).

I think it's a very open question how well we can expect advanced AI systems to understand or mirror human concepts by default. Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways. I'm cautiously optimistic, since this could make things a lot easier. It's also unclear ATM how precisely AI concepts need to track human concepts in order for things to work out OK. The "basin of attraction" line of thought suggests that they don't need to be that great, because they can self-correct or learn to defer to humans appropriately. My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.

Re: gwern's article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that's not the same as RL.

Yes ofc they are different.

I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning. RL can also be pointed at narrow domains, but for a lot of problems, I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.

I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it's safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances.

That seems great, but also likely to be very difficult, especially if we demand high reliability and performance.

Comment by capybaralet on Distance Functions are Hard · 2019-09-13T15:00:52.143Z · score: 1 (1 votes) · LW · GW

They're a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.

RE self-supervised learning: I don't see why we needed the rebranding (of unsupervised learning). I don't see why it would make alignment straightforward (ETA: except to the extent that you aren't necessarily, deliberately building something agenty). The boundaries between SSL and other ML is fuzzy; I don't think we'll get to AGI using just SSL and nothing like RL. SSL doesn't solve the exploration problem, if you start caring about exploration, I think you end up doing things that look more like RL.

I also tend to agree (e.g. with that gwern article) that AGI designs that aren't agenty are going to be at a significant competitive disadvantage, so probably aren't a satisfying solution to alignment, but could be a stop-gap.

Comment by capybaralet on Distance Functions are Hard · 2019-09-12T04:36:08.322Z · score: 1 (1 votes) · LW · GW

At the same time, the importance of having a good distance/divergence, the lack of appropriate ones, and the difficulty of learning them are widely acknowledged challenges in machine learning.

A distance function is fairly similar to a representation in my mind, and high-quality representation learning is considered a bit of a holy grail open problem.

Machine learning relies on formulating *some* sort of objective, which can be viewed as analogous to the choice of a good distance function, so I think the central point of the post (as I understood it from a quick glance) is correct: "specifying a good distance measure is not that much easier than specifying a good objective".

It's also an open question how much learning, (relatively) generic priors, and big data can actually solve the issue of weak learning signals and weak priors for us. A lot of people are betting pretty hard on that; I think its plausible, but not very likely. I think its more like a recipe for unaligned AI, and we need to get more bits of information about what we actually want into AI systems somehow. Highly interactive training protocols seem super valuable for that, but the ML community has a strong preference against such work because it is a massive pain compared to the non-interactive UL/SL/RL settings that are popular.

Comment by capybaralet on Two senses of “optimizer” · 2019-09-12T04:18:54.965Z · score: 1 (1 votes) · LW · GW

Yep. Good post. Important stuff. I think we're still struggling to understand all of this fully, and work on indifference seems like the most relevant stuff.

My current take is that as long as there is any "black-box" part of the algorithm which is optimizing for performance, then it may end up behaving like an optimizer_2, since the black box can pick up on arbitrary effective strategies.

(in partial RE to Rohin below): I wouldn't necessarily say that such an algorithm knows about its environment (i.e. has a good model), it may simply have stumbled upon an effective strategy for interacting with it (i.e. have a good policy).

Comment by capybaralet on Two senses of “optimizer” · 2019-09-12T04:12:54.291Z · score: 3 (2 votes) · LW · GW
It seems to me that the distinction is whether the optimizer has knowledge about the environment

Alternatively, you could say the distinction is whether the optimizer cares about the environment. I think there's a sense (or senses?) in which these things can be made/considered equivalent. I don't feel like I totally understand or am satisfied with either way of thinking about it, though.

Comment by capybaralet on The "Commitment Races" problem · 2019-09-12T04:00:49.620Z · score: 4 (2 votes) · LW · GW

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-09-07T05:45:43.734Z · score: 3 (2 votes) · LW · GW

Nice! owning up to it; I like it! :D

Comment by capybaralet on Eli's shortform feed · 2019-09-07T05:44:23.330Z · score: 1 (1 votes) · LW · GW

This reminded me of the argument that superintelligent agents will be very good at coordinating and just divvy of the multiverse and be done with it.

It would be interesting to do an experimental study of how the intelligence profile of a population influences the level of cooperation between them.

Comment by capybaralet on jacobjacob's Shortform Feed · 2019-09-07T05:38:22.224Z · score: 1 (1 votes) · LW · GW

I had that idea!

Assurance contracts are going to turn us into a superorganism, for better or worse. You heard it here first.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-09-06T19:30:31.486Z · score: 1 (1 votes) · LW · GW

How is that an answer? It seems like he's mostly contesting my premises "that AI-Xrisk is significant (e.g. roughly >10%), and timelines are not long (e.g. >50% ASI in <100years)"

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-09-06T19:28:28.466Z · score: 3 (2 votes) · LW · GW

I'm often prioritizing posting over polishing posts, for better or worse.

I'm also sometimes somewhat deliberately underspecific in my statements because I think it can lead to more interesting / diverse / "outside-the-box" kinds of responses that I think are very valuable from an "idea/perspective generation/exposure" point-of-view (and that's something I find very valuable in general).

Comment by capybaralet on Response to Glen Weyl on Technocracy and the Rationalist Community · 2019-08-26T20:29:23.989Z · score: 1 (1 votes) · LW · GW

I've tweeted at them twice about this problem. Not sure how else to contact them to get it fixed :/

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-25T18:48:59.302Z · score: 4 (2 votes) · LW · GW

I have spent *some* time on it (on the order of 10-15hrs maybe? counting discussions, reading, etc.), and I have a vague intention to do so again, in the future. At the moment, though, I'm very focused on getting my PhD and trying to land a good professorship ~ASAP.

The genesis of this list is basically me repeatedly noticing that there are crucial considerations I'm ignoring (/more like procrastinating on :P) that I don't feel like I have a good justification for ignoring, and being bothered by that.

It seemed important enough to at least *flag* these things.

If you think most AI alignment researchers should have some level of familiarity with these topics, it seems like it would be valuable for someone to put together a summary for us. I might be interested in such a project at some point in the next few years.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-23T14:54:43.331Z · score: 1 (1 votes) · LW · GW
And it seems like the best way to prevent that would be to build a superintelligent AI that would do a good job of maximizing the chances of generating infinite utility, in case that was possible.

I haven't thought about it enough to say... it certainly seems plausible, but it seems plausible that spending a good chunk of time thinking about it *might* lead to different conclusions. *shrug

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-23T05:21:31.768Z · score: 1 (1 votes) · LW · GW

Well... it's also pretty useful to individuals, IMO, since it affects what you tell other people, when discussing cause prioritization.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-23T05:20:30.886Z · score: 1 (1 votes) · LW · GW
I think infinite ethics will most likely be solved in a way that leaves longtermism unharmed.

Yes, or it might just never be truly "solved". I agree that complexity theory seems fairly likely to hold (something like) a solution.

Do you have specific candidate solutions in mind?

Not really. I don't think about infinite ethics much, which is probably one of the reasons it seems likely to change my mind. I expect that if I spent more time thinking about it, I would just become increasingly convinced that it isn't worth thinking about.

But it definitely troubles me that I haven't taken the time to really understand it, since I feel like I am in a similar epistemic state to ML researchers who dismiss Xrisk concerns and won't take the time to engage with them.

I guess there's maybe a disanalogy there, though, in that it seems like people who *have* thought more about infinite ethics tend to not be going around trying to convince others that it really actually matters a lot and should change what they work on or which causes they prioritize.


I guess the main way I can imagine changing my views by studying infinite ethics would be to start believing that I should actually just aim to increase the chances of generating infinite utility (to the extent this is actually a mathematically coherent thing to try to do), which doesn't necessarily/obviously lead to prioritizing Xrisk, as far as I can see.

The possibility of such an update seems like it might make studying infinite ethics until I understand it better a higher priority than reducing AI-Xrisk.

Comment by capybaralet on On the purposes of decision theory research · 2019-08-22T18:47:11.644Z · score: 1 (1 votes) · LW · GW

What's the best description of what you mean by "metaphilosophy" you can point me to? I think I have a pretty good sense of it, but it seems worthwhile to be as rigorous / formal / descriptive / etc. as possible.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-22T18:36:41.195Z · score: 1 (2 votes) · LW · GW

I agree there's substantial overlap, but there could be cases where "what's best for reducing Xrisk" and "what's best for reducing Srisk" really come apart. If I saw a clear-cut case for that; I'd be inclined to favor Srisk reduction (modulo, e.g., comparative advantage considerations).

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-22T18:34:39.132Z · score: 4 (2 votes) · LW · GW

I'm not aware of a satisfying resolution to the problems of infinite ethics. It calls into question the underlying assumptions of classical utilitarianism, which is my justification for prioritizing AI-Xrisk above all else. I can imagine ways of resolving infinite ethics that convince me of a different ethical viewpoint which in turn changes my cause prioritization.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-22T18:31:19.850Z · score: 2 (2 votes) · LW · GW

I agree with the distinction you make and think it's nice to disentangle them. I'm most interested in the "Is AI x-risk the top priority for humanity?" question. I'm fine with bundling all of the approaches to reducing AI-Xrisk being bundled here, because I'm just asking "is working on it (in *some* way) the highest priority".

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-21T16:48:34.104Z · score: 1 (1 votes) · LW · GW

Yeah it looks like maybe the same argument just expressed very differently? Like, I think the "coherence implies goal-directedness" argument basically goes through if you just consider computational complexity, but I'm still not sure if you agree? (maybe I'm being way to vague)

Or maybe I want a stronger conclusion? I'd like to say something like "REAL, GENERAL intelligence" REQUIRES goal-directed behavior (given the physical limitations of the real world). It seems like maybe our disagreement (if there is one) is around how much departure from goal-directed-ness is feasible / desirable and/or how much we expect such departures to affect performance (the trade-off also gets worse for more intelligent systems).

It seems likely the AI's beliefs would be logically coherent whenever the corresponding human beliefs are logically coherent. This seems quite different from arguing that the AI has a goal.

Yeah, it's definitely only an *analogy* (in my mind), but I find it pretty compelling *shrug.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-21T16:18:36.045Z · score: 1 (1 votes) · LW · GW
But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs".

Totally agree; it's an under-appreciated point!

Here's my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don't actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)

The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.

This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.

I'm personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the "early days" of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of "social epistemology" would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I'd argue we're in the process of failing catastrophically at that)

Comment by capybaralet on Where are people thinking and talking about global coordination for AI safety? · 2019-08-21T16:02:56.381Z · score: 11 (4 votes) · LW · GW

RE the title, a quick list:

  • FHI (and associated orgs)
  • CSER
  • OpenAI
  • OpenPhil
  • FLI
  • FRI
  • GovAI
  • PAI

I think a lot of orgs that are more focused on social issues which can or do arise from present day AI / ADM (automated decision making) technology should be thinking more about global coordination, but seem focused on national (or subnational, or EU) level policy. It seems valuable to make the most compelling case for stronger international coordination efforts to these actors. Examples of this kind of org that I have in mind are AINow and Montreal AI ethics institute (MAIEI).

As mentioned in other comments, there are many private conversations among people concerned about AI-Xrisk, and (IMO, legitimate) info-hazards / unilateralist curse concerns loom large. It seems prudent to make progress on those meta-level issues (i.e. how to engage the public and policymakers on AI(-Xrisk) coordination efforts) as a community as quickly as possible, because:

  • Getting effective AI governance in place seems like it will be challenging and take a long time.
  • There are a rapidly growing number of organizations seeking to shape AI policy, who may have objectives that are counter-productive from the point of view of AI-Xrisk. And there may be a significant first-mover advantage (e.g. via setting important legal or cultural precedents, and framing the issue for the public and policymakers).
  • There is massive untapped potential for people who are not currently involved in reducing AI-Xrisk to contribute (consider the raw number of people who haven't been exposed to serious thought on the subject).
  • Info-hazard-y ideas are becoming public knowledge anyways, on the timescale of years. There may be a significant advantage to getting ahead of the "natural" diffusion of these memes and seeking to control the framing / narrative.

My answers to your 6 questions:

1. Hopefully the effect will be transient and minimal.

2. I strongly disagree. I think we (ultimately) need much better coordination.

3. Good question. As an incomplete answer, I think personal connections and trust play a significant (possibly indispensable) role.

4. I don't know. Speculating/musing/rambling: the kinds of coordination where IT has made a big difference (recently, i.e. starting with the internet) are primarily economic and consumer-faced. For international coordination, the stakes are higher; it's geopolitics, not economics; you need effective international institutions to provide enforcement mechanisms.

5. Yes, but this doesn't seem like a crucial consideration (for the most part). Do you have specific examples in mind?

6. Social science and economics seem really valuable to me. Game theory, mechanism design, behavioral game theory. I imagine there's probably a lot of really valuable stuff on how people/orgs make collective decisions that the stakeholders are satisfied with in some other fields as well (psychology? sociology? anthropology?). We need experts in these fields (esp, I think the softer fields are underrepresented) to inform the AI-Xrisk community about existing findings and create research agendas.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-21T02:48:56.243Z · score: 1 (1 votes) · LW · GW

BoMAI is in this vein, as well ( )

Comment by capybaralet on [deleted post] 2019-08-20T21:50:02.653Z

I don't understand how this answers the question.

As a clarification, I'm considering the case where we consider the state space to be the set of all "possible" histories (including counter-logical ones), like the standard "general RL" (i.e. AIXI-style) set-up.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-19T04:58:36.179Z · score: 3 (2 votes) · LW · GW

I don't know how deep blue worked. My impression was that it doesn't use learning, so the answer would be no.

A starting point for Tom and Stuart's works:

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-17T05:39:51.778Z · score: 3 (2 votes) · LW · GW

Naturally (as an author on that paper), I agree to some extent with this argument.

I think it's worth pointing out one technical 'caveat': the agent should get utility 0 *on all future timesteps* as soon as it takes an action other than the one specified by the policy. We say the agent gets reward 1: "if and only if its history is an element of the set H", *not* iff "the policy would take action a given history h". Without this caveat, I think the agent might take other actions in order to capture more future utility (e.g. to avoid terminal states). [Side-note (SN): this relates to a question I asked ~10days ago about whether decision theories and/or policies need to specify actions for impossible histories.]

My main point, however, is that I think you could do some steelmanning here and recover most of the arguments you are criticizing (based on complexity arguments). TBC, I think the thesis (i.e. the title) is a correct and HIGHLY valuable point! But I think there are still good arguments for intelligence strongly suggesting some level of "goal-directed behavior". e.g. it's probably physically impossible to implement policies (over histories) that are effectively random, since they look like look-up tables that are larger than the physical universe. So when we build AIs, we are building things that aren't at that extreme end of the spectrum. Eliezer has a nice analogy in a comment on one of Paul's posts (I think), about an agent that behaves like it understands math, except that it thinks 2+2=5. You don't have to believe the extreme version of this view to believe that it's harder to build agents that aren't coherent *in a more intuitively meaningful sense* (i.e. closer to caring about states, which is (I think, e.g. see Hutter's work on state aggregation) equivalent to putting some sort of equivalence relation on histories).

I also want to mention Laurent Orseau's paper: "Agents and Devices: A Relative Definition of Agency", which can be viewed as attempting to distinguish "real" agents from things that merely satisfy coherence via the construction in our paper.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-17T05:16:36.740Z · score: 1 (1 votes) · LW · GW

I strongly agree.

I should've been more clear.

I think this is a situation where our intuition is likely wrong.

This sort of thing is why I say "I'm not satisfied with my current understanding".

Comment by capybaralet on After critical event W happens, they still won't believe you · 2019-08-17T04:55:55.774Z · score: 1 (2 votes) · LW · GW

The two examples here seem to not have alarming/obvious enough Ws. It seems like you are arguing against a straw-man who makes bad predictions, based on something like a typical mind fallacy.

Comment by capybaralet on On the purposes of decision theory research · 2019-08-17T04:28:32.342Z · score: 4 (2 votes) · LW · GW
my concern that decision theory research (as done by humans in the foreseeable future) can't solve decision theory in a definitive enough way that would obviate the need to make sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself

So you're saying we need to solve decision theory at the meta-level, instead of the object-level. But can't we view any meta-level solution as also (trivially) an object level solution?

In other words, "[making] sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself" sounds like a special case of "[solving] decision theory in a definitive enough way".


I'm starting with the objective of objecting to your (6): this seems like an important goal, in my mind. And if we *aren't* able to verify that an AI is free from decision theoretic flaws, then how can we trust it to self-modify to be free of such flaws?

Your perspective still make sense to me if you say: "this AI (soit ALICE) is exploitable, but it'll fix that within 100 years, so if it doesn't get exploited in the meanwhile, then we'll be OK". And OFC in principle, making an agent that will have no flaws within X years of when it is created is easier than the special case of X=0.

In reality, it seems plausible to me that we can build an agent like ALICE and have a decent change that ALICE won't get exploited within 100 years.

But I still don't see why you dismiss the goal of (6); I don't think we have anything like definitive evidence that it is an (effectively) impossible goal.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-17T03:59:56.165Z · score: 1 (1 votes) · LW · GW
we haven't seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game.

It's a good point, but... we won't see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we've let them run.

I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don't think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.

Now, one can say that this actually means we don't care that much about the long-term future, but I don't agree with that conclusion; I think we *do* care (at least, I do), but aren't very good at acting as if we(/I) do.

Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a "positive-sum" trade with us (e.g. "I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want").


I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/"irrational") in such a way that they will only seek to optimize things within the ontology we've provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn't seem mature.... at least I'm not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who've done work in this area, and their work on this stuff seems quite under appreciated.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:33:02.136Z · score: 3 (2 votes) · LW · GW

Yeah, I think it totally does! (and that's a very interesting / "trippy" line of thought :D)

However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don't think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for "savant-like" intelligence, which is sort of what I'm imagining here. I can't think of why I have that intuition OTTMH.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:29:40.756Z · score: 1 (1 votes) · LW · GW

So I don't take EY's post as about AI researchers' competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.

I don't think I'm underestimating AI researchers, either, but for a different reason... let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I'm imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.

Regarding long-term planning, I'd factor this into 2 components:

1) having a good planning algorithm

2) having a good world model

I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.

I don't think its necessary to have a very "complete" world-model (i.e. enough knowledge to look smart to a person) in order to find "steganographic" long-term strategies like the ones I'm imagining.

I also don't think it's even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.... (i.e. be some sort of savant).

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:13:08.445Z · score: 1 (1 votes) · LW · GW

I'm not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I'm actually more interested in hearing your take on those lines of argument than saying mine ATM :P

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:07:30.055Z · score: 2 (2 votes) · LW · GW


Not a direct response: It's been argued (e.g. I think Paul said this in his 2nd 80k podcast interview?) that this isn't very realistic, because the low-hanging fruit (of easy to attack systems) is already being picked by slightly less advanced AI systems. This wouldn't apply if you're *already* in a discontinuous regime (but then it becomes circular).

Also not a direct response: It seems likely that some AIs will be much more/less cautious than humans, because they (e.g. implicitly) have very different discount rates. So AIs might take very risky gambles, which means both that we might get more sinister stumbles (good thing), but also that they might readily risk the earth (bad thing).

Comment by capybaralet on Project Proposal: Considerations for trading off capabilities and safety impacts of AI research · 2019-08-14T16:38:55.933Z · score: 7 (4 votes) · LW · GW

I do think this is an overly optimistic picture. The amount of traction an argument gets seems to be something like a product of how good the argument is, how credible those making the argument are, and how easy it is to process the argument.

Also, regarding this:

But the credibility system is good enough that the top credible people are really pretty smart, so to an extent can be swayed by good arguments presented well.

It's not just intelligence that determines if people will be swayed; I think other factors (like "rationality", "open-mindedness", and other personality factors play a very big role.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-14T16:28:43.404Z · score: 1 (1 votes) · LW · GW

Oops, missed that, sry.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-09T16:15:41.276Z · score: 1 (1 votes) · LW · GW

I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don't have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-09T16:13:32.122Z · score: 1 (1 votes) · LW · GW

Yep, good catch ;)

I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I'm epistemically humble enough these days to think it's not reasonable to say "nearly inevitable" if you integrate out epistemic uncertainty.

But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?

Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-09T16:00:03.586Z · score: 3 (3 votes) · LW · GW

Interesting. Your crux seems good; I think it's a crux for us. I expect things play out more like Eliezer predicts here:

I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don't become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc... I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.

A very similar problem would be a form of longer-term "seeding", where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances ("at the margin") that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.

I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.

That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can't carry on a conversations, but can implement a very sophisticated covert world domination strategy.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-09T04:14:07.799Z · score: 3 (2 votes) · LW · GW

Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?

E.g. will we see "sinister stumbles" (IIRC this was Adam Gleave's name for half-baked treacherous turns)? I think we will, FWIW.

Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)

How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn't emphasize the "optimization" part.)

Jessica's posts about MIRI vs. Paul's views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I'd expect, unless ML becomes "woke" to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we'd see something that *looks* like a discontinuity, but is *actually* more like "the same reason".

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-09T04:07:56.824Z · score: 1 (1 votes) · LW · GW
incentives to "beat the other team" don't seem nearly as strong

You mean it's stronger for nukes than for AI? I think I disagree, but it's a bit nuanced. It seems to me (as someone very ignorant about nukes) like with current nuclear tech you hit diminishing returns pretty fast, but I don't expect that to be the case for AI.

Also, I'm curious if weaponization of AI is a crux for us.

Comment by capybaralet on Project Proposal: Considerations for trading off capabilities and safety impacts of AI research · 2019-08-07T18:05:20.445Z · score: 2 (2 votes) · LW · GW

I meant it to be about all AI research. I don't usually make too much effort to distinguish ML and AI, TBH.

Comment by capybaralet on Project Proposal: Considerations for trading off capabilities and safety impacts of AI research · 2019-08-07T18:04:19.006Z · score: 5 (3 votes) · LW · GW

As someone with 5+ years of experience in the field, I think you're impression of current ML is not very accurate. It's true that we haven't *solved* the problem of "online learning" (what you probably mean is something more like "continual learning" or "lifelong learning"), but a fair number of people are working on those problems (with a fairly incremental approach, granted). You can find several recent workshops on those topics recently, and work going back to the 90s at least.

It's also true that long-term planning, credit assignment, memory preservation, and other forms of "stability" appear to be a central challenge to making this stuff work. On the other hand, we don't know that humans are stable in the limit, just for ~100yrs, so there very well may be no non-heuristic solution to these problems.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-07T08:08:47.232Z · score: 11 (3 votes) · LW · GW

I don't think you need to posit a discontinuity to expect tests to occasionally fail.

I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.

I'll admit I don't feel like I really understand the perspective of people who seem to think we'll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:

  • We'll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
    • counter-argument: but the concern is about what happens at deployment time
  • We'll deploy AI in a box, too then
    • counter: seems like that entails a massive performance hit (but it's not clear if that's actually the case)
  • We'll have other "AI police" to stop any "evil AIs" that "go rogue" (just like we have for people).
    • counter: where did the AI police come from, and why can't they go rogue as well?
  • The "AI police" can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
    • counter: this seems to be assuming the "corrigibility as basin of attraction" argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.
  • A single failure isn't likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. "satiable") AI and make it an insatiable "open ended optimizer AI".
    • counter: we can't assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that's needed; this seems like an open question

OK I could go on, but I'd rather actually hear from anyone who has this view! :)

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-07T07:48:27.477Z · score: 5 (3 votes) · LW · GW

TBC, I think climate change is probably an even better analogy.

And I also like to talk about international regulation, in general, like with tax havens.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-07T07:46:52.010Z · score: 11 (3 votes) · LW · GW

For me it's because:

  • Nukes seem like an obvious Xrisk
  • People mostly seem to agree that we haven't done a good job coordinating around them
  • They seem a lot easier to coordinate around

Also, not a reason, but:

AI seems likely to be weaponized, and warfare (whether conventional or not) seems like one of the areas where we should be most worried about "unbridled competition" creating a race-to-the-bottom on safety.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-07T07:36:16.135Z · score: 1 (2 votes) · LW · GW

A slightly misspecified reward function can lead to anything from perfectly aligned behavior to catastrophic failure. So I think we need much stronger and more formal arguments to believe that catastrophe is almost inevitable than EY's genie post provides.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-07T07:28:32.536Z · score: 1 (1 votes) · LW · GW

I hold a nuanced view that I believe is more similar to the empirical claim than your views.

I think what we want is an extremely high level of justified confidence that any AI system or technology that is likely to become widely available is not carrying a significant and non-decreasing amount of Xrisk-per-second.
And it seems incredibly difficult and likely impossible to have such an extremely high level of justified confidence.

Formal verification and proof seem like the best we can do now, but I agree with you that we shouldn't rule out other approaches to achieving extreme levels of justified confidence. What it all points at to me is the need for more work on epistemology, so that we can begin to understand how extreme levels of confidence actually operate.

Comment by capybaralet on Project Proposal: Considerations for trading off capabilities and safety impacts of AI research · 2019-08-07T07:24:00.135Z · score: 1 (1 votes) · LW · GW

I don't follow. You seem to be responding to a statement that "I think the ML communities perceptions are important, because the ML community's attitude seems of critical importance for getting good Xrisk reduction policies in place", which I see as having little bearing on the question the post raises of "how should we assess info-hazard type risks of AI research we conduct?"