Can indifference methods redeem person-affecting views? 2019-11-12T04:23:10.011Z · score: 11 (4 votes)
What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? 2019-08-20T21:45:12.118Z · score: 30 (8 votes)
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research 2019-08-06T22:22:20.928Z · score: 34 (17 votes)
False assumptions and leaky abstractions in machine learning and AI safety 2019-06-28T04:54:47.119Z · score: 23 (6 votes)
Let's talk about "Convergent Rationality" 2019-06-12T21:53:35.356Z · score: 27 (8 votes)
X-risks are a tragedies of the commons 2019-02-07T02:48:25.825Z · score: 9 (5 votes)
My use of the phrase "Super-Human Feedback" 2019-02-06T19:11:11.734Z · score: 13 (8 votes)
Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" 2019-02-06T19:09:20.809Z · score: 25 (12 votes)
The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk 2019-01-31T06:13:35.321Z · score: 14 (8 votes)
Imitation learning considered unsafe? 2019-01-06T15:48:36.078Z · score: 10 (5 votes)
Conceptual Analysis for AI Alignment 2018-12-30T00:46:38.014Z · score: 26 (9 votes)
Disambiguating "alignment" and related notions 2018-06-05T15:35:15.091Z · score: 43 (13 votes)
Problems with learning values from observation 2016-09-21T00:40:49.102Z · score: 0 (7 votes)
Risks from Approximate Value Learning 2016-08-27T19:34:06.178Z · score: 1 (4 votes)
Inefficient Games 2016-08-23T17:47:02.882Z · score: 14 (15 votes)
Should we enable public binding precommitments? 2016-07-31T19:47:05.588Z · score: 0 (1 votes)
A Basic Problem of Ethics: Panpsychism? 2015-01-27T06:27:20.028Z · score: -4 (11 votes)
A Somewhat Vague Proposal for Grounding Ethics in Physics 2015-01-27T05:45:52.991Z · score: -3 (16 votes)


Comment by capybaralet on Can indifference methods redeem person-affecting views? · 2019-11-17T01:28:59.231Z · score: 1 (1 votes) · LW · GW

Can you give a concrete example for why the utility function should change?

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-24T00:01:41.942Z · score: 1 (1 votes) · LW · GW

I couldn't say without knowing more what "human safety" means here.

But here's what I imagine an example pivotal command looking like: "Give me the ability to shut-down unsafe AI projects for the foreseeable future. Do this while minimizing disruption to the current world order / status quo. Interpret all of this in the way I intend."

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-21T19:43:17.770Z · score: 1 (1 votes) · LW · GW

OK, I think that makes some sense.

I dont know how I'd fill out the row, since I don't understand what is covered by the phrase "human safety", or what assumptions are being made about the proliferation of the technology, or more specifically, the characteristics of the humans who do possess the tech.

I think I was imagining that the pivotal tool AI is developed by highly competent and safety-conscious humans who use it to perform a pivotal act (or series of pivotal acts) that effectively precludes the kind of issues mentioned in Wei's quote there.

Comment by capybaralet on TAISU 2019 Field Report · 2019-10-16T06:34:38.468Z · score: 3 (2 votes) · LW · GW
Linda organized it as two 2 day unconferences held back-to-back

Can you explain how that is different from a 4-day unconference, more concretely?

Comment by capybaralet on TAISU 2019 Field Report · 2019-10-16T06:33:51.818Z · score: 7 (5 votes) · LW · GW
I think the workshop would be a valuable use of three days for anyone actively working in AI safety, even if they consider themselves "senior" in the field: it offered a valuable space for reconsidering basic assumptions and rediscovering the reasons why we're doing what we're doing.

This read to me as a remarkably strong claim; I assumed you meant something slightly weaker. But then I realized you said "valuable" which might mean "not considering opportunity cost". Can you clarify that?

And if you do mean "considering opportunity cost", I think it would be worth giving your ~strongest argument(s) for it!

For context, I am a PhD candidate in ML working on safety, and I am interested in such events, but unsure if they would be a valuable use of my time, and OTTMH would expect most of the value to be in terms of helping others rather than benefitting my own understanding/research/career/ability-to-contribute (I realize this sounds a bit conceited, and I didn't try to avoid that except via this caveat, and I really do mean (just) OTTMH... I think the reality is a bit more that I'm mostly estimating value based on heuristics). If I had been in the UK when they happened, I would probably have attended at least one.

But I think I am a bit unusual in my level of enthusiasm. And FWICT, such initiatives are not receiving much resources (including money and involvement of senior safety researchers) and potentially should receive A LOT more (e.g. 1-2 orders of magnitude). So the case for them being valuable (in general or for more senior/experienced researchers) is an important one!

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:48:00.035Z · score: 2 (2 votes) · LW · GW

Does an "AI safety success story" encapsulate just a certain trajectory in AI (safety) development?

Or does it also include a story about how AI is deployed (and by who, etc.)?

I like this post a lot, but I think it ends up being a bit unclear because I don't think everyone has the same use cases in mind for the different technologies underlying these scenarios, and/or I don't think everyone agrees with the way in which safety research is viewed as contributing to success in these different scenarios... Maybe fleshing out the success stories, or referencing some more in-depth elaborations of them would make this clearer?

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:43:16.838Z · score: 1 (1 votes) · LW · GW

I'm going to dispute a few cells in your grid.

  • I think pivotal tool story has low reliance on human safety (although I'm confused by that row in general).
  • Whether sovereigns would require restricted access is unclear. This is basically the question of whether single-agent, single-user alignment will likely produce a solution to multi-agent, multi-user alignment (in a timely manner).
  • ETA: the "interim quality of life improver" seems to roughly be talking about episodic RL, which I would classify as "medium" autonomy.
Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:38:37.175Z · score: 3 (2 votes) · LW · GW

I don't understand what you mean by "Reliance on human safety". Can you clarify/elaborate? Is this like... relying on humans' (meta-)philosophical competence? Relying on not having bad actors? etc...

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:37:07.599Z · score: 1 (1 votes) · LW · GW

While that's true to some extent, a lot of research does seem to be motivated much more by some of these scenarios. For example, work on safe oracle designs seems primarily motivated by the pivotal tool success story.

Comment by capybaralet on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-11T06:38:35.437Z · score: 1 (1 votes) · LW · GW

No idea why this is heavily downvoted; strong upvoted to compensate.

I'd say he's discouraging everyone from working on the problems, or at least from considering such work to be important, urgent, high status, etc.

Comment by capybaralet on Review: Selfish Reasons to Have More Kids · 2019-10-03T06:04:28.903Z · score: 4 (1 votes) · LW · GW
It seems to me that the elephant in the room here are the peer effects.

By regressing on household identity, you capture how parents' efforts to control kids' peer groups influence outcomes. This was discussed (briefly) in the book, towards the beginning (~page30-40?)

Comment by capybaralet on Just Imitate Humans? · 2019-09-19T10:30:50.864Z · score: 2 (2 votes) · LW · GW
A Bayesian predictor of the human's behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human's observed behavior won't include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg.

This is a very good argument, and I'm still trying to decide how decisive I think it is.

In the meanwhile, I'll mention that I'm imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of "aha" moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don't fit the current general explanations. That view makes it seem plausible that "planning" would emerge as an "aha" moment before being refined as "oh wait, bounded planning... with these heuristics... and these restrictions...", creating a dangerous window of time between "I'm doing planning" and "I'm planning like a human, warts and all".

Comment by capybaralet on Just Imitate Humans? · 2019-09-18T04:00:30.538Z · score: 4 (2 votes) · LW · GW

RE: "Imitation learning considered unsafe?" (I'm the author):

The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.

I agree with your response; this is also why I said: "Mistakes in imitating the human may be relatively harmless; the approximation may be good enough".

I don't agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).

Comment by capybaralet on Distance Functions are Hard · 2019-09-14T19:15:45.876Z · score: 1 (1 votes) · LW · GW
I see. How about doing active learning of computable functions? That solves all 3 problems

^ I don't see how?

I should elaborate... it sounds like your thinking of active learning (where the AI can choose to make queries for information, e.g. labels), but I'm talking about *inter*active training, where a human supervisor is *also* actively monitoring the AI system, making queries of it, and intelligently selecting feedback for the AI. This might be simulated as well, using multiple AIs, and there might be a lot of room for good work there... but I think if we want to solve alignment, we want a deep and satisfying understanding of AI systems, which seems hard to come by without rich feedback loops between humans and AIs. Basically, by interactive training, I have in mind something where training AIs looks more like teaching other humans.

So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning).

I think it's a very open question how well we can expect advanced AI systems to understand or mirror human concepts by default. Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways. I'm cautiously optimistic, since this could make things a lot easier. It's also unclear ATM how precisely AI concepts need to track human concepts in order for things to work out OK. The "basin of attraction" line of thought suggests that they don't need to be that great, because they can self-correct or learn to defer to humans appropriately. My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.

Re: gwern's article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that's not the same as RL.

Yes ofc they are different.

I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning. RL can also be pointed at narrow domains, but for a lot of problems, I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.

I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it's safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances.

That seems great, but also likely to be very difficult, especially if we demand high reliability and performance.

Comment by capybaralet on Distance Functions are Hard · 2019-09-13T15:00:52.143Z · score: 1 (1 votes) · LW · GW

They're a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.

RE self-supervised learning: I don't see why we needed the rebranding (of unsupervised learning). I don't see why it would make alignment straightforward (ETA: except to the extent that you aren't necessarily, deliberately building something agenty). The boundaries between SSL and other ML is fuzzy; I don't think we'll get to AGI using just SSL and nothing like RL. SSL doesn't solve the exploration problem, if you start caring about exploration, I think you end up doing things that look more like RL.

I also tend to agree (e.g. with that gwern article) that AGI designs that aren't agenty are going to be at a significant competitive disadvantage, so probably aren't a satisfying solution to alignment, but could be a stop-gap.

Comment by capybaralet on Distance Functions are Hard · 2019-09-12T04:36:08.322Z · score: 1 (1 votes) · LW · GW

At the same time, the importance of having a good distance/divergence, the lack of appropriate ones, and the difficulty of learning them are widely acknowledged challenges in machine learning.

A distance function is fairly similar to a representation in my mind, and high-quality representation learning is considered a bit of a holy grail open problem.

Machine learning relies on formulating *some* sort of objective, which can be viewed as analogous to the choice of a good distance function, so I think the central point of the post (as I understood it from a quick glance) is correct: "specifying a good distance measure is not that much easier than specifying a good objective".

It's also an open question how much learning, (relatively) generic priors, and big data can actually solve the issue of weak learning signals and weak priors for us. A lot of people are betting pretty hard on that; I think its plausible, but not very likely. I think its more like a recipe for unaligned AI, and we need to get more bits of information about what we actually want into AI systems somehow. Highly interactive training protocols seem super valuable for that, but the ML community has a strong preference against such work because it is a massive pain compared to the non-interactive UL/SL/RL settings that are popular.

Comment by capybaralet on Two senses of “optimizer” · 2019-09-12T04:18:54.965Z · score: 1 (1 votes) · LW · GW

Yep. Good post. Important stuff. I think we're still struggling to understand all of this fully, and work on indifference seems like the most relevant stuff.

My current take is that as long as there is any "black-box" part of the algorithm which is optimizing for performance, then it may end up behaving like an optimizer_2, since the black box can pick up on arbitrary effective strategies.

(in partial RE to Rohin below): I wouldn't necessarily say that such an algorithm knows about its environment (i.e. has a good model), it may simply have stumbled upon an effective strategy for interacting with it (i.e. have a good policy).

Comment by capybaralet on Two senses of “optimizer” · 2019-09-12T04:12:54.291Z · score: 3 (2 votes) · LW · GW
It seems to me that the distinction is whether the optimizer has knowledge about the environment

Alternatively, you could say the distinction is whether the optimizer cares about the environment. I think there's a sense (or senses?) in which these things can be made/considered equivalent. I don't feel like I totally understand or am satisfied with either way of thinking about it, though.

Comment by capybaralet on The "Commitment Races" problem · 2019-09-12T04:00:49.620Z · score: 4 (2 votes) · LW · GW

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-09-07T05:45:43.734Z · score: 3 (2 votes) · LW · GW

Nice! owning up to it; I like it! :D

Comment by capybaralet on Eli's shortform feed · 2019-09-07T05:44:23.330Z · score: 1 (1 votes) · LW · GW

This reminded me of the argument that superintelligent agents will be very good at coordinating and just divvy of the multiverse and be done with it.

It would be interesting to do an experimental study of how the intelligence profile of a population influences the level of cooperation between them.

Comment by capybaralet on jacobjacob's Shortform Feed · 2019-09-07T05:38:22.224Z · score: 1 (1 votes) · LW · GW

I had that idea!

Assurance contracts are going to turn us into a superorganism, for better or worse. You heard it here first.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-09-06T19:30:31.486Z · score: 1 (1 votes) · LW · GW

How is that an answer? It seems like he's mostly contesting my premises "that AI-Xrisk is significant (e.g. roughly >10%), and timelines are not long (e.g. >50% ASI in <100years)"

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-09-06T19:28:28.466Z · score: 3 (2 votes) · LW · GW

I'm often prioritizing posting over polishing posts, for better or worse.

I'm also sometimes somewhat deliberately underspecific in my statements because I think it can lead to more interesting / diverse / "outside-the-box" kinds of responses that I think are very valuable from an "idea/perspective generation/exposure" point-of-view (and that's something I find very valuable in general).

Comment by capybaralet on Response to Glen Weyl on Technocracy and the Rationalist Community · 2019-08-26T20:29:23.989Z · score: 1 (1 votes) · LW · GW

I've tweeted at them twice about this problem. Not sure how else to contact them to get it fixed :/

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-25T18:48:59.302Z · score: 4 (2 votes) · LW · GW

I have spent *some* time on it (on the order of 10-15hrs maybe? counting discussions, reading, etc.), and I have a vague intention to do so again, in the future. At the moment, though, I'm very focused on getting my PhD and trying to land a good professorship ~ASAP.

The genesis of this list is basically me repeatedly noticing that there are crucial considerations I'm ignoring (/more like procrastinating on :P) that I don't feel like I have a good justification for ignoring, and being bothered by that.

It seemed important enough to at least *flag* these things.

If you think most AI alignment researchers should have some level of familiarity with these topics, it seems like it would be valuable for someone to put together a summary for us. I might be interested in such a project at some point in the next few years.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-23T14:54:43.331Z · score: 1 (1 votes) · LW · GW
And it seems like the best way to prevent that would be to build a superintelligent AI that would do a good job of maximizing the chances of generating infinite utility, in case that was possible.

I haven't thought about it enough to say... it certainly seems plausible, but it seems plausible that spending a good chunk of time thinking about it *might* lead to different conclusions. *shrug

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-23T05:21:31.768Z · score: 1 (1 votes) · LW · GW

Well... it's also pretty useful to individuals, IMO, since it affects what you tell other people, when discussing cause prioritization.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-23T05:20:30.886Z · score: 1 (1 votes) · LW · GW
I think infinite ethics will most likely be solved in a way that leaves longtermism unharmed.

Yes, or it might just never be truly "solved". I agree that complexity theory seems fairly likely to hold (something like) a solution.

Do you have specific candidate solutions in mind?

Not really. I don't think about infinite ethics much, which is probably one of the reasons it seems likely to change my mind. I expect that if I spent more time thinking about it, I would just become increasingly convinced that it isn't worth thinking about.

But it definitely troubles me that I haven't taken the time to really understand it, since I feel like I am in a similar epistemic state to ML researchers who dismiss Xrisk concerns and won't take the time to engage with them.

I guess there's maybe a disanalogy there, though, in that it seems like people who *have* thought more about infinite ethics tend to not be going around trying to convince others that it really actually matters a lot and should change what they work on or which causes they prioritize.


I guess the main way I can imagine changing my views by studying infinite ethics would be to start believing that I should actually just aim to increase the chances of generating infinite utility (to the extent this is actually a mathematically coherent thing to try to do), which doesn't necessarily/obviously lead to prioritizing Xrisk, as far as I can see.

The possibility of such an update seems like it might make studying infinite ethics until I understand it better a higher priority than reducing AI-Xrisk.

Comment by capybaralet on On the purposes of decision theory research · 2019-08-22T18:47:11.644Z · score: 1 (1 votes) · LW · GW

What's the best description of what you mean by "metaphilosophy" you can point me to? I think I have a pretty good sense of it, but it seems worthwhile to be as rigorous / formal / descriptive / etc. as possible.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-22T18:36:41.195Z · score: 1 (2 votes) · LW · GW

I agree there's substantial overlap, but there could be cases where "what's best for reducing Xrisk" and "what's best for reducing Srisk" really come apart. If I saw a clear-cut case for that; I'd be inclined to favor Srisk reduction (modulo, e.g., comparative advantage considerations).

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-22T18:34:39.132Z · score: 4 (2 votes) · LW · GW

I'm not aware of a satisfying resolution to the problems of infinite ethics. It calls into question the underlying assumptions of classical utilitarianism, which is my justification for prioritizing AI-Xrisk above all else. I can imagine ways of resolving infinite ethics that convince me of a different ethical viewpoint which in turn changes my cause prioritization.

Comment by capybaralet on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-22T18:31:19.850Z · score: 2 (2 votes) · LW · GW

I agree with the distinction you make and think it's nice to disentangle them. I'm most interested in the "Is AI x-risk the top priority for humanity?" question. I'm fine with bundling all of the approaches to reducing AI-Xrisk being bundled here, because I'm just asking "is working on it (in *some* way) the highest priority".

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-21T16:48:34.104Z · score: 1 (1 votes) · LW · GW

Yeah it looks like maybe the same argument just expressed very differently? Like, I think the "coherence implies goal-directedness" argument basically goes through if you just consider computational complexity, but I'm still not sure if you agree? (maybe I'm being way to vague)

Or maybe I want a stronger conclusion? I'd like to say something like "REAL, GENERAL intelligence" REQUIRES goal-directed behavior (given the physical limitations of the real world). It seems like maybe our disagreement (if there is one) is around how much departure from goal-directed-ness is feasible / desirable and/or how much we expect such departures to affect performance (the trade-off also gets worse for more intelligent systems).

It seems likely the AI's beliefs would be logically coherent whenever the corresponding human beliefs are logically coherent. This seems quite different from arguing that the AI has a goal.

Yeah, it's definitely only an *analogy* (in my mind), but I find it pretty compelling *shrug.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-21T16:18:36.045Z · score: 1 (1 votes) · LW · GW
But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs".

Totally agree; it's an under-appreciated point!

Here's my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don't actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)

The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.

This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.

I'm personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the "early days" of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of "social epistemology" would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I'd argue we're in the process of failing catastrophically at that)

Comment by capybaralet on Where are people thinking and talking about global coordination for AI safety? · 2019-08-21T16:02:56.381Z · score: 12 (5 votes) · LW · GW

RE the title, a quick list:

  • FHI (and associated orgs)
  • CSER
  • OpenAI
  • OpenPhil
  • FLI
  • FRI
  • GovAI
  • PAI

I think a lot of orgs that are more focused on social issues which can or do arise from present day AI / ADM (automated decision making) technology should be thinking more about global coordination, but seem focused on national (or subnational, or EU) level policy. It seems valuable to make the most compelling case for stronger international coordination efforts to these actors. Examples of this kind of org that I have in mind are AINow and Montreal AI ethics institute (MAIEI).

As mentioned in other comments, there are many private conversations among people concerned about AI-Xrisk, and (IMO, legitimate) info-hazards / unilateralist curse concerns loom large. It seems prudent to make progress on those meta-level issues (i.e. how to engage the public and policymakers on AI(-Xrisk) coordination efforts) as a community as quickly as possible, because:

  • Getting effective AI governance in place seems like it will be challenging and take a long time.
  • There are a rapidly growing number of organizations seeking to shape AI policy, who may have objectives that are counter-productive from the point of view of AI-Xrisk. And there may be a significant first-mover advantage (e.g. via setting important legal or cultural precedents, and framing the issue for the public and policymakers).
  • There is massive untapped potential for people who are not currently involved in reducing AI-Xrisk to contribute (consider the raw number of people who haven't been exposed to serious thought on the subject).
  • Info-hazard-y ideas are becoming public knowledge anyways, on the timescale of years. There may be a significant advantage to getting ahead of the "natural" diffusion of these memes and seeking to control the framing / narrative.

My answers to your 6 questions:

1. Hopefully the effect will be transient and minimal.

2. I strongly disagree. I think we (ultimately) need much better coordination.

3. Good question. As an incomplete answer, I think personal connections and trust play a significant (possibly indispensable) role.

4. I don't know. Speculating/musing/rambling: the kinds of coordination where IT has made a big difference (recently, i.e. starting with the internet) are primarily economic and consumer-faced. For international coordination, the stakes are higher; it's geopolitics, not economics; you need effective international institutions to provide enforcement mechanisms.

5. Yes, but this doesn't seem like a crucial consideration (for the most part). Do you have specific examples in mind?

6. Social science and economics seem really valuable to me. Game theory, mechanism design, behavioral game theory. I imagine there's probably a lot of really valuable stuff on how people/orgs make collective decisions that the stakeholders are satisfied with in some other fields as well (psychology? sociology? anthropology?). We need experts in these fields (esp, I think the softer fields are underrepresented) to inform the AI-Xrisk community about existing findings and create research agendas.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-21T02:48:56.243Z · score: 1 (1 votes) · LW · GW

BoMAI is in this vein, as well ( )

Comment by capybaralet on [deleted post] 2019-08-20T21:50:02.653Z

I don't understand how this answers the question.

As a clarification, I'm considering the case where we consider the state space to be the set of all "possible" histories (including counter-logical ones), like the standard "general RL" (i.e. AIXI-style) set-up.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-19T04:58:36.179Z · score: 3 (2 votes) · LW · GW

I don't know how deep blue worked. My impression was that it doesn't use learning, so the answer would be no.

A starting point for Tom and Stuart's works:

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-17T05:39:51.778Z · score: 3 (2 votes) · LW · GW

Naturally (as an author on that paper), I agree to some extent with this argument.

I think it's worth pointing out one technical 'caveat': the agent should get utility 0 *on all future timesteps* as soon as it takes an action other than the one specified by the policy. We say the agent gets reward 1: "if and only if its history is an element of the set H", *not* iff "the policy would take action a given history h". Without this caveat, I think the agent might take other actions in order to capture more future utility (e.g. to avoid terminal states). [Side-note (SN): this relates to a question I asked ~10days ago about whether decision theories and/or policies need to specify actions for impossible histories.]

My main point, however, is that I think you could do some steelmanning here and recover most of the arguments you are criticizing (based on complexity arguments). TBC, I think the thesis (i.e. the title) is a correct and HIGHLY valuable point! But I think there are still good arguments for intelligence strongly suggesting some level of "goal-directed behavior". e.g. it's probably physically impossible to implement policies (over histories) that are effectively random, since they look like look-up tables that are larger than the physical universe. So when we build AIs, we are building things that aren't at that extreme end of the spectrum. Eliezer has a nice analogy in a comment on one of Paul's posts (I think), about an agent that behaves like it understands math, except that it thinks 2+2=5. You don't have to believe the extreme version of this view to believe that it's harder to build agents that aren't coherent *in a more intuitively meaningful sense* (i.e. closer to caring about states, which is (I think, e.g. see Hutter's work on state aggregation) equivalent to putting some sort of equivalence relation on histories).

I also want to mention Laurent Orseau's paper: "Agents and Devices: A Relative Definition of Agency", which can be viewed as attempting to distinguish "real" agents from things that merely satisfy coherence via the construction in our paper.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-17T05:16:36.740Z · score: 1 (1 votes) · LW · GW

I strongly agree.

I should've been more clear.

I think this is a situation where our intuition is likely wrong.

This sort of thing is why I say "I'm not satisfied with my current understanding".

Comment by capybaralet on After critical event W happens, they still won't believe you · 2019-08-17T04:55:55.774Z · score: 1 (2 votes) · LW · GW

The two examples here seem to not have alarming/obvious enough Ws. It seems like you are arguing against a straw-man who makes bad predictions, based on something like a typical mind fallacy.

Comment by capybaralet on On the purposes of decision theory research · 2019-08-17T04:28:32.342Z · score: 4 (2 votes) · LW · GW
my concern that decision theory research (as done by humans in the foreseeable future) can't solve decision theory in a definitive enough way that would obviate the need to make sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself

So you're saying we need to solve decision theory at the meta-level, instead of the object-level. But can't we view any meta-level solution as also (trivially) an object level solution?

In other words, "[making] sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself" sounds like a special case of "[solving] decision theory in a definitive enough way".


I'm starting with the objective of objecting to your (6): this seems like an important goal, in my mind. And if we *aren't* able to verify that an AI is free from decision theoretic flaws, then how can we trust it to self-modify to be free of such flaws?

Your perspective still make sense to me if you say: "this AI (soit ALICE) is exploitable, but it'll fix that within 100 years, so if it doesn't get exploited in the meanwhile, then we'll be OK". And OFC in principle, making an agent that will have no flaws within X years of when it is created is easier than the special case of X=0.

In reality, it seems plausible to me that we can build an agent like ALICE and have a decent change that ALICE won't get exploited within 100 years.

But I still don't see why you dismiss the goal of (6); I don't think we have anything like definitive evidence that it is an (effectively) impossible goal.

Comment by capybaralet on Coherence arguments do not imply goal-directed behavior · 2019-08-17T03:59:56.165Z · score: 1 (1 votes) · LW · GW
we haven't seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game.

It's a good point, but... we won't see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we've let them run.

I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don't think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.

Now, one can say that this actually means we don't care that much about the long-term future, but I don't agree with that conclusion; I think we *do* care (at least, I do), but aren't very good at acting as if we(/I) do.

Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a "positive-sum" trade with us (e.g. "I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want").


I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/"irrational") in such a way that they will only seek to optimize things within the ontology we've provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn't seem mature.... at least I'm not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who've done work in this area, and their work on this stuff seems quite under appreciated.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:33:02.136Z · score: 3 (2 votes) · LW · GW

Yeah, I think it totally does! (and that's a very interesting / "trippy" line of thought :D)

However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don't think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for "savant-like" intelligence, which is sort of what I'm imagining here. I can't think of why I have that intuition OTTMH.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:29:40.756Z · score: 1 (1 votes) · LW · GW

So I don't take EY's post as about AI researchers' competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.

I don't think I'm underestimating AI researchers, either, but for a different reason... let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I'm imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.

Regarding long-term planning, I'd factor this into 2 components:

1) having a good planning algorithm

2) having a good world model

I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.

I don't think its necessary to have a very "complete" world-model (i.e. enough knowledge to look smart to a person) in order to find "steganographic" long-term strategies like the ones I'm imagining.

I also don't think it's even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.... (i.e. be some sort of savant).

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:13:08.445Z · score: 1 (1 votes) · LW · GW

I'm not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I'm actually more interested in hearing your take on those lines of argument than saying mine ATM :P

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-15T04:07:30.055Z · score: 2 (2 votes) · LW · GW


Not a direct response: It's been argued (e.g. I think Paul said this in his 2nd 80k podcast interview?) that this isn't very realistic, because the low-hanging fruit (of easy to attack systems) is already being picked by slightly less advanced AI systems. This wouldn't apply if you're *already* in a discontinuous regime (but then it becomes circular).

Also not a direct response: It seems likely that some AIs will be much more/less cautious than humans, because they (e.g. implicitly) have very different discount rates. So AIs might take very risky gambles, which means both that we might get more sinister stumbles (good thing), but also that they might readily risk the earth (bad thing).

Comment by capybaralet on Project Proposal: Considerations for trading off capabilities and safety impacts of AI research · 2019-08-14T16:38:55.933Z · score: 7 (4 votes) · LW · GW

I do think this is an overly optimistic picture. The amount of traction an argument gets seems to be something like a product of how good the argument is, how credible those making the argument are, and how easy it is to process the argument.

Also, regarding this:

But the credibility system is good enough that the top credible people are really pretty smart, so to an extent can be swayed by good arguments presented well.

It's not just intelligence that determines if people will be swayed; I think other factors (like "rationality", "open-mindedness", and other personality factors play a very big role.

Comment by capybaralet on AI Alignment Open Thread August 2019 · 2019-08-14T16:28:43.404Z · score: 1 (1 votes) · LW · GW

Oops, missed that, sry.