Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-14T04:56:34.573Z · score: 1 (1 votes) · LW · GW

I'm not sure. I was trying to disagree with your top level comment :P

Comment by capybaralet on How much can value learning be disentangled? · 2019-02-11T22:53:59.141Z · score: 1 (1 votes) · LW · GW

FWICT, both of your points are actually responses to be point (3).

RE "re: #2", see: https://en.wikipedia.org/wiki/Value_of_information#Characteristics

RE "re: #3", my point was that it doesn't seem like VoI is the correct way for one agent to think about informing ANOTHER agent. You could just look at the change in expected utility for the receiver after updating on some information, but I don't like that way of defining it.

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-11T22:43:34.891Z · score: 1 (1 votes) · LW · GW

I think it is rivalrous.

Xrisk mitigation isn't the resource; risky behavior is the resource. If you engage in more risky behavior, then I can't engage in as much risky behavior without pushing us over into a socially unacceptable level of total risky behavior.

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-11T22:38:38.288Z · score: 4 (2 votes) · LW · GW

If there is a cost to reducing Xrisk (which I think is a reasonable assumption), then there will be an incentive to defect, i.e. to underinvest in reducing Xrisk. There's still *some* incentive to prevent Xrisk, but to some people everyone dying is not much worse than just them dying.

Comment by capybaralet on The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk · 2019-02-11T22:24:45.850Z · score: 1 (1 votes) · LW · GW

1) Yep, independence.

2) Seems right as well.

3) I think it's important to consider "risk per second", because

(i) I think many AI systems could eventually become dangerous, just not on reasonable time-scales.

(ii) I think we might want to run AI systems which have the potential to become dangerous for limited periods of time.

(iii) If most of the risk is far in the future, we can hope to become more prepared in the meanwhile

Comment by capybaralet on Predictors as Agents · 2019-02-10T22:39:31.662Z · score: 1 (1 votes) · LW · GW

Whether or not this happens depends on the learning algorithm. Let's assume an IID setting. Then an algorithm that evaluates many random parameter settings and choses the one that gives the best performance would have this effect. But a gradient-based learning algorithm wouldn't necessarily, since it only aims to improve its predictions locally (so what you say in the ETA is more accurate, **in this case**, I think).

Also, I just wanted to mention that Stuart Armstrong's paper "Good and safe uses of AI oracles" discusses self-fulfilling prophecies as well; Stuart provides a way of training a predictor that won't be victim to such effects (just don't reveal its predictions when training). But then it also fails to account for the effect its predictions actually have, which can be a source of irreducible error... The example is a (future) stock-price predictor: making its predictions public makes them self-refuting to some extent, as they influence market actors decisions.

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-07T19:42:31.506Z · score: 4 (2 votes) · LW · GW

I dunno... I think describing them as a tragedy of the commons can help people understand why the problems are challenging and deserving of attention.

## X-risks are a tragedies of the commons

2019-02-07T02:48:25.825Z · score: 9 (5 votes)
Comment by capybaralet on Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" · 2019-02-07T02:10:29.396Z · score: 22 (6 votes) · LW · GW

RE Sarah: Longer timelines don't change the picture that much, in my mind. I don't find this article to be addressing the core concerns. Can you recommend one that's more focused on "why AI-Xrisk isn't the most important thing in the world"?

RE Robin Hanson: I don't really know much of what he thinks, but IIRC his "urgency of AI depends on FOOM" was not compelling.

What I've noticed is that critics are often working from very different starting points, e.g. being unwilling to estimate probabilities of future events, using common-sense rather than consequentialist ethics, etc.

## My use of the phrase "Super-Human Feedback"

2019-02-06T19:11:11.734Z · score: 12 (7 votes)

## Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?"

2019-02-06T19:09:20.809Z · score: 25 (12 votes)
Comment by capybaralet on How much can value learning be disentangled? · 2019-01-31T19:58:48.431Z · score: 1 (1 votes) · LW · GW

IMO, VoI is also not a sufficient criteria for defining manipulation... I'll list a few problems I have with it, OTTMH:

1) It seems to reduce it to "providing misinformation, or providing information to another agent that is not maximally/sufficiently useful for them (in terms of their expected utility)". An example (due to Mati Roy) of why this doesn't seem to match our intuition is: what if I tell someone something true and informative that serves (only) to make them sadder? That doesn't really seem like manipulation (although you could make a case for it).

2) I don't like the "maximally/sufficiently" part; maybe my intuition is misleading, but manipulation seems like a qualitative thing to me. Maybe we should just constrain VoI to be positive?

3) Actually, it seems weird to talk about VoI here; VoI is prospective and subjective... it treats an agent's beliefs as real and asks how much value they should expect to get from samples or perfect knowledge, assuming these samples or the ground truth would be distributed according to their beliefs; this makes VoI strictly non-negative. But when we're considering whether to inform an agent of something, we might recognize that certain information we'd provide would actually be net negative (see my top level comment for an example). Not sure what to make of that ATM...

Comment by capybaralet on The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk · 2019-01-31T19:46:50.359Z · score: 1 (1 votes) · LW · GW

Agree, good point. I'd say it's aleatoric risk is necessary to produce compounding, but not sufficient, but maybe I'm still looking at this the wrong way.

Comment by capybaralet on How much can value learning be disentangled? · 2019-01-31T19:41:05.045Z · score: 3 (2 votes) · LW · GW

Haha no not at all ;)

I'm not actually trying to recruit people to work on that, just trying to make people aware of the idea of doing such projects. I'd suggest it to pretty much anyone who wants to work on AI-Xrisk without diving deep into math or ML.

## The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk

2019-01-31T06:13:35.321Z · score: 14 (8 votes)
Comment by capybaralet on How much can value learning be disentangled? · 2019-01-31T04:58:48.703Z · score: 3 (2 votes) · LW · GW

So I want to emphasize that I'm only saying it's *plausible* that *there exists* a specification of "manipulation". This is my default position on all human concepts. I also think it's plausible that there does not exist such a specification, or that the specification is too complex to grok, or that there end up being multiple conflicting notions we conflate under the heading of "manipulation". See this post for more.

Overall, I understand and appreciate the issues you're raising, but I think all this post does is show that naive attempts to specify "manipulation" fail; I think it's quite difficult to argue compellingly that no such specification exists ;)

"It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle."

^ Actually, I think "ending up with a better understanding" (in the sense I'm reading it)is probably not sufficient to rule out manipulation; what I mean is that I can do something which actually improves your model of the world, but leads you to follow a policy with worse expected returns. A simple example would be if you are doing Bayesian updating and your prior over returns for two bandit arms is P(r|a_1) = N(1,1), P(r|a_2) = N(2, 1), while the true returns are 1/2 and 2/3 (respectively). So your current estimates are optimistic, but they are ordered correctly, and so induce the optimal (greedy) policy.

Now if I give you a bunch of observations of a_2, I will be giving you true information, that will lead you to learn, correctly and with high confidence, that the expected reward for a_2 is ~2/3, improving your model of the world. But since you haven't updated your estimate for a_1, you will now prefer a_1 to a_2 (if acting greedily), which is suboptimal. So overall I've informed you with true information, but disadvantaged you nonetheless. I'd argue that if I did this intentionally, it should count as a form of manipulation.

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-09T18:27:26.934Z · score: 1 (1 votes) · LW · GW

I don't think I'd put it that way (although I'm not saying it's inaccurate). See my comments RE "safety via myopia" and "inner optimizers".

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-09T18:22:42.679Z · score: 6 (3 votes) · LW · GW

Yes, maybe? Elaborating...

I'm not sure how well this fits into the category of "inner optimizers"; I'm still organizing my thoughts on that (aiming to finish doing so within the week...). I'm also not sure that people are thinking about inner optimizers in the right way.

Also, note that the thing being imitated doesn't have to be a human.

OTTMH, I'd say:

• This seems more general in the sense that it isn't some "subprocess" of the whole system that becomes a dangerous planning process.
• This seems more specific in the sense that the boldest argument for inner optimizers is, I think, that they should appear in effectively any optimization problem when there's enough optimization pressure.

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-07T15:55:46.050Z · score: 4 (2 votes) · LW · GW

See the clarifying note in the OP. I don't think this is about imitating humans, per se.

The more general framing I'd use is WRT "safety via myopia" (something I've been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn't have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it's performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).

Comment by capybaralet on Assuming we've solved X, could we do Y... · 2019-01-07T15:39:48.509Z · score: 1 (1 votes) · LW · GW

Aha, OK. So I either misunderstand or disagree with that.

I think SHF (at least most examples) have the human as "CEO" with AIs as "advisers", and thus the human can chose to ignore all of the advice and make the decision unaided.

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-07T15:31:57.847Z · score: 1 (1 votes) · LW · GW

I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.

1) I don't think it's realistic to imagine we have "indistinguishable imitation" with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I'm not expecting it to happen on a deadline. So I'm talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.

2) I wouldn't say "decision theory"; I think that's a bit of a red herring. What I'm talking about is the policy.

3) I'm not sure the link you are trying to make to the "universal prior is malign" ideas. But I'll draw my own connection. I do think the core of the argument I'm making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table.

## Imitation learning considered unsafe?

2019-01-06T15:48:36.078Z · score: 9 (4 votes)
Comment by capybaralet on Assuming we've solved X, could we do Y... · 2019-01-06T15:13:06.375Z · score: 1 (1 votes) · LW · GW

OK, so it sounds like your argument why SHF can't do ALD is (a specific, technical version of) the same argument that I mentioned in my last response. Can you confirm?

Comment by capybaralet on Conceptual Analysis for AI Alignment · 2018-12-30T21:58:25.522Z · score: 1 (1 votes) · LW · GW

I intended to make that clear in the "Concretely, I imagine a project around this with the following stages (each yielding at least one publication)" section. The TL;DR is: do a literature review of analytic philosophy research on (e.g.) honesty.

Comment by capybaralet on Assuming we've solved X, could we do Y... · 2018-12-30T21:56:30.356Z · score: 1 (1 votes) · LW · GW

Yes, please try to clarify. In particular, I don't understand your "|" notation (as in "S|Output").

I realized that I was a bit confused in what I said earlier. I think it's clear that (proposed) SHF schemes should be able to do at least as well as a human, given the same amount of time, because they have human "on top" (as "CEO") who can merely ignore all the AI helpers(/underlings).

But now I can also see an argument for why SHF couldn't do ALD, if it doesn't have arbitrarily long to deliberate: there would need to be some parallelism/decomposition in SHF, and that might not work well/perfectly for all problems.

## Conceptual Analysis for AI Alignment

2018-12-30T00:46:38.014Z · score: 26 (9 votes)
Comment by capybaralet on Assuming we've solved X, could we do Y... · 2018-12-27T04:42:09.568Z · score: 1 (1 votes) · LW · GW

Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn't find it in my notes (yet).

RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is "arbitrarily long deliberation" (ALD), i.e. a single human is allowed as long as they want to make the decision (I don't think that's a perfect "target" for alignment, but it seems like a reasonable starting point). I don't see why ALD would (even potentially) "do something that can't be approximated by SHF-schemes", since those schemes still have the human making a decision.

"Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?" <-- yes, IIUC.

Comment by capybaralet on Survey: What's the most negative*plausible cryonics-works story that you know? · 2018-12-19T22:42:54.714Z · score: 1 (1 votes) · LW · GW

I'd suggest separating these two scenarios, based on the way the comments are meant to work according to the OP.

Comment by capybaralet on Assuming we've solved X, could we do Y... · 2018-12-17T04:43:41.963Z · score: 3 (2 votes) · LW · GW

I actually don't understand why you say they can't be fully disentangled.

IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) "arbitrarily long deliberation (ALD)" was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way.

But I think it's reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like "manipulation". So do you disagree? Or is your objection something else entirely?

Comment by capybaralet on Assuming we've solved X, could we do Y... · 2018-12-12T19:20:36.102Z · score: 6 (3 votes) · LW · GW

Hey, David here!

Just writing to give some context... The point of this session was to discuss an issue I see with "super-human feedback (SHF)" schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let's consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of "AI safety via debate"). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug.

Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we'd like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don't aim to change the decisions.

Now, the issue I mentioned is: there doesn't seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP's decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.

Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I'm more ready to trust B, since I'm worried about the helper AIs having an undesirable influence on A's decision-making.

--------------------------------------------------------------------

...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be "what we want". I'm relatively comfortable with trusting idealized versions of "behavioral cloning/imitation/supervised learning" (P) or "(myopic) reinforcement learning/preference learning" (NP), compared with the SHF-schemes (PSPACE).

One insight I gleaned from our discussion is the usefulness of disentangling:

• an idealized process for *defining* "what we want" (HCH was mentioned as potentially a better model of this than "a single human given as long as they want to think about the decision" (which was what I proposed using, for the purposes of the discussion)).
• a means of *approximating* that definition.

From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: "Assuming that the output of a human's indefinite deliberation is a good definition of 'what they want', do SHF-schemes do a good/safe job of approximating that?"

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-11-26T06:55:58.050Z · score: 1 (1 votes) · LW · GW

So I discovered that Paul Christiano already made a very similar distinction to the holistic/parochial one here:

https://ai-alignment.com/ambitious-vs-narrow-value-learning-99bd0c59847e

ambitious ~ holistic

narrow ~ parochial

Someone also suggested simply using general/narrow instead of holistic/parochial.

Comment by capybaralet on Notification update and PM fixes · 2018-08-15T16:01:45.520Z · score: 1 (1 votes) · LW · GW

Has it been rolled out yet? I would really like this feature.

RE spamming: certainly they can be disabled by default, and you can have an unsubscribe button at the bottom of every email?

Comment by capybaralet on Safely and usefully spectating on AIs optimizing over toy worlds · 2018-08-15T15:49:12.296Z · score: 1 (1 votes) · LW · GW

I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.

I'm very optimistic about this approach of doing "capability control" by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we'd still need to worry about "accidental" creation of subagents and (e.g. evolutionary) optimization pressures for their creation).

I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it's hard to tell if it is coherent, or how to formalize it.

P.S. Is this the same as "platonic goals"? Could you include references to previous thought on the topic?

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-10T14:31:55.309Z · score: 2 (1 votes) · LW · GW

I realized it's unclear to me what "trying" means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by "(actually) trying", and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-10T14:26:06.488Z · score: 2 (1 votes) · LW · GW

It seems like you are referring to daemons.

To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.

To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn't mean we should ignore such problems; it's just out of scope.

Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from "has X as explicit terminal goal" to "is actually trying to achieve X."

That seems like the wrong way of phrasing it to me. I would put it like "MIRI wants to figure out how to build properly 'consequentialist' agents, a capability they view us as currently lacking".

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-10T14:14:01.206Z · score: 2 (1 votes) · LW · GW

Can you please explain the distinction more succinctly, and say how it is related?

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-07T19:35:35.554Z · score: 4 (2 votes) · LW · GW

I don't think I was very clear; let me try to explain.

I mean different things by "intentions" and "terminal values" (and I think you do too?)

By "terminal values" I'm thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it's terminal values are whatever that reward function expresses.

Whereas "trying to do what H wants it to do" I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but "wants to learn the right one", or really just any case where R could reasonably be described as "trying to do what H wants it to do".

Talking about a "black box system" was probably a red herring.

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-07T18:47:43.057Z · score: 2 (1 votes) · LW · GW

Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn't need to have common sense "background values" like "don't kill anyone".

Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don't know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-07T18:43:13.704Z · score: 2 (1 votes) · LW · GW

It doesn't *necessarily*. But it sounds like what you're thinking of here is some form of "sufficient alignment".

The point is that you could give an AI a reward function that leads it to be a good personal assistant program, so long as it remains restricted to doing the sort of things we expect a personal assistant program to do, and isn't doing things like manipulating the stock market when you ask it to invest some money for you (unless that's what you expect from a personal assistant). If it knows it could do things like that, but doesn't want to, then it's more like something sufficiently aligned. If it doesn't do such things because it doesn't realize they are possibilities (yet), or because it hasn't figured out a good way to use it's actuators to have that kind of effect (yet), because you've done a good job boxing it, then it's more like "parochially aligned".

Comment by capybaralet on Amplification Discussion Notes · 2018-06-06T12:06:55.580Z · score: 3 (2 votes) · LW · GW

This is one of my main cruxes. I have 2 main concerns about honest mistakes:

1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don't see any strong reasons to believe it will happen or that we'll be able to recognize whether it is or not.

2) The "progeny alignment problem" (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don't have a good reason to believe it won't. You could argue that humans recognize this problem, so an AGI should as well (and if it's aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we're building "R"):

2a) R can make an unaligned progeny before it's "smart enough" to realize it needs to exercise care to avoid doing so.

2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI's AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.

2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R's effective horizon, but will take over and optimize a different objective after that.

2b and 2c are *arguably* "non-problems" (although they're at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.

Comment by capybaralet on Disambiguating "alignment" and related notions · 2018-06-05T19:59:10.666Z · score: 9 (2 votes) · LW · GW

This is not what I meant by "the same values", but the comment points towards an interesting point.

When I say "the same values", I mean the same utility function, as a function over the state of the world (and the states of "R is having sex" and "H is having sex" are different).

The interesting point is that states need to be inferred from observations, and it seems like there are some fundamentally hard issues around doing that in a satisfying way.

Comment by capybaralet on Funding for AI alignment research · 2018-06-05T16:07:24.487Z · score: 3 (1 votes) · LW · GW

So my original response was to the statement:

Differential research that advances safety more than AI capability still advances AI capability.

Which seems to suggest that advancing AI capability is sufficient reason to avoid technical safety that has non-trivial overlap with capabilities. I think that's wrong.

RE the necessary and sufficient argument:

1) Necessary: it's unclear that a technical solution to alignment would be sufficient, since our current social institutions are not designed for superintelligent actors, and we might not develop effective new ones quickly enough

2) Sufficient: I agree that never building AGI is a potential Xrisk (or close enough). I don't think it's entirely unrealistic "to shoot for levels of coordination like 'let's just never build AGI'", although I agree it's a long shot. Supposing we have that level of coordination, we could use "never build AGI" as a backup plan while we work to solve technical safety to our satisfaction, if that is in fact possible.

Comment by capybaralet on Funding for AI alignment research · 2018-06-05T16:01:44.240Z · score: 3 (1 votes) · LW · GW
Moving on from that I'm thinking that we might need a broad base of support from people (depending upon the scenario) so being able to explain how people could still have meaningful lives post AI is important for building that support. So I've been thinking about that.

This sounds like it would be useful for getting people to support the development of AGI, rather than effective global regulation of AGI. What am I missing?

## Disambiguating "alignment" and related notions

2018-06-05T15:35:15.091Z · score: 43 (13 votes)
Comment by capybaralet on Funding for AI alignment research · 2018-06-05T14:38:21.600Z · score: 3 (1 votes) · LW · GW

Can you give some arguments for these views?

I think the best argument against institution-oriented work is that it might be harder to make a big impact. But more importantly, I think strong global coordination is necessary and sufficient, whereas technical safety is plausibly neither.

I also agree that one should consider tradeoffs, sometimes. But every time someone has raised this concern to me (I think it's been 3x?) I think it's been a clear cut case of "why are you even worrying about that", which leads me to believe that there are a lot of people who are overconcerned about this.

Comment by capybaralet on When is unaligned AI morally valuable? · 2018-06-05T14:33:53.494Z · score: 3 (1 votes) · LW · GW
It seems like the preferences of the AI you build are way more important than its experience (not sure if that's what you mean).

This is because the AIs preferences are going to have a much larger downstream impact?

I'd agree, but caveat that there may be likely possible futures which don't involve the creation of hyper-rational AIs with well-defined preferences, but rather artificial life with messy incomplete, inconsistent preferences but morally valuable experiences. More generally, the future of the light cone could be determined by societal/evolutionary factors rather than any particular agent or agent-y process.

I found your 2nd paragraph unclear...

the goals happen to overlap enough

Is this referring to the goals of having "AIs that have good preferences" and "AIs that have lots of morally valuable experience"?

Comment by capybaralet on Funding for AI alignment research · 2018-06-04T13:06:14.377Z · score: 3 (1 votes) · LW · GW

Are you funding constrained? Would you give out more money if you had more?

Comment by capybaralet on Funding for AI alignment research · 2018-06-04T13:05:37.898Z · score: 3 (1 votes) · LW · GW

FWIW, I think I represent the majority of safety researchers in saying that you shouldn't be too concerned with your effect on capabilities; there's many more people pushing capabilities, so most safety research is likely a drop in the capabilities bucket (although there may be important exceptions!)

Personally, I agree that improving social institutions seems more important for reducing AI-Xrisk ATM than technical work. Are you doing that? There are options for that kind of work as well, e.g. at FHI.

Comment by capybaralet on When is unaligned AI morally valuable? · 2018-05-29T13:44:29.028Z · score: 3 (2 votes) · LW · GW
Overall, I think the question “which AIs are good successors?” is both neglected and time-sensitive, and is my best guess for the highest impact question in moral philosophy right now.

Interesting... my model of Paul didn't assign any work in moral philosophy high priority.

I agree this is high impact. My idea of the kind of work to do here is mostly trying to solving the hardish problem of consciousness so that we can have some more informed guess as to the quantity and valence of experience that different possible futures generate.

Comment by capybaralet on Soon: a weekly AI Safety prerequisites module on LessWrong · 2018-05-10T23:30:58.394Z · score: 1 (1 votes) · LW · GW

I don't think most places have enough ML courses at the undergraduate level; I'd expect 0-2 undergraduate ML courses at a typical large or technically focused university. OFC, you can often take graduate courses as an undergraduate as well.

Comment by capybaralet on Soon: a weekly AI Safety prerequisites module on LessWrong · 2018-05-07T18:44:44.823Z · score: 1 (1 votes) · LW · GW

There are lots of graduate ML programs that will give you ML background (although that might not be the most efficient route; e.g. compare with Google Brain Residency).

Is there a clear academic path towards getting a good background for AF? Maybe mathematical logic? RAISE might be filling that niche?

Comment by capybaralet on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2018-04-23T21:03:53.443Z · score: 1 (1 votes) · LW · GW

"But I'm not sure what the alternative would be."

I'm not sure if it's what your thinking of, but I'm thinking of “What action is best according to these values” == "maximize reward". One alternative that's worth investigating more (IMO) is imposing hard constraints.

For instance, you could have an RL agent taking actions in $(a_1, a_2) \in \mathbb{R}^2$, and impose the constraint that $a_1 + a_2 < 3$ by projection.

A recent near-term safety paper takes this approach: https://arxiv.org/abs/1801.08757

Comment by capybaralet on China’s Plan to ‘Lead’ in AI: Purpose, Prospects, and Problems · 2017-08-11T04:40:23.729Z · score: 0 (0 votes) · LW · GW

a FB friend of mine speculated that this was referring to alienation resulting from ppl losing their jobs to robots... shrug

Comment by capybaralet on China’s Plan to ‘Lead’ in AI: Purpose, Prospects, and Problems · 2017-08-10T22:26:05.157Z · score: 0 (0 votes) · LW · GW

What is "robot alienation"?

Comment by capybaralet on Counterfactual Mugging · 2017-01-30T18:14:17.104Z · score: 1 (1 votes) · LW · GW

But you aren't supposed to be updating... the essence of UDT, I believe, is that your policy should be set NOW, and NEVER UPDATED.

So... either:

1. You consider the choice of policy based on the prior where you DIDN'T KNOW whether you'd face Nomega or Omega, and NEVER UPDATE IT (this seems obviously wrong to me: why are you using your old prior instead of your current posterior?). or
2. You consider the choice of policy based on the prior where you KNOW that you are facing Omega AND that the coin is tails, in which case paying Omega only loses you money.
Comment by capybaralet on Counterfactual Mugging · 2017-01-30T18:08:34.353Z · score: 0 (0 votes) · LW · GW

Thanks for pointing that out. The answer is, as expected, a function of p. So I now find explanations of why UDT gets mugged incomplete and misleading.

Here's my analysis:

The action set is {give, don't give}, which I'll identify with {1, 0}. Now, the possible deterministic policies are simply every mapping from {N,O} --> {1,0}, of which there are 4.

We can disregard the policies for which pi(N) = 1, since giving money to Nomega serves no purpose. So we're left with

pi_give

and

pi_don't,

which give/don't, respectively, to Omega.

Now, we can easily compute expected value, as follows:

r (pi_give(N)) = 0

r (pi_give(O, heads)) = 10

r (pi_give(0, tails)) = -1

r (pi_don't(N)) = 10

r (pi_don't(0)) = 0

So now:

Eg := E_give(r) = 0 p + .5 (10-1) * (1-p)

Ed := E_don't(r) = 10 p + 0 (1-p)

Eg > Ed whenever 4.5 (1-p) > 10 p,

i.e. whenever 4.5 > 14.5 p

i.e. whenever 9/29 > p

So, whether you should precommit to being mugged depends on how likely you are to encounter N vs. O, which is intuitively obvious.

Comment by capybaralet on Progress and Prizes in AI Alignment · 2017-01-05T04:56:37.765Z · score: 3 (3 votes) · LW · GW

Looking at what they've produced to date, I don't really expect MIRI and CHCAI to produce that similar of work. I expect Russell's group to be more focused on value learning an corrigibility vs. reliable agent designs (MIRI).

## Problems with learning values from observation

2016-09-21T00:40:49.102Z · score: 0 (7 votes)

## Risks from Approximate Value Learning

2016-08-27T19:34:06.178Z · score: 1 (4 votes)

## Inefficient Games

2016-08-23T17:47:02.882Z · score: 14 (15 votes)

## Should we enable public binding precommitments?

2016-07-31T19:47:05.588Z · score: 0 (1 votes)

## A Basic Problem of Ethics: Panpsychism?

2015-01-27T06:27:20.028Z · score: -4 (11 votes)

## A Somewhat Vague Proposal for Grounding Ethics in Physics

2015-01-27T05:45:52.991Z · score: -3 (16 votes)