Posts

Offer of collaboration and/or mentorship 2019-05-16T14:16:20.684Z · score: 109 (36 votes)
Reinforcement learning with imperceptible rewards 2019-04-07T10:27:34.127Z · score: 16 (8 votes)
Dimensional regret without resets 2018-11-16T19:22:32.551Z · score: 9 (4 votes)
Computational complexity of RL with traps 2018-08-29T09:17:08.655Z · score: 14 (5 votes)
Entropic Regret I: Deterministic MDPs 2018-08-16T13:08:15.570Z · score: 12 (7 votes)
Algo trading is a central example of AI risk 2018-07-28T20:31:55.422Z · score: 25 (15 votes)
The Learning-Theoretic AI Alignment Research Agenda 2018-07-04T09:53:31.000Z · score: 16 (7 votes)
Meta: IAFF vs LessWrong 2018-06-30T21:15:56.000Z · score: 1 (1 votes)
Computing an exact quantilal policy 2018-04-12T09:23:27.000Z · score: 2 (1 votes)
Quantilal control for finite MDPs 2018-04-12T09:21:10.000Z · score: 3 (3 votes)
Improved regret bound for DRL 2018-03-02T12:49:27.000Z · score: 0 (0 votes)
More precise regret bound for DRL 2018-02-14T11:58:31.000Z · score: 1 (1 votes)
Catastrophe Mitigation Using DRL (Appendices) 2018-02-14T11:57:47.000Z · score: 0 (0 votes)
Bugs? 2018-01-21T21:32:10.492Z · score: 4 (1 votes)
The Behavioral Economics of Welfare 2017-12-22T11:35:09.617Z · score: 28 (12 votes)
Improved formalism for corruption in DIRL 2017-11-30T16:52:42.000Z · score: 0 (0 votes)
Why DRL doesn't work for arbitrary environments 2017-11-30T12:22:37.000Z · score: 0 (0 votes)
Catastrophe Mitigation Using DRL 2017-11-22T05:54:42.000Z · score: 2 (1 votes)
Catastrophe Mitigation Using DRL 2017-11-17T15:38:18.000Z · score: 0 (0 votes)
Delegative Reinforcement Learning with a Merely Sane Advisor 2017-10-05T14:15:45.000Z · score: 1 (1 votes)
On the computational feasibility of forecasting using gamblers 2017-07-18T14:00:00.000Z · score: 0 (0 votes)
Delegative Inverse Reinforcement Learning 2017-07-12T12:18:22.000Z · score: 11 (3 votes)
Learning incomplete models using dominant markets 2017-04-28T09:57:16.000Z · score: 1 (1 votes)
Dominant stochastic markets 2017-03-17T12:16:55.000Z · score: 0 (0 votes)
A measure-theoretic generalization of logical induction 2017-01-18T13:56:20.000Z · score: 3 (3 votes)
Towards learning incomplete models using inner prediction markets 2017-01-08T13:37:53.000Z · score: 2 (2 votes)
Subagent perfect minimax 2017-01-06T13:47:12.000Z · score: 0 (0 votes)
Minimax forecasting 2016-12-14T08:22:13.000Z · score: 0 (0 votes)
Minimax and dynamic (in)consistency 2016-12-11T10:42:08.000Z · score: 0 (0 votes)
Attacking the grain of truth problem using Bayes-Savage agents 2016-10-20T14:41:56.000Z · score: 1 (1 votes)
IRL is hard 2016-09-13T14:55:26.000Z · score: 0 (0 votes)
Stabilizing logical counterfactuals by pseudorandomization 2016-05-25T12:05:07.000Z · score: 1 (1 votes)
Stability of optimal predictor schemes under a broader class of reductions 2016-04-30T14:17:35.000Z · score: 0 (0 votes)
Predictor schemes with logarithmic advice 2016-03-27T08:41:23.000Z · score: 1 (1 votes)
Reflection with optimal predictors 2016-03-22T17:20:37.000Z · score: 1 (1 votes)
Logical counterfactuals for random algorithms 2016-01-06T13:29:52.000Z · score: 3 (3 votes)
Quasi-optimal predictors 2015-12-25T14:17:05.000Z · score: 2 (2 votes)
Implementing CDT with optimal predictor systems 2015-12-20T12:58:44.000Z · score: 1 (1 votes)
Bounded Solomonoff induction using optimal predictor schemes 2015-11-10T13:59:29.000Z · score: 1 (1 votes)
Superrationality in arbitrary games 2015-11-04T18:20:41.000Z · score: 7 (6 votes)
Optimal predictor schemes 2015-11-01T17:28:46.000Z · score: 2 (2 votes)
Optimal predictors for global probability measures 2015-10-06T17:40:19.000Z · score: 0 (0 votes)
Logical counterfactuals using optimal predictor schemes 2015-10-04T19:48:23.000Z · score: 0 (0 votes)
Towards reflection with relative optimal predictor schemes 2015-09-30T15:44:21.000Z · score: 1 (1 votes)
Improved error space for universal optimal predictor schemes 2015-09-30T15:08:53.000Z · score: 0 (0 votes)
Optimal predictor schemes pass a Benford test 2015-08-30T13:25:59.000Z · score: 3 (3 votes)
Optimal predictors and propositional calculus 2015-07-04T09:51:38.000Z · score: 0 (0 votes)
Optimal predictors and conditional probability 2015-06-30T18:01:31.000Z · score: 2 (2 votes)
A complexity theoretic approach to logical uncertainty (Draft) 2015-05-11T20:04:28.000Z · score: 5 (5 votes)
Identity and quining in UDT 2015-03-19T19:03:29.000Z · score: 2 (2 votes)

Comments

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-26T11:18:33.520Z · score: 10 (3 votes) · LW · GW

I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn't matter whether the prior is "explicit" and I doubt it's even a well-defined question. What I mean by "prior" is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.

I think the prior will have to look somewhat like the universal prior. Occam's razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there's even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).

I think that DRL doesn't require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it's relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn't make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-23T14:02:39.740Z · score: 2 (1 votes) · LW · GW

For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?

Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get "general" intelligence that is computationally efficient. In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.

Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.

The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. So the AI also cannot do something weird and complicated that the advisor doesn't understand. In order for this not to be too constraining, I imagine the advisor having some kind of "diary" where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the "debate" approach about how to strengthen it.

Maybe another way of putting it -- is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?

The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the "soft" version it will become a limit on how much the AI policy deviates from the advisor policy, so it's not quite a review in the usual sense: there is no binary division between "legal" and "illegal" actions. I think of it more like, the AI should emulate an "improved" version of the advisor: do all the things the advisor would do on eir "best day".

Comment by vanessa-kosoy on How does Gradient Descent Interact with Goodhart? · 2019-06-19T14:18:10.365Z · score: 4 (2 votes) · LW · GW

Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-13T12:41:57.268Z · score: 5 (2 votes) · LW · GW

Dealing with corrupt states requires a "different" algorithm, but the modification is rather trivial: for each hypothesis that includes dynamics and corruption, you need to replace the corrupt states by an inescapable state with reward zero and run the usual PSRL algorithm on this new prior. Indeed, the algorithm deals with corruption by never letting the agent go there. I am not sure I understand why you think this is not a good approach. Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored? Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability. In this case, I agree, the current model is extremely simplified, but it still feels like progress. You can see this for a model of continuous corruption in DIRL, a simpler setting. More generally, I think that a better version of the formalism would build on ideas from quantilization and catastrophe mitigation to arrive at a setting where, you have a low rate of falling into traps or accumulating corruption as long as your policy remains "close enough" to the advisor policy w.r.t. some metric similar to infinity-Renyi divergence (and, as long as your corruption remains low).

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-08T08:43:57.952Z · score: 6 (3 votes) · LW · GW

Hi Rohin, thank you for writing about my work! I want to address some issues you brought up regarding Delegative RL.

I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

Yes. I think that the Cartesian boundary is part of the definition of the agent, and events that violate the Cartesian boundary should be thought of as destroying the agent. Destruction of the agent is a certainly a trap since from the POV of the agent it is irreversible. See also the "death of the agent" subsection of the imperceptible rewards essay.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident -- I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

I think that most of the "intentional" catastrophic actions can be regarded as "due to ignorance" from an appropriate perspective (the main exception is probably non-Cartesian daemons). Consider two examples:

Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn't match the intended reward function (and also possibly the advisor becomes unreliable). We can equip the agent with a prior that accounts for the existence of such states. However, without further help, the agent doesn't have enough information to know when it could enter one. So, if the agent decides to e.g. hack its own reward channel, one perspective is that it is an intentional action against us, but another perspective is that it is due to the agent's ignorance of the true model of corruption. This problem is indeed fixed by Delegative RL (assuming that, in uncorrupt states, the advisor's actions don't lead to corruption).

Example 2 is malign hypotheses. The agent's prior may contain hypotheses that are agentic in themselves. Such a hypothesis can intentionally produce correct predictions up to a "traitorous turn" point, at which it produces predictions that manipulate the agent into an irreversible catastrophic action. From the perspective of the "outer" agent this is "ignorance", but from the perspective of the "inner" agent, this is intentional. Once again delegative RL fixes it: at the traitorous turn, a DRL agent detects the ambiguity in predictions and the critical need to take the right action, leading it delegate. Observing the advisor's action leads it to update away from the malign hypothesis.

Comment by vanessa-kosoy on I translated 'Twelve Virtues of Rationality' into Hebrew. · 2019-06-02T07:24:02.595Z · score: 2 (1 votes) · LW · GW

You can create an empty profile just for the LessWrong group, but it's your call of course. It's just convenient for tracking meetups, and there are some interesting discussions. There is also a Google group but it is mostly dormant nowadays.

Comment by vanessa-kosoy on I translated 'Twelve Virtues of Rationality' into Hebrew. · 2019-06-01T20:38:26.213Z · score: 3 (2 votes) · LW · GW

How about posting it on the Facebook group and discussing more there? It might be worthwhile to put in on rationality.co.il for example.

Comment by vanessa-kosoy on Offer of collaboration and/or mentorship · 2019-05-26T18:00:14.443Z · score: 20 (8 votes) · LW · GW

Update: In total, 16 people contacted me. Offer of mentorship is closed, since I have sufficiently many candidates for now. Offer of collaboration remains open for experienced researchers (i.e. researchers that (i) have some track record of original math / theoretical compsci research, and (ii) are able to take on concrete open problems without much guidance).

Comment by vanessa-kosoy on TAISU - Technical AI Safety Unconference · 2019-05-24T21:59:50.766Z · score: 8 (4 votes) · LW · GW

Hey, it's an AI safety unconference, we should be able to coordinate acausally ;)

Comment by vanessa-kosoy on TAISU - Technical AI Safety Unconference · 2019-05-24T21:50:13.965Z · score: 2 (1 votes) · LW · GW

Is there a specific deadline for signing up?

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-21T21:37:48.717Z · score: 8 (4 votes) · LW · GW

AFAIU discussing charged political issues is not allowed, or at least very frowned upon on LW, and for good reasons. So, I can't discuss the object level. On the other hand, the meta level is too vague. That is, the error is in the way the abstract reasoning is applied to case X (it's just not the right model), rather than in the abstract reasoning itself.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-21T21:31:21.392Z · score: 2 (1 votes) · LW · GW

I didn't know about this feature. It has advantages and disadvantages, but I will at least consider it. Thank you!

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-21T21:24:38.284Z · score: 19 (10 votes) · LW · GW

That is very reasonable and fair. I think that in practice I won't write such a compilation post any time soon, because (i) I already created too much drama, (ii) I don't enjoy writing call-out posts and (iii) my time is much better spent working on AI alignment.

Upon reflection, my strong reaction was probably because my System 1 is designed to deal with Dunbar-number-size groups. In such a tribe, one voice with an agenda which, if implemented, would put me in physical danger, is already notable risk. However, in a civilization of millions the significance of one such voice is microscopic (unless it's very exceptional in its charisma or otherwise). On the other hand, AGI is a serious risk, and it's one that I'm much better equipped to affect.

Sorry for causing all this trouble! Hopefully putting this analysis here in public will help me to stay focused in the future :)

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T22:07:32.947Z · score: 7 (4 votes) · LW · GW

Alright, thank you!

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T21:39:18.983Z · score: 9 (4 votes) · LW · GW

I wasn't referring to "where to discuss politically charged topics", I was referring to "where to discuss the fact that something that happens on LessWrong.com makes me uncomfortable because [reasons]".

To be honest I prefer to avoid politically charged topics, as long as they avoid me (which they didn't, in this case).

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T21:27:44.907Z · score: 2 (1 votes) · LW · GW

Firstly, I have always said (and this incident has once again reinforced my view of this) that “we”, which is to say “rationalists”, should not be a “community”.

Well, that is a legitimate opinion. I just want to point out that it did not appear to be the consensus so far. If it is the consensus (or becomes such) then it seems fair to ask to make it clear, in particular to inform's people's decisions about how and whether to interact with the forum.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T20:26:46.582Z · score: 10 (2 votes) · LW · GW

I mean, the sum total of spaces that the rationalist community uses to hold discussions, propagate information, do collective decision making, (presumably) provide mutual support et cetera, to the extent these spaces are effective in fulfilling their functions. Anywhere where I can say something and people in the community will listen to me, and take this new information into account if it's worth taking into account, or at least provide me with compassionate feedback even if it's not.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T20:19:16.442Z · score: 20 (10 votes) · LW · GW

Regarding "Kolmogorov complicity", I just want to make clear that I don't want to censor your opinion on the political question. Such censorship would only serve to justify your notion that "we only refuse to believe X because it's heresy, while any systematic truthseeker would believe X", which is something I very much disagree with. I might be interested in discussing the political question if we were allowed to do it. It is the double bind of, not being able to allowed to argue with you on the political quesiton while having to listen to you constantly hinting at it, is what bugging me. Then again, I don't really have a good solution.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T20:08:55.352Z · score: 11 (7 votes) · LW · GW

Alright, let's suppose it's off-topic in this thread, or even on this forum. But is there another place within the community's "discussion space" where it is on-topic? Or you don't think such a place should exist at all?

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T19:50:29.506Z · score: 6 (16 votes) · LW · GW

Ugh, because productive discussion happens between perfectly dispassionate robots in a vacuum, and if I'm not one then it is my fault and I should be ashamed? Specifically, I should be ashamed just for saying that something made me uncomfortable rather than suffering in silence? I mean, if that's your vision, it's fine, I understand. But I wonder whether that's really the predominant opinion around here? What about all the stuff about "community" and "Village" etc?

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T18:19:38.448Z · score: 2 (1 votes) · LW · GW

See my reply to Said Achmiz.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T18:13:40.593Z · score: 1 (19 votes) · LW · GW

The abuse did not happen on LW. However, because I happen to be somewhat familiar with Davis' political writing, I am aware of a sinister context to what ey write in LW of which you are not aware. Now, you may say that this is not a fair objection to Davis writing whatever ey write here, and you might well be right. However, I thought I at least have the right to express my feelings on this matter so that Davis and others can take them into account (or not). If we are supposed to be a community, then it should be normal for us to consider each other's feelings, even when there was no norm violation per se involved, not so?

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-19T10:54:23.529Z · score: 4 (17 votes) · LW · GW

That's understandable, but I hope it's also understandable that I find it unpleasant that our standard Bayesian philosophy-of-language somehow got politicized (!?), such that my attempts to do correct epistemology are perceived as attacking people?!

Our philosophy of language did not "somehow" got politicized. You personally (Zack M. Davis) politicized it by abusing it in the context of a political issue.

...Which might make it all the more gratifying if you can find a mistake in the racist bastard's math: then you could call out the mistake in the comments and bask in moral victory as the OP gets downvoted to oblivion for the sin of bad math.

If you had interesting new math or non-trivial novel insights, I would not complain. Of course that's somewhat subjective: someone else might consider your insights valuable.

But what, realistically, do you expect me to do?

You're right, I don't have a good meta-level solution. So, if you want to keep doing that thing you're doing, knock yourself out.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-18T17:09:27.761Z · score: 22 (24 votes) · LW · GW

I find it unpleasant that you always bring your hobbyhorse in, but in an "abstract" way that doesn't allow discussing the actual object level question. It makes me feel attacked in a way that allows for no legal recourse to defend myself.

Comment by vanessa-kosoy on Offer of collaboration and/or mentorship · 2019-05-18T14:42:53.353Z · score: 2 (1 votes) · LW · GW

Thank you, I appreciate the positive feedback :)

Comment by vanessa-kosoy on Offer of collaboration and/or mentorship · 2019-05-18T14:41:08.106Z · score: 5 (3 votes) · LW · GW

Sure, I will do it :)

Comment by vanessa-kosoy on Value Learning is only Asymptotically Safe · 2019-04-23T14:37:27.432Z · score: 4 (2 votes) · LW · GW

Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor could take. The confidence threshold is controlled by the parameter which has a certain optimal value to achieve the best regret bound (as , ; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a "structural" / "non-uniform" version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn't been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.

Comment by vanessa-kosoy on Value Learning is only Asymptotically Safe · 2019-04-22T20:19:55.411Z · score: 2 (1 votes) · LW · GW

Another useful perspective on the conditions the advisor must satisfy, is regarding the environment w.r.t. which these conditions are defined as the belief state of the advisor rather than the true environment. This is difficult to do with the current formalism that requires MDPs, but would be possible with POMDPs for example. Indeed, I took this perspective in an earlier essay about a different setting that allows general environments (see Corollary 1 in that essay). This would lead to a performance guarantee which shows that the agent achieves optimal expected utility w.r.t. the belief state of the advisor. Obviously, this is not as good as optimal expected utility w.r.t. the true environment, however, this means that from the perspective of the advisor, building such an agent is the best possible strategy.

Comment by vanessa-kosoy on Value Learning is only Asymptotically Safe · 2019-04-22T18:09:48.722Z · score: 2 (1 votes) · LW · GW

I think that in the real world, most superficially reasonable actions do not have irreversible consequences that are very important. So, this assumption can hold within some approximation, and this should lead to a performance guarantee that is optimal within the accuracy of this approximation.

Comment by vanessa-kosoy on Value Learning is only Asymptotically Safe · 2019-04-22T17:59:12.564Z · score: 2 (1 votes) · LW · GW

The agent interacts with an environment, that is for the time being assumed to be a finite MDP (generalizations to POMDP and infinite state spaces should be possible, but working out the precise assumptions that are needed is currently an open problem). On each round it either takes a normal action from the set or takes the special "delegation" action . If the agent delegates, the advisor produces an action from that acts on the environment instead.

The assumptions on the advisor are: (i) it never falls into traps (or enters corrupt states, which means states in which the advisor and/or the input channels were compromised and longer provide reliable rewards or advice) (ii) it has at least some small probability of taking the optimal action (instead, we could assume that there is some set of "good enough" actions s.t. the advisor has at least small small probability to take such an action, and reformulate the guarantee w.r.t. the best policy comprised of "good enough" actions rather than the fully optimal policy).

Under these assumptions, we have a regret bound (the particular algorithm I use to prove the bound is Thompson sampling where (i) the agent delegates when it's not sure than an action is safe and (ii) hypotheses with low probability are discarded), meaning that as the geometric time discount constant goes to 1, the agent achieves nearly optimal expected utility.

Here I generalize the setup to allow a small probability of losing long-term value or entering a corrupt state when following the advisor policy. This is important because the aligned AGI is supposed to, among other things, block any unaligned AGI and this is something that the advisor cannot do on its own. I envision more ways to further "soften" the assumptions, in particular we can use the same method as in quantilizers, and argue that if the advisor policy loses long-term value very slowly then any policy with sufficiently small Renyi divergence w.r.t. the advisor policy also loses long-term value slowly at most. The agent should then be able to converge to the optimal policy under the Renyi divergence constraint. (Intuitively, we constraint the agent to behavior that is sufficiently "human like".) This should also have the benefit of a continuous rather than discrete model of corruption (that covers e.g. gradual value drift).

Comment by vanessa-kosoy on Value Learning is only Asymptotically Safe · 2019-04-21T12:44:35.010Z · score: 6 (3 votes) · LW · GW

It is true that going beyond finite MDPs (more generally, environments satisfying sufficient ergodicity assumptions) causes problems but I believe it is possible to overcome them. For example, we can assume that there is a baseline policy (the advisor policy in case of DRL) s.t. the resulting trajectory in state space never (up to catastrophes) diverges from the optimal trajectory (or, less ambitiously, some "target" trajectory) further than some "distance" (measured in terms of the time it would take to go back to the optimal trajectory).

Comment by vanessa-kosoy on Value Learning is only Asymptotically Safe · 2019-04-21T12:19:23.844Z · score: 4 (2 votes) · LW · GW

Delegative Reinforcement Learning is safe not just asymptotically. See also this, this and (once it's uploaded) upcoming paper for SafeML 2019. In addition, there are directions for further improvment here in the "value learning protocols" sections.

Comment by vanessa-kosoy on What failure looks like · 2019-04-12T14:24:34.832Z · score: 6 (3 votes) · LW · GW

I agree that robot armies are an important aspect of part II.

Why? I can easily imagine an AI takeover that works mostly through persuasion/manipulation, with physical elimination of humans coming only as an "afterthought" when AI is already effectively in control (and produced adequate replacements for humans for the purpose of physically manipulating the world). This elimination doesn't even require an "army", it can look like everyone agreeing to voluntary "euthanasia" (possibly not understanding its true meaning). To the extent physical force is involved, most of it might be humans against humans.

Comment by vanessa-kosoy on Two Neglected Problems in Human-AI Safety · 2019-04-09T15:40:12.107Z · score: 5 (2 votes) · LW · GW

I certainly agree that humans might have critical failures of judgement in situations that are outside of some space of what is "comprehensible". This is a special case of what I called "corrupt states" when talking about DRL, so I don't feel like I have been ignoring the issue. Of course there is a lot more work to be done there (and I have some concrete research directions how to understand this better).

Comment by vanessa-kosoy on Rule Thinkers In, Not Out · 2019-03-18T21:27:33.841Z · score: 4 (2 votes) · LW · GW

Oh, I just use the pronoun "ey" for everyone. IMO the entire concept of gendered pronouns is net harmful.

Comment by vanessa-kosoy on Blegg Mode · 2019-03-15T21:16:07.055Z · score: 2 (2 votes) · LW · GW

Hmm. Why would the entity feel disrespected by how many clusters the workers use? I actually am aware that this is an allegory for something else. Moreover, I think that I disagree you with about the something else (although I am not sure since I am not entirely sure what's your position about the something else is). Which is to say, I think that this allegory misses crucial aspects of the original situation and loses the crux of the debate.

Comment by vanessa-kosoy on Blegg Mode · 2019-03-14T20:52:53.921Z · score: 8 (4 votes) · LW · GW

Alright, but then you need some (at least informal) model of why computationally bounded agents need categories. Instead, your argument seems to rely purely on the intuition of your fictional character ("you notice that... they seem to occupy a third category in your ontology of sortable objects").

Also, you seem to assume that categories are non-overlapping. You write "you don't really put them in the same mental category as bleggs". What does it even mean, to put two objects in the same or not the same category? Consider a horse and a cow. Are they in the same mental category? Both are in the categories "living organisms", "animals", "mammals", "domesticated mammals". But, they are different species. So, sometimes you put them in the same category, sometimes you put them in different categories. Are "raven" and "F16 aircraft" in the same category? They are if your categories are "flying objects" vs. "non-flying objects", but they aren't if your categories are "animate" vs. "non-animate".

Moreover, you seem to assume that categories are crisp rather than fuzzy, which is almost never the case for categories that people actually use. How many coins does it take to make a "pile" of coins? Is there an exact number? Is there an exact age when a person gets to be called "old"? If you take a table made out of a block of wood, and start to gradually deform its shape until it becomes perfectly spherical, is there an exact point when it is no longer called a "table"? So, "rubes" and "bleggs" can be fuzzy categories, and the anomalous objects are in the gray area that defies categorization. There's nothing wrong with that.

If we take this rube/blegg factory thought experiment seriously, then what we need to imagine is the algorithm (instructions) that the worker in the factory executes. Then you can say that the relevant "categories" (in the context of the factory, and in that context only) are the vertices in the flow graph of the algorithm. For example, the algorithm might be, a table that specifies how to score each object (blue +5 points, egg-shaped +10 points, furry +1 point...) and a threshold which says what the score should to be to put it in a given bin. Then there are essentially only two categories. Another algorithm might be "if object passes test X, put in the rube bin, if object passes test Y, put it in the blegg bin, if object passes neither test, put in in the Palladium scanner and sort according to that". Then, you have approximately seven categories: "regular rube" (passed test X), "regular blegg" (passed test Y), "irregular object" (failed both tests), "irregular rube" (failed both tests and found to contain enough Palladium), "irregular blegg" (failed both tests and found to contain not enough Palladium), "rube" (anything put in the rube bin) and "blegg" (anything put in the blegg bin). But in any case, the categorization would depend on the particular trade-offs that the designers of the production line made (depending on things like, how expensive is it to run the palladium scanner), rather than immutable Platonic truths about the nature of the objects themselves.

Then again, I'm not entirely sure whether we are really disagreeing or just formulating the same thing in different ways?

Comment by vanessa-kosoy on Blegg Mode · 2019-03-13T19:02:44.101Z · score: 3 (2 votes) · LW · GW

I don't understand what point are you trying to make.

Presumably, each object has observable properties and unobservable properties . The utility of putting an object into bin A is and the utility of putting it into bin B is . Therefore, your worker should put an object into bin A if an only if

That's it. Any "categories" you introduce here are at best helpful heuristics, with no deep philosophical significance.

Comment by vanessa-kosoy on So You Want to Colonize The Universe Part 4: Velocity Changes and Energy · 2019-03-02T18:47:30.377Z · score: 1 (1 votes) · LW · GW

Makes perfect sense, forget I asked.

Comment by vanessa-kosoy on So You Want to Colonize The Universe Part 4: Velocity Changes and Energy · 2019-03-02T17:27:29.660Z · score: 1 (1 votes) · LW · GW

I'm confused. Wouldn't it mean that even without this trick laser sail is only for nearby missions?

Comment by vanessa-kosoy on 'This Waifu Does Not Exist': 100,000 StyleGAN & GPT-2 samples · 2019-03-01T21:45:02.052Z · score: 4 (3 votes) · LW · GW

Amusingly, one of the sample texts contained the Japanese "一生える山の図の彽をふるほゥていしまうもようざないかった" which google translate renders as "I had no choice but to wear a grueling of a mountain picture that would last me" (no, it doesn't make sense in context).

Comment by vanessa-kosoy on So You Want to Colonize The Universe Part 4: Velocity Changes and Energy · 2019-03-01T21:38:03.281Z · score: 2 (2 votes) · LW · GW

Some options you didn't mention (maybe on purpose because they are less efficient?):

  • Cheating the rocket equation using pulse propulsion
  • Breaking a laser sail spaceship by having a mirror that detaches and reflects the laser back to the spaceship but from the opposite direction (don't remember whose idea that is)

Also, your rocket equation is non-relativistic, although IIRC the relativistic equation is the same just with change in rapidity instead of change in velocity.

Comment by vanessa-kosoy on So You Want To Colonize The Universe Part 3: Dust · 2019-03-01T19:04:06.866Z · score: 3 (3 votes) · LW · GW

Usual neutrinos or dark matter won't work, but if we go to the extremely speculative realm, there might be some "hidden sector" of matter that doesn't normally interact with ordinary matter but allows complex structure. Producing it and doing anything with it would be very hard, but not necessarily impossible.

Comment by vanessa-kosoy on So You Want To Colonize The Universe Part 3: Dust · 2019-03-01T19:00:54.515Z · score: 2 (2 votes) · LW · GW

This is extremely speculative, but one way it could be possible to build very sturdy probes is, if we there was a phase of matter whose binding energies were typical of the nuclear forces (or some other, hitherto unknown strong force) rather than the electromagnetic force, like usual matter. Strangelets are one candidate.

Comment by vanessa-kosoy on So You Want to Colonize the Universe Part 2: Deep Time Engineering · 2019-03-01T17:15:43.056Z · score: 5 (5 votes) · LW · GW

Instead of delivering a vessel that can support earth-based life for hundreds of millions of years, we just have to deliver about 100 kg of Von Neumann probes and stored people, which build more of themselves.

We don't necessarily need stored people. The probe can unfold into basic infrastructure + receiver, and the people can be transmitted by some communication channel (radio, laser or something more exotic).

Comment by vanessa-kosoy on So You Want to Colonize The Universe · 2019-03-01T17:06:42.890Z · score: 6 (3 votes) · LW · GW

I think that the Landauer limit argument was debunked.

Comment by vanessa-kosoy on Rule Thinkers In, Not Out · 2019-02-28T21:02:44.642Z · score: 9 (2 votes) · LW · GW

It seems that Einstein was just factually wrong, since ey did not expect the EPR paradox to be empirically confirmed (which only happened after eir death), but intended it as a reductio ad absurdum. Of course, thinking of the paradox did contribute to our understanding of QM, in which sense Einstein played a positive role here, paradoxically.

Comment by vanessa-kosoy on Rule Thinkers In, Not Out · 2019-02-27T14:54:28.311Z · score: 11 (7 votes) · LW · GW

Einstein seems to have batted a perfect 1000

Did ey? As far as I know, ey continued to resist quantum mechanics (in its ultimate form) for eir entire life, and eir attempts to create a unified field theory led to nothing (or almost nothing).

Comment by vanessa-kosoy on Some disjunctive reasons for urgency on AI risk · 2019-02-17T21:46:01.851Z · score: 12 (4 votes) · LW · GW

I think that this problem is in the same broad category as "invent general relativity" or "prove the Poincare conjecture". That is, for one thing quantity doesn't easily replace talent (you couldn't invent GR just as easily with 50 mediocre physicists instead of one Einstein), and, for another thing, the work is often hard to parallelize (50 Einsteins wouldn't invent GR 50 times as fast). So, you can't solve it just by spending lots of resources in a short time frame.

Comment by vanessa-kosoy on Some disjunctive reasons for urgency on AI risk · 2019-02-16T22:40:55.314Z · score: 14 (5 votes) · LW · GW

Where do you draw the line between "the people in that industry will have the time and skill to notice the problems and start working on them" and what is happening now, which is: some people in the industry (at least, you can't argue DeepMind and OpenAI are not in the industry) noticed there is a problem and started working on it? Is it an accurate representation of the no-foom position to say, we should only start worrying when we literally observe a superhuman AI that is trying to take over the world? What if, AI takes years to gradually push humans to the sidelines, but the process in unstoppable because this time is not enough to solve alignment from scratch and the economic incentives to keep employing and developing AI are too strong to fight against?