Comment by capybaralet on Aligning a toy model of optimization · 2019-07-05T00:13:03.548Z · score: 1 (1 votes) · LW · GW

What is pi here?

Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-07-04T23:59:08.577Z · score: 1 (1 votes) · LW · GW

I basically agree with your main point (and I didn't mean to suggest that it "[makes] sense to just decide whether CRT seems more true or false and then go from there").

But I think it's also suggestive of an underlying view that I disagree with, namely: (1) "we should aim for high-confidence solutions to AI-Xrisk". I think this is a good heuristic, but from a strategic point of view, I think what we should be doing is closer to: (2) "aim to maximize the rate of Xrisk reduction".

Practically speaking, a big implication of favoring (2) over (1) is giving a relatively higher priority to research at making unsafe-looking approaches (e.g. reward modelling + DRL) safer (in expectation).



Comment by capybaralet on False assumptions and leaky abstractions in machine learning and AI safety · 2019-06-30T18:36:35.672Z · score: 5 (3 votes) · LW · GW

IIUC, yes, that's basically what I was trying to say about embedded agency.


Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-28T15:24:10.237Z · score: 1 (1 votes) · LW · GW

RE "Is there any reason to expect the drift to systematically move in a certain direction?"

For bit-flips, evolution should select among multiple systems for those that get lucky and get bit-flipped towards higher fitness, but not directly push a given system in that direction.

For self-modification ("reprogramming itself"), I think there are a lot of arguments for CRT (e.g. the decision theory self-modification arguments), but they all seem to carry some implicit assumptions about the inner-workings of the AI.

False assumptions and leaky abstractions in machine learning and AI safety

2019-06-28T04:54:47.119Z · score: 23 (6 votes)
Comment by capybaralet on Open question: are minimal circuits daemon-free? · 2019-06-28T04:47:53.052Z · score: 2 (2 votes) · LW · GW

I think it's relevant for either kind (actually, I'm not sure I like the distinction, or find it particularly relevant).

If there aren't other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.


Comment by capybaralet on Open question: are minimal circuits daemon-free? · 2019-06-27T17:17:05.411Z · score: 1 (1 votes) · LW · GW

Yeah that seems right. I think it's a better summary of what Paul was talking about.

Comment by capybaralet on Open question: are minimal circuits daemon-free? · 2019-06-27T17:13:18.840Z · score: 1 (1 votes) · LW · GW

A concrete vision:

Suppose the best a system can do without a daemon is 97% accuracy.

The daemon can figure out how to get 99% accuracy.

But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it's own agenda.

This all happens on-distribution.


If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a "non-daemon baseline".


Comment by capybaralet on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-27T05:13:38.524Z · score: 8 (3 votes) · LW · GW

RE meetings with FHI/DeepMind (etc.): I think "aren't familiar with or aren't convinced" is part of it, but there are also political elements to all of this.

In general, I think most everything that is said publicly about AI-Xrisk has some political element to it. And people's private views are inevitably shaped by their public views somewhat (in expectation) as well.

I find it pretty hard to account for the influence of politics, though. And I probably overestimate it somewhat.


Comment by capybaralet on [AN #58] Mesa optimization: what it is, and why we should care · 2019-06-27T04:34:56.690Z · score: 3 (2 votes) · LW · GW

RE Natasha's work: she's said she thinks that whether the influence criteria leads to more or less altruistic behavior is probably environment dependent.

Comment by capybaralet on Open question: are minimal circuits daemon-free? · 2019-06-27T04:31:07.033Z · score: 1 (1 votes) · LW · GW

Regarding daemons starting as upstream and becoming downstream...

I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don't think of it that way.

What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.

It seems highly analogous to hacking a learned reward function.

Comment by capybaralet on Open question: are minimal circuits daemon-free? · 2019-06-27T04:23:45.415Z · score: 1 (1 votes) · LW · GW

(Summarizing/reinterpreting the upstream/downstream distinction for myself):

"upstream": has a (relatively benign?) goal which actually helps achieve X

"downstream": doesn't



Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-27T02:37:51.778Z · score: 1 (1 votes) · LW · GW

" I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. "

^ I agree; this is the point of my analogy with ordinal numbers.

A completely myopic agent (that doesn't directly do planning over future time-steps, but only seeks to optimize its current decision) probably shouldn't make any sub-agents in the first place (except incidentally).

Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-26T20:26:42.993Z · score: 1 (1 votes) · LW · GW

Concerns about inner optimizers seem like a clear example of people arguing for some version of CRT (as I describe it). Would you disagree (why)?

Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-26T20:25:05.677Z · score: 1 (1 votes) · LW · GW

I have unpublished work on that. And a similar experiment (with myopic reinforcement learning) in our paper "Misleading meta-objectives and hidden incentives for distributional shift." ( https://sites.google.com/view/safeml-iclr2019/accepted-papers?authuser=0 )

The environment used in the unpublished work is summarized here: https://docs.google.com/presentation/d/1K6Cblt_kSJBAkVtYRswDgNDvULlP5l7EH09ikP2hK3I/edit?usp=sharing


Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-25T17:03:17.638Z · score: 3 (2 votes) · LW · GW

That's a reasonable position, but I think the reality is that we just don't know. Moreover, it seems possible to build goal-directed agents that don't become hyper-rational by (e.g.) restricting their hypothesis space. Lots of potential for deconfusion, IMO.

Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-14T17:08:03.954Z · score: 1 (1 votes) · LW · GW

See my response to rohin, below.

I'm potentially worried about both; let's not make a false dichotomy!

Comment by capybaralet on Let's talk about "Convergent Rationality" · 2019-06-14T17:07:11.406Z · score: 5 (3 votes) · LW · GW

I definitely want to distinguish CRT from arguments that humans will deliberately build goal-directed agents. But let me emphasize: I think incentives for humans to build goal-directed agents are a larger and more significant and important source of risk than CRT.

RE VVMUT being vacuous: this is a good point (and also implied by the caveat from the reward modeling paper). But I think that in practice we can meaningfully identify goal-directed agents and infer their rationality/bias "profile", as suggested by your work ( http://proceedings.mlr.press/v97/shah19a.html ), and Laurent Orseau's ( https://arxiv.org/abs/1805.12387 ).

Let's talk about "Convergent Rationality"

2019-06-12T21:53:35.356Z · score: 23 (6 votes)
Comment by capybaralet on Let's split the cake, lengthwise, upwise and slantwise · 2019-05-15T18:56:25.355Z · score: 2 (2 votes) · LW · GW

It's worth mentioning that (FWICT) these things are called bargaining problems in the literature: https://en.wikipedia.org/wiki/Bargaining_problem

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T03:42:25.538Z · score: 1 (1 votes) · LW · GW

Yes, that's basically what I mean. I think I'm trying to refer to the same issue that Paul mentioned here: https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#ZWtTvMdL8zS9kLpfu

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T03:18:13.818Z · score: 1 (1 votes) · LW · GW

I like that you emphasize and discuss the need for the AI to not believe that it can influence the outside world, and cleanly distinguish this from it actually being able to influence the outside world. I wonder if you can get any of the benefits here without needing the box to actually work (i.e. can you just get the agent to believe it does? and is that enough for some form/degree of benignity?)

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T02:11:38.623Z · score: 1 (1 votes) · LW · GW

This doesn't seem to address what I view as the heart of Joe's comment. Quoting from the paper:

"Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action".

It seems like simulating *post-episode* events in particular would be useful for predicting the human's responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for "revealed preferences" reasons, though).

Moreover, I think in practice we'll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human's model of it) potentially more appealing, at least in some cases.

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T01:55:31.155Z · score: 1 (1 votes) · LW · GW

I'm calling this the "no grue assumption" (https://en.wikipedia.org/wiki/New_riddle_of_induction).

My concern here is that this assumption might be False, even in a strong sense of "There is no such U".

Have you proven the existence of such a U? Do you agree it might not exist? It strikes me as potentially running up against issues of NFL / self-reference.

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T01:51:13.044Z · score: 1 (1 votes) · LW · GW

Also, it's worth noting that this assumption (or rather, Lemma 3) also seems to preclude BoMAI optimizing anything *other* than revealed preferences (which others have noted seems problematic, although I think it's definitely out of scope).

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T01:46:13.988Z · score: 2 (2 votes) · LW · GW

Still wrapping my head around the paper, but...

1) It seems too weak: In the motivating scenario of Figure 3, isn't is the case that "what the operator inputs" and "what's in the memory register after 1 year" are "historically distributed identically"?

2) It seems too strong: aren't real-world features and/or world-models "dense"? Shouldn't I be able to find features arbitrarily close to F*? If I can, doesn't that break the assumption?

3) Also, I don't understand what you mean by: "it's on policy behavior [is described as] simulating X". It seems like you (rather/also) want to say something like "associating reward with X"?

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T00:53:41.153Z · score: 2 (2 votes) · LW · GW

Just exposition-wise, I'd front-load pi^H and pi^* when you define pi^B, and also clarify then that pi^B considers human-exploration as part of it's policy.

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T00:52:30.094Z · score: 1 (1 votes) · LW · GW

" This result is independently interesting as one solution to the problem of safe exploration with limited oversight in nonergodic environments, which [Amodei et al., 2016] discus "

^ This wasn't super clear to me.... maybe it should just be moved somewhere else in the text?

I'm not sure what you're saying is interesting here. I guess it's the same thing I found interesting, which is that you can get sufficient (and safe-as-a-human) exploration using the human-does-the-exploration scheme you propose. Is that what you mean to refer to?

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T00:19:30.200Z · score: 1 (1 votes) · LW · GW

Maybe "promotional of" would be a good phrase for this.

Comment by capybaralet on Asymptotically Unambitious AGI · 2019-04-01T00:16:01.463Z · score: 1 (1 votes) · LW · GW

ETA: NVM, what you said is more descriptive (I just looked in the appendix).

RE footnote 2: maybe you want to say "monotonically increasing as a function of" rather than "proportional to". (It's a shame there doesn't seem to be a shorter way of saying the first one, which seems to be more often what people actually want to say...)

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-14T04:56:34.573Z · score: 1 (1 votes) · LW · GW

I'm not sure. I was trying to disagree with your top level comment :P

Comment by capybaralet on How much can value learning be disentangled? · 2019-02-11T22:53:59.141Z · score: 1 (1 votes) · LW · GW

FWICT, both of your points are actually responses to be point (3).

RE "re: #2", see: https://en.wikipedia.org/wiki/Value_of_information#Characteristics

RE "re: #3", my point was that it doesn't seem like VoI is the correct way for one agent to think about informing ANOTHER agent. You could just look at the change in expected utility for the receiver after updating on some information, but I don't like that way of defining it.

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-11T22:43:34.891Z · score: 1 (1 votes) · LW · GW

I think it is rivalrous.

Xrisk mitigation isn't the resource; risky behavior is the resource. If you engage in more risky behavior, then I can't engage in as much risky behavior without pushing us over into a socially unacceptable level of total risky behavior.

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-11T22:38:38.288Z · score: 5 (3 votes) · LW · GW

If there is a cost to reducing Xrisk (which I think is a reasonable assumption), then there will be an incentive to defect, i.e. to underinvest in reducing Xrisk. There's still *some* incentive to prevent Xrisk, but to some people everyone dying is not much worse than just them dying.

Comment by capybaralet on The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk · 2019-02-11T22:24:45.850Z · score: -1 (2 votes) · LW · GW

1) Yep, independence.

2) Seems right as well.

3) I think it's important to consider "risk per second", because

(i) I think many AI systems could eventually become dangerous, just not on reasonable time-scales.

(ii) I think we might want to run AI systems which have the potential to become dangerous for limited periods of time.

(iii) If most of the risk is far in the future, we can hope to become more prepared in the meanwhile

Comment by capybaralet on Predictors as Agents · 2019-02-10T22:39:31.662Z · score: 2 (2 votes) · LW · GW

Whether or not this happens depends on the learning algorithm. Let's assume an IID setting. Then an algorithm that evaluates many random parameter settings and choses the one that gives the best performance would have this effect. But a gradient-based learning algorithm wouldn't necessarily, since it only aims to improve its predictions locally (so what you say in the ETA is more accurate, **in this case**, I think).

Also, I just wanted to mention that Stuart Armstrong's paper "Good and safe uses of AI oracles" discusses self-fulfilling prophecies as well; Stuart provides a way of training a predictor that won't be victim to such effects (just don't reveal its predictions when training). But then it also fails to account for the effect its predictions actually have, which can be a source of irreducible error... The example is a (future) stock-price predictor: making its predictions public makes them self-refuting to some extent, as they influence market actors decisions.

Comment by capybaralet on X-risks are a tragedies of the commons · 2019-02-07T19:42:31.506Z · score: 5 (3 votes) · LW · GW

I dunno... I think describing them as a tragedy of the commons can help people understand why the problems are challenging and deserving of attention.

X-risks are a tragedies of the commons

2019-02-07T02:48:25.825Z · score: 9 (5 votes)
Comment by capybaralet on Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" · 2019-02-07T02:10:29.396Z · score: 22 (6 votes) · LW · GW

RE Sarah: Longer timelines don't change the picture that much, in my mind. I don't find this article to be addressing the core concerns. Can you recommend one that's more focused on "why AI-Xrisk isn't the most important thing in the world"?

RE Robin Hanson: I don't really know much of what he thinks, but IIRC his "urgency of AI depends on FOOM" was not compelling.

What I've noticed is that critics are often working from very different starting points, e.g. being unwilling to estimate probabilities of future events, using common-sense rather than consequentialist ethics, etc.

My use of the phrase "Super-Human Feedback"

2019-02-06T19:11:11.734Z · score: 12 (7 votes)

Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?"

2019-02-06T19:09:20.809Z · score: 25 (12 votes)
Comment by capybaralet on How much can value learning be disentangled? · 2019-01-31T19:58:48.431Z · score: 1 (1 votes) · LW · GW

IMO, VoI is also not a sufficient criteria for defining manipulation... I'll list a few problems I have with it, OTTMH:

1) It seems to reduce it to "providing misinformation, or providing information to another agent that is not maximally/sufficiently useful for them (in terms of their expected utility)". An example (due to Mati Roy) of why this doesn't seem to match our intuition is: what if I tell someone something true and informative that serves (only) to make them sadder? That doesn't really seem like manipulation (although you could make a case for it).

2) I don't like the "maximally/sufficiently" part; maybe my intuition is misleading, but manipulation seems like a qualitative thing to me. Maybe we should just constrain VoI to be positive?

3) Actually, it seems weird to talk about VoI here; VoI is prospective and subjective... it treats an agent's beliefs as real and asks how much value they should expect to get from samples or perfect knowledge, assuming these samples or the ground truth would be distributed according to their beliefs; this makes VoI strictly non-negative. But when we're considering whether to inform an agent of something, we might recognize that certain information we'd provide would actually be net negative (see my top level comment for an example). Not sure what to make of that ATM...

Comment by capybaralet on The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk · 2019-01-31T19:46:50.359Z · score: 1 (1 votes) · LW · GW

Agree, good point. I'd say it's aleatoric risk is necessary to produce compounding, but not sufficient, but maybe I'm still looking at this the wrong way.

Comment by capybaralet on How much can value learning be disentangled? · 2019-01-31T19:41:05.045Z · score: 3 (2 votes) · LW · GW

Haha no not at all ;)

I'm not actually trying to recruit people to work on that, just trying to make people aware of the idea of doing such projects. I'd suggest it to pretty much anyone who wants to work on AI-Xrisk without diving deep into math or ML.

The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk

2019-01-31T06:13:35.321Z · score: 14 (8 votes)
Comment by capybaralet on How much can value learning be disentangled? · 2019-01-31T04:58:48.703Z · score: 3 (2 votes) · LW · GW

So I want to emphasize that I'm only saying it's *plausible* that *there exists* a specification of "manipulation". This is my default position on all human concepts. I also think it's plausible that there does not exist such a specification, or that the specification is too complex to grok, or that there end up being multiple conflicting notions we conflate under the heading of "manipulation". See this post for more.

Overall, I understand and appreciate the issues you're raising, but I think all this post does is show that naive attempts to specify "manipulation" fail; I think it's quite difficult to argue compellingly that no such specification exists ;)

"It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle."

^ Actually, I think "ending up with a better understanding" (in the sense I'm reading it)is probably not sufficient to rule out manipulation; what I mean is that I can do something which actually improves your model of the world, but leads you to follow a policy with worse expected returns. A simple example would be if you are doing Bayesian updating and your prior over returns for two bandit arms is P(r|a_1) = N(1,1), P(r|a_2) = N(2, 1), while the true returns are 1/2 and 2/3 (respectively). So your current estimates are optimistic, but they are ordered correctly, and so induce the optimal (greedy) policy.

Now if I give you a bunch of observations of a_2, I will be giving you true information, that will lead you to learn, correctly and with high confidence, that the expected reward for a_2 is ~2/3, improving your model of the world. But since you haven't updated your estimate for a_1, you will now prefer a_1 to a_2 (if acting greedily), which is suboptimal. So overall I've informed you with true information, but disadvantaged you nonetheless. I'd argue that if I did this intentionally, it should count as a form of manipulation.

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-09T18:27:26.934Z · score: 1 (1 votes) · LW · GW

I don't think I'd put it that way (although I'm not saying it's inaccurate). See my comments RE "safety via myopia" and "inner optimizers".

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-09T18:22:42.679Z · score: 6 (3 votes) · LW · GW

Yes, maybe? Elaborating...

I'm not sure how well this fits into the category of "inner optimizers"; I'm still organizing my thoughts on that (aiming to finish doing so within the week...). I'm also not sure that people are thinking about inner optimizers in the right way.

Also, note that the thing being imitated doesn't have to be a human.

OTTMH, I'd say:

  • This seems more general in the sense that it isn't some "subprocess" of the whole system that becomes a dangerous planning process.
  • This seems more specific in the sense that the boldest argument for inner optimizers is, I think, that they should appear in effectively any optimization problem when there's enough optimization pressure.

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-07T15:55:46.050Z · score: 4 (2 votes) · LW · GW

See the clarifying note in the OP. I don't think this is about imitating humans, per se.

The more general framing I'd use is WRT "safety via myopia" (something I've been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn't have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it's performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).

Comment by capybaralet on Assuming we've solved X, could we do Y... · 2019-01-07T15:39:48.509Z · score: 1 (1 votes) · LW · GW

Aha, OK. So I either misunderstand or disagree with that.

I think SHF (at least most examples) have the human as "CEO" with AIs as "advisers", and thus the human can chose to ignore all of the advice and make the decision unaided.

Comment by capybaralet on Imitation learning considered unsafe? · 2019-01-07T15:31:57.847Z · score: 1 (1 votes) · LW · GW

I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.

1) I don't think it's realistic to imagine we have "indistinguishable imitation" with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I'm not expecting it to happen on a deadline. So I'm talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.

2) I wouldn't say "decision theory"; I think that's a bit of a red herring. What I'm talking about is the policy.

3) I'm not sure the link you are trying to make to the "universal prior is malign" ideas. But I'll draw my own connection. I do think the core of the argument I'm making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table.

Imitation learning considered unsafe?

2019-01-06T15:48:36.078Z · score: 9 (4 votes)
Comment by capybaralet on Assuming we've solved X, could we do Y... · 2019-01-06T15:13:06.375Z · score: 1 (1 votes) · LW · GW

OK, so it sounds like your argument why SHF can't do ALD is (a specific, technical version of) the same argument that I mentioned in my last response. Can you confirm?

Comment by capybaralet on Conceptual Analysis for AI Alignment · 2018-12-30T21:58:25.522Z · score: 1 (1 votes) · LW · GW

I intended to make that clear in the "Concretely, I imagine a project around this with the following stages (each yielding at least one publication)" section. The TL;DR is: do a literature review of analytic philosophy research on (e.g.) honesty.

Comment by capybaralet on Assuming we've solved X, could we do Y... · 2018-12-30T21:56:30.356Z · score: 1 (1 votes) · LW · GW

Yes, please try to clarify. In particular, I don't understand your "|" notation (as in "S|Output").

I realized that I was a bit confused in what I said earlier. I think it's clear that (proposed) SHF schemes should be able to do at least as well as a human, given the same amount of time, because they have human "on top" (as "CEO") who can merely ignore all the AI helpers(/underlings).

But now I can also see an argument for why SHF couldn't do ALD, if it doesn't have arbitrarily long to deliberate: there would need to be some parallelism/decomposition in SHF, and that might not work well/perfectly for all problems.

Conceptual Analysis for AI Alignment

2018-12-30T00:46:38.014Z · score: 26 (9 votes)
Comment by capybaralet on Assuming we've solved X, could we do Y... · 2018-12-27T04:42:09.568Z · score: 1 (1 votes) · LW · GW

Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn't find it in my notes (yet).

RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is "arbitrarily long deliberation" (ALD), i.e. a single human is allowed as long as they want to make the decision (I don't think that's a perfect "target" for alignment, but it seems like a reasonable starting point). I don't see why ALD would (even potentially) "do something that can't be approximated by SHF-schemes", since those schemes still have the human making a decision.

"Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?" <-- yes, IIUC.

Comment by capybaralet on Survey: What's the most negative*plausible cryonics-works story that you know? · 2018-12-19T22:42:54.714Z · score: 1 (1 votes) · LW · GW

I'd suggest separating these two scenarios, based on the way the comments are meant to work according to the OP.

Disambiguating "alignment" and related notions

2018-06-05T15:35:15.091Z · score: 43 (13 votes)

Problems with learning values from observation

2016-09-21T00:40:49.102Z · score: 0 (7 votes)

Risks from Approximate Value Learning

2016-08-27T19:34:06.178Z · score: 1 (4 votes)

Inefficient Games

2016-08-23T17:47:02.882Z · score: 14 (15 votes)

Should we enable public binding precommitments?

2016-07-31T19:47:05.588Z · score: 0 (1 votes)

A Basic Problem of Ethics: Panpsychism?

2015-01-27T06:27:20.028Z · score: -4 (11 votes)

A Somewhat Vague Proposal for Grounding Ethics in Physics

2015-01-27T05:45:52.991Z · score: -3 (16 votes)