Comment by vladimir_nesov on Privacy · 2019-03-19T04:54:30.696Z · score: 5 (3 votes) · LW · GW

Nod. I did actually consider a more accurate version of the comment that said something like "at least one of us is at least somewhat confused about something" [...]

The clarification doesn't address what I was talking about, or else disagrees with my point, so I don't see how that can be characterised with a "Nod". The confusion I refer to is about what the other means, with the question of whether anyone is correct about the world irrelevant. And this confusion is significant on both sides, otherwise a conversation doesn't go off the rails in this way. Paying attention to truth is counterproductive when intended meaning is not yet established, and you seem to be talking about truth, while I was commenting about meaning.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-19T04:24:04.728Z · score: 4 (2 votes) · LW · GW

No ancient updates for the previous week, several for this week. An alternative to removing old notifications is to prepend entries in the list with recency, like "13d" or "8y", and sort by it.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-19T04:16:42.683Z · score: 2 (1 votes) · LW · GW

The reversal test is with respect to the norm, not with respect to ways of handling a fixed norm. So imagine that the norm is the opposite, and see what will happen. People will invent weird things like gaging popularity based on number of downvotes, or sum of absolute values of upvotes and downvotes, when there are not enough downvotes. This will work about as well as what happens with the present norm. In that context, the option of "only upvotes" looks funny and pointless, but we can see that it actually isn't, because we can look from the point of view of both possible norms.

When an argument goes through in the world of the opposite status quo, we can transport it to our world. In this case, we obtain the argument that "only downvotes" is not particularly funny and pointless, instead it's about as serviceable (or about as funny and pointless) as "only upvotes", and both are not very good.

Comment by vladimir_nesov on Privacy · 2019-03-19T03:59:52.945Z · score: 6 (3 votes) · LW · GW

I have some probability on me being the confused one here.

In conversations like this, both sides are confused, that is don't understand the other's point, so "who is the confused one" is already an incorrect framing. One of you may be factually correct, but that doesn't really matter for making a conversation work, understanding each other is more relevant.

(In this particular case, I think both of you are correct and fail to see what the other means, but Jessica's point is harder to follow and pattern-matches misleading things, hence the balance of votes.)

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-10T00:32:19.855Z · score: 2 (1 votes) · LW · GW

Sure, for voting the effect on decision making is greater. I'm just suspicious of this whole idea of acausal impact, and moderate observations about effect size don't help with that confusion. I don't think it can apply to voting without applying to other things, so the quantitative distinction doesn't point in a particular direction on correctness of the overall idea.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T23:35:36.030Z · score: 2 (1 votes) · LW · GW

New information argues for a change on the margin, so the new equilibrium is different, though it may not be far away. The arguments are not "cancelled out", but they do only have bounded impact. Compare with charity evaluation in effective altruism: if we take the impact of certain decisions as sufficiently significant, it calls for their organized study, so that the decisions are no longer made based on first impressions. On the other hand, if there is already enough infrastructure for making good decisions of that type, then significant changes are unnecessary.

In the case of acausal impact, large reference classes imply that at least that many people are already affected, so if organized evaluation of such decisions is feasible to set up, it's probably already in place without any need for the acausal impact argument. So actual changes are probably in how you pay attention to info that's already available, not in creating infrastructure for generating better info. On the other hand, a source of info about sizes of reference classes may be useful.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T23:12:49.041Z · score: 6 (3 votes) · LW · GW

The absolute size of a reference class only gives the problem statement for an individual decision some altruistic/paternalistic tilt, which can fail to change it. Greater relative size of a reference class increases the decision's relative importance compared to other decisions, which on the margin should pull some effort away from the other decisions.

That the effective multiplier due to acausal coordination is smaller for non-voting decisions doesn't inform the question of whether the argument applies to non-voting decisions. The argument may be ignored in the decision algorithm only if the reference class is always small or about the same size for different decisions.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T20:03:55.150Z · score: 2 (1 votes) · LW · GW

That influences sizes of reference classes, but at some point the sizes cash out in morally relevant object level decisions.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T06:01:17.121Z · score: 2 (1 votes) · LW · GW

The magnitude depends on the sizes of reference classes, which differ dramatically. So some personal decisions are suddenly much more important than others simply because more people make them, and so you should allocate more resources on deciding those things in particular correctly. Exercise regimen seems like a high acausal impact decision. Another difference is that the goal that the personal decisions pursue shifts from what you want to happen to yourself, to what you want to happen to other people, and this effect increases with population. (Edited the grandparent to express these points more clearly.)

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T05:24:23.145Z · score: 8 (3 votes) · LW · GW

In an old post I argued that for acausal coordination reasons it seems as if you should further multiply this value by the number of people in the reference class of those making the decision the same way (discounted by how little you care about strangers vs. yourself). This makes decisions about things that only affect you personally depend on the relative sizes of their reference classes and on total population (greater population shifts focus of the decisions further away from yourself). Your decision inflicts the micromorts not just on yourself, but on all the people in the reference class, for the proportionally greater total number of micromorts that given this consideration turn into actual morts very easily.

The idea doesn't seem to have taken root, people talk about this argument mostly in the context of voting, where it's comforting for the argument to hold, even though it seems to apply in general, where it demands monstrous responsibility for every tiny little thing. It's very suspicious, but I don't know how to resolve the confusion. Maybe it's just psychologically unrealistic to follow through in almost all cases where the argument applies, despite its normative correctness.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-06T01:08:16.092Z · score: 2 (1 votes) · LW · GW

The main thing I like about the 'only downvotes' option is that it's kind of funny and pointless.

I feel the same about the 'only upvotes' option. Applying the reversal test, imagine that most people treat the 'only downvotes' option seriously and suggest that it should be the default, since it agrees with the usual norms of in-person conversation. Downvotes could even measure popularity if there was enough volume, in the meantime the sum of absolute values of upvotes and downvotes can play that role.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-02T13:03:38.081Z · score: 9 (5 votes) · LW · GW

I'd appreciate the feature that restricts the notifications to the votes on comments posted at most X months ago (with X configurable in the settings). As it is, I'll mostly get noise, notifications for the comments posted 8-11 years ago that I'm not currently learning from. (At least I expect this to be the case and the first batch of notifications supports this.)

Comment by vladimir_nesov on Rule Thinkers In, Not Out · 2019-02-28T13:16:20.892Z · score: 2 (1 votes) · LW · GW

The apparent alternative to the reliable vs. Newton tradeoff when you are the thinker is to put appropriate epistemic status around the hypotheses. So you publish the book on Bible codes or all-powerful Vitamin C, but note in the preface that you remain agnostic about whether any version of the main thesis applies to the real world, pending further development. You build a theory to experience how it looks once it's more developed, and publish it because it was substantial work, even when upon publication you still don't know if there is a version of the theory that works out.

Maybe the theory is just beautiful, and that beauty doesn't much diminish from its falsity. So call it philosophical fiction, not a description of this world, the substantial activity of developing the theory and communicating it remains the same without sacrificing reliability of your ideas. There might even be a place for an edifice of such fictions that's similar to math in mapping out an aspect of the world that doesn't connect to the physical reality for very long stretches. This doesn't seem plausible in the current practice, but seems possible in principle, so even calling such activity "fiction" might be misleading, it's more than mere fiction.

I don't think hypersensitive pattern-matching does a lot to destroy ability to distinguish between an idea that you feel like pursuing and an idea that you see as more reliably confirmed to be applicable in the real world. So you can discuss this distinction when communicating such ideas. Maybe the audience won't listen to the distinction you are making, or won't listen because you are making this distinction, but that's a different issue.

Comment by vladimir_nesov on Why we need a *theory* of human values · 2019-02-18T12:53:25.091Z · score: 2 (1 votes) · LW · GW

Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.

It's not clear what the relevant difference is between then and now, so the argument that it's more important to solve a problem now is as suspect as the argument that the problem should be solved later.

How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.

Comment by vladimir_nesov on Limiting an AGI's Context Temporally · 2019-02-18T12:44:47.964Z · score: 9 (4 votes) · LW · GW

It could throw a paperclip maximizer at you.

Comment by vladimir_nesov on Open Thread January 2019 · 2019-01-16T03:21:46.524Z · score: 4 (2 votes) · LW · GW

So your decision doesn't just determine the future; it also determines (with high probability) which you "you" are.

Worse. It doesn't change who you are, you are the person being blackmailed. This you know. What you don't know is whether you exist (or ever existed). Whether you ever existed is determined by your decision.

(The distinction from the quoted sentence may matter if you put less value on the worlds of people slightly different from yourself, so you may prefer to ensure your own existence, even in a blackmailed situation, over the existence of the alternative you who is not blackmailed but who is different and so less valuable. This of course involves unstable values, but motivates degree of existence phrasing of the effect of decisions, over the change in the content of the world phrasing, since the latter doesn't let us weigh whole alternative worlds differently.)

Comment by vladimir_nesov on Non-Consequentialist Cooperation? · 2019-01-11T16:55:35.588Z · score: 3 (2 votes) · LW · GW

What you describe does not want to be a thought experiment, because it doesn't abstract away relevant confounders (moral value of human life). The setup in the post is better at being a thought experiment for the distinctions being discussed (moral value of golem's life more clearly depends on a moral framework). In this context, it's misleading to ask whether something should be done instead of whether it's the action that's hedonistic utilitarian / preference utilitarian / autonomy-preserving.

Comment by vladimir_nesov on Two More Decision Theory Problems for Humans · 2019-01-05T11:18:42.464Z · score: 4 (2 votes) · LW · GW

The latter, where "a lot of work" is the kind of thing humanity can manage in subjective centuries. In an indirect normativity design, doing much more work than that should still be feasible, since it's only specified abstractly, to be predicted by an AI, enabling distillation. So we can still reach it, if there is an AI to compute the result. But if there is already such an AI, perhaps the work is pointless, because the AI can carry out the work's purpose in a different way.

Comment by vladimir_nesov on Two More Decision Theory Problems for Humans · 2019-01-05T00:42:12.360Z · score: 6 (3 votes) · LW · GW

Humans are not immediately prepared to solve many decision problems, and one of the hardest problems is formulation of preference for a consequentialist agent. In expanding the scope of well-defined/reasonable decisions, formulating our goals well enough for use in a formal decision theory is perhaps the last milestone, far outside of what can be reached with a lot of work!

Indirect normativity (after distillation) can make the timeline for reaching this milestone mostly irrelevant, as long as there is sufficient capability to compute the outcome, and amplification is about capability. It's unclear how the scope of reasonable decisions is related to capability within that scope, amplification seems ambiguous between the two, perhaps the scope of reasonable decisions is just another kind of stuff that can be improved. And it's corrigibility's aspect to keep AI within the scope of well-defined decisions.

But with these principles in place, it's unclear if formulating goals for consequentialist agents remains a thing, when instead it's possible to just continue to expand the scope of reasonable decisions and to distill/amplify them.

Comment by vladimir_nesov on An Extensive Categorisation of Infinite Paradoxes · 2018-12-17T17:40:21.250Z · score: 5 (3 votes) · LW · GW

A well-order has a least element in all non-empty subsets, and 1 > 1/2 > 1/4 > ... > 0 has a non-empty subset without a least element, so it's not a well-order.

Comment by vladimir_nesov on Two Neglected Problems in Human-AI Safety · 2018-12-17T11:23:44.542Z · score: 13 (5 votes) · LW · GW

I worry that in the context of corrigibility it's misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won't be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can't be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of operators. (Corrigible AIs are interesting because they might be easier to build than aligned agents, and are useful as tools to defend against misaligned agents and to build aligned agents.)

In the process of gradually changing from a corrigible AI into an aligned agent, an AI becomes less corrigible in the sense that corrigibility ceases to help in describing its behavior, it stops manifesting. At the same time, goal-directedness starts to dominate the description of its behavior as the AI learns well enough what its goal should be. If during the process of learning its values it's more corrigible than goal-directed, there shouldn't be any surprises like sudden disassembly of its operators on molecular level.

Comment by vladimir_nesov on Three AI Safety Related Ideas · 2018-12-17T09:37:10.148Z · score: 2 (1 votes) · LW · GW

I thought the point of idealized humans was to avoid problems of value corruption or manipulation

Among other things, yes.

which makes them better than real ones

This framing loses the distinction I'm making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.

Comment by vladimir_nesov on Three AI Safety Related Ideas · 2018-12-16T13:55:44.809Z · score: 2 (1 votes) · LW · GW

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.

My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can't be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren't, and so more useful when we can compensate for the errors. So a better term might be "abstract humans" or "models of humans".

If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won't affect the eventual outcomes.

Comment by vladimir_nesov on Why we need a *theory* of human values · 2018-12-15T07:54:29.869Z · score: 5 (2 votes) · LW · GW

More to the point, these failure modes are ones that we can talk about from outside

So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won't be present.

Comment by vladimir_nesov on LW Update 2018-11-22 – Abridged Comments · 2018-12-09T23:00:05.504Z · score: 4 (2 votes) · LW · GW

By the way, comment permalinks don't work for comments in collapsed subthreads (example). The anchor should be visible from javascript, so this could be fixed by expanding the subthread and navigating to the anchor.

Comment by vladimir_nesov on Intuitions about goal-directed behavior · 2018-12-03T01:06:01.556Z · score: 5 (2 votes) · LW · GW

Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don't necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can't derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won't be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.

When the issue is figuring out which influences of the world to follow, it's not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don't pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it's not clearly an improvement to restructure that system around a concept of goals, because that won't move it closer to the influences of the world it's designed to follow.

Comment by vladimir_nesov on Intuitions about goal-directed behavior · 2018-12-02T13:33:07.522Z · score: 3 (2 votes) · LW · GW

My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI's goal, a situation where a goal concept becomes necessary).

I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.

Comment by vladimir_nesov on Clarifying "AI Alignment" · 2018-11-28T04:08:42.344Z · score: 2 (1 votes) · LW · GW

Trying to have influence over aspects of value change that people don't much care about ... [is] reasonable ... to do to make the future better

This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here "value" refers to true/normative/idealized preference). So it's hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn't experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what's relevant.

It's agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn't show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what's done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).

Comment by vladimir_nesov on Decision Theory · 2018-11-03T02:34:02.007Z · score: 2 (1 votes) · LW · GW

I understand that there is no point examining one's algorithm if you already execute it and see what it does.

Rather there is no point if you are not going to do anything with the results of the examination. It may be useful if you make the decision based on what you observe (about how you make the decision).

you say "nothing stops you", but that is only possible if you could act contrary to your own algorithm, no?

You can, for a certain value of "can". It won't have happened, of course, but you may still decide to act contrary to how you act, two different outcomes of the same algorithm. The contradiction proves that you didn't face the situation that triggers it in actuality, but the contradiction results precisely from deciding to act contrary to the observed way in which you act, in a situation that a priori could be actual, but is rendered counterlogical as a result of your decision. If instead you affirm the observed action, then there is no contradiction and so it's possible that you have faced the situation in actuality. Thus the "chicken rule", playing chicken with the universe, making the present situation impossible when you don't like it.

So your reasoning is inaccurate

You don't know that it's inaccurate, you've just run the computation and it said $5. Maybe this didn't actually happen, but you are considering this situation without knowing if it's actual. If you ignore the computation, then why run it? If you run it, you need responses to all possible results, and all possible results except one are not actual, yet you should be ready to respond to them without knowing which is which. So I'm discussing what you might do for the result that says that you take the $5. And in the end, the use you make of the results is by choosing to take the $5 or the $10.

This map from predictions to decisions could be anything. It's trivial to write an algorithm that includes such a map. Of course, if the map diagonalizes, then the predictor will fail (won't give a prediction), but the map is your reasoning in these hypothetical situations, and the fact that the map may say anything corresponds to the fact that you may decide anything. The map doesn't have to be identity, decision doesn't have to reflect prediction, because you may write an algorithm where it's not identity.

Comment by vladimir_nesov on Decision Theory · 2018-11-02T11:45:20.426Z · score: 4 (2 votes) · LW · GW

For example, in the 5&10 game an agent would examine its own algorithm, see that it leads to taking $10 and stop there.

Why do even that much if this reasoning could not be used? The question is about the reasoning that could contribute to the decision, that could describe the algorithm, and so has the option to not "stop there". What if you see that your algorithm leads to taking the $10 and instead of stopping there, you take the $5?

Nothing stops you. This is the "chicken rule" and it solves some issues, but more importantly illustrates the possibility in how a decision algorithm can function. The fact that this is a thing is evidence that there may be something wrong with the "stop there" proposal. Specifically, you usually don't know that your reasoning is actual, that it's even logically possible and not part of an impossible counterfactual, but this is not a hopeless hypothetical where nothing matters. Nothing compels you to affirm what you know about your actions or conclusions, this is not a necessity in a decision making algorithm, but different things you do may have an impact on what happens, because the situation may be actual after all, depending on what happens or what you decide, or it may be predicted from within an actual situation and influence what happens there. This motivates learning to reason in and about possibly impossible situations.

What if you examine your algorithm and find that it takes the $5 instead? It could be the same algorithm that takes the $10, but you don't know that, instead you arrive at the $5 conclusion using reasoning that could be impossible, but that you don't know to be impossible, that you haven't decided yet to make impossible. One way to solve the issue is to render the situation where that holds impossible, by contradicting the conclusion with your action, or in some other way. To know when to do that, you should be able to reason about and within such situations that could be impossible, or could be made impossible, including by the decisions made in them. This makes the way you reason in them relevant, even when in the end these situations don't occur, because you don't a priori know that they don't occur.

(The 5-and-10 problem is not specifically about this issue, and explicit reasoning about impossible situations may be avoided, perhaps should be avoided, but my guess is that the crux in this comment thread is about things like usefulness of reasoning from within possibly impossible situations, where even your own knowledge arrived at by pure computation isn't necessarily correct.)

Comment by vladimir_nesov on Hero Licensing · 2018-10-29T19:08:23.675Z · score: 2 (1 votes) · LW · GW

so long as you can't change their notions of status there is nothing you can do to communicate "you are fundamentally wrong about how this works" without them hearing it as "I don't realize how far out of my depth I am right now".

But from the other direction, it seems quite possible to hear what the wrong-status person says about how I'm wrong. So "nothing you can do" seems excessive. Perhaps politeness often suffices, for arguments that would be accepted when delivered by an appropriate-status person, as long as you are being heard at all.

Comment by vladimir_nesov on Decision Theory FAQ · 2018-10-27T15:02:33.702Z · score: 4 (2 votes) · LW · GW

These are decisions in different situations. Transitivity of preference is about a single situation. There should be three possible actions A, B and C that can be performed in a single situation, with B preferred to A and C preferred to B. Transitivity of preference says that C is then preferred to A in that same situation. Betting on a fight of B vs. A is not a situation where you could also bet on C, and would prefer to bet on C over betting on B.

Comment by vladimir_nesov on Schools Proliferating Without Practicioners · 2018-10-27T09:48:55.926Z · score: 5 (3 votes) · LW · GW

It seems like a fairly straightforward claim, to say that a quality like integrity is all the more valuable for being rare.

I think the difference is between associating with producers vs. consumers. When something is more scarce, its price increases, which makes it less valuable to consumers and more valuable to producers of individual items. So for people who perceive themselves as producers of things like integrity and virginity and honest labor, scarcity would contribute to their value. And for things like medicine or the naive models of bitcoin and diamonds, decrease their value by increasing the price, since the audience of the article identify as consumers.

Comment by vladimir_nesov on Outline of Metarationality, or much less than you wanted to know about postrationality · 2018-10-16T10:30:05.359Z · score: 12 (4 votes) · LW · GW

You can decide to question any such principles, which is how they get formulated in the first place, as designs for improved cognition devised by an evolved mind that doesn't originally follow any particular crisp design, but can impose order on itself. The only situation where they remain stable is if the decision always comes out in their favor, which will happen if they are useful for agents pursuing your preference. When these agents become sufficiently different, they probably shouldn't use any object level details of the design of cognition that holds for you. The design improves, so it's not the same.

Examples of such principles are pursuit of well-calibrated empirical beliefs, of valid mathematical knowledge, of useful plans, and search for rational principles of cognition.

I don't know how to describe the thing that remains through correct changes, which is probably what preference should be, so it's never formal. There shouldn't be a motivation to "be at peace" with it, since it's exactly the thing you turn out to be at peace with, for reasons other than being at peace with it.

Comment by Vladimir_Nesov on [deleted post] 2018-10-16T03:45:39.571Z

This is a duplicate of the post made yesterday.

Comment by vladimir_nesov on Logical Counterfactuals are low-res · 2018-10-15T20:56:42.633Z · score: 5 (3 votes) · LW · GW

I don't know how shminux interprets the words, and if the question was related to this, but there is an issue in your use of "impossible". Things that happen in this world are actual, and things that happen in alternative worlds are possible. (The set of alternative worlds needs to be defined for each question.) An impossible situation is one that can't occur, that doesn't happen in any of the alternative worlds.

Thus outputs of an algorithm different from the actual output are not actual, and furthermore not possible, as they don't occur in the alternative worlds with the same algorithm receiving the same inputs. But there are alternative worlds (whose set is different than in the previous sentence, no longer constrained by the state of input) with the same algorithm that receives different inputs, and so despite not being actual, the different inputs are still possible, in other words not impossible.

Comment by vladimir_nesov on LW Update 2018-10-01 – Private Messaging Works · 2018-10-05T04:20:41.270Z · score: 4 (2 votes) · LW · GW

Bug report: Posts by Day page shows incorrect URLs for most posts, for example it lists the post "The Rocket Alignment Problem" with a link to Fasting Mimicking Diet Looks Pretty Good. Looks like some kind of intermittent off-by-one error that only affects some of the entries. The entries on the same page loaded with the "Load More Days" link don't seem to be affected. I'm not seeing this issue on the front page, and it only appeared around yesterday. Interestingly, if I click the "All Posts/Daily" link on the front page, which leads to the same URL, then there is no issue; it's only reproduced if I open the URL directly.

Comment by vladimir_nesov on Newcomb's Problem and Regret of Rationality · 2018-10-03T14:51:00.316Z · score: 3 (2 votes) · LW · GW

There's an archived copy here.

Comment by vladimir_nesov on Righting a Wrong Question · 2018-10-03T12:07:03.226Z · score: 2 (1 votes) · LW · GW

(That may be a useful clue for identifying the meaning of the question, as understood by the people pursuing it, but not necessarily a good reason to agree that it currently should be considered mysterious or that it's a sensible question to pursue.)

Comment by vladimir_nesov on The Tails Coming Apart As Metaphor For Life · 2018-10-01T21:50:21.708Z · score: 2 (1 votes) · LW · GW

[W]e still need a coherent moral framework to use to generate our AI's utility function if we want it to be aligned, so morality is important, and we do need to develop an acceptable solution to it.

This is not clear. It's possible for the current world to exist as it is, and similarly for any other lawful simulated world that's not optimized in any particular direction. So an AI could set up such a world without interfering, and defend its lawful operation from outside interference. This is a purposeful thing, potentially as good at defending itself as any other AI, that sets up a world that's not optimized by it, that doesn't need morality for the world it maintains, in order to maintain it. Of course this doesn't solve any problems inside that world, and it's unclear how to make such a thing, but it illustrates the problems with the "morality is necessary for AGI" position. Corrigibility also fits this description, being something unlike an optimization goal for the world, but still a purpose.

Comment by vladimir_nesov on Why we should ban the concept of Effective Altruism · 2018-09-28T22:29:05.978Z · score: 7 (4 votes) · LW · GW

The concept is fine, it's the misuse of the words denoting it that can cause problems by diluting the concept, giving bad training data so that the concept gets learned in many distorted forms. It might be useful to taboo the words that denote the concept in order to make explaining it easier. Or just refer people to a blog post somewhere that explains the idea, it's the same as any knowledge.

It's also not necessary to show people how to be an effective altruist (or to believe or assert that it's a good idea) in order to explain the concept.

Comment by vladimir_nesov on Open Thread September 2018 · 2018-09-25T16:34:20.475Z · score: 3 (2 votes) · LW · GW

I write notes in a single plain text file, using the dates they are made to cite them in newer notes. There are two types of notes, brainstorming throw-away ones that maintain the process of thinking about a problem or of learning something (such as carefully reading a paper), and more lucid ones, with some re-reading value, which are marked differently and have a one-sentence summary. The notes are intended to never be made public, so that I feel free to use them to resolve any silly confusions.

Comment by vladimir_nesov on AI Reading Group Thoughts (2/?): Reconstructive Psychosurgery · 2018-09-25T15:20:50.869Z · score: 2 (1 votes) · LW · GW

The useful question is about value of the data that can be collected about people, not so much its usefulness for achieving a particular task, because it may no longer make sense to perform that task once it becomes possible to do so (as in giant cheesecake fallacy). A system that can reconstruct people from data can do many other things that may be more valuable than reconstruction of people even from the point of view of these people that could've been reconstructed. It's a question of what should be done with the resources and capabilities of that system.

The value of the data is in its contribution to the value of the best things that can be made, and these best things don't necessarily improve through availability of that data, because they are not necessarily reconstructed people. I'm not sure there is any knowable-on-human-level difference in value between what can be done with the world given the knowledge about particular people who used to live in the past (either through indirect data or cryopreserved brains), compared to the value of what can be done without that knowledge. I guess it can't hurt if it doesn't consume resources that could otherwise find a meaningful purpose.

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T21:29:44.808Z · score: 2 (1 votes) · LW · GW

I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it's a maximizer that's not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won't be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It's sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T19:59:46.413Z · score: 2 (1 votes) · LW · GW

It could as easily be "do this one slightly helpful thing", an addition on top of doing nothing. It doesn't seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T18:49:20.720Z · score: 2 (1 votes) · LW · GW

I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don't think it's particularly useful to work out the details without a general plan or expectation of important technical surprises.)

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T17:00:21.909Z · score: 2 (1 votes) · LW · GW

It's Rice's theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it's never fully general, unless we by construction ensure that unexpected things can't occur. And even then it's not what we would have wanted the notions of agents or goals to be, because it's not clear what that is.

Intent verification doesn't seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T16:37:30.381Z · score: 2 (1 votes) · LW · GW
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.

It'll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.

Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.

To the extent its existence could pose a problem for another agent (according to the measure, which can't really talk about goals of agents-in-general), it'll surrender its resources without a fight or help with the other agent's goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.

In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there's probably enough leeway from that outcome.

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T16:16:16.314Z · score: 2 (1 votes) · LW · GW

The sub-agent in this scenario won't be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It's no more useful than its absence. But for the main agent, it's as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.

Comment by vladimir_nesov on Impact Measure Desiderata · 2018-09-23T15:35:01.011Z · score: 2 (1 votes) · LW · GW
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact.

Here is a scenario that I think low impact methods can't address. Let's say the AI has the opportunity to easily (without affecting its ability to do its task) create a "free" copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would've been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn't otherwise be captured).

This functions like "neutral magic" in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)

A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it's done with its task (maybe as a lexicographically lower goal), so that it doesn't do something like that because it can. This also looks like an impact measure.

(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)

No Anthropic Evidence

2012-09-23T10:33:06.994Z · score: 10 (15 votes)

A Mathematical Explanation of Why Charity Donations Shouldn't Be Diversified

2012-09-20T11:03:48.603Z · score: 2 (25 votes)

Consequentialist Formal Systems

2012-05-08T20:38:47.981Z · score: 12 (13 votes)

Predictability of Decisions and the Diagonal Method

2012-03-09T23:53:28.836Z · score: 21 (16 votes)

Shifting Load to Explicit Reasoning

2011-05-07T18:00:22.319Z · score: 15 (21 votes)

Karma Bubble Fix (Greasemonkey script)

2011-05-07T13:14:29.404Z · score: 23 (26 votes)

Counterfactual Calculation and Observational Knowledge

2011-01-31T16:28:15.334Z · score: 11 (22 votes)

Note on Terminology: "Rationality", not "Rationalism"

2011-01-14T21:21:55.020Z · score: 31 (41 votes)

Unpacking the Concept of "Blackmail"

2010-12-10T00:53:18.674Z · score: 25 (34 votes)

Agents of No Moral Value: Constrained Cognition?

2010-11-21T16:41:10.603Z · score: 6 (9 votes)

Value Deathism

2010-10-30T18:20:30.796Z · score: 26 (48 votes)

Recommended Reading for Friendly AI Research

2010-10-09T13:46:24.677Z · score: 26 (31 votes)

Notion of Preference in Ambient Control

2010-10-07T21:21:34.047Z · score: 14 (19 votes)

Controlling Constant Programs

2010-09-05T13:45:47.759Z · score: 25 (38 votes)

Restraint Bias

2009-11-10T17:23:53.075Z · score: 16 (21 votes)

Circular Altruism vs. Personal Preference

2009-10-26T01:43:16.174Z · score: 11 (17 votes)

Counterfactual Mugging and Logical Uncertainty

2009-09-05T22:31:27.354Z · score: 10 (13 votes)

Bloggingheads: Yudkowsky and Aaronson talk about AI and Many-worlds

2009-08-16T16:06:18.646Z · score: 20 (22 votes)

Sense, Denotation and Semantics

2009-08-11T12:47:06.014Z · score: 9 (16 votes)

Rationality Quotes - August 2009

2009-08-06T01:58:49.178Z · score: 6 (10 votes)

Bayesian Utility: Representing Preference by Probability Measures

2009-07-27T14:28:55.021Z · score: 33 (18 votes)

Eric Drexler on Learning About Everything

2009-05-27T12:57:21.590Z · score: 31 (36 votes)

Consider Representative Data Sets

2009-05-06T01:49:21.389Z · score: 6 (11 votes)

LessWrong Boo Vote (Stochastic Downvoting)

2009-04-22T01:18:01.692Z · score: 3 (30 votes)

Counterfactual Mugging

2009-03-19T06:08:37.769Z · score: 56 (76 votes)

Tarski Statements as Rationalist Exercise

2009-03-17T19:47:16.021Z · score: 11 (21 votes)

In What Ways Have You Become Stronger?

2009-03-15T20:44:47.697Z · score: 26 (28 votes)

Storm by Tim Minchin

2009-03-15T14:48:29.060Z · score: 15 (22 votes)