Comment by vladimir_nesov on Schelling Fences versus Marginal Thinking · 2019-05-23T11:34:27.099Z · score: 3 (2 votes) · LW · GW

I would think that applying schelling fences to reinforce current values reduces the amount of expected drift in the future

It reinforces the position endorsed by current values, not the current values themselves. (I'm not saying this about Schelling fences in general, which have their uses, rather about leveraging of status quo and commitment norms via reliable application of simple rules, chosen to signal current (past, idealized) values.) This hurts people with future changed values without preventing the change in values.

what specifically you think is making the error of making it difficult to re-align with current values

The effect on prevention of change in values is negative only in the sense of opportunity cost and because of the possibility of confusing this activity for something useful, which inhibits seeking something actually useful. It's analogous to the issues caused by homeopathy. (Though I'm skeptical about value drift being harmful for humans.)

Comment by vladimir_nesov on Comment section from 05/19/2019 · 2019-05-23T09:54:48.144Z · score: 2 (1 votes) · LW · GW

Abstractions are a central example of things considered on the object level, so I don't understand them as being in opposition to the object level. They can be in opposition to more concrete ideas, those closer to experience, but not to being considered on object level.

Comment by vladimir_nesov on Comment section from 05/19/2019 · 2019-05-23T07:35:26.508Z · score: 2 (1 votes) · LW · GW

And the object level is what we're all doing this for, or what's the point?

What's the point of concrete ideas, compared to more abstract ideas? The reasons seem similar, just with different levels of grounding in experience, like with a filter bubble that you can only peer beyond with great difficulty. This situation is an argument against emphasis on the concrete, not for it.

(I think there's a mixup between "meta" and "abstract" in this subthread. It's meta that exists for the object level, not abstractions. Abstractions are themselves on object level when you consider them in their own right.)

Comment by vladimir_nesov on Schelling Fences versus Marginal Thinking · 2019-05-23T07:05:44.104Z · score: 2 (1 votes) · LW · GW

This runs the risk of denying that value drift has taken place instead of preventing value drift, creating ammunition for a conflict with future self or future others instead of ensuring that your current self is in harmony with them. Some examples you cite and list seem to be actually making this error.

Comment by vladimir_nesov on Is value drift net-positive, net-negative, or neither? · 2019-05-05T17:11:14.911Z · score: 6 (3 votes) · LW · GW

This question requires distinguishing current values from idealized values, and values (in charge) of the world from values of a person. Idealized values is an unchanging and general way of judging situations (the world), including choices that take place there. Current values are an aspect of an actual agent (person) that are more limited in scope and can't accurately judge many things. By idealizing current values, we obtain idealized values that give a way in which the current values should function.

Most changes in current values change their idealization, but some changes that follow the same path as idealization don't, they only improve ability to judge things in the same idealized way. Value drift is a change in current values that changes their idealization. When current values disagree with idealized current values, their development without value drift eventually makes them agree, fixes their error. But value drift can instead change idealized values to better fit current values, calcifying the error.

Values in charge of the world (values of a singleton AI or of an agentic idealization of humanity) in particular direct what happens to people who live there. From the point of view of any idealized values, including idealized values of particular people (who can't significantly affect the world), it's the current values of the world that matter the most, because they determine what actually happens, and idealized values judge what actually happens.

Unless all people have the same idealized values, the values of the world are different from values of individual people, so value drift in values of the world can change what happens both positively and negatively according to idealized values of individual people. On the other hand, values of the world could approve of value drift in individual people (conflict between people, diversity of personal values over time, disruption of reflective equilibrium in people's reasoning), and so could those individual people, since their personal value drift won't disrupt the course of the world, which is what their idealized values judge. Note that idealized personal values approving of value drift doesn't imply that current personal values do. Finally, idealized values of the world disapprove of value drift in values of the world, since that actually would disrupt the course of the world.

Comment by vladimir_nesov on [Meta] Hiding negative karma notifications by default · 2019-05-05T15:33:11.403Z · score: 6 (5 votes) · LW · GW

In theory I agree with this, and it was my position several years ago. My own reaction is balanced in this sense, I perceive downvotes as interesting observations, not punishment. But several people I respect described their experience as remaining negative after all this time, so I suspect they can't easily change that response. You wouldn't occasionaly throw spiders at someone with arachnophobia. Maybe you should shower them in spiders though, I heard that helps with desensitisation.

Comment by vladimir_nesov on Dishonest Update Reporting · 2019-05-04T23:50:26.603Z · score: 7 (2 votes) · LW · GW

The value of caring about informal reasoning is in training the same skills that apply for knowably important questions, and in seemingly unimportant details adding up in ways you couldn't plan for. Existence of a credible consensus lets you use a belief without understanding its origin (i.e. without becoming a world-class expert on it), so doesn't interact with those skills.

When correct disagreement of your own beliefs with consensus is useful at scale, it eventually shifts the consensus, or else you have a source of infinite value. So almost any method of deriving significant value from private predictions being better than consensus is a method of contributing knowledge to consensus.

(Not sure what you were pointing at, mostly guessing the topic.)

Comment by vladimir_nesov on Dishonest Update Reporting · 2019-05-04T19:31:17.849Z · score: 3 (2 votes) · LW · GW

By "informal" I meant that the belief is not on a prediction market, so you can influence consensus only by talking, without carefully keeping track of transactions. (I disagree with it being appropriate not to care about results in informal communication, so it's not a distinction I was making.)

Comment by vladimir_nesov on Dishonest Update Reporting · 2019-05-04T18:56:27.976Z · score: 7 (4 votes) · LW · GW

Holding to delivery is already familiar for informal communication. But short-term speculation is a different mode of contributing rare knowledge into consensus that doesn't seem to exist for discussions of beliefs that are not on prediction markets, and breaks many assumptions about how communication should proceed. In particular it puts into question the virtues of owning up to your predictions and of regularly publishing updated beliefs.

Comment by vladimir_nesov on Dishonest Update Reporting · 2019-05-04T16:04:38.882Z · score: 19 (4 votes) · LW · GW

In a prediction market your belief is not shared, but contributes to the consensus (market price of a futures). Many traders become agnostic about a question (close their position) before the underlying fact of the matter is revealed (delivery), perhaps shortly after stating the direction in which they expect the consensus to move (opening the position), to contribute (profit from) their rare knowledge while it remains rare. Requiring traders to own up to a prediction (hold to delivery) interferes with efficient communication of rare information into common knowledge (market price).

So consider declaring that the consensus is shifting in a particular direction, without explaining your reasoning, and then shortly after bow out of the discussion (taking note of how the consensus shifted in the interim). This seems very strange when compared to common norms, but I think something in this direction could work.

Comment by vladimir_nesov on Habryka's Shortform Feed · 2019-05-04T14:55:03.000Z · score: 10 (2 votes) · LW · GW

That depends on what norm is in place. If the norm is to explain downvoting, then people should explain, otherwise there is no issue in not doing so. So the claim you are making is that the norm should be for people to explain. The well-known counterargument is that this disincentivizes downvoting.

you are under no obligation to waste cognition trying to figure them out

There is rarely an obligation to understand things, but healthy curiosity ensures progress on recurring events, irrespective of morality of their origin. If an obligation would force you to actually waste cognition, don't accept it!

Comment by vladimir_nesov on Functional Decision Theory vs Causal Decision Theory: Expanding on Newcomb's Problem · 2019-05-03T03:15:55.027Z · score: 10 (6 votes) · LW · GW

To make decisions, an agent needs to understand the problem, to know what's real and valuable that it needs to optimize. Suppose the agent thinks it's solving one problem, while you are fooling it in a way that it can't perceive, making its decisions lead to consequences that the agent can't (shouldn't) take into account. Then in a certain sense the agent acts in a different world (situation), in the world that it anticipates (values), not in the world that you are considering it in.

This is also the issue with CDT in Newcomb's problem: a CDT agent can't understand the problem, so when we test it, it's acting according to its own understanding of the world that doesn't match the problem. If you explain a reverse Newcomb's to an FDT agent (ensure that it's represented in it), so that it knows that it needs to act to win in the reverse Newcomb's and not in regular Newcomb's, then the FDT agent will two-box in regular Newcomb's problem, because it will value winning in reverse Newcomb's problem and won't value winning in regular Newcomb's.

Comment by vladimir_nesov on Speaking for myself (re: how the LW2.0 team communicates) · 2019-04-27T21:58:56.697Z · score: 2 (1 votes) · LW · GW

Within the hypothetical where the dimensions I suggest are better, fuzziness of upvote/downvote is better in the same way as uncertainty about facts is better than incorrect knowledge, even when the latter is easier to embrace than correct knowledge. In that hypothetical, moving from upvote/downvote to agree/disagree is a step in the wrong direction, even if the step in the right direction is too unwieldy to be worth making.

Comment by vladimir_nesov on Speaking for myself (re: how the LW2.0 team communicates) · 2019-04-27T19:53:34.920Z · score: 5 (2 votes) · LW · GW

I think both agree/disagree and approve/disapprove are toxic dimensions for evaluating quality discussions. Useful communication is about explaining and understanding relevant things, real-world truth and preference are secondary distractions. So lucid/confused (as opposed to clear/unclear) and relevant/misleading (as opposed to interesting/off-topic) seem like better choices.

Comment by vladimir_nesov on Agent Foundation Foundations and the Rocket Alignment Problem · 2019-04-09T16:48:27.192Z · score: 10 (5 votes) · LW · GW

At first glance, the current Agent Foundations work seems to be formal, but it's not the kind of formal where you work in an established setting. It's counterintuitive that people can doodle in math, but they can. There's a lot of that in physics and economics. Pre-formal work doesn't need to lack formality, it just doesn't follow a specific set of rules, so it can use math to sketch problems approximately the same way informal words sketch problems approximately.

Comment by vladimir_nesov on Would solving logical counterfactuals solve anthropics? · 2019-04-06T12:33:24.778Z · score: 4 (2 votes) · LW · GW

I consider questions of morality or axiology seperate from questions of decision theory.

The claim is essentially that specification of anthropic principles an agent follows belongs to axiology, not decision theory. That is, the orthogonality thesis applies to the distinction, so that different agents may follow different anthropic principles in the same way as different stuff-maximizers may maximize different kinds of stuff. Some things discussed under the umbrella of "anthropics" seem relevant to decision theory, such as being able to function with most anthropic principles, but not, say, choice between SIA and SSA.

(I somewhat disagree with the claim, as structuring values around instances of agents doesn't seem natural, maps/worlds are more basic than agents. But that is disagreement with emphasizing the whole concept of anthropics, perhaps even with emphasizing agents, not with where to put the concepts between axiology and decision theory.)

Comment by vladimir_nesov on Open Thread April 2019 · 2019-04-01T08:04:47.277Z · score: 11 (2 votes) · LW · GW

The issue is that GPT2 posts so much it drowns out everything else.

Comment by vladimir_nesov on Do you like bullet points? · 2019-03-26T21:11:33.435Z · score: 4 (2 votes) · LW · GW

An explanation communicates an idea without insisting on its relevance to reality (or some other topic). It's modular. You can then explain its relevance to reality, as another idea that reifies the relevance. Persuasion is doing both at the same time without making it clear. For example, you can explain how to think about a hypothesis to see what observations it predicts, without persuading that the hypothesis is likely.

Comment by vladimir_nesov on Privacy · 2019-03-19T04:54:30.696Z · score: 5 (3 votes) · LW · GW

Nod. I did actually consider a more accurate version of the comment that said something like "at least one of us is at least somewhat confused about something" [...]

The clarification doesn't address what I was talking about, or else disagrees with my point, so I don't see how that can be characterised with a "Nod". The confusion I refer to is about what the other means, with the question of whether anyone is correct about the world irrelevant. And this confusion is significant on both sides, otherwise a conversation doesn't go off the rails in this way. Paying attention to truth is counterproductive when intended meaning is not yet established, and you seem to be talking about truth, while I was commenting about meaning.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-19T04:24:04.728Z · score: 4 (2 votes) · LW · GW

No ancient updates for the previous week, several for this week. An alternative to removing old notifications is to prepend entries in the list with recency, like "13d" or "8y", and sort by it.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-19T04:16:42.683Z · score: 2 (1 votes) · LW · GW

The reversal test is with respect to the norm, not with respect to ways of handling a fixed norm. So imagine that the norm is the opposite, and see what will happen. People will invent weird things like gaging popularity based on number of downvotes, or sum of absolute values of upvotes and downvotes, when there are not enough downvotes. This will work about as well as what happens with the present norm. In that context, the option of "only upvotes" looks funny and pointless, but we can see that it actually isn't, because we can look from the point of view of both possible norms.

When an argument goes through in the world of the opposite status quo, we can transport it to our world. In this case, we obtain the argument that "only downvotes" is not particularly funny and pointless, instead it's about as serviceable (or about as funny and pointless) as "only upvotes", and both are not very good.

Comment by vladimir_nesov on Privacy · 2019-03-19T03:59:52.945Z · score: 8 (4 votes) · LW · GW

I have some probability on me being the confused one here.

In conversations like this, both sides are confused, that is don't understand the other's point, so "who is the confused one" is already an incorrect framing. One of you may be factually correct, but that doesn't really matter for making a conversation work, understanding each other is more relevant.

(In this particular case, I think both of you are correct and fail to see what the other means, but Jessica's point is harder to follow and pattern-matches misleading things, hence the balance of votes.)

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-10T00:32:19.855Z · score: 4 (2 votes) · LW · GW

Sure, for voting the effect on decision making is greater. I'm just suspicious of this whole idea of acausal impact, and moderate observations about effect size don't help with that confusion. I don't think it can apply to voting without applying to other things, so the quantitative distinction doesn't point in a particular direction on correctness of the overall idea.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T23:35:36.030Z · score: 2 (1 votes) · LW · GW

New information argues for a change on the margin, so the new equilibrium is different, though it may not be far away. The arguments are not "cancelled out", but they do only have bounded impact. Compare with charity evaluation in effective altruism: if we take the impact of certain decisions as sufficiently significant, it calls for their organized study, so that the decisions are no longer made based on first impressions. On the other hand, if there is already enough infrastructure for making good decisions of that type, then significant changes are unnecessary.

In the case of acausal impact, large reference classes imply that at least that many people are already affected, so if organized evaluation of such decisions is feasible to set up, it's probably already in place without any need for the acausal impact argument. So actual changes are probably in how you pay attention to info that's already available, not in creating infrastructure for generating better info. On the other hand, a source of info about sizes of reference classes may be useful.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T23:12:49.041Z · score: 6 (3 votes) · LW · GW

The absolute size of a reference class only gives the problem statement for an individual decision some altruistic/paternalistic tilt, which can fail to change it. Greater relative size of a reference class increases the decision's relative importance compared to other decisions, which on the margin should pull some effort away from the other decisions.

That the effective multiplier due to acausal coordination is smaller for non-voting decisions doesn't inform the question of whether the argument applies to non-voting decisions. The argument may be ignored in the decision algorithm only if the reference class is always small or about the same size for different decisions.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T20:03:55.150Z · score: 2 (1 votes) · LW · GW

That influences sizes of reference classes, but at some point the sizes cash out in morally relevant object level decisions.

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T06:01:17.121Z · score: 2 (1 votes) · LW · GW

The magnitude depends on the sizes of reference classes, which differ dramatically. So some personal decisions are suddenly much more important than others simply because more people make them, and so you should allocate more resources on deciding those things in particular correctly. Exercise regimen seems like a high acausal impact decision. Another difference is that the goal that the personal decisions pursue shifts from what you want to happen to yourself, to what you want to happen to other people, and this effect increases with population. (Edited the grandparent to express these points more clearly.)

Comment by vladimir_nesov on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T05:24:23.145Z · score: 8 (3 votes) · LW · GW

In an old post I argued that for acausal coordination reasons it seems as if you should further multiply this value by the number of people in the reference class of those making the decision the same way (discounted by how little you care about strangers vs. yourself). This makes decisions about things that only affect you personally depend on the relative sizes of their reference classes and on total population (greater population shifts focus of the decisions further away from yourself). Your decision inflicts the micromorts not just on yourself, but on all the people in the reference class, for the proportionally greater total number of micromorts that given this consideration turn into actual morts very easily.

The idea doesn't seem to have taken root, people talk about this argument mostly in the context of voting, where it's comforting for the argument to hold, even though it seems to apply in general, where it demands monstrous responsibility for every tiny little thing. It's very suspicious, but I don't know how to resolve the confusion. Maybe it's just psychologically unrealistic to follow through in almost all cases where the argument applies, despite its normative correctness.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-06T01:08:16.092Z · score: 2 (1 votes) · LW · GW

The main thing I like about the 'only downvotes' option is that it's kind of funny and pointless.

I feel the same about the 'only upvotes' option. Applying the reversal test, imagine that most people treat the 'only downvotes' option seriously and suggest that it should be the default, since it agrees with the usual norms of in-person conversation. Downvotes could even measure popularity if there was enough volume, in the meantime the sum of absolute values of upvotes and downvotes can play that role.

Comment by vladimir_nesov on Karma-Change Notifications · 2019-03-02T13:03:38.081Z · score: 9 (5 votes) · LW · GW

I'd appreciate the feature that restricts the notifications to the votes on comments posted at most X months ago (with X configurable in the settings). As it is, I'll mostly get noise, notifications for the comments posted 8-11 years ago that I'm not currently learning from. (At least I expect this to be the case and the first batch of notifications supports this.)

Comment by vladimir_nesov on Rule Thinkers In, Not Out · 2019-02-28T13:16:20.892Z · score: 2 (1 votes) · LW · GW

The apparent alternative to the reliable vs. Newton tradeoff when you are the thinker is to put appropriate epistemic status around the hypotheses. So you publish the book on Bible codes or all-powerful Vitamin C, but note in the preface that you remain agnostic about whether any version of the main thesis applies to the real world, pending further development. You build a theory to experience how it looks once it's more developed, and publish it because it was substantial work, even when upon publication you still don't know if there is a version of the theory that works out.

Maybe the theory is just beautiful, and that beauty doesn't much diminish from its falsity. So call it philosophical fiction, not a description of this world, the substantial activity of developing the theory and communicating it remains the same without sacrificing reliability of your ideas. There might even be a place for an edifice of such fictions that's similar to math in mapping out an aspect of the world that doesn't connect to the physical reality for very long stretches. This doesn't seem plausible in the current practice, but seems possible in principle, so even calling such activity "fiction" might be misleading, it's more than mere fiction.

I don't think hypersensitive pattern-matching does a lot to destroy ability to distinguish between an idea that you feel like pursuing and an idea that you see as more reliably confirmed to be applicable in the real world. So you can discuss this distinction when communicating such ideas. Maybe the audience won't listen to the distinction you are making, or won't listen because you are making this distinction, but that's a different issue.

Comment by vladimir_nesov on Why we need a *theory* of human values · 2019-02-18T12:53:25.091Z · score: 2 (1 votes) · LW · GW

Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.

It's not clear what the relevant difference is between then and now, so the argument that it's more important to solve a problem now is as suspect as the argument that the problem should be solved later.

How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.

Comment by vladimir_nesov on Limiting an AGI's Context Temporally · 2019-02-18T12:44:47.964Z · score: 9 (4 votes) · LW · GW

It could throw a paperclip maximizer at you.

Comment by vladimir_nesov on Open Thread January 2019 · 2019-01-16T03:21:46.524Z · score: 6 (3 votes) · LW · GW

So your decision doesn't just determine the future; it also determines (with high probability) which you "you" are.

Worse. It doesn't change who you are, you are the person being blackmailed. This you know. What you don't know is whether you exist (or ever existed). Whether you ever existed is determined by your decision.

(The distinction from the quoted sentence may matter if you put less value on the worlds of people slightly different from yourself, so you may prefer to ensure your own existence, even in a blackmailed situation, over the existence of the alternative you who is not blackmailed but who is different and so less valuable. This of course involves unstable values, but motivates degree of existence phrasing of the effect of decisions, over the change in the content of the world phrasing, since the latter doesn't let us weigh whole alternative worlds differently.)

Comment by vladimir_nesov on Non-Consequentialist Cooperation? · 2019-01-11T16:55:35.588Z · score: 3 (2 votes) · LW · GW

What you describe does not want to be a thought experiment, because it doesn't abstract away relevant confounders (moral value of human life). The setup in the post is better at being a thought experiment for the distinctions being discussed (moral value of golem's life more clearly depends on a moral framework). In this context, it's misleading to ask whether something should be done instead of whether it's the action that's hedonistic utilitarian / preference utilitarian / autonomy-preserving.

Comment by vladimir_nesov on Two More Decision Theory Problems for Humans · 2019-01-05T11:18:42.464Z · score: 4 (2 votes) · LW · GW

The latter, where "a lot of work" is the kind of thing humanity can manage in subjective centuries. In an indirect normativity design, doing much more work than that should still be feasible, since it's only specified abstractly, to be predicted by an AI, enabling distillation. So we can still reach it, if there is an AI to compute the result. But if there is already such an AI, perhaps the work is pointless, because the AI can carry out the work's purpose in a different way.

Comment by vladimir_nesov on Two More Decision Theory Problems for Humans · 2019-01-05T00:42:12.360Z · score: 8 (4 votes) · LW · GW

Humans are not immediately prepared to solve many decision problems, and one of the hardest problems is formulation of preference for a consequentialist agent. In expanding the scope of well-defined/reasonable decisions, formulating our goals well enough for use in a formal decision theory is perhaps the last milestone, far outside of what can be reached with a lot of work!

Indirect normativity (after distillation) can make the timeline for reaching this milestone mostly irrelevant, as long as there is sufficient capability to compute the outcome, and amplification is about capability. It's unclear how the scope of reasonable decisions is related to capability within that scope, amplification seems ambiguous between the two, perhaps the scope of reasonable decisions is just another kind of stuff that can be improved. And it's corrigibility's aspect to keep AI within the scope of well-defined decisions.

But with these principles in place, it's unclear if formulating goals for consequentialist agents remains a thing, when instead it's possible to just continue to expand the scope of reasonable decisions and to distill/amplify them.

Comment by vladimir_nesov on An Extensive Categorisation of Infinite Paradoxes · 2018-12-17T17:40:21.250Z · score: 5 (3 votes) · LW · GW

A well-order has a least element in all non-empty subsets, and 1 > 1/2 > 1/4 > ... > 0 has a non-empty subset without a least element, so it's not a well-order.

Comment by vladimir_nesov on Two Neglected Problems in Human-AI Safety · 2018-12-17T11:23:44.542Z · score: 13 (5 votes) · LW · GW

I worry that in the context of corrigibility it's misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won't be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can't be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of operators. (Corrigible AIs are interesting because they might be easier to build than aligned agents, and are useful as tools to defend against misaligned agents and to build aligned agents.)

In the process of gradually changing from a corrigible AI into an aligned agent, an AI becomes less corrigible in the sense that corrigibility ceases to help in describing its behavior, it stops manifesting. At the same time, goal-directedness starts to dominate the description of its behavior as the AI learns well enough what its goal should be. If during the process of learning its values it's more corrigible than goal-directed, there shouldn't be any surprises like sudden disassembly of its operators on molecular level.

Comment by vladimir_nesov on Three AI Safety Related Ideas · 2018-12-17T09:37:10.148Z · score: 2 (1 votes) · LW · GW

I thought the point of idealized humans was to avoid problems of value corruption or manipulation

Among other things, yes.

which makes them better than real ones

This framing loses the distinction I'm making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.

Comment by vladimir_nesov on Three AI Safety Related Ideas · 2018-12-16T13:55:44.809Z · score: 2 (1 votes) · LW · GW

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.

My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can't be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren't, and so more useful when we can compensate for the errors. So a better term might be "abstract humans" or "models of humans".

If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won't affect the eventual outcomes.

Comment by vladimir_nesov on Why we need a *theory* of human values · 2018-12-15T07:54:29.869Z · score: 5 (2 votes) · LW · GW

More to the point, these failure modes are ones that we can talk about from outside

So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won't be present.

Comment by vladimir_nesov on LW Update 2018-11-22 – Abridged Comments · 2018-12-09T23:00:05.504Z · score: 4 (2 votes) · LW · GW

By the way, comment permalinks don't work for comments in collapsed subthreads (example). The anchor should be visible from javascript, so this could be fixed by expanding the subthread and navigating to the anchor.

Comment by vladimir_nesov on Intuitions about goal-directed behavior · 2018-12-03T01:06:01.556Z · score: 5 (2 votes) · LW · GW

Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don't necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can't derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won't be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.

When the issue is figuring out which influences of the world to follow, it's not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don't pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it's not clearly an improvement to restructure that system around a concept of goals, because that won't move it closer to the influences of the world it's designed to follow.

Comment by vladimir_nesov on Intuitions about goal-directed behavior · 2018-12-02T13:33:07.522Z · score: 3 (2 votes) · LW · GW

My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI's goal, a situation where a goal concept becomes necessary).

I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.

Comment by vladimir_nesov on Clarifying "AI Alignment" · 2018-11-28T04:08:42.344Z · score: 2 (1 votes) · LW · GW

Trying to have influence over aspects of value change that people don't much care about ... [is] reasonable ... to do to make the future better

This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here "value" refers to true/normative/idealized preference). So it's hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn't experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what's relevant.

It's agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn't show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what's done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).

Comment by vladimir_nesov on Decision Theory · 2018-11-03T02:34:02.007Z · score: 4 (2 votes) · LW · GW

I understand that there is no point examining one's algorithm if you already execute it and see what it does.

Rather there is no point if you are not going to do anything with the results of the examination. It may be useful if you make the decision based on what you observe (about how you make the decision).

you say "nothing stops you", but that is only possible if you could act contrary to your own algorithm, no?

You can, for a certain value of "can". It won't have happened, of course, but you may still decide to act contrary to how you act, two different outcomes of the same algorithm. The contradiction proves that you didn't face the situation that triggers it in actuality, but the contradiction results precisely from deciding to act contrary to the observed way in which you act, in a situation that a priori could be actual, but is rendered counterlogical as a result of your decision. If instead you affirm the observed action, then there is no contradiction and so it's possible that you have faced the situation in actuality. Thus the "chicken rule", playing chicken with the universe, making the present situation impossible when you don't like it.

So your reasoning is inaccurate

You don't know that it's inaccurate, you've just run the computation and it said $5. Maybe this didn't actually happen, but you are considering this situation without knowing if it's actual. If you ignore the computation, then why run it? If you run it, you need responses to all possible results, and all possible results except one are not actual, yet you should be ready to respond to them without knowing which is which. So I'm discussing what you might do for the result that says that you take the $5. And in the end, the use you make of the results is by choosing to take the $5 or the $10.

This map from predictions to decisions could be anything. It's trivial to write an algorithm that includes such a map. Of course, if the map diagonalizes, then the predictor will fail (won't give a prediction), but the map is your reasoning in these hypothetical situations, and the fact that the map may say anything corresponds to the fact that you may decide anything. The map doesn't have to be identity, decision doesn't have to reflect prediction, because you may write an algorithm where it's not identity.

Comment by vladimir_nesov on Decision Theory · 2018-11-02T11:45:20.426Z · score: 4 (2 votes) · LW · GW

For example, in the 5&10 game an agent would examine its own algorithm, see that it leads to taking $10 and stop there.

Why do even that much if this reasoning could not be used? The question is about the reasoning that could contribute to the decision, that could describe the algorithm, and so has the option to not "stop there". What if you see that your algorithm leads to taking the $10 and instead of stopping there, you take the $5?

Nothing stops you. This is the "chicken rule" and it solves some issues, but more importantly illustrates the possibility in how a decision algorithm can function. The fact that this is a thing is evidence that there may be something wrong with the "stop there" proposal. Specifically, you usually don't know that your reasoning is actual, that it's even logically possible and not part of an impossible counterfactual, but this is not a hopeless hypothetical where nothing matters. Nothing compels you to affirm what you know about your actions or conclusions, this is not a necessity in a decision making algorithm, but different things you do may have an impact on what happens, because the situation may be actual after all, depending on what happens or what you decide, or it may be predicted from within an actual situation and influence what happens there. This motivates learning to reason in and about possibly impossible situations.

What if you examine your algorithm and find that it takes the $5 instead? It could be the same algorithm that takes the $10, but you don't know that, instead you arrive at the $5 conclusion using reasoning that could be impossible, but that you don't know to be impossible, that you haven't decided yet to make impossible. One way to solve the issue is to render the situation where that holds impossible, by contradicting the conclusion with your action, or in some other way. To know when to do that, you should be able to reason about and within such situations that could be impossible, or could be made impossible, including by the decisions made in them. This makes the way you reason in them relevant, even when in the end these situations don't occur, because you don't a priori know that they don't occur.

(The 5-and-10 problem is not specifically about this issue, and explicit reasoning about impossible situations may be avoided, perhaps should be avoided, but my guess is that the crux in this comment thread is about things like usefulness of reasoning from within possibly impossible situations, where even your own knowledge arrived at by pure computation isn't necessarily correct.)

Comment by vladimir_nesov on Hero Licensing · 2018-10-29T19:08:23.675Z · score: 2 (1 votes) · LW · GW

so long as you can't change their notions of status there is nothing you can do to communicate "you are fundamentally wrong about how this works" without them hearing it as "I don't realize how far out of my depth I am right now".

But from the other direction, it seems quite possible to hear what the wrong-status person says about how I'm wrong. So "nothing you can do" seems excessive. Perhaps politeness often suffices, for arguments that would be accepted when delivered by an appropriate-status person, as long as you are being heard at all.

Comment by vladimir_nesov on Decision Theory FAQ · 2018-10-27T15:02:33.702Z · score: 4 (2 votes) · LW · GW

These are decisions in different situations. Transitivity of preference is about a single situation. There should be three possible actions A, B and C that can be performed in a single situation, with B preferred to A and C preferred to B. Transitivity of preference says that C is then preferred to A in that same situation. Betting on a fight of B vs. A is not a situation where you could also bet on C, and would prefer to bet on C over betting on B.

No Anthropic Evidence

2012-09-23T10:33:06.994Z · score: 10 (15 votes)

A Mathematical Explanation of Why Charity Donations Shouldn't Be Diversified

2012-09-20T11:03:48.603Z · score: 2 (25 votes)

Consequentialist Formal Systems

2012-05-08T20:38:47.981Z · score: 12 (13 votes)

Predictability of Decisions and the Diagonal Method

2012-03-09T23:53:28.836Z · score: 21 (16 votes)

Shifting Load to Explicit Reasoning

2011-05-07T18:00:22.319Z · score: 15 (21 votes)

Karma Bubble Fix (Greasemonkey script)

2011-05-07T13:14:29.404Z · score: 23 (26 votes)

Counterfactual Calculation and Observational Knowledge

2011-01-31T16:28:15.334Z · score: 11 (22 votes)

Note on Terminology: "Rationality", not "Rationalism"

2011-01-14T21:21:55.020Z · score: 31 (41 votes)

Unpacking the Concept of "Blackmail"

2010-12-10T00:53:18.674Z · score: 25 (34 votes)

Agents of No Moral Value: Constrained Cognition?

2010-11-21T16:41:10.603Z · score: 6 (9 votes)

Value Deathism

2010-10-30T18:20:30.796Z · score: 26 (48 votes)

Recommended Reading for Friendly AI Research

2010-10-09T13:46:24.677Z · score: 26 (31 votes)

Notion of Preference in Ambient Control

2010-10-07T21:21:34.047Z · score: 14 (19 votes)

Controlling Constant Programs

2010-09-05T13:45:47.759Z · score: 25 (38 votes)

Restraint Bias

2009-11-10T17:23:53.075Z · score: 16 (21 votes)

Circular Altruism vs. Personal Preference

2009-10-26T01:43:16.174Z · score: 11 (17 votes)

Counterfactual Mugging and Logical Uncertainty

2009-09-05T22:31:27.354Z · score: 10 (13 votes)

Bloggingheads: Yudkowsky and Aaronson talk about AI and Many-worlds

2009-08-16T16:06:18.646Z · score: 20 (22 votes)

Sense, Denotation and Semantics

2009-08-11T12:47:06.014Z · score: 9 (16 votes)

Rationality Quotes - August 2009

2009-08-06T01:58:49.178Z · score: 6 (10 votes)

Bayesian Utility: Representing Preference by Probability Measures

2009-07-27T14:28:55.021Z · score: 33 (18 votes)

Eric Drexler on Learning About Everything

2009-05-27T12:57:21.590Z · score: 31 (36 votes)

Consider Representative Data Sets

2009-05-06T01:49:21.389Z · score: 6 (11 votes)

LessWrong Boo Vote (Stochastic Downvoting)

2009-04-22T01:18:01.692Z · score: 3 (30 votes)

Counterfactual Mugging

2009-03-19T06:08:37.769Z · score: 56 (76 votes)

Tarski Statements as Rationalist Exercise

2009-03-17T19:47:16.021Z · score: 11 (21 votes)

In What Ways Have You Become Stronger?

2009-03-15T20:44:47.697Z · score: 26 (28 votes)

Storm by Tim Minchin

2009-03-15T14:48:29.060Z · score: 15 (22 votes)