Habryka's Shortform Feed

post by habryka (habryka4) · 2019-04-27T19:25:26.666Z · score: 62 (17 votes) · LW · GW · 74 comments

In an attempt to get myself to write more here is my own shortform feed. Ideally I would write something daily, but we will see how it goes.

74 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2019-05-09T19:12:09.799Z · score: 64 (23 votes) · LW · GW

Thoughts on integrity and accountability

[Epistemic Status: Early draft version of a post I hope to publish eventually. Strongly interested in feedback and critiques, since I feel quite fuzzy about a lot of this]

When I started studying rationality and philosophy, I had the perspective that people who were in positions of power and influence should primarily focus on how to make good decisions in general and that we should generally give power to people who have demonstrated a good track record of general rationality. I also thought of power as this mostly unconstrained resource, similar to having money in your bank account, and that we should make sure to primarily allocate power to the people who are good at thinking and making decisions.

That picture has changed a lot over the years. While I think there is still a lot of value in the idea of "philosopher kings", I've made a variety of updates that significantly changed my relationship to allocating power in this way:

  • I have come to believe that people's ability to come to correct opinions about important questions is in large part a result of whether their social and monetary incentives reward them when they have accurate models in a specific domain. This means a person can have extremely good opinions in one domain of reality, because they are subject to good incentives, while having highly inaccurate models in a large variety of other domains in which their incentives are not well optimized.
  • People's rationality is much more defined by their ability to maneuver themselves into environments in which their external incentives align with their goals, than by their ability to have correct opinions while being subject to incentives they don't endorse. This is a tractable intervention and so the best people will be able to have vastly more accurate beliefs than the average person, but it means that "having accurate beliefs in one domain" doesn't straightforwardly generalize to "will have accurate beliefs in other domains".

    One is strongly predictive of the other, and that’s in part due to general thinking skills and broad cognitive ability. But another major piece of the puzzle is the person's ability to build and seek out environments with good incentive structures.
  • Everyone is highly irrational in their beliefs about at least some aspects of reality, and positions of power in particular tend to encourage strong incentives that don't tend to be optimally aligned with the truth. This means that highly competent people in positions of power often have less accurate beliefs than much less competent people who are not in positions of power.
  • The design of systems that hold people who have power and influence accountable in a way that aligns their interests with both forming accurate beliefs and the interests of humanity at large is a really important problem, and is a major determinant of the overall quality of the decision-making ability of a community. General rationality training helps, but for collective decision making the creation of accountability systems, the tracking of outcome metrics and the design of incentives is at least as big of a factor as the degree to which the individual members of the community are able to come to accurate beliefs on their own.

A lot of these updates have also shaped my thinking while working at CEA, LessWrong and the LTF-Fund over the past 4 years. I've been in various positions of power, and have interacted with many people who had lots of power over the EA and Rationality communities, and I've become a lot more convinced that there is a lot of low-hanging fruit and important experimentation to be done to ensure better levels of accountability and incentive-design for the institutions that guide our community.

I also generally have broadly libertarian intuitions, and a lot of my ideas about how to build functional organizations are based on a more start-up like approach that is favored here in Silicon Valley. Initially these intuitions seemed at conflict with the intuitions for more emphasis on accountability structures, with broken legal systems, ad-hoc legislation, dysfunctional boards and dysfunctional institutions all coming to mind immediately as accountability-systems run wild. I've since then reconciled my thoughts on these topics a good bit.

Integrity

Somewhat surprisingly, "integrity" has not been much discussed as a concept handle on LessWrong. But I've found it to be a pretty valuable virtue to meditate and reflect on.

I think of integrity as a more advanced form of honesty – when I say “integrity” I mean “acting in accordance with your stated beliefs.” Where honesty is the commitment to not speak direct falsehoods, integrity is the commitment to speak truths that actually ring true to yourself, not ones that are just abstractly defensible to other people. It is also a commitment to act on the truths that you do believe, and to communicate to others what your true beliefs are.

Integrity can be a double-edged sword. While it is good to judge people by the standards they expressed, it is also a surefire way to make people overly hesitant to update. If you get punished every time you change your mind because your new actions are now incongruent with the principles you explained to others before you changed your mind, then you are likely to stick with your principles for far longer than you would otherwise, even when evidence against your position is mounting.

The great benefit that I experienced from thinking of integrity as a virtue, is that it encourages me to build accurate models of my own mind and motivations. I can only act in line with ethical principles that are actually related to the real motivators of my actions. If I pretend to hold ethical principles that do not correspond to my motivators, then sooner or later my actions will diverge from my principles. I've come to think of a key part of integrity being the art of making accurate predictions about my own actions and communicating those as clearly as possible.

There are two natural ways to ensure that your stated principles are in line with your actions. You either adjust your stated principles until they match up with your actions, or you adjust your behavior to be in line with your stated principles. Both of those can backfire, and both of those can have significant positive effects.

Who Should You Be Accountable To?

In the context of incentive design, I find thinking about integrity valuable because it feels to me like the natural complement to accountability. The purpose of accountability is to ensure that you do what you say you are going to do, and integrity is the corresponding virtue of holding up well under high levels of accountability.

Highlighting accountability as a variable also highlights one of the biggest error modes of accountability and integrity – choosing too broad of an audience to hold yourself accountable to.

There is tradeoff between the size of the group that you are being held accountable by, and the complexity of the ethical principles you can act under. Too large of an audience, and you will be held accountable by the lowest common denominator of your values, which will rarely align well with what you actually think is moral (if you've done any kind of real reflection on moral principles).

Too small or too memetically close of an audience, and you risk not enough people paying attention to what you do, to actually help you notice inconsistencies in your stated beliefs and actions. The smaller the group that is holding you accountable is, the smaller your inner circle of trust, which reduces the amount of total resources that can be coordinated under your shared principles.

I think a major mistake that even many well-intentioned organizations make is to try to be held accountable by some vague conception of "the public". As they make public statements, someone in the public will misunderstand them, causing a spiral of less communication, resulting in more misunderstandings, resulting in even less communication, culminating into an organization that is completely opaque about any of its actions and intentions, with the only communication being filtered by a PR department that has little interest in the observers acquiring any beliefs that resemble reality.

I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely, and ideally do so in a way that is itself transparent to a broader audience. Common versions of this are auditors, as well as nonprofit boards that try to ensure the integrity of an organization.

This is all part of a broader reflection on trying to create good incentives for myself and the LessWrong team. I will probably follow this up with a post that more concretely summarizes my thoughts on how all of this applies to LessWrong concretely.

In summary:

  • One lens to view integrity through is as an advanced form of honesty – “acting in accordance with your stated beliefs.”
    • To improve integrity, you can either try to bring your actions in line with your stated beliefs, or your stated beliefs in line with your actions, or reworking both at the same time. These options all have failure modes, but potential benefits.
  • People with power sometimes have incentives that systematically warp their ability to form accurate beliefs, and (correspondingly) to act with integrity.
  • An important tool for maintaining integrity (in general, and in particular as you gain power) is to carefully think about what social environment and incentive structures you want for yourself.
  • Choose carefully who, and how many people, you are accountable to:
    • Too many people, and you are limited in the complexity of the beliefs and actions that you can justify.
    • Too few people, too similar to you, and you won’t have enough opportunities for people to notice and point out what you’re doing wrong. You may also not end up with a strong enough coalition aligned with your principles to accomplish your goals.
comment by Raemon · 2019-05-12T02:57:08.357Z · score: 12 (6 votes) · LW · GW

Just wanted to say I like this a lot and think it'd be fine as a full fledged post. :)

comment by Zvi · 2019-06-02T11:36:51.491Z · score: 5 (3 votes) · LW · GW

More than fine. Please do post a version on its own. A lot of strong insights here, and where I disagree there's good stuff to chew on. I'd be tempted to respond with a post.

I do think this has a different view of integrity than I have, but in writing it out, I notice that the word is overloaded and that I don't have as good a grasp of its details as I'd like. I'm hesitant to throw out a rival definition until I have a better grasp here, but I think the thing you're in accordance with is not beliefs so much as principles?

comment by elityre · 2019-06-02T09:45:41.665Z · score: 1 (1 votes) · LW · GW

Seconded.

comment by Kaj_Sotala · 2019-06-02T15:45:02.713Z · score: 3 (1 votes) · LW · GW

Thirded.

comment by elityre · 2019-06-02T09:45:25.048Z · score: 10 (3 votes) · LW · GW

This was a great post that might have changed my worldview some.

Some highlights:

1.

People's rationality is much more defined by their ability to maneuver themselves into environments in which their external incentives align with their goals, than by their ability to have correct opinions while being subject to incentives they don't endorse. This is a tractable intervention and so the best people will be able to have vastly more accurate beliefs than the average person, but it means that "having accurate beliefs in one domain" doesn't straightforwardly generalize to "will have accurate beliefs in other domains".

I've heard people say things like this in the past, but haven't really taken it seriously as an important component of my rationality practice. Somehow what you say here is compelling to me (maybe because I recently noticed a major place where my thinking was majorly constrained by my social ties and social standing) and it prodded me to think about how to build "mech suits" that not only increase my power but incentives my rationality. I now have a todo item to "think about principles for incentivizing true beliefs, in team design."

2.

I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely,

Similarly, thinking explicitly about which groups I want to be accountable to sounds like a really good idea.

I had been going through the world keeping this Paul Graham quote in mind...

I think the best test is one Gino Lee taught me: to try to do things that would make your friends say wow. But it probably wouldn't start to work properly till about age 22, because most people haven't had a big enough sample to pick friends from before then.

...choosing good friends, and and doing things that would impress them.

But what you're pointing at here seems like a slightly different thing. Which people do I want to make myself transparent to, so that they can judge if I'm living up to my values.

This also gave me an idea for a CFAR style program: a reassess your life workshop, in which a small number of people come together for a period of 3 days or so, and reevaluate cached decisions. We start by making lines of retreat (with mentor assistance), and then look at high impact questions in our life: given new info, does your current job / community / relationship / life-style choice / other still make sense?

Thanks for writing.


comment by MakoYass · 2019-05-12T08:41:07.187Z · score: 3 (2 votes) · LW · GW

I think you might be confusing two things together under "integrity". Having more confidence in your own beliefs than the shared/imposed beliefs of your community isn't really a virtue or.. it's more just a condition that a person can be in, whether it's virtuous is completely contextual. Sometimes it is, sometimes it isn't. I can think of lots of people who should have more confidence other peoples' beliefs than they have in their own. In many domains, that's me. I should listen more. I should act less boldly. An opposite of that sense of integrity is the virtue of respect- recognising other peoples' qualities- it's a skill. If you don't have it, you can't make use of other peoples' expertise very well. A superfluence of respect is a person who is easily moved by others' feedback, usually, a person who is patient with their surroundings.

On the other hand I can completely understand the value of {having a known track record of staying true to self-expression, claims made about the self}. Humility is actually a part of that. The usefulness of deliniating that into a virtue separate from the more general Honesty is clear to me.

comment by Pattern · 2019-06-04T19:43:05.566Z · score: 3 (2 votes) · LW · GW

There's a lot of focus on personally updating based on evidence. Groups aren't addressed as much. What does it mean for a group to have a belief? To have honesty or integrity?

comment by ioannes_shade · 2019-05-19T15:58:48.860Z · score: 1 (1 votes) · LW · GW

See Sinclair: "It is difficult to get a man to understand something, when his salary depends upon his not understanding it!"


comment by habryka (habryka4) · 2019-04-28T00:02:18.467Z · score: 42 (11 votes) · LW · GW

Thoughts on voting as approve/disapprove and agree/disagree:

One of the things that I am most uncomfortable with in the current LessWrong voting system is how often I feel conflicted between upvoting something because I want to encourage the author to write more comments like it, and downvoting something because I think the argument that the author makes is importantly flawed and I don't want other readers to walk away with a misunderstanding about the world.

I think this effect quite strongly limits certain forms of intellectual diversity on LessWrong, because many people will only upvote your comment if they agree with it, and downvote comments they disagree with, and this means that arguments supporting people's existing conclusions have a strong advantage in the current karma system. Whereas the most valuable comments are likely ones that challenge existing beliefs and that are rigorously arguing for unpopular positions.

A feature that has been suggested many times over the years is to split voting into two dimensions. One dimension being "agree/disagree" and the other being "approve/disapprove". Only the "approve/disapprove" dimension matters for karma and sorting, but both are displayed relatively prominently on the comment (the agree/disagree dimension on the the bottom, the approve/disapprove dimension at the top). I think this has some valuable things going for it, and in particular would make me likely to upvote more comments because I could simultaneously signal that while I think a comment was good, I don't agree with it.

An alternative way of doing this that Ray has talked about is the introduction of short reactions that users can click at the bottom of a comment, two of the most prominently displayed ones would be "agree/disagree". Reactions would be by default non-anonymous and so would serve more as a form of shorthand comment instead of an alternative voting system. Here is an example of how that kind of UI might look:

React mockup

I don't know precisely what the selection menu for choosing reactions should look like. My guess is we want to have a relatively broad selection, maybe even with the ability to type something custom into it (obviously limiting the character count significantly).

I am most worried that this will drastically increase the clutter of comment threads and make things a lot harder to parse. In particular if the order of the reacts is different on each comment, since then there is no reliable way of scanning for the different kinds of information.

A way to improve on this might be by having small icons for the most frequent reacts, but that then introduces a pretty sharp learning curve into the site, and it's always a pain to find icons for really abstract concepts like "agree/disagree".

I think I am currently coming around to the idea of reactions being a good way to handle approve/disapprove, but also think it might make more sense to introduce more as a new kind of vote that has more top-level support than simple reacts would have. Though in the most likely case this whole dimension will turn out to be too complicated and not worth the complexity costs (as 90% of feature ideas do).

comment by MakoYass · 2019-05-01T07:40:16.282Z · score: 18 (6 votes) · LW · GW

Having a reaction for "changed my view [LW · GW]" would be very nice.

Features like custom reactions gives me this feeling that.. language will emerge from allowing people to create reactions that will be hard to anticipate but, in retrospect, crucial. Playing a similar role that body language plays during conversation, but designed, defined, explicit.

If someone did want to introduce the delta through this system, it might be necessary to give the coiner of a reaction some way of linking an extended description. In casual exchanges.. I've found myself reaching for an expression that means "shifted my views in some significant lasting way" that's kind of hard to explain in precise terms, and probably impossible to reduce to one or two words, but it feels like a crucial thing to measure. In my description, I would explain that a lot of dialogue has no lasting impact on its participants, it is just two people trying to better understand where they already are. When something really impactful is said, I think we need to establish a habit of noticing and recognising that.

But I don't know. Maybe that's not the reaction type that what will justify the feature. Maybe it will be something we can't think of now.

Generally, it seems useful to be able to take reduced measurements of the mental states of the readers.

comment by Said Achmiz (SaidAchmiz) · 2019-05-01T10:23:39.331Z · score: 13 (5 votes) · LW · GW

the language that will emerge from allowing people to create reactions that will be hard to anticipate but, in retrospect, crucial

This is essentially the concept of a folksonomy, and I agree that it is potentially both applicable here and quite important.

comment by Rob Bensinger (RobbBB) · 2019-04-28T03:40:56.888Z · score: 5 (3 votes) · LW · GW
I am most worried that this will drastically increase the clutter of comment threads and make things a lot harder to parse. In particular if the order of the reacts is different on each comment, since then there is no reliable way of scanning for the different kinds of information.

I like the reactions UI above, partly because separating it from karma makes it clearer that it's not changing how comments get sorted, and partly because I do want 'agree'/'disagree' to be non-anonymous by default (unlike normal karma).

I agree that the order of reacts should always be the same. I also think every comment/post should display all the reacts (even just to say '0 Agree, 0 Disagree...') to keep things uniform. That means I think there should only be a few permitted reacts -- maybe start with just 'Agree' and 'Disagree', then wait 6+ months and see if users are especially clambering for something extra.

I think the obvious other reacts I'd want to use sometimes are 'agree and downvote' + 'disagree and upvote' (maybe shorten to Agree+Down, Disagree+Up), since otherwise someone might not realize that one and the same person is doing both, which loses a fair amount of this thing I want to be fluidly able to signal. (I don't think there's much value to clearly signaling that the same person agreed and upvoted or disagree and downvoted a thing.)

I would also sometimes click both the 'agree' and 'disagree' buttons, which I think is fine to allow under this UI. :)

comment by Said Achmiz (SaidAchmiz) · 2019-04-28T03:28:02.974Z · score: 2 (1 votes) · LW · GW

Why not Slashdot-style?

comment by habryka (habryka4) · 2019-04-28T06:11:08.735Z · score: 5 (3 votes) · LW · GW

Slashdot has tags, but each tag still comes with a vote. In the above, the goal would be explicitly to allow for the combination of "upvoted though I still disagree" which I don't think would work straightforwardly with the slashdot system.

I also find it it quite hard to skim for anything on Slashdot, including the tags (and the vast majority of users at any given time can't add reactions on slashdot at any given time, so there isn't much UI for it).

comment by habryka (habryka4) · 2019-05-04T06:02:22.297Z · score: 29 (10 votes) · LW · GW

Thoughts on minimalism, elegance and the internet:

I have this vision for LessWrong of a website that gives you the space to think for yourself, and doesn't constantly distract you with flashy colors and bright notifications and vibrant pictures. Instead it tries to be muted in a way that allows you to access the relevant information, but still gives you the space to disengage from the content of your screen, take a step back and ask yourself "what are my goals right now?".

I don't know how well we achieved that so far. I like our frontpage, and I think the post-reading experience is quite exceptionally focused and clear, but I think there is still something about the way the whole site is structured, with its focus on recent content and new discussion that often makes me feel scattered when I visit the site.

I think a major problem is that Lesswrong doesn't make it easy to do only a single focused thing on the site at a time, and it doesn't currently really encourage you to engage with the site in a focused way. We have the library, which I do think is decent, but the sequence navigation experience is not yet fully what I would like it to be, and when I go to the frontpage the primary thing I still see is recent content. Not the sequences I recently started reading, or the practice exercises I might want to fill out, or the open questions I might want to answer.

I think ther are a variety of ways to address this, some of which I hope to build very soon:

+ The frontpage should show you not only recent content, but also show you much older historical content (that can be of much higher quality, due to being drawn from a much larger pool). [We have a working prototype of this, and I hope we can push it soon]

+ We should encourage you to read whole sequences at a time, instead of individual posts. If you start reading a sequence, you should be encouraged to continue reading it from the frontpage [This is also quite close to working]

+ There should be some way to encourage people to put serious effort into answering the most important open questions [This is currently mostly bottlenecked on making the open-question system/UX good enough to make real progress in]

+ You should be able to easily bookmark posts and comments to allow you to continue reading something at a later point in time [We haven't really started on this, but it's pretty straightforward, so I still think this isn't too far off]

+ I would love it if there were real rationality exercises in many of the sequences, in a way that would periodically require you to write essays and answer questions and generally check your understanding. This is obviously quite difficult to make happen, both in terms of UI, but also in terms of generating the content

I think if we had all of these, in particular the open questions one, then I think I would feel more like LessWrong is oriented towards my long-term growth instead of trying to give me short-term reinforcement. It would also create a natural space in which to encourage focused work and generally make me feel less scattered when I visit the site, due to deemphasizing the most recent wave of content.

I do think there are problems with deemphasizing more recent content, mostly because this indirectly disincentivizes creating new content, which I do think would obviously be bad for the site. Though in some sense it might encourage the creation of longer-lived content, which would be quite good for the site.

comment by MakoYass · 2019-05-04T23:19:40.869Z · score: 4 (3 votes) · LW · GW
The frontpage should show you not only recent content, but also show you much older historical content

When I was a starry eyed undergrad, I liked to imagine that reddit might resurrect old posts if they gained renewed interest, if someone rediscovered something and gave it a hard upvote, that would put it in front of more judges, which might lead to a cascade of re-approval that hoists the post back into the spotlight. There would be no need for reposts, evergreen content would get due recognition, a post wouldn't be done until the interest of the subreddit (or, generally, user cohort) is really gone.

Of course, reddit doesn't do that at all. Along with the fact that threads are locked after a year, this is one of many reasons it's hard to justify putting a lot of time into writing for reddit.

comment by habryka (habryka4) · 2019-09-19T04:49:46.970Z · score: 26 (7 votes) · LW · GW

What is the purpose of karma?

LessWrong has a karma system, mostly based off of Reddit's karma system, with some improvements and tweaks to it. I've thought a lot about more improvements to it, but one roadblock that I always run into when trying to improve the karma system, is that it actually serves a lot of different uses, and changing it in one way often means completely destroying its ability to function in a different way. Let me try to summarize what I think the different purposes of the karma system are:

Helping users filter content

The most obvious purpose of the karma system is to determine how long a post is displayed on the frontpage, and how much visibility it should get.

Being a social reward for good content

This aspect of the karma system comes out more when thinking about Facebook "likes". Often when I upvote a post, it is more of a public signal that I value something, with the goal that the author will feel rewarded for putting their effort into writing the relevant content.

Creating common-knowledge about what is good and bad

This aspect of the karma system comes out the most when dealing with debates, though it's present in basically any karma-related interaction. The fact that the karma of a post is visible to everyone, helps people establish common knowledge of what the community considers to be broadly good or broadly bad. Seeing a an insult downvoted, does more than just filter it out of people's feeds, it also makes it so that anyone who stumbles accross it learns something about the norms of the community.

Being a low-effort way of engaging with the site

On lesswrong, Reddit and Facebook, karma is often the simplest action you can take on the site. This means its usually key for a karma system like that to be extremely simple, and not require complicated decisions, since that would break the basic engagement loop with the site.

Problems with alternative karma systems

Here are some of the most common alternatives to our current karma system, and how they perform on the above dimensions:

Eigenkarma as weighted by a set of core users

The basic idea here is that you try to signal-boost a small set of trusted users, by giving people voting power that is downstream from the initially defined set of users.

There are some problems with this. The first one is whether to assign any voting power to new users. If you don't you remove a large part of the value of having a low-effort way of engaging with your site.

It also forces you to separate the points that you get on your content, from your total karma score, from your "karma-trust score" which introduces some complexity into the system. It also makes it so that increases in the points of your content, no longer neatly correspond to voting events, because the underlying reputation graph is constantly shifting and changing, making the social reward signal a lot weaker.

In exchange for this, you likely get a system that is better at filtering content, and probably has better judgement about what should be made common-knowledge or not.

Prediction-based system

I was talking with Scott Garrabrant today, who was excited about a prediction-based karma system. The basic idea is to just have a system that tries to do its best to predict what rating you are likely to give to a post, based on your voting record, the post, and other people's votes.

In some sense this is what Youtube and Facebook are doing in their systems, though he was unhappy with the transparency of what they were doing.

The biggest sacrifice I see in creating this system, is the loss in the ability to create common knowledge, since now all votes are ultimately private, and the ability for karma to establish social norms, or just common knowledge about foundational facts that the community is built around, is greatly diminished.

I also think it diminishes the degree to which votes can serve as a social reward signal, since there is no obvious thing to inform the user of when their content got votes on. No number that went up or down, just a few thousand weights in some distant predictive matrix, or neural net.

Augmenting experts

A similar formulation to the eigenkarma system is the idea of trying to augment experts, by rewarding users in proportion to how successful they are at predicting how a trusted expert would vote, and then using that predicted expert's vote as the reward signal. Periodically, you do query the trusted expert, and use that to calibrate and train the users who are trying to predict the expert.

This still allows you to build common-knowledge, and allows you to have effective reward signals ("simulated Eliezer upvoted your comment"), but does run into problems when it comes to being a low-effort way of engaging with the site. The operation of "what would person X think about this comment" is a much more difficult one than "did I like this comment?", and as such might deter a large number of users from using your site.

comment by Ruby · 2019-09-25T05:37:08.003Z · score: 2 (1 votes) · LW · GW

This is really good and I missed it until now. I vote for you making this a full-on post. I think it's fine as is for that.

comment by habryka (habryka4) · 2019-07-14T18:10:55.934Z · score: 25 (6 votes) · LW · GW

Is intellectual progress in the head or in the paper?

Which of the two generates more value:

  • A researcher writes up a core idea in their field, but only a small fraction of good people read it in the next 20 years
  • A researchers gives a presentation at a conference to all the best researchers in his field, but none of them write up the idea later

I think which of the two will generate more value determines a lot of your strategy about how to go about creating intellectual progress. In one model what matters is that the best individuals hear about the most important ideas in a way that then allows them to make progress on other problems. In the other model what matters is that the idea gets written as an artifact that can be processed and evaluated by reviews and the proper methods of the scientific progress, and then built upon when referenced and cited.

I think there is a tradeoff of short-term progress against long-term progress in these two approaches. I think many fields can go through intense periods of progress when focusing on just establishing communication between the best researchers of the field, but would be surprised if that period lasts longer than one or two decades. Here are some reasons for why that might be the case:

  • A long-lasting field needs a steady supply of new researchers and thinkers, both to bring in new ideas, and also to replace the old researchers who retire. If you do not write up your ideas, the ability for a field to evaluate the competence of a researchers has to rely on the impressions of individual researchers. My sense is that relying on that kind of implicit impression does not survive multiple successions and will get corrupted by people trying to use their influence for some other means within two decades.
  • You are blocking yourself off from interdisciplinary progress. After a decade a two fields often end up in a rut that needs some new paradigm or at least new idea to allow people to make progress again. If you don't write up your ideas publicly, you lose a lot of opportunities for interdisciplinary researchers to enter your field and bring in ideas from other places.
  • You make it hard to improve on research debt because there is no canonical reference that can be updated with better explanations and better definitions. (Current journals don't do particularly well on this, but this is an opportunity that wiki-like systems can take advantage of, or with some kind of set of published definitions like the DSM-5, and new editions of textbooks also help with this)
  • If you are a theoretical field, you are making it harder for your ideas to get implemented or transformed into engineering problems. This prevents your field from visibly generating value, which reduces both the total amount of people who want to join your field, and also the interest of other people to invest resources into your field

However, you also gain a large number of benefits, that will probably increase your short-term output significantly:

  • Through the use of in-person conversations and conferences the cost of communicating a new idea and letting others build on it is often an order of magnitude smaller
  • Your ability to identify the best talent can now be directly downstream of the taste of the best people in the field, which allows you to identify researchers who are not great at writing, but still great at thinking
  • The complexity limit of any individual idea in your field is a lot higher, since the ideas get primarily transmitted via high-bandwidth channels
  • Your feedback cycles of getting feedback on your ideas from other people in the field is a lot faster, since your ideas don't need to go through a costly writeup and review phase

My current model is that it's often good for research fields to go through short periods (< 2 years) in which there is a lot of focus on just establishing good communications among the best researchers, either with a parallel investment in trying to write up at least the basics of the discussion, or a subsequent clean-up period in which the primary focus is on writing up the core insights that all the best researchers converged on.

comment by Ruby · 2019-07-15T05:14:33.032Z · score: 7 (3 votes) · LW · GW
The complexity limit of any individual idea in your field is a lot higher, since the ideas get primarily transmitted via high-bandwidth channels

Depends if you're sticking specifically to "presentation at a conference", which I don't think is necessarily that "high bandwidth". Very loosely, I think it's something like (ordered by "bandwidth"): repeated small group of individual interaction (e.g. apprenticeship, collaboration) >> written materials >> presentations. I don't think I could have learned Kaj's models of multi-agent minds from a conference presentation (although possibly from a lecture series). I might have learnt even more if I was his apprentice.

comment by Pattern · 2019-07-23T02:33:11.054Z · score: 1 (1 votes) · LW · GW
A researchers gives a presentation at a conference to all the best researchers in his field, but none of them write up the idea later

What if someone makes a video? (Or the powerpoint/s used in the conference are released to the public?)

comment by habryka (habryka4) · 2019-07-23T06:06:18.232Z · score: 2 (1 votes) · LW · GW

This was presuming that that would not happen (for example, because there is a vague norm that things are kind-of confidential and shouldn't be posted publicly).

comment by habryka (habryka4) · 2019-04-27T19:28:25.066Z · score: 25 (9 votes) · LW · GW

Thoughts on negative karma notifications:

  • An interesting thing that I and some other people on the LessWrong team noticed (as well as some users) was that since we created karma notifications we feel a lot more hesitant to downvote older comments, since we know that this will show up for the other users as a negative notification. I also feel a lot more hesitant to retract my own strong upvotes or upvotes in general since the author of the comment will see that as a downvote.
  • I've had many days in a row in which I received +20 or +30 karma, followed by a single day where by chance I received a single downvote and ended up at -2. The emotional valence of having a single day at -2 was somehow stronger than the emotional valence of multiple days of +20 or +30.
comment by Jan_Kulveit · 2019-04-29T19:47:53.575Z · score: 11 (4 votes) · LW · GW

What I noticed on the EA forum is the whole karma thing is messing up with my S1 processes and makes me unhappy on average. I've not only turned off the notifications, but also hidden all karma displays in comments via css, and the experience is much better.

comment by habryka (habryka4) · 2019-04-29T20:41:09.524Z · score: 4 (2 votes) · LW · GW

I... feel conflicted about people deactivating the display of karma on their own comments. In many ways karma (and downvotes in particular) serve as a really important feedback source, and I generally think that people who reliably get downvoted should change how they are commenting, and them not doing so usually comes at high cost. I think this is more relevant to new users, but is still relevant for most users.

Deactivating karma displays feels a bit to me like someone who shows up at a party and says "I am not going to listen to any subtle social feedback that people might give me about my behavior, and I will just do things until someone explicitly tells me to stop", which I think is sometimes the correct behavior and has some good properties in terms of encouraging diversity of discussion, but I also expect that this can have some pretty large negative impact on the trust and quality of the social atmosphere.

On the other hand, I want people to have control over the incentives that they are under, and think it's important to give users a lot of control over how they want to be influenced by the platform.

And there is also the additional thing, which is that if users just deactivate the karma display for their comments without telling anyone then that creates an environment of ambiguity where it's very unclear whether someone receives the feedback you are giving them at all. In the party metaphor this would be like showing up and not telling anyone that you are not going to listen to subtle social feedback, which I think can easily lead to unnecessary escalation of conflict.

I don't have a considered opinion on what to incentivize here, besides being pretty confident that I wouldn't want most people to deactivate their karma displays, and that I am glad that you told me here that you did. This means that I will err on the side of leaving feedback by replying in addition to voting (though this obviously comes at a significant cost to me, so it might be game theoretically better for me to not shift towards replying, but I am not sure of that. Will think more about it).

There are also some common-knowledge effects that get really weird when one person is interacting with the discussion with a different set of data than I am seeing. I.e. I am going to reply to a downvoted comment in a way that assumes that many people thought the comment was bad and will try to explain potential reasons for why people might have downvoted it, but if you have karma displays disabled then you might perceive me as making a kind of social attack where I claim the support of some kind of social group without backing it up. I think this makes me quite hesitant to participate in discussions with that kind of weird information asymmetry.

comment by Jan_Kulveit · 2019-04-30T02:29:12.995Z · score: 4 (2 votes) · LW · GW

Actually I turned the karma for all comments, not just mine. The bold claim is my individual taste in what's good on the EA forum is in important ways better than the karma system, and the karma signal is similar to sounds made by a noisy mob. If I want I can actually predict what average sounds will the crowd make reasonably well, so it is not any new source of information. But it still messes up with your S1 processing and motivations.

Continuing with the party metaphor, I think it is generally not that difficult to understand what sort of behaviour will make you popular at a party, and what sort of behaviours even when they are quite good in a broader scheme of things will make you unpopular at parties. Also personally I often feel something like "I actually want to have good conversations about juicy topics in a quite place, unfortunately you all people are congregating at this super loud space, with all these status games, social signals, and ethically problematic norms how to treat other people" toward most parties.

Overall I posted this here because it seemed like an interesting datapoint. Generally I think it would be great if people moved toward writing information rich feedback instead of voting, so such shift seems good. From what I've seen on EA forum it's quite rarely "many people" doing anything. More often it is like 6 users upvote a comment, 1user strongly downvotes it, something like karma 2 is a result. I would guess you may be in larger risk of distorted perception that this represents some meaningful opinion of the community. (Also I see some important practical cases where people are misled by "noises of the crowd" and it influences them in a harmful way.)

comment by Said Achmiz (SaidAchmiz) · 2019-04-29T21:12:07.323Z · score: 4 (2 votes) · LW · GW

Well… you can’t actually stop people from activating custom CSS that hides karma values. It doesn’t matter how you feel about it—you can’t affect it! It’s therefore probably best to create some mechanism that gives people what they want to get out of hiding karma, while still giving you what you want out of showing people karma (e.g., a “hide karma but give me a notification if one of my comments is quite strongly downvoted” option—not suggesting this exact thing, just brainstorming…).

comment by habryka (habryka4) · 2019-04-29T21:49:12.287Z · score: 4 (2 votes) · LW · GW

Hmm, I agree that I can't prevent it in that sense, but I think defaults matter a lot here, as does just normal social feedback and whatever the social norms are.

It's not at all clear to me that the current equilibrium isn't pretty decent, where people can do it, but it's reasonably inconvenient to do it, and so allows the people who are disproportionately negatively affected by karma notification to go that route. I would be curious in whether there are any others who do the same as Jan does, and if there are many, then we can figure out what the common motivations are and see whether it makes sense to elevate it to some site-level feature.

comment by Jan_Kulveit · 2019-04-30T02:37:18.189Z · score: 6 (3 votes) · LW · GW

FWIW I also think it's quite possible the current equilibrium is decent (which is part of reasons why I did not posted something like "How did I turned karma off" with simple instruction about how to do it on the forum, which I did consider). On the other hand I'd be curious about more people trying it and reporting their experiences.

I suspect many people kind of don't have this action in the space of things they usually consider - I'd expect what most people would do is 1) just stop posting 2) write about their negative experience 3) complain privately.

comment by Said Achmiz (SaidAchmiz) · 2019-04-29T22:16:32.290Z · score: 6 (2 votes) · LW · GW

It’s not at all clear to me that the current equilibrium isn’t pretty decent, where people can do it, but it’s reasonably inconvenient to do it, and so allows the people who are disproportionately negatively affected by karma notification to go that route.

But this is an extremely fragile equilibrium. It can be broken by, say, someone posting a set of simple instructions on how to do this. For instance:

Anyone running the uBlock Origin browser extension can append several lines to their “My Filters” tab in the uBlock extension preferences, and thus totally hide all karma-related UI elements on Less Wrong. (PM me if you want the specific lines to append.)

Or someone makes a browser extension to do this. Or a user style. Or…

comment by Zvi · 2019-04-28T23:32:36.723Z · score: 9 (5 votes) · LW · GW

If people are checking karma changes constantly and getting emotional validation or pain from the result, that seems like a bad result. And yes, the whole 'one -2 and three +17s feels like everyone hates me' thing is real, can confirm.

comment by habryka (habryka4) · 2019-04-29T00:21:51.470Z · score: 5 (3 votes) · LW · GW

Because of the way we do batching you can't check karma changes constantly (unless you go out of your way to change your setting) because we batch karma notifications on a 24h basis by default.

comment by DanielFilan · 2019-04-30T18:36:08.424Z · score: 5 (3 votes) · LW · GW

I mean, you can definitely check your karma multiple times a day to see where the last two sig digits are at, which is something I sometimes do.

comment by habryka (habryka4) · 2019-04-30T18:40:24.465Z · score: 3 (2 votes) · LW · GW

True. We did very intentionally avoid putting your total karma on the frontpage anywhere as most other platforms do to avoid people getting sucked into that unintentionally, but it you can still do that on your profile.

I hope we aren't wasting a lot of people's time by causing them to check their profile all the time. If we do, it might be the correct choice to also only update that number every 24h.

comment by Rob Bensinger (RobbBB) · 2019-04-30T23:17:48.363Z · score: 2 (1 votes) · LW · GW

I've never checked my karma total on LW 2.0 to see how it's changed.

comment by DanielFilan · 2019-04-30T21:40:06.551Z · score: 2 (1 votes) · LW · GW

In my case, it sure feels like I check my karma often because I often want to know what my karma is, but maybe others differ.

comment by Ben Pace (Benito) · 2019-04-29T01:01:22.612Z · score: 3 (2 votes) · LW · GW

Do our karma karma notifications disappear if you don’t check them that day? My model of Zvi suggested to me this is attention-grabbing and bad. I wonder if it’s better to let folks be notified of all days’ karma updates ‘til their most recent check in, and maybe also see all historical ones ordered by date if they click on a further button, so that the info isn’t lost and doesn’t feel scarce.

comment by habryka (habryka4) · 2019-04-29T01:26:33.791Z · score: 4 (2 votes) · LW · GW

Nah, they accumulate until you click on them.

comment by Zvi · 2019-04-29T12:08:47.879Z · score: 8 (4 votes) · LW · GW

Which is definitely better than it expiring, and 24h batching is better than instantaneous feedback (unless you were going to check posts individually for information already, in which case things are already quite bad). It's not obvious to me what encouraging daily checks here is doing for discourse as opposed to being a Skinner box.

comment by Raemon · 2019-04-29T20:04:05.494Z · score: 12 (5 votes) · LW · GW

The motivation was (among other things) several people saying to us "yo, I wish LessWrong was a bit more of a skinner box because right now it's so throughly not a skinner box that it just doesn't make it into my habits, and I endorse it being a stronger habit than it currently is."

See this comment and thread [LW · GW].

comment by shminux · 2019-04-27T20:27:23.765Z · score: 6 (3 votes) · LW · GW

It's interesting to see how people's votes on a post or comment are affected by other comments. I've noticed that a burst of vote count changes often appears after a new and apparently influential reply shows up.

comment by Alexei · 2019-04-27T19:38:41.011Z · score: 4 (2 votes) · LW · GW

Yeah, I had the same occurrence + feeling recently when I wrote the quant trading post. It felt like: "Wait, who would downvote this post...??" It's probably more likely that someone just retracted an upvote.

comment by MakoYass · 2019-04-28T03:21:55.996Z · score: 1 (3 votes) · LW · GW

Reminder: If a person is not willing to explain their voting decisions, you are under no obligation to waste cognition trying to figure them out. They don't deserve that. They probably don't even want that.

comment by Vladimir_Nesov · 2019-05-04T14:55:03.000Z · score: 10 (2 votes) · LW · GW

That depends on what norm is in place. If the norm is to explain downvoting, then people should explain, otherwise there is no issue in not doing so. So the claim you are making is that the norm should be for people to explain. The well-known counterargument is that this disincentivizes downvoting.

you are under no obligation to waste cognition trying to figure them out

There is rarely an obligation to understand things, but healthy curiosity ensures progress on recurring events, irrespective of morality of their origin. If an obligation would force you to actually waste cognition, don't accept it!

comment by MakoYass · 2019-05-05T09:07:21.283Z · score: 1 (1 votes) · LW · GW
So the claim you are making is that the norm should be for people to explain

I'm not really making that claim. A person doesn't have to do anything condemnable to be in a state of not deserving something. If I don't pay the baker, I don't deserve a bun. I am fine with not deserving a bun, as I have already eaten.

The baker shouldn't feel like I am owed a bun.

Another metaphor is that the person who is beaten on the street by silent, masked assailants should not feel like they owe their oppressors an apology.

comment by Said Achmiz (SaidAchmiz) · 2019-04-28T03:29:49.749Z · score: 4 (2 votes) · LW · GW

Do you mean anything by this beyond “you don’t have an obligation to figure out why people voted one way or another, period”? (Or do you think that I [i.e., the general Less Wrong commenter] do have such an obligation?)

Edit: Also, the “They don’t deserve that” bit confuses me. Are you suggesting that understanding why people upvoted or downvoted your comment is a favor that you are doing for them?

comment by MakoYass · 2019-04-28T05:44:58.326Z · score: 2 (2 votes) · LW · GW

Sometimes a person wont want to reply and say outright that they thought the comment was bad, because it's just not pleasant, and perhaps not necessary. Instead, they might just reply with information that they think you might be missing, which you could use to improve, if you chose to. With them, an engaged interlocutor will be able to figure out what isn't being said. With them, it can be productive to try to read between the lines.

Are you suggesting that understanding why people upvoted or downvoted your comment is a favor that you are doing for them?

Isn't everything relating to writing good comments a favor, that you are doing for others. But I don't really think in terms of favors. All I mean to say is that we should write our comments for the sorts of people who give feedback. Those are the good people. Those are the people who're a part of a good faith self-improving discourse. Their outgroup are maybe not so good, and we probably shouldn't try to write for their sake.

comment by habryka (habryka4) · 2019-04-28T06:13:07.816Z · score: 3 (2 votes) · LW · GW

I think I disagree. If you are getting downvoted by 5 people and one of them explains why, then even if the other 4 are not explaining their reasoning it's often reasonable to assume that more than just the one person had the same complaints, and as such you likely want to update more that it's better for you to change what you are doing.

comment by MakoYass · 2019-04-28T21:47:06.269Z · score: 6 (4 votes) · LW · GW

We don't disagree.

comment by habryka (habryka4) · 2019-04-28T22:02:11.026Z · score: 4 (2 votes) · LW · GW

Cool

comment by habryka (habryka4) · 2019-09-13T04:57:17.342Z · score: 23 (6 votes) · LW · GW

Thoughts on impact measures and making AI traps

I was chatting with Turntrout today about impact measures, and ended up making some points that I think are good to write up more generally.

One of the primary reasons why I am usually unexcited about impact measures is that I have a sense that they often "push the confusion into a corner" in a way that actually makes solving the problem harder. As a concrete example, I think a bunch of naive impact regularization metrics basically end up shunting the problem of "get an AI to do what we want" into the problem of "prevent the agent from interferring with other actors in the system".

The second one sounds easier, but mostly just turns out to also require a coherent concept and reference of human preferences to resolve, and you got very little from pushing the problem around that way, and sometimes get a false sense of security because the problem appears to be solved in some of the toy problems you constructed.

I am definitely concerned that Turntrou's AUP does the same, just in a more complicated way, but am a bit more optimistic than that, mostly because I do have a sense that in the AUP case there is actually some meaningful reduction going on, though I am unsure how much.

In the context of thinking about impact measures, I've also recently been thinking about the degree to which "trap-thinking" is actually useful for AI Alignment research. I think Eliezer was right in pointing out that a lot of people, when first considering the problem of unaligned AI, end up proposing some kind of simple solution like "just make it into an oracle" and then consider the problem solved.

I think he is right that it is extremely dangerous to consider the problem solved after solutions of this type, but it isn't obvious that there isn't some good work that can be done that is born out of the frame of "how can I trap the AI and make it marginally harder for it to be dangerous, basically pretending it's just a slightly smarter human?".

Obviously those kinds of efforts won't solve the problem, but they still seem like good things to do anyways, even if they just buy you a bit of time, or help you notice a bit earlier if your AI is actually engaging in some kind of adversarial modeling.

My broad guess is that research of this type is likely very cheap and much more scalable, and you hit diminishing marginal returns on it much faster than you would on AI Alignment research that is tackling the core problem, so it might just be fine to punt it until later. Though if you are acting on very short timelines it probably should still be someones job to make sure that someone at Deepmind tries to develop the obvious transparency technologies to help you spot if your neural net has any large fractions of it dedicated to building sophisticated human modeling, even if this won't solve the problem in the long-run.

This perspective, combined with Wei Dai's recent comments that one job of AI Alignment researchers is to produce evidence that the problem is actually difficult, is that it might be a good idea for some people to just try to develop lots of benchmarks of adversarial behavior that have any chance of triggering before you have a catastrophic failure. Like, it seems obviously great to have a paper that takes some modern ML architecture and can clearly demonstrate in which cases it might engage in adversarial modeling, and maybe some remotely realistic scenarios where that might happen.

My current guess is that current ML architectures aren't really capable of adversarial modeling in this way, though I am not actually that confident of that, and actually would be somewhat surprised if you couldn't get any adversarial behavior out of a dedicated training regime, if you were to try. For example, let's say I train an RL-based AI architecture on chat interactions with humans in which it just tries to prolong the length of the chat session as much as possible. I would be surprised if the AI wouldn't build pretty sophisticated models of human interactions, and try some weird tactics like get the human to believe that it is another human, or pretend that it is performing some long calculation, or deceive the humans in a large variety of ways, at least if it was pretrained with a language model of comparable quality to GPT-2, and had similar resources going to it as Open AI Five. Though it's also unclear to what degree this would actually give us evidence about treacherous turn scenarios.

I've also been quite curious about the application of ML to computer security, where an obvious experiment is to just try to set up some reasonable RL-architecture in which I have an AI interface with a webserver, trying to get access to some set of files that it shouldn't get access to . The problem here is obviously the sparse reward landscape, and there really isn't an obvious training regime here, but showing how even current AI could possibly leverage security vulnerabilities in a lot of systems in a way that could easily give rise to unintented side-effects could be a valuable goal. But in general training RL for almost anything is really hard, so this seems unlikely to work straightforwardly.

Overall, I am not sure what I feel about the perspective I am exploring above. I have a deep sense that a lot of it is just trying to dodge the hard parts of the problem, but it seems fine to put on my hat for short-term, increase marginal difficulty of bad outcomes, for a bit and see how I feel after exploring it for a while.

comment by Matthew Barnett (matthew-barnett) · 2019-09-13T07:20:07.805Z · score: 7 (4 votes) · LW · GW

[ETA: This isn't a direct reply to the content in your post. I just object to your framing of impact measures, so I want to put my own framing in here]

I tend to think that impact measures are just tools in a toolkit. I don't focus on arguments of the type "We just need to use an impact measure and the world is saved" because this indeed would be diverting attention from important confusion. Arguments for not working on them are instead more akin to saying "This tool won't be very useful for building safe value aligned agents in the long run." I think that this is probably true if we are looking to build aligned systems that are competitive with unaligned systems. By definition, an impact penalty can only limit the capabilities of a system, and therefore does not help us to build powerful aligned systems.

To the extent that they meaningfully make cognitive reductions, this is much more difficult for me to analyze. On one hand, I can see a straightforward case for everyone being on the same page when the word "impact" is used. On the other hand, I'm skeptical that this terminology will meaningfully input into future machine learning research.

The above two things are my main critiques of impact measures personally.

comment by TurnTrout · 2019-09-20T23:58:08.692Z · score: 4 (2 votes) · LW · GW

I think a natural way of approaching impact measures is asking "how do I stop a smart unaligned AI from hurting me?" and patching hole after hole. This is really, really, really not the way to go about things. I think I might be equally concerned and pessimistic about the thing you're thinking of.

The reason I've spent enormous effort on Reframing Impact is that the impact-measures-as-traps framing is wrong! The research program I have in mind is: let's understand instrumental convergence on a gears level. Let's understand why instrumental convergence tends to be bad on a gears level. Let's understand the incentives so well that we can design an unaligned AI which doesn't cause disaster by default.

The worst-case outcome is that we have a theorem characterizing when and why instrumental convergence arises, but find out that you can't obviously avoid disaster-by-default without aligning the actual goal. This seems pretty darn good to me.

comment by habryka (habryka4) · 2019-08-30T20:48:03.232Z · score: 22 (8 votes) · LW · GW

I just came back from talking to Max Harms about the Crystal trilogy, which made me think about rationalist fiction, or the concept of hard sci-fi combined with explorations of cognitive science and philosophy of science in general (which is how I conceptualize the idea of rationalist fiction). 

I have a general sense that one of the biggest obstacles for making progress on difficult problems is something that I would describe as “focusing attention on the problem”. I feel like after an initial burst of problem-solving activity, most people when working on hard problems, either give up, or start focusing on ways to avoid the problem, or sometimes start building a lot of infrastructure around the problem in a way that doesn’t really try to solve it. 

I feel like one of the most important tools/skills that I see top scientist or problem solvers in general use, is utilizing workflows and methods that allow them to focus on a difficult problem for days and months, instead of just hours. 

I think at least for me, the case of exam environments displays this effect pretty strongly. I have a sense that in an exam environment, if I am given a question, I successfully focus my full attention on a problem for a full hour, in a way that often easily outperforms me thinking about a problem in a lower key environment for multiple days in a row.

And then, when I am given a problem set with concrete technical problems, my attention is again much better focused than when I am given the same problem but in a much less well-defined way. E.g. thinking about solving some engineering problem, but without thinking about it by trying to create a concrete proof or counterproof. 

My guess is that there is a lot of potential value in fiction that helps people focus their attention on a problem in a real way. In fiction you have the ability to create real-feeling stakes that depend on problem solving, and things like the final exam in Methods of Rationality show how that can be translated into large amounts of cognitive labor. 

I think my strongest counterargument to this model is something like “sure, it’s easy to make progress on problems when you have someone else give you the correct ontology in which the problem is solvable, but that’s just because 90% of the work of solving problems is coming up with the right ontologies for problems like this”. And I think there is something importantly real about this, but also that it doesn’t fully address the value of exams and fiction and problem sets that I am trying to point to (though I do think it explains a good chunk of their effect). 

Going back to the case of fiction, it is clear to me that fiction is as a literary form much more optimized to hold human attention that most non-fiction is. I think first of all that this constraint means that most fiction (and in particular most popular fiction) isn’t about much else than whatever best holds people’s attention, but it also means that if the bottleneck on a lot of problems is just getting people to hold their attention on the problem for a while, then utilizing the methods that fiction-writing has developed seems like an obvious way of making progress on those problems. 

I feel like another major effect that explains a lot of the effects that I observe is people believing that a problem is solvable. In a fictional setting, if the author promises you that things have a good explanation, then it’s motivating to figure out why. On an exam you are promised that the problems that you are given are solvable, and solvable within a reasonable amount of time. 

I do think this can still be exploited. In the last few chapters of HPMOR, Harry does a mental motion that I would describe as "don't waste mental energy on asking whether a problem is solvable, just pretend it it, and ask what the solution would be if it was solvable", in a way that felt to me like it would work on a lot of real-world problems. 

comment by eigen · 2019-08-31T16:51:53.379Z · score: 2 (2 votes) · LW · GW

Yes, fiction has a lot of potential to change mindsets. Many Philosophers actually look at the greatest novel writers to infer the motives and the solutions their heroes to come up with general theories that touch the very core of how our society is laid out.

Most of this come from the fact that we are already immersed in a meta-story, externally and internally. Much of our efforts are focused on internal rationalizations to gain something where a final outcome has been already thought out, this being consciously known to us or not.

I think that in fiction this is laid out perfectly. So analyzing fiction is rewarding in a sense. Specially when realizing that when we go to exams or interviews we're rapidly immersing ourselves in an isolated story with motives and objectives (what we expect to happen), we create our own little world, our own little stories.

comment by Viliam · 2019-08-31T14:10:26.040Z · score: 2 (1 votes) · LW · GW

Warning: HPMOR spoilers!

I suspect that fiction can conveniently ignore the details of real life that could ruin seemingly good plans.

Let's look at HPMOR.

The general idea of "create a nano-wire, then use it to simultaneously kill/cripple all your opponents" sounds good on paper. Now imagine yourself, at that exact situation, trying to actually do it. What could possibly go wrong?

As a first objection, how would you actually put the nano-wire in the desired position? Especially when you can't even see it (otherwise the Death Eaters and Voldemort would see it too). One mistake would ruin the entire plan. What if the wind blows and moves your wire? If one of the Death Eaters moves a bit, and feels a weird stinging at the side of their neck?

Another objection, when you pull the wire to kill/cripple your opponents, how far do you actually have to move it? Assuming dozen Death Eaters (I do not remember the exact number in the story), if you need 10 cm for an insta-kill, that's 1.2 meters you need to do before the last one kills you. Sounds doable, but also like something that could possibly go wrong.

In other words, I think that in real life, even Harry Potter's plan would most likely fail. And if he is smart enough, he would know it.

The implication for real life is that, similarly, smart plans are still likely to fail, and you know it. Which is probably why you are not trying hard enough. You probably already remember situations in your past when something seemed like a great idea, but still failed. Your brain may predict that your new idea would belong to the same reference class.

comment by habryka (habryka4) · 2019-08-31T17:06:34.068Z · score: 6 (3 votes) · LW · GW

While I agree that this is right, your two objections are both explicitly addressed within the relevant chapter: 

"As a first objection, how would you actually put the nano-wire in the desired position? Especially when you can't even see it (otherwise the Death Eaters and Voldemort would see it too). One mistake would ruin the entire plan. What if the wind blows and moves your wire? If one of the Death Eaters moves a bit, and feels a weird stinging at the side of their neck?"

Harry first transfigures a much larger spiderweb, which also has the advantage of being much easier to move in place, and to not be noticed by people that are interacting with it. 

"Another objection, when you pull the wire to kill/cripple your opponents, how far do you actually have to move it? Assuming dozen Death Eaters (I do not remember the exact number in the story), if you need 10 cm for an insta-kill, that's 1.2 meters you need to do before the last one kills you. Sounds doable, but also like something that could possibly go wrong."

Indeed, which is why Harry was waving the web into an intervowen circle that contracts simultaneously in all directions. 

Obviously things could have still gone wrong, and Eliezer has explicitly acknowledged that HPMOR is a world in which complicated plans definitely succeed a lot more than they would in the normal world, but he did try to at least cover the obvious ways things could go wrong. 


comment by Ben Pace (Benito) · 2019-08-31T19:14:51.548Z · score: 2 (1 votes) · LW · GW

I have covered both of your spoilers in spoiler tags (">!").

comment by habryka (habryka4) · 2019-05-27T06:42:29.252Z · score: 20 (5 votes) · LW · GW

Random thoughts on game theory and what it means to be a good person

It does seem to me like there doesn’t exist any good writing on game theory from a TDT perspective. Whenever I read classical game theory, I feel like the equilibria that are being described obviously fall apart when counterfactuals are being properly brought into the mix (like D/D in prisoners dilemmas).

The obvious problem with TDT-based game theory, just as it is with Bayesian epistemology, the vast majority of direct applications are completely computationally intractable. It’s kind of obvious what should happen in games with lots of copies of yourself, but as soon as anything participates that isn’t a precise copy, everything gets a lot more confusing. So it is not fully clear what a practical game-theory literature from a TDT-perspective would look like, though maybe the existing LessWrong literature on Bayesian epistemology might be a good inspiration.

Even when you can’t fully compute everything (and we even don’t really know how to compute everything in principle), you might still be able to go through concrete scenarios and list considerations and perspectives that incorporate TDT-perspectives. I guess in that sense, a significant fraction of Zvi’s writing could be described as practical game theory, though I do think there is a lot of value in trying to formalize the theory and make things as explicit as possible, which I feel like Zvi at least doesn’t do most of the time.

Critch (Academian) tends to have this perspective of trying to figure out what a “robust agent” would do, in the sense of an agent that would at the very least be able to reliably cooperate with copies of itself, and adopt cooperation and coordination principles that allow it to achieve very good equilibria with agents that adopt the same type of cooperation and coordination norms. And I do think there is something really valuable here, though I am also worried that the part where you have to cooperate with agents who haven’t adopted super similar cooperation norms is actually the more important one (at least until something like AGI).

And I do think that the majority of the concepts we have for what it means to be a “good person” are ultimately attempts at trying to figure out how to coordinate effectively with other people, in a way that a more grounded game theory would help a lot with.

Maybe a good place to start would be to brainstorm a list of concrete situations in which I am uncertain what the correct action is. Here is some attempt at that:

  • How to deal with threats of taking strongly negative-sum actions? What is the correct response to the following concrete instances?

    • A robber threatens to shoot you if you don’t hand over your wallet
      • Do you precommit to violently attack any robber that robs you, or do you simply hand over your wallet?
  • You are in the room with someone holding the launch buttons for the USA’s nuclear arsenal and they are threatening to launch them if you don’t hand over your wallet

  • You are head of the U.S. and another nation state is threatening a small-scale nuclear attack on one of your cities if you don’t provide some kind of economic subsidy to them

    • Do you launch a conventional attack?
    • Do you launch a full out nuclear response as a deterrent?
    • Do you launch a small-scale nuclear response?
    • Do you not do anything at all?
    • Does the answer depend on the size of the economic subsidy? What if they ask twice?
  • You are at a party and your assigned driver ended up drinking, even though they said they would not (the driver was chosen by a random draw)

    • Do you somehow punish them now, do you punish them later, or not at all?
    • What if they are less likely to remember if you punish them now because they are drunk? Does that matter for the game-theoretic correct action?
    • What if they did this knowingly, reasoning from a CDT perspective that there wouldn’t be any point in punishing them now because they wouldn’t remember the next day
      • What if you would never see them again later?
      • What if you only ever get to interact with them after they made the choice to be drunk?

I feel like I have some hint of an answer to all of these, but also feel like any answer that I can come up with makes me exploitable in a way that makes me feel like there is no meta-level on which there is an ideal strategy.

comment by Raemon · 2019-05-28T01:11:42.972Z · score: 12 (3 votes) · LW · GW

Reading through this, I went "well, obviously I pay the mugger...

...oh, I see what you're doing here."

I don't have a full answer to the problem you're specifying, but something that seems relevant is the question of "How much do you want to invest in the ability to punish defectors [both in terms of maximum power-to-punish, a-la nukes, and in terms of your ability to dole out fine-grained-exactly-correct punishment, a-la skilled assassins]"

The answer to this depends on your context. And how you have answered this question determines whether it makes sense to punish people in particular contexts.

In many cases there might want to be some amount of randomization where at least some of the time you really disproportionately punish people, but you don't have to pay the cost of doing so every time.

Answering a couple of the concrete questions:

Mugger

Right now, in real life, I've never been mugged, and I feel fine basically investing zero effort into preparing for being mugged. If I do get mugged, I will just hand over my wallet.

If I was getting mugged all the time, I'd probably invest effort into a) figuring out what good policies existed for dealing with muggers, b) what costs I'd have to pay in order to implement those policies.

In some worlds, it's worth investing in literal body armor or bullet proof cars or whatever, and in the skill to successfully fight back against a literal mugger. (My understanding is that this usually not actually a good idea even in crime-heavy areas, but I can imagine worlds where it was correct to just get good at fighting, or to hire people who are good at fighting as bodyguards)

In some worlds it's worth investing more in police-force and avoiding having to think about the problem, or not carrying as much money around in the first place.

Small Nation Demands Subsidies, Threatens Nuclear War

Again, I think my options here depend a lot on having already invested in defense.

One scenario is "I do not have the ability to say 'no' without risking millions of either my own citizens lives, or innocent citizens of the country-in-question." In that case, I probably have to do something that makes my vague-hippie-values sad.

I have some sense that my vague-hippie-values depend on having invested enough money in defense (and offense) that I can "afford" to be moral. Things I may wish my country had invested in include:

  • Anti-ICBM capabilities that can shoot down incoming nukes with enough reliability that either a small-scale nuclear counterstrike, or a major non-nuclear retaliatory invasion, are viable options that will at least only punish foreign civilians if the foreign government actually launches an attack
  • Possibly invested in assassins who just kill individuals who threaten nuclear strikes (I'm somewhat confused about why this isn't more used, suspect the answer has to do with the game theory of 'the people in charge [of all nations] want it to be true that they aren't at risk of getting assassinated, so they have a gentleman's agreement to avoid killing enemy leaders)

So I probably want to invest a lot in either having strong capabilities in those domains, or having allies who do.

Drinking

In real life I expect that the solution here is "I never invite said person to parties again, and depending on our relative social standing I might publicly badmouth them or quietly gossip about them."

In weird contrived scenarios I'm not sure what I do because I don't know how to anticipate weird contrived scenarios.

I do invest, generally, on communicating about how obviously people should follow up on their commitments, such that when someone fails to live up to their commitment, it costs less to punish them for doing so. (And this is a shared social good that multiple people invest in).

If I'm in a one-off interaction with someone who is currently too drunk to remember being punished and who I'm not socially connected to, I probably treat it like being mugged – a fluke event that doesn't happen often enough to be worth investing resources in being able to handle better.

Extra Example: Having to Stand Up to a Boss/High-Status-Person/Organization

A situation that I'm more likely to run into, where the problem actually seems hard, is that sometimes high status people do bad things, and they have more power than you, and people will naturally end up on their side and take their word over yours.

Sort of similar to the "Small nation threatening nuclear war", I think if you want to be able to "afford to actually have moral principles", you need to invest upfront in capabilities. This isn't always the right thing to do, depending on your life circumstances, but it may be sometimes. You want to have enough surplus power that you have the Slack to stand up for things.

Possibilities include investing in being directly high status yourself, or investing in making friends with a strong enough coalition of people to punish high status people, or encourage strong norms and rule of law such that you don't need to have as strong a coalition, because you've made it lower cost to socially attack someone who breaks a norm.

Extra Example: The Crazy House Guest

Perhaps related to the drinking example: a couple times, I've had people show at former houses, potentially looking to move in, and then causing some kind of harm.

In one case, they had a very weird combination of mental illnesses and cluelessness that resulted in them dealing several thousands of dollars worth of physical damage to the house.

They seemed crazy and unpredictable enough that it seemed like if I tried to punish them, they might follow me around forever and make my life suck in weird ways.

So I didn't punish them and they got away with it and went away and I never heard from them again.

So... sure, you can get away with certain kinds of things by signaling insanity and unpredictability... but at the cost of not being welcome in major social networks. The few extra thousand dollars they saved was not remotely worth the fact that, had they been a more reasonable person, they'd have had access to a strong network of friends and houses that help each other out finding jobs and places to live and what-not.

So I'm not worried about the longterm incentives here – the only people for whom insanity is a cost-effective tool to avoid punishment are actual insane people who don't have the ability to interface with society normally.

What if there turn out to be lots of crazy people? Then you probably either invest upfront resources in fighting this somehow, or become less trusting.

Extra Example: The Greedy Landlord

In another housing situation, the landlord tried to charge us extra for things that were not our fault. In this case, it was reasonably clear that we were in the right. Going to small claims court would have been net-negative for us, but also costly to them.

I was angry and full of zealous energy and I decided it was worth it and I threatened going to small claims court and wasting both of our time, even though a few hundred dollars wasn't really worth it.

They backed down.

This seems like the system working as intended. This is what anger is for, to make sure people have the backbone to defend themselves, and to live in a world where at least some of the time people will get riled up and punish you disproportionately

What if you haven't invested in defense capabilities in advance?

Then you probably will periodically need to lose and accept bad situations, such as either a more powerful empire demanding tribute from your country, or choosing policies like "if you are under threat, flip an unknown number of coins and if enough coins come up heads, go to war and punish them disproportionately even though you will probably lose and lots of people will die but now empires will sometimes think twice about invading poor countries."

The meta level point

It doesn't seem inconsistent to me to apply different policies in different situations, even if they share commonalities, based on how common the situation is, how costly the defection, how much long-term punishment you can inflict, and how much resources your have invested in being able to punish.

This does mean that mugging (for example) is a somewhat viable strategy, since people don't invest as heavily in handling it (because it is rare), but this seems like a self-correcting problem. There would be some least-defended against defect-button that defectors can press, you can't protect against everything.

Another point is that it's important to be somewhat unpredictable, and to at least sometimes just punish people disproportionately (when they wrong you), so that people aren't confident that the expected value of taking advantage of you is positive.

comment by Lanrian · 2019-05-27T10:02:22.819Z · score: 2 (3 votes) · LW · GW

Any reason why you mention timeless decision theory (TDT) specifically? My impression was that functional decision theory (as well as UDT, since they're basically the same thing) is regarded as a strict improvement over TDT.

comment by habryka (habryka4) · 2019-05-27T16:49:27.727Z · score: 2 (1 votes) · LW · GW

Same thing, it's just the handle that stuck in my mind. I think of the whole class as "timeless", since I don't think there exists a good handle that describes all of them.

comment by habryka (habryka4) · 2019-05-02T19:05:53.491Z · score: 20 (7 votes) · LW · GW

Printing more rationality books: I've been quite impressed with the success of the printed copies of R:A-Z and think we should invest resources into printing more of the other best writing that has been posted on LessWrong and the broad diaspora.

I think a Codex book would be amazing, but I think there also exists potential for printing smaller books on things like Slack/Sabbath/etc., and many other topics that have received a lot of other coverage over the years. I would also be really excited about printing HPMOR, though that has some copyright complications to it.

My current model is that there exist many people interested in rationality who don't like reading longform things on the internet and are much more likely to read things when they are in printed form. I also think there is a lot of value in organizing writing into book formats. There is also the benefit that the book now becomes a potential gift for someone else to read, which I think is a pretty common way ideas spread.

I have some plans to try to compile some book-length sequences of LessWrong content and see whether we can get things printed (obviously in coordination with the authors of the relevant pieces).

comment by habryka (habryka4) · 2019-04-30T18:28:29.037Z · score: 19 (6 votes) · LW · GW

Forecasting on LessWrong: I've been thinking for quite a while about somehow integrating forecasts and prediction-market like stuff into LessWrong. Arbital has these small forecasting boxes that look like this:

Arbital Prediction Screenshot

I generally liked these, and think they provided a good amount of value to the platform. I think our implementation would probably take up less space, but the broad gist of Arbital's implementation seems like a good first pass.

I do also have some concerns about forecasting and prediction markets. In particular I have a sense that philosophical and mathematical progress only rarely benefits from attaching concrete probabilities to things, and more works via mathematical proof and trying to achieve very high confidence on some simple claims by ruling out all other interpretations as obviously contradictory. I am worried that emphasizing probability much more on the site would make making progress on those kinds of issues harder.

I also think a lot of intellectual progress is primarily ontological, and given my experience with existing forecasting platforms and Zvi's sequence on prediction markets, they are not very good at resolving ontological confusions and often seem to actively hinder them by causing lots of sunk-costs into easy-to-operationalize ontologies that tend to dominate the platforms.

And then there is the question of whether we want to go full-on internal prediction market and have active markets that are traded in some kind of virtual currency that people actually care about. I think there is a lot of value in that direction, but it's obviously also a lot of engineering effort that isn't obviously worth it. It seems likely better to wait until a project like foretold.io has reached maturity and then see whether we can integrate it into LessWrong somehow.

comment by Zvi · 2019-05-03T18:34:04.684Z · score: 21 (8 votes) · LW · GW

This feature is important to me. It might turn out to be a dud, but I would be excited to experiment with it. If it was available in a way that was portable to other websites as well, that would be even more exciting to me (e.g. I could do this in my base blog).

Note that this feature can be used for more than forecasting. One key use case on Arbital was to see who was willing to endorse or disagree with, to what extent, various claims relevant to the post. That seemed very useful.

I don't think having internal betting markets is going to add enough value to justify the costs involved. Especially since it both can't be real money (for legal reasons, etc) and can't not be real money if it's going to do what it needs to do.

comment by habryka (habryka4) · 2019-05-03T19:08:51.070Z · score: 6 (3 votes) · LW · GW

There are some external platforms that one could integrate with, here is one that is run by some EA-adjacent people: https://www.empiricast.com/

I am currently confused about whether using an external service is a good idea. In some sense it makes things mode modular, but it also limits the UI design-space a lot and lengthens the feedback loop. I think I am currently tending towards rolling our own solution and maybe allowing others to integrate it into their site.

comment by Rob Bensinger (RobbBB) · 2019-04-30T23:35:02.135Z · score: 4 (2 votes) · LW · GW

One small thing you could do is to have probability tools be collapsed by default on any AIAF posts (and maybe even on the LW versions of AIAF posts).

Also, maybe someone should write a blog post that's a canonical reference for 'the relevant risks of using probabilities that haven't already been written up', in advance of the feature being released. Then you could just link to that a bunch. (Maybe even include it in the post that explains how the probability tools work, and/or link to that post from all instances of the probability tool.)

Another idea: Arbital had a mix of (1) 'specialized pages that just include a single probability poll and nothing else'; (2) 'pages that are mainly just about listing a ton of probability polls'; and (3) 'pages that have a bunch of other content but incidentally include some probability polls'.

If probability polls on LW mostly looked like 1 and 2 rather than 3, then that might make it easier to distinguish the parts of LW that should be very probability-focused from the parts that shouldn't. I.e., you could avoid adding Arbital's feature for easily embedding probability polls in arbitrary posts (and/or arbitrary comments), and instead treat this more as a distinct kind of page, like 'Questions'.

You could still link to the 'Probability' pages prominently in your post, but the reduced prominence and site support might cause there to be less social pressure for people to avoid writing/posting things out of fears like 'if I don't provide probability assignments for all my claims in this blog post, or don't add a probability poll about something at the end, will I be seen as a Bad Rationalist?'

comment by Rob Bensinger (RobbBB) · 2019-04-30T23:36:36.309Z · score: 5 (3 votes) · LW · GW

Also, if you do something Arbital-like, I'd find it valuable if the interface encourages people to keep updating their probabilities later as they change. E.g., some (preferably optional) way of tracking how your view has changed over time. Probably also make it easy for people to re-vote without checking (and getting anchored by) their old probability assignment, for people who want that.

comment by Ben Pace (Benito) · 2019-05-01T01:03:22.104Z · score: 14 (4 votes) · LW · GW

Note that Paul Christiano warns against encouraging sluggish updating by massively publicising people’s updates and judging them on it. Not sure what implementation details this suggests yet, but I do want to think about it.

https://sideways-view.com/2018/07/12/epistemic-incentives-and-sluggish-updating/

comment by Rob Bensinger (RobbBB) · 2019-05-01T18:06:57.710Z · score: 4 (2 votes) · LW · GW

Yeah, strong upvote to this point. Having an Arbital-style system where people's probabilities aren't prominently timestamped might be the worst of both worlds, though, since it discourages updating and makes it look like most people never do it.

I have an intuition that something socially good might be achieved by seeing high-status rationalists treat ass numbers as ass numbers, brazenly assign wildly different probabilities to the same proposition week-by-week, etc., especially if this is a casual and incidental thing rather than being the focus of any blog posts or comments. This might work better, though, if the earlier probabilities vanish by default and only show up again if the user decides to highlight them.

(Also, if a user repeatedly abuses this feature to look a lot more accurate than they really were, this warrants mod intervention IMO.)

comment by habryka (habryka4) · 2019-05-15T06:00:49.483Z · score: 7 (4 votes) · LW · GW

Making yourself understandable to other people

(Epistemic status: Processing obvious things that have likely been written many times before, but that are still useful to have written up in my own language)

How do you act in the context of a community that is vetting constrained? I think there are fundamentally two approaches you can use to establish coordination with other parties:

1. Professionalism: Establish that you are taking concrete actions with predictable consequences that are definitely positive

2. Alignment: Establish that you are a competent actor that is acting with intentions that are aligned with the aims of others

I think a lot of the concepts around professionalism arise when you have a group of people who are trying to coordinate, but do not actually have aligned interests. In those situations you will have lots of contracts and commitments to actions that have well-specified outcomes and deviations from those outcomes are generally considered bad. It also encourages a certain suppression of agency and a fear of people doing independent optimization in a way that is not transparent to the rest of the group.

Given a lot of these drawbacks, it seems natural to aim for establishing alignment with others, it is however much less clear how to achieve that. Close group of friends can often act in alignment because they have credibly signaled to each other that they care about each others experiences and goals. This also tends to involve costly signals of sacrifice that are only economical if the goals of the participants were actually aligned. I also suspect that there is a real "merging of utility functions" going on, where close friends and partners self-modify to share each others values.

For larger groups of people, establishing alignment with each other seems much harder, in particular in the presence of adversarial actors. You can request costly signals, but it is often difficult to find good signals that are not prohibitively costly for many members of your group (this task is much easier for smaller groups, since you have less spread in the costs of different actions). You are also under much more adversarial pressure, since with more people you likely have access to more resources which attracts more adversarial actors.

I expect this is the reason why we see larger groups often default to professionalism norms with very clearly defined contracts.

I think the EA and Rationality communities have historically optimized hard for alignment and not professionalism, since that enabled much better overall coordination, but as the community grew and attracted more adversarial actors those methods didn't scale very well and so we currently expect alignment-level coordination capabilities while only having access to professionalism-level coordination protocols and resources.

We've also seen an increase in people trying to increase the amount of alignment, by looking into things like circling and specializing in mediation and facilitation, which I think is pretty promising and has some decent traction.

I also think there is a lot of value in building better infrastructure and tools for more "professionalism" style interactions, where people offer concrete services with bounded upside. A lot of my thinking on the importance of accountability derives from this perspective.

comment by Rob Bensinger (RobbBB) · 2019-05-02T22:09:44.951Z · score: 7 (4 votes) · LW · GW

I like this shortform feed idea!