In the future, my advice to you would be:
Start small - what individual bias do you think you could explain best? How would you explain just that 1 small thing as simply and engagingly as possible?
Use the site questions feature - if you want examples from the community just ask the question without any commentary on who is/isn't debugged etc.
I suspect you have more learning to do before you really get LW rationality as G Gordon Worley III describes so it might be better to really get a handle on all this first.wei_dai on "Other people are wrong" vs "I am right"
I'm confused about what point you're making with the bike thief example. I'm reading through that post and its comments to see if I can understand your post better with that as background context, but you might want to clarify that part of the post (with a reader who doesn't have that context in mind).
I think the techniques in this post may be helpful to avoid the kind of overconfidence you describe here and be more disjunctive in one's thinking.
Here are some other ideas which I continue to endorse which had that ring of truth to them, but whose details I’ve been similarly overconfident about.
I'm curious what details you were overconfident about, in case I can use the same updates that you made.john_maxwell_iv on Thoughts on Human Models
A toy model I find helpful is correlated vs uncorrelated safety measures. Suppose we have 3 safety measures. Suppose if even 1 safety measure succeeds, our AI remains safe. And suppose each safety measure has a 60% success rate in the event of an accident. If the safety measures are accurately described by independent random variables, our odds of safety in an accident are 1 - 0.4^3 = 94%. If the successes of the safety measures are perfectly correlated, failure of one implies certain failure of the others, and our odds of safety are only 1 - 0.4 = 60%.
In my mind, this is a good argument for working on ideas like safely interruptible agents, impact measures, and boxing. The chance of these ideas failing seems fairly independent from the chance of your value learning system failing.
But I think you could get a similar effect by having your AGI search for models whose failure probabilities are uncorrelated with one another. The better your AGI, the better this approach is likely to work.john_maxwell_iv on Thoughts on Human Models
Human modelling is very close to human manipulation in design space. A system with accurate models of humans is close to a system which successfully uses those models to manipulate humans.
Trying to communicate why this sounds like magical thinking to me... Taylor is a data scientist for the local police department. Taylor notices that detectives are wasting a lot of time working on crimes which never get solved. They want to train a logistic regression on the crime database in order to predict whether a given crime will ever get solved, so detectives can focus their efforts on crimes that are solvable. Would you advise Taylor against this project, on the grounds that the system will be "too close in design space" to one which attempts to commit the perfect crime?
Although they do not rely on human modelling, some of these approaches nevertheless make most sense in a context where human modelling is happening: for example, impact measures seem to make most sense for agents that will be operating directly in the real world, and such agents are likely to require human modelling.
Let's put AI systems into two categories: those that operate in the real world and those that don't. The odds of x-risk from the second kind of system seem low. I'm not sure what kind of safety work is helpful, aside from making sure it truly does not operate in the real world. But if a system does operate in the real world, it's probably going to learn about humans and acquire knowledge about our preferences. Which means you have to solve the problems that implies.
My steelman of this section is: Find a way to create a narrow AI that puts the world on a good trajectory.john_maxwell_iv on Thoughts on Human Models
Re: independent audits, although they're not possible for this particular problem, there are many close variants of this problem such that independent audits are possible. Let's think of human approval as a distorted view of our actual preferences, and our goal is to avoid things which are really bad according to our undistorted actual preferences. If we pass distorted human approval to our AI system, and the AI system avoids things which are really bad according to undistorted human approval, that suggests the system is robust to distortion.
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.dagon on "Other people are wrong" vs "I am right"
See also https://www.lesswrong.com/posts/qNZM3EGoE5ZeMdCRt/reversed-stupidity-is-not-intelligence. I applaud (and aspire to) your ability to distinguish levels of plausability without falling into false dichotomies.
We should fill the universe with hedonium.
I'll admit that the "pink goo" scenario (which I tend to call orgasmium, but will now use hedonium as the name of the substance) is one which I find most likely to be my supervillian almost-correct belief (that thing which, if I were powerful and not epistemically vigilant, I might undertake, causing great harm).bucky on De-Bugged brains wanted
I’ve read it and commented on it already. You can refer to that comment for my thoughts.
Concepts which I can’t find elsewhere are only good if they are accurate/helpful which I don’t believe they are.
I think in this case it is up to you to show that you’re right, rather than up to me to show you’re wrong.habryka4 on Less Competition, More Meritocracy?
Promoted to curated: I've applied the ideas in this post to a variety of domains since I first read it, and I think it was quite useful in a lot of them (examples of questions I was thinking about: "How much more progress should we expect in Science given that at least 10x more resources are available for recruiting scientists?" and "How much does the size of the EA and Rationality community determine the quality of people working at organizations in the community?").
I do think I had to read this post at least twice to really grasp any of the core points, and am still struggling with some of them. I think this was partially the result of trying to translate a mathematical econ paper into a post without any equations, which is always a really big challenge, but I also think I could have benefitted from a longer initial section that just summarized the econ paper, and then a separate section that commented on it. As it stood, I think I ended up somewhat confused about which points were covered in the econ paper, and which ones were your points.
But overall, I think this post changed my mind on some important ideas, which is one of the most valuable things a post can do. Thanks a lot for writing it.wei_dai on The Argument from Philosophical Difficulty
It feels like this is true for the vast majority of plausible technological progress as well? E.g. most scientific experiments / designed technologies require real-world experimentation, which means you get very little data, making it very hard to naively automate with ML. I could make a just-so story where philosophy has much more data (philosophy writing), that is relatively easy to access (a lot of it is on the Internet), and so will be easier to automate.
On the scientific/technological side, you can also use scientific/engineering papers (which I'm guessing has to be at least an order of magnitude greater in volume than philosophy writing), plus you have access to ground truths in the form of experiments and real world outcomes (as well as near-ground truths like simulation results) which has no counterpart in philosophy. My main point is that it seems a lot harder for technological progress to go "off the rails" due to having access to ground truths (even if that data is sparse) so we can push it much harder with ML.
My actual reason for not seeing much of a difference is that (conditional on short timelines) I expect that the systems we develop will be very similar to humans in the profile of abilities they have, because it looks like we will develop them in a manner similar to how humans were “developed”
I agree this could be a reason that things turn out well even if we don't explicitly solve metaphilosophy or do something like my hybrid approach ahead of time. The way I would put it is that humans developed philosophical abilities for some mysterious reason that we don't understand, so we can't rule out AI developing philosophical abilities for the same reason. It feels pretty risky to rely on this though. If by the time we get human-level AI, this turns out not to be true, what are we going to do then? And even if we end up with AIs that appear to be able to help us with philosophy, without having solved metaphilosophy how would we know whether it's actually helping or pushing us "off the rails"?habryka4 on How could "Kickstarter for Inadequate Equilibria" be used for evil or turn out to be net-negative?
I think a lot of people have high time-discounting rates, resulting in a pretty adversarial relationship with their future selves, such that contracts that allow them to commit to arbitrary future actions are bad. For example, imagine a drug addict being offered to commit themselves to slavery a month from now, in exchange for some drugs right now. I would argue that the existence of this offer is overall net negative from a humanitarian perspective.
I think there is a part of human psychology, and human culture, that expects that large commitments should be accompanied by significant-seeming rituals, in order to help people grok the significance of what they are committing to (example: Marriage). As such, I think it would be important that this platform limits the type of thing that people can commit to, to stuff that wouldn't be extremely costly to their future selves (though this is already mostly covered with modern contract law, which mostly prevents you from signing contracts that are extremely costly for your future self).