Comment by john_maxwell_iv on No, it's not The Incentives—it's you · 2019-06-17T03:54:10.847Z · score: 3 (5 votes) · LW · GW

quality and useful research is much easier without academia

I think you have to do a lot more to demonstrate this.

destroying the credibility of academia would be the logical useful action.

Did you read Scott Alexander's recent posts on cultural evolution?

If the credibility of academia is destroyed, it's not obvious something better will come along to fill that void. Why is it better to destroy than repair? Plus, if something new gets created, it will probably have its own set of flaws. The more pressure is put on your system (in terms of funding and status), the greater the incentive to game things, and the more the cracks will start to show.

I suggest instead of focusing on the destruction of a suboptimal means for ascertaining credibility, you focus on the creation of a superior means for ascertaining credibility. Let's phase academia out after it has been made obsolete, not before.

Comment by john_maxwell_iv on No, it's not The Incentives—it's you · 2019-06-17T02:57:49.064Z · score: 2 (1 votes) · LW · GW

There's been a great deal of discussion of the EA Hotel on the EA Forum. Here's one relevant thread:

Here's another:

It's possible the hotel's funding troubles have more to do with weirdness aversion than anything else.

I personally spent 6 months at the hotel, thought it was a great environment, and felt the time I spent there was pretty helpful for my career as an EA. The funding situation is not as dire as it was a little while ago. But I've donated thousands of dollars to the project and I encourage others to donate too.

Comment by john_maxwell_iv on Recommendation Features on LessWrong · 2019-06-16T05:40:55.363Z · score: 19 (6 votes) · LW · GW

Idea: nudge "From the Archives" so it tends to show different users the same posts around the same time, so if someone leaves a comment on a post they read, others might also see the comment and a discussion can happen. (Or alternatively, I suppose you could just upweight posts which recently received comments in the "From the Archives" selection process. That seems better.)

Comment by john_maxwell_iv on The Univariate Fallacy · 2019-06-16T05:24:27.431Z · score: 11 (3 votes) · LW · GW

Good post. Some feedback:

  • I think you can replace the first instance of "are statistically independent" with "are statistically independent and identically distributed" & improve clarity.

  • IMO, your argument needs work if you want it to be more than an intuition pump. If the question is the existence or nonexistence of particular clusters, you are essentially assuming what you need to prove in this post. Plus, the existence or nonexistence of clusters is a "choice of ontology" question which doesn't necessarily have a single correct answer.

  • You're also fuzzing things by talking about discrete distributions here, then linking to Eliezer's discussion of continuous latent variables ("intelligence") without noting the difference. And: If a number of characteristics have been observed to co-vary, this isn't sufficient evidence for any particular causal mechanism. Correlation isn't causation. As I pointed out in this essay, it's possible there's some latent factor like the ease of obtaining calories in an organism's environment which explains interspecies intelligence differences but doesn't say anything about the "intelligence" of software.

Comment by john_maxwell_iv on No, it's not The Incentives—it's you · 2019-06-16T03:50:10.696Z · score: 2 (1 votes) · LW · GW

It's possible. That's what I myself am doing--supporting myself with a part-time job while I self-study and do independent FAI research.

However, it's harder have credibility in the eyes of the public with this path. And for good reason--the public has no easy way to tell apart a crank from a lone genius, since it's hard to judge expertise in a domain unless you yourself are an expert in it. One could argue that the academia acts as a reasonable approximation of eigendemocracy and thereby solves this problem.

Anyway, if the scientists with credibility are the ones who don't care about scientific integrity, that seems bad for public epistemology.

Comment by john_maxwell_iv on No, it's not The Incentives—it's you · 2019-06-15T21:49:26.165Z · score: 4 (2 votes) · LW · GW

Is this specific to research? Given unaligned incentives and Goodheart, I think you could make an argument that nothing important should be a source of income. All long-term values-oriented work should be undertaken as hobbies.

This is an interesting argument for funding something like the EA Hotel over traditional EA orgs.

Comment by john_maxwell_iv on No, it's not The Incentives—it's you · 2019-06-15T21:48:46.533Z · score: 2 (1 votes) · LW · GW

Well from a consequentialist perspective, if people with a stronger desire for scientific integrity self-select out of science, that makes science weaker in the long run.

I think a more realistic norm, which will likely create better outcomes, is for you personally to ensure that your work is at least in the top 40% for quality, and castigate anyone whose work is in the bottom 20%. Either of these practices should cause a gradual increase in quality if widely implemented (assuming these thresholds are tracked & updated as they change over time).

Comment by john_maxwell_iv on No, it's not The Incentives—it's you · 2019-06-15T21:25:03.212Z · score: 2 (1 votes) · LW · GW

fake data or misleading statistics

You shouldn't put these in the same category. Fake data is a much graver sin than failing to correct for multiple comparisons or running a study with a small sample size. For the second two, anyone who reads you paper can see what you did (assuming you mention all the comparisons you made) and discount your conclusions accordingly. For a savvy reader or meta-analysis author, a paper which commits these sins can still improve their overall picture of the literature, especially if they employ tools to detect/correct for publication bias. It's not obvious to me that a scientist who employs these practices is doing harm with their academic career, especially given that readers are getting more and more savvy nowadays.

I don't think "fraud" is the right word for these statistical practices. Cherry-picking examples that support your point, the way an opinion columnist does, is probably a more fraudulent practice.

Comment by john_maxwell_iv on Let's talk about "Convergent Rationality" · 2019-06-14T21:55:51.813Z · score: 2 (1 votes) · LW · GW

(2) We can try to build AIs that are not in this category, but screw up*


*(Remember, any AI is running searches through some space in pursuit of something, otherwise you would never call it "intelligence". So one can imagine that the intelligent search may accidentally get aimed at the wrong target.)

The map is not the territory. A system can select a promising action from the space of possible actions without actually taking it. That said, there could be a risk of a "daemon" forming somehow.

Comment by john_maxwell_iv on Does Bayes Beat Goodhart? · 2019-06-14T03:28:09.922Z · score: 4 (2 votes) · LW · GW

By the way I just want to note that expected value isn't the only option available for aggregating utility functions. There's also stuff like Bostrom's parliament idea. I expect there are many opportunities for cross fertilization between AI safety and philosophical work on moral uncertainty.

Comment by john_maxwell_iv on Asymmetric Weapons Aren't Always on Your Side · 2019-06-09T07:33:27.514Z · score: 5 (3 votes) · LW · GW

Big armies tend to defeat smaller ones, and supporting a big army requires large-scale cooperation?

Comment by john_maxwell_iv on Does Bayes Beat Goodhart? · 2019-06-05T01:06:14.008Z · score: 3 (2 votes) · LW · GW

When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they're off-distribution.

I realize I'm playing fast and loose with realizability again, but it seems to me that a system which is capable of being "calibrated", in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I'm not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a "calibrated" ML system in the sense that I'm using the term.

Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.

If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that's fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).

Quantilization follows fairly directly from that :)

My current understanding of quantilization is "choose randomly from the top X% of actions". I don't see how this helps very much with staying on-distribution... as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.

In any case, quantilization seems like it shouldn't work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth's atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren't very valuable.

Comment by john_maxwell_iv on Does Bayes Beat Goodhart? · 2019-06-05T00:54:44.928Z · score: 2 (1 votes) · LW · GW

Why not just tell the AI the truth? Which, in this case, is: Although we might not be able to give it useful information to differentiate between certain complex candidate hypotheses at this point in time, as we reflect and enhance our intelligence, this will become possible. This process of us reflecting & enhancing our intelligence will take an eyeblink in cosmic time. The amount of time from now until the heat death of the universe is so large that instead of maximizing EU according to a narrow conception of our values in the short term, it's better for the AI's actions to remain compatible with a broad swath of potential values that we might discover are the correct values on reflection.

Comment by john_maxwell_iv on Does Bayes Beat Goodhart? · 2019-06-04T03:49:44.496Z · score: 2 (1 votes) · LW · GW

Why wouldn't they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?

Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.

I know I'm being a little fuzzy about realizability. Let's consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you've come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you've seen and also has your unusual idea inadvertently killing the alien. (This is a bit like "murphyjitsu" from CFAR.) If you aren't able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea... then you probably aren't super smart.

I'm just saying that the case is not clear, and it seems like we'd want the case to be clear.

You have to start somewhere. Discussions like this can help make things clear :) I'm getting value from it... you've given me some things to think about, and I think the murphyjitsu idea is something I hadn't thought of previously :)

I think it often makes sense to reason at an informal level before proceeding to a formal one.

Edit: related discussion here.

Comment by john_maxwell_iv on Selection vs Control · 2019-06-03T17:57:10.293Z · score: 4 (2 votes) · LW · GW

Totally agree this is a useful distinction. The map/territory thing feels right on. This is something that the mainstream AI research community doesn't seem confused about. As far as I can see, no one there thinks search and planning are the same task.

With regard to search algorithms being controllers: Here's a discussion I had with ErickBall where they argue that planning will ultimately prove useful for search and I argue it won't. There might also be some new ideas for "what's the critical distinction" in that discussion.

Comment by john_maxwell_iv on Does Bayes Beat Goodhart? · 2019-06-03T07:29:50.699Z · score: 4 (2 votes) · LW · GW

I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?

I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can't be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won't this run into siren worlds and so on, by default?

The siren world scenario posits an AI that is "actually evil" and is an agent which makes plans to manipulate the user.

  • If the AI assigns decent credence to a utility function that assigns massive negative utility to "evil and unmitigated suffering", that will cause its subjective expected utility estimate of the siren world to take a big hit. It would be better off implementing the exact same world, minus the evil and unmitigated suffering. The only way it would think that world was actually better with the evil and unmitigated suffering in it is if something went very wrong during the data-gathering process.

  • I also don't think we should create an agent which makes plans to manipulate the user. The only question it should ever ask the user is the one that maximizes its subjective value of information.

The marketing world problem is very related to the discussion I had with Paul Christiano here. The problem is that the overseer has insufficient time to reflect on their true values. I don't think there is any way of getting around this issue in general: Creating FAI is time-sensitive, which means we won't have enough time to reflect on our true values to be 100% sure that all the input we give the AI is good. In addition to the things I mentioned in that discussion, I think we should:

  • Make a system that's capable of changing its values "online" in response to our input. Corrigibility lets us procrastinate on moral philosophy.

  • Instead of trying to build eutopia right off the bat, build an "optimal ivory tower" for doing moral philosophy in. Essentially, implement coherent extrapolated volition in the real world.

Anyway, the reason the Goodhart-shaped concerns go away is because the thing that maximizes the mixture is likely to be something that is approved of by a diverse range of utility functions that are all semi-compatible with the input the user has provided. If there's even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high. For a worked example, see "Smile maximization case study" in this essay.

As I said, I think Goodhart's law is largely about distributional shift. My scheme incentivizes the AI to mostly take "on-distribution" plans: plans it is confident are good, because many different ways of looking at the data all point to them being good. "Off-distribution" plans will tend to benefit from clarification first: Some ways of extrapolating the data say they are good, others say they are bad, so VoI is high.

the remaining hypotheses might be utility functions which can't be differentiated by the type of evidence which the AI is able to gather

Thanks for bringing this up, I'll think about it. Part of me wants to say "if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!" Or: "If the overseer isn't capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!" But maybe there's an elegant way to implement a more conservative design. (You could, for example, disallow the execution of any plan that the AI thought there was at least a 5% chance was below some utility threshold. But that involves the use of two arbitrary parameters, which seems inelegant.)

Comment by john_maxwell_iv on Does Bayes Beat Goodhart? · 2019-06-03T04:31:12.435Z · score: 5 (3 votes) · LW · GW

However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.

Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?

Calibration seems like it does, in fact, significantly address regressional Goodheart. You can't have seen a lot of instances of an estimate being too high, and still accept that too-high estimate. It doesn't address extremal Goodheart, because calibrated learning can only guarantee that you eventually calibrate, or converge at some rate, or something like that -- extreme values that you've rarely encountered would remain a concern.

If I understand correctly, extremal Goodhart is essentially the same as distributional shift from the Concrete Problems in AI Safety paper.

In any case... I'm not exactly sure what you mean by "calibration", but when I say "calibration", I refer to "knowing what you know". For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I'm reasonably "well-calibrated"; that is, I have a sense of what I do and don't know.

A calibrated AI system, to me, is one that correctly says "this thing I'm looking at is an unusual thing I've never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high".

Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven't yet seen an FAI problem which seems like it can't somehow be reduced to calibrated learning.

I'm not super hung up on statistical guarantees, as I haven't yet seen a way to make them in general which doesn't require making some sort of unreasonable or impractical assumption about the world (and I'm skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.

If there's a true utility function which is assigned some weight, and we apply a whole lot of optimization pressure to the overall mixture distribution, then it is perfectly possible that the true utility function gets compromised for the sake of satisfying a large number of other possible utility functions.

If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.

To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can't easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.

Why should we think that there's a "true" utility function which captures our preferences? And, if there is, why should we assume that it has an explicit representation in the hypothesis space?

Technically, you don't need this assumption. As I wrote in this comment: "it's not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there's some model in the ensemble that also makes that veto."

(I haven't read a lot about quantilization so I can't say much about that. However, a superintelligent adversary seems like something to avoid.)

Comment by john_maxwell_iv on Uncertainty versus fuzziness versus extrapolation desiderata · 2019-06-01T09:26:49.280Z · score: 7 (2 votes) · LW · GW

I think it's better not to let jargon proliferate unnecessarily, and your use of the term "fuzziness" seems rather, well, fuzzy. Is it possible that the content of this post could be communicated using existing jargon such as "moral uncertainty"?

Comment by john_maxwell_iv on What is the best online community for questions about AI capabilities? · 2019-06-01T08:57:17.921Z · score: 2 (1 votes) · LW · GW

Maybe Quora?

Comment by john_maxwell_iv on Feedback Requested! Draft of a New About/Welcome Page for LessWrong · 2019-06-01T06:18:18.234Z · score: 24 (9 votes) · LW · GW

If you want to get ideas, you could look at the history of the old about page and homepage on the wiki. Looking over the versions of those pages I wrote, here are some things I like about my versions better:

  • I don't try to be super comprehensive. I link to an FAQ for reference. FAQs are nice because they're indexed by the content the user wants to access.
  • There is just generally less text. Some of the stuff you're writing doesn't deliver a lot of value to the reader in my opinion. For example, you write: "We invite you to use this site for any number of reasons, including, but not limited to: learning valuable things, being entertained, sharing and getting feedback on your ideas, and participating in a community you like." You're basically describing how people use social media websites. It's not delivering insight for the average reader and it's going to cause peoples' eyes to glaze over. Omit needless words. At most, this sentence should be a footnote or FAQ question "Can I use Less Wrong for things that aren't rationality?" or a shorter sentence "Less Wrong isn't just for rationality, everything is on topic in personal blogposts". Remember that we're trying to put our best foot forward with this page, which will be read by many people, so time spent wordsmithing is worthwhile. (Note: It's fine to blather on in an obscure comment like I'm doing here.)
  • I place less emphasis on individuals. Compare: "The writings of Albert Einstein and Richard Feynman comprise the core readings of Here are Albert's writings, and here are Richard's."
  • I don't try to sell people on reading long sequences of posts right away. I'd sprinkle a variety of interesting, important links I wish more people even outside the community would read, in kind of a clickbaity way, to give people a sense of what the site is about and why it's interesting before getting them to invest in reading a book-length document.
  • I try to emphasize self-improvement benefits. It's a good sales pitch (always start with benefit to the customer), and I think it draws the right sort of ambitious, driven people into the community. Upgrade your beliefs, habits, brain, etc. You do touch on this but you don't lead with the benefits as much as you could. In sales, I think it's better to present the problem before the solution. But you present the solution ("rationality") before the problem.
  • I emphasize that the community is weird and has weird interests. If Less Wrong causes you to acquire some unusual opinions relative to your society or social circle, that's a common side effect. Autodidactism, cryonics, artificial intelligence, effective altruism, transhumanism, etc. You could "show not tell" by saying: "Here's a particular topic many users currently have a contrarian opinion about. But if you still disagree after reading our thoughts, we want to hear why!"

If I was writing the about page in today's era, I would probably emphasize much more heavily that Less Wrong has a much higher standard of discussion than most of the internet, what that means (emphasis on curiosity/truthseeking/critical thinking/intellectual collaboration, long attention spans expected of readers), how we work to preserve it, etc. I might even make it the central thesis of the about page. I think this would help lay down the right culture if the site was to expand, and also attract good people and prime them to be on their best behavior.

I think I'd also lean on the word "rationality" somewhat less.

Comment by john_maxwell_iv on "But It Doesn't Matter" · 2019-06-01T05:44:51.969Z · score: 19 (11 votes) · LW · GW

Sounds like an argument for reading more celebrity gossip :)

Comment by john_maxwell_iv on [Meta] Hiding negative karma notifications by default · 2019-05-13T04:55:14.121Z · score: 2 (1 votes) · LW · GW

Elizabeth's point seems important. What if there was a weekly notification for "here's the downvotes you received in the past week"?

Comment by john_maxwell_iv on The Relationship Between the Village and the Mission · 2019-05-13T04:08:41.315Z · score: 6 (3 votes) · LW · GW

Run the occasional event that requires and/or builds a skill (rationality skills or otherwise).

FYI, the EA Hotel has an upcoming weekend rationality workshop.

Comment by john_maxwell_iv on Disincentives for participating on LW/AF · 2019-05-11T23:22:13.805Z · score: 4 (2 votes) · LW · GW

Well if they're incompetent, that enhances the plausible deniability aspect ('If there was anything in the conversation that didn't make sense on reflection, they could say "oh it was probably the secretary's mistake in transcribing the conversation".') It also might be a way to quickly evaluate someone's distillation ability.

Comment by john_maxwell_iv on Disincentives for participating on LW/AF · 2019-05-11T04:16:43.331Z · score: 9 (5 votes) · LW · GW

So it serves as a training program for aspiring researchers... even better! Actually, in more ways than one, because other aspiring researchers can read the transcript and come up to speed more quickly.

Comment by john_maxwell_iv on Disincentives for participating on LW/AF · 2019-05-11T01:54:52.074Z · score: 10 (6 votes) · LW · GW

Online discussions are much more scaleable than in-person ones. And the stuff you write becomes part of a searchable archive.

I also feel that online discussions allow me to organize my thoughts better. And I think it can be easier to get to the bottom of a disagreement online, whereas in person it's easier for someone to just keep changing the subject and make themselves impossible to pin down, or something like that. In-person conversations end up being too depth-first somehow.

Comment by john_maxwell_iv on Disincentives for participating on LW/AF · 2019-05-11T01:48:56.216Z · score: 19 (7 votes) · LW · GW

What if AI safety researchers hired a secretary to take notes on their conversations? If there was anything in the conversation that didn't make sense on reflection, they could say "oh it was probably the secretary's mistake in transcribing the conversation". Heck, the participants could even be anonymized.

Comment by john_maxwell_iv on How To Use Bureaucracies · 2019-05-10T16:31:47.276Z · score: 1 (4 votes) · LW · GW

If it's hard to conduct RCTs in a domain, it's hard to have reliable knowledge about it period. Who's to say whether your anecdotal observations & conclusions beat mine or someone else's? One way is to check whether someone's job is high status enough that their writings on the topic can be considered part of "the literature". But this is a weak heuristic IMO.

Comment by john_maxwell_iv on How To Use Bureaucracies · 2019-05-09T18:53:23.171Z · score: 7 (4 votes) · LW · GW

I don't think people should feel obligated to read all that's been written on a topic before posting their thoughts on that topic to Less Wrong, especially if that writing is not supported by randomized controlled trials (unsure if this is true for the writings you cite).

Comment by john_maxwell_iv on Crypto quant trading: Naive Bayes · 2019-05-09T05:32:33.981Z · score: 2 (1 votes) · LW · GW

Is this an instance of the "theory" bullet point then? Because the probability of the statement "trading signal XYZ works on Wednesdays, because [specific reason]" cannot be higher than the probability of the statement "trading signal XYZ works" (the first statement involves a conjunction).

Comment by john_maxwell_iv on Crypto quant trading: Naive Bayes · 2019-05-08T21:12:30.535Z · score: 2 (1 votes) · LW · GW

I'd be interested to learn more about the "components" part.

Raleigh SSC/LW/EA Meetup - Meet MealSquares People

2019-05-08T00:01:36.639Z · score: 12 (3 votes)
Comment by john_maxwell_iv on The AI alignment problem as a consequence of the recursive nature of plans · 2019-04-10T01:58:57.754Z · score: 7 (4 votes) · LW · GW

Any agent that seeks X as an instrumental goal, with, say, Y as a terminal goal, can easily be outcompeted by an agent that seeks X as a terminal goal.

You offered a lot of arguments for why this is true for humans, but I'm less certain this is true for AIs.

Suppose the first AI devotes 100% of its computation to achieving X, and the second AI devotes 90% of its computation to achieving X and 10% of its computation to monitoring that achieving X is still helpful for achieving Y. All else equal, the first AI is more likely to win. But it's not necessarily true that all else is equal. For example, if the second AI possessed 20% more computational resources than the first AI, I'd expect the second AI to win even though it only seeks X as an instrumental goal.

Comment by john_maxwell_iv on Reinforcement learning with imperceptible rewards · 2019-04-08T05:01:56.091Z · score: 2 (1 votes) · LW · GW

The literature study was very cursory and I will be glad to know about prior work I missed!

This post of mine seems related.

Comment by john_maxwell_iv on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-04T21:26:59.119Z · score: 2 (1 votes) · LW · GW

It's uncertainty all the way down. This is where recursive self-improvement comes in handy.

Comment by john_maxwell_iv on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-03T22:56:22.579Z · score: 4 (2 votes) · LW · GW

Glad you are thinking along these lines. Personally, I would go even further to use existing ML concepts in the implementation of this idea. Instead of explicitly stating W as our current best estimate for U, provide the system with a labeled dataset about human preferences, using soft labels (probabilities that aren't 0 or 1) instead of hard labels, to better communicate our uncertainty. Have the system use active learning to identify examples such that getting a label for those examples would be highly informative for its model. Use cross-validation to figure out which modeling strategies generalize with calibrated probability estimates most effectively. I'm pretty sure there are also machine learning techniques for identifying examples which have a high probability of being mislabeled, or examples that are especially pivotal to the system's model of the world, so that could be used to surface particular examples so the human overseer could give them a second look. (If such techniques don't exist already I don't think it would be hard to develop them.)

Comment by john_maxwell_iv on What would you need to be motivated to answer "hard" LW questions? · 2019-03-30T07:37:14.087Z · score: 4 (2 votes) · LW · GW

This could motivate me to spend minutes or hours answering a question, but I think it would be insufficient to motivate me to spend weeks or months. Maybe if there was an option to also submit my question answer as a regular post.

Comment by john_maxwell_iv on What would you need to be motivated to answer "hard" LW questions? · 2019-03-30T07:10:18.000Z · score: 12 (4 votes) · LW · GW

If answering the question takes weeks or months of work, won't the question have fallen off the frontpage by the time the research is done?

What motivates me is making an impact and getting quality feedback on my thinking. These both scale with the number of readers. If no one will read my answer, I'm not feeling very motivated.

Comment by john_maxwell_iv on Unsolved research problems vs. real-world threat models · 2019-03-30T06:28:20.749Z · score: 2 (1 votes) · LW · GW

Moreover, the “adversary” need not be a human actor searching deliberately: a search for mistakes can happen unintentionally any time a selection process with adverse incentives is applied. (Such as testing thousands of inputs to find which ones get the most clicks or earn the most money).

Is there a post or paper which talks about this in more detail?

I understand optimizing for an imperfect measurement, but it's not clear to me if/how this is linked to small perturbation adversarial examples beyond general handwaving about the deficiencies of machine learning.

Comment by john_maxwell_iv on Alignment Newsletter #50 · 2019-03-28T19:22:35.834Z · score: 4 (2 votes) · LW · GW

One thing that bothered me a bit about some of the AI doom discussion is that it felt a little like it was working backwards from the assumption of AI doom instead of working forwards from the situation we're currently in and various ways in which things could plausibly evolve. When I was a Christian, I remember reading websites which speculated about which historical events corresponded to various passages in the book of Revelation. Making the assumption that AI doom is coming and trying to figure out which real-world event corresponds to the prophecied doom is thinking that has a similar flavor.

Comment by john_maxwell_iv on The Main Sources of AI Risk? · 2019-03-26T03:06:21.987Z · score: 2 (1 votes) · LW · GW

I don't think proofs are the right tool here. Proof by induction was meant as an analogy.

Comment by john_maxwell_iv on The Main Sources of AI Risk? · 2019-03-24T19:36:55.715Z · score: 2 (1 votes) · LW · GW

One possibility is a sort of proof by induction, where you start with code which has been inspected by humans, then that code inspects further code, etc.

Daemons and mindcrime seem most worrisome for superhuman systems, but a human-level system is plausibly sufficient to comprehend human values (and thus do useful inspections). For daemons, I think you might even be able to formalize the idea without leaning hard on any specific utility function. The best approach might involve utility uncertainty on the part of the AI that becomes narrower with time, so you can gradually bootstrap your way to understanding human values while avoiding computational hazards according to your current guesses about human values on your way there.

People already choose not to think about particular topics on the basis of information hazards and internal suffering. Sometimes these judgements are made in an interrupt fashion partway through thinking about a topic; others are outside view judgments ("thinking about topic X always makes me feel depressed").

Comment by john_maxwell_iv on What failure looks like · 2019-03-24T19:20:58.399Z · score: 2 (1 votes) · LW · GW

You could always get a job at a company which controls an important algorithm.

Comment by john_maxwell_iv on Why the AI Alignment Problem Might be Unsolvable? · 2019-03-24T06:04:41.323Z · score: 8 (6 votes) · LW · GW

And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time, that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.

I don't believe an AI which simultaneously optimized multiple utility functions using a moral parliament approach would tile the universe with tiny artificial agents as described here.

"optimizing for competing utility functions" is not the same as optimizing for conflict resolution. There are various schemes for combining utility functions (some discussion on this podcast for instance). But let's wave our hands a bit and say each of my utility functions outputs a binary approve/disapprove signal for any given action, and we choose randomly among those actions which are approved of by all of my utility functions. Then if even a single utility function doesn't approve of the action "tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves", this action will not be done.

Comment by john_maxwell_iv on The Main Sources of AI Risk? · 2019-03-23T17:02:20.314Z · score: 4 (2 votes) · LW · GW

You could add another entry for "something we haven't thought of".

I think the best way to deal with the "something we haven't thought of" entry is to try & come up with simple ideas which knock out multiple entries on this list simultaneously. For example, 4 and 17 might both be solved if our system inspects code before running it to try & figure out whether running that code will be harmful according to its values. This is a simple solution which plausibly generalizes to problems we haven't thought of. (Assuming the alignment problem is solved.)

In the same way simple statistical models are more likely to generalize, I think simple patches are also more likely to generalize. Having a separate solution for every item on the list seems like overfitting to the list.

Comment by john_maxwell_iv on Humans aren't agents - what then for value learning? · 2019-03-19T04:17:19.600Z · score: 5 (3 votes) · LW · GW

Flagging that the end of "The Tails Coming Apart as Metaphor for Life" more or less describes "distributional shift" from the Concrete Problems in AI Safety paper.

I have a hunch that many AI safety problems end up boiling down to distributional shift in one way or another. For example, here I argued that concerns around Goodhart's Law are essentially an issue of distributional shift: If the model you're using for human values is vulnerable to distributional shift, then the maximum value will likely be attained off-distribution.

Comment by john_maxwell_iv on What failure looks like · 2019-03-19T01:06:04.852Z · score: 2 (1 votes) · LW · GW

To a large extent "ML" refers to a few particular technologies that have the form "try a bunch of things and do more of what works" or "consider a bunch of things and then do the one that is predicted to work."

Why not "try a bunch of measurements and figure out which one generalizes best" or "consider a bunch of things and then do the one that is predicted to work according to the broadest variety of ML-generated measurements"? (I expect there's already some research corresponding to these suggestions, but more could be valuable?)

Comment by john_maxwell_iv on What failure looks like · 2019-03-18T05:03:07.956Z · score: 10 (5 votes) · LW · GW

OK, thanks for clarifying. Sounds like a new framing of the "daemon" idea.

Comment by john_maxwell_iv on What failure looks like · 2019-03-18T02:16:58.426Z · score: 7 (4 votes) · LW · GW

Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.


One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”

Consider this video of an AI system with a misspecified reward function. (Background in this post.) The AI system searches the space of policies to discover the one that performs best according to its reward function in the simulated boat-racing world. It turns out that the one which performs best according to this misspecified reward function doesn't perform well according to the intended reward function (the "training objective" that the system's developers use to evaluate performance).

The goal of picking up as many power-ups as possible could lead to influence-seeking behavior: If the boat can persuade us to leave the simulation on, it can keep picking up power-ups until the end of time. Suppose for the sake of argument that performing well on the training objective is the best strategy for obtaining influence, as you posit. Then the boat should complete the race correctly, in order to fool us into thinking it reliably works towards the training objective. And yet it doesn't complete the race correctly in the video. Why not?

One answer is that the human supervisor isn't part of the system's world model. But I don't think that would change things. Suppose instead of making use of an existing video game, the system's world model was generated automatically by observing the world, and the observations were detailed enough to include the supervisor of the AI system and even the AI system itself. Now the boat is trying to find policies that maximize power-ups in this absurdly detailed, automatically generated world model (with some power-ups manually added in). Why would a policy which manipulates the operator within the simulated world score well? It seems like it would take a confused world model for manipulation of the simulated operator to help with picking up simulated power-ups. Like if painting animals on cave walls actually caused them to appear. Larry Ellison is not going to win a yacht race by telling his data scientist to cripple his opponents in a simulation.

[Another frame: Cartesian dualism will happen by default, or at least will be easy to enforce on the architectural level. You could argue Cartesian dualists lose because they don't do self-improvement? But an implied premise of your post is that foom won't happen. I disagree but that's another discussion.]

But let's suppose the world model actually is confused, and the best policy in the simulation is one that manipulates the simulated operator to gain simulated power-ups. Even in this case, I think we'd still see a video like I linked earlier. We'd see the boat powering over to the part of the simulated world where the simulated operator resides, doing something to manipulate the simulated operator, and then the boat would have loads of power-ups somehow. I think the biggest concern is exposure to an information hazard when we see how the boat manipulates the operator. (Luckily, if we implement an information hazard filter before letting ourselves watch the video, the boat will not optimize to get past it.)

Human billionaires are hiring physicists to try & figure out if our universe is a simulation and if so, how to hack our way out. So there might be something here. Maybe if world model construction happens in tandem with exploring the space of policies, the boat will start "considering the possibility that it's in a simulation" in a sense. (Will trying to manipulate the thing controlling the simulation be a policy that performs well in the simulation?)

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-03-11T00:03:22.198Z · score: 2 (1 votes) · LW · GW

they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate.

This description makes it sound like the researcher looks ahead about 1 step. I think that's short-term planning, not long-term planning.

My intuition is that the most important missing puzzle pieces for AGI involve the "generate abstract theories of how and why different approaches work" part. Once you've figured that out, there's a second step of searching for an experiment which will let you distinguish between your current top few theories. In terms of competitiveness, I think the "long-term planning free" approach of looking ahead just 1 step will likely prove just as competitive if not more so than trying to look ahead multiple steps. (Doing long-term planning means spending a lot of time refining theories about hypothetical data points you haven't yet gathered! That seems a bit wasteful, since most possible data points won't actually get gathered. Why not spend that compute gathering data instead?)

But I also think this may all be beside the point. Remember my claim from further up this thread:

In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.

For the sake of argument, I'll assume we'll soon see major gains from long-term planning and modify my statement so it reads:

In machine learning++, we make plans for collecting data and refining theories about that data. Attaining new resources means making plans for manipulating the physical world. (And then executing that plan!) These are different search tasks with different objective functions.

Even in a world where long-term planning is a critical element of machine learning++, it seems to me that the state space that these plans act on is an abstract state space corresponding to states of knowledge of the system. It's not making plans for acting in the physical world, except accidentally insofar as it does computations which are implemented in the physical world. Despite its superhuman planning abilities, AlphaGo did not make any plans for e.g. manipulating humans in the physical world, because the state space it did its planning over only involved Go stones.

Comment by john_maxwell_iv on Karma-Change Notifications · 2019-03-07T19:53:11.966Z · score: 30 (8 votes) · LW · GW

FYI, I talked to Oliver about this and he says:

  • The average post gets between 200 and 500 unique views in the first month, with curated ones usually getting around 2k to 5k.

  • Usually viewership appears to be roughly a factor 20 or 30 times the vote count.

The Case for a Bigger Audience

2019-02-09T07:22:07.357Z · score: 64 (25 votes)

Why don't people use formal methods?

2019-01-22T09:39:46.721Z · score: 21 (8 votes)

General and Surprising

2017-09-15T06:33:19.797Z · score: 3 (3 votes)

Heuristics for textbook selection

2017-09-06T04:17:01.783Z · score: 8 (8 votes)

Revitalizing Less Wrong seems like a lost purpose, but here are some other ideas

2016-06-12T07:38:58.557Z · score: 22 (28 votes)

Zooming your mind in and out

2015-07-06T12:30:58.509Z · score: 8 (9 votes)

Purchasing research effectively open thread

2015-01-21T12:24:22.951Z · score: 12 (13 votes)

Productivity thoughts from Matt Fallshaw

2014-08-21T05:05:11.156Z · score: 13 (14 votes)

Managing one's memory effectively

2014-06-06T17:39:10.077Z · score: 14 (15 votes)

OpenWorm and differential technological development

2014-05-19T04:47:00.042Z · score: 6 (7 votes)

System Administrator Appreciation Day - Thanks Trike!

2013-07-26T17:57:52.410Z · score: 70 (71 votes)

Existential risks open thread

2013-03-31T00:52:46.589Z · score: 10 (11 votes)

Why AI may not foom

2013-03-24T08:11:55.006Z · score: 23 (35 votes)

[Links] Brain mapping/emulation news

2013-02-21T08:17:27.931Z · score: 2 (7 votes)

Akrasia survey data analysis

2012-12-08T03:53:35.658Z · score: 13 (14 votes)

Akrasia hack survey

2012-11-30T01:09:46.757Z · score: 11 (14 votes)

Thoughts on designing policies for oneself

2012-11-28T01:27:36.337Z · score: 80 (80 votes)

Room for more funding at the Future of Humanity Institute

2012-11-16T20:45:18.580Z · score: 18 (21 votes)

Empirical claims, preference claims, and attitude claims

2012-11-15T19:41:02.955Z · score: 5 (28 votes)

Economy gossip open thread

2012-10-28T04:10:03.596Z · score: 23 (30 votes)

Passive income for dummies

2012-10-27T07:25:33.383Z · score: 17 (22 votes)

Morale management for entrepreneurs

2012-09-30T05:35:05.221Z · score: 9 (14 votes)

Could evolution have selected for moral realism?

2012-09-27T04:25:52.580Z · score: 4 (14 votes)

Personal information management

2012-09-11T11:40:53.747Z · score: 18 (19 votes)

Proposed rewrites of LW home page, about page, and FAQ

2012-08-17T22:41:57.843Z · score: 18 (19 votes)

[Link] Holistic learning ebook

2012-08-03T00:29:54.003Z · score: 10 (17 votes)

Brainstorming additional AI risk reduction ideas

2012-06-14T07:55:41.377Z · score: 12 (15 votes)

Marketplace Transactions Open Thread

2012-06-02T04:31:32.387Z · score: 29 (30 votes)

Expertise and advice

2012-05-27T01:49:25.444Z · score: 17 (22 votes)

PSA: Learn to code

2012-05-25T18:50:01.407Z · score: 34 (39 votes)

Knowledge value = knowledge quality × domain importance

2012-04-16T08:40:57.158Z · score: 8 (13 votes)

Rationality anecdotes for the homepage?

2012-04-04T06:33:32.097Z · score: 3 (8 votes)

Simple but important ideas

2012-03-21T06:59:22.043Z · score: 18 (23 votes)

6 Tips for Productive Arguments

2012-03-18T21:02:32.326Z · score: 30 (45 votes)

Cult impressions of Less Wrong/Singularity Institute

2012-03-15T00:41:34.811Z · score: 34 (59 votes)

[Link, 2011] Team may be chosen to receive $1.4 billion to simulate human brain

2012-03-09T21:13:42.482Z · score: 8 (15 votes)

Productivity tips for those low on motivation

2012-03-06T02:41:20.861Z · score: 7 (12 votes)

The Singularity Institute has started publishing monthly progress reports

2012-03-05T08:19:31.160Z · score: 21 (24 votes)

Less Wrong mentoring thread

2011-12-29T00:10:58.774Z · score: 31 (34 votes)

Heuristics for Deciding What to Work On

2011-06-01T07:31:17.482Z · score: 20 (23 votes)

Upcoming meet-ups: Auckland, Bangalore, Houston, Toronto, Minneapolis, Ottawa, DC, North Carolina, BC...

2011-05-21T05:06:08.824Z · score: 5 (8 votes)

Being Rational and Being Productive: Similar Core Skills?

2010-12-28T10:11:01.210Z · score: 18 (31 votes)

Applying Behavioral Psychology on Myself

2010-06-20T06:25:13.679Z · score: 53 (60 votes)

The Math of When to Self-Improve

2010-05-15T20:35:37.449Z · score: 6 (16 votes)

Accuracy Versus Winning

2009-04-02T04:47:37.156Z · score: 12 (21 votes)

So you say you're an altruist...

2009-03-12T22:15:59.935Z · score: 11 (35 votes)