Comment by john_maxwell_iv on The AI alignment problem as a consequence of the recursive nature of plans · 2019-04-10T01:58:57.754Z · score: 7 (4 votes) · LW · GW

Any agent that seeks X as an instrumental goal, with, say, Y as a terminal goal, can easily be outcompeted by an agent that seeks X as a terminal goal.

You offered a lot of arguments for why this is true for humans, but I'm less certain this is true for AIs.

Suppose the first AI devotes 100% of its computation to achieving X, and the second AI devotes 90% of its computation to achieving X and 10% of its computation to monitoring that achieving X is still helpful for achieving Y. All else equal, the first AI is more likely to win. But it's not necessarily true that all else is equal. For example, if the second AI possessed 20% more computational resources than the first AI, I'd expect the second AI to win even though it only seeks X as an instrumental goal.

Comment by john_maxwell_iv on Reinforcement learning with imperceptible rewards · 2019-04-08T05:01:56.091Z · score: 2 (1 votes) · LW · GW

The literature study was very cursory and I will be glad to know about prior work I missed!

This post of mine seems related.

Comment by john_maxwell_iv on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-04T21:26:59.119Z · score: 2 (1 votes) · LW · GW

It's uncertainty all the way down. This is where recursive self-improvement comes in handy.

Comment by john_maxwell_iv on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-03T22:56:22.579Z · score: 4 (2 votes) · LW · GW

Glad you are thinking along these lines. Personally, I would go even further to use existing ML concepts in the implementation of this idea. Instead of explicitly stating W as our current best estimate for U, provide the system with a labeled dataset about human preferences, using soft labels (probabilities that aren't 0 or 1) instead of hard labels, to better communicate our uncertainty. Have the system use active learning to identify examples such that getting a label for those examples would be highly informative for its model. Use cross-validation to figure out which modeling strategies generalize with calibrated probability estimates most effectively. I'm pretty sure there are also machine learning techniques for identifying examples which have a high probability of being mislabeled, or examples that are especially pivotal to the system's model of the world, so that could be used to surface particular examples so the human overseer could give them a second look. (If such techniques don't exist already I don't think it would be hard to develop them.)

Comment by john_maxwell_iv on What would you need to be motivated to answer "hard" LW questions? · 2019-03-30T07:37:14.087Z · score: 4 (2 votes) · LW · GW

This could motivate me to spend minutes or hours answering a question, but I think it would be insufficient to motivate me to spend weeks or months. Maybe if there was an option to also submit my question answer as a regular post.

Comment by john_maxwell_iv on What would you need to be motivated to answer "hard" LW questions? · 2019-03-30T07:10:18.000Z · score: 12 (4 votes) · LW · GW

If answering the question takes weeks or months of work, won't the question have fallen off the frontpage by the time the research is done?

What motivates me is making an impact and getting quality feedback on my thinking. These both scale with the number of readers. If no one will read my answer, I'm not feeling very motivated.

Comment by john_maxwell_iv on Unsolved research problems vs. real-world threat models · 2019-03-30T06:28:20.749Z · score: 2 (1 votes) · LW · GW

Moreover, the “adversary” need not be a human actor searching deliberately: a search for mistakes can happen unintentionally any time a selection process with adverse incentives is applied. (Such as testing thousands of inputs to find which ones get the most clicks or earn the most money).

Is there a post or paper which talks about this in more detail?

I understand optimizing for an imperfect measurement, but it's not clear to me if/how this is linked to small perturbation adversarial examples beyond general handwaving about the deficiencies of machine learning.

Comment by john_maxwell_iv on Alignment Newsletter #50 · 2019-03-28T19:22:35.834Z · score: 4 (2 votes) · LW · GW

One thing that bothered me a bit about some of the AI doom discussion is that it felt a little like it was working backwards from the assumption of AI doom instead of working forwards from the situation we're currently in and various ways in which things could plausibly evolve. When I was a Christian, I remember reading websites which speculated about which historical events corresponded to various passages in the book of Revelation. Making the assumption that AI doom is coming and trying to figure out which real-world event corresponds to the prophecied doom is thinking that has a similar flavor.

Comment by john_maxwell_iv on The Main Sources of AI Risk? · 2019-03-26T03:06:21.987Z · score: 2 (1 votes) · LW · GW

I don't think proofs are the right tool here. Proof by induction was meant as an analogy.

Comment by john_maxwell_iv on The Main Sources of AI Risk? · 2019-03-24T19:36:55.715Z · score: 2 (1 votes) · LW · GW

One possibility is a sort of proof by induction, where you start with code which has been inspected by humans, then that code inspects further code, etc.

Daemons and mindcrime seem most worrisome for superhuman systems, but a human-level system is plausibly sufficient to comprehend human values (and thus do useful inspections). For daemons, I think you might even be able to formalize the idea without leaning hard on any specific utility function. The best approach might involve utility uncertainty on the part of the AI that becomes narrower with time, so you can gradually bootstrap your way to understanding human values while avoiding computational hazards according to your current guesses about human values on your way there.

People already choose not to think about particular topics on the basis of information hazards and internal suffering. Sometimes these judgements are made in an interrupt fashion partway through thinking about a topic; others are outside view judgments ("thinking about topic X always makes me feel depressed").

Comment by john_maxwell_iv on More realistic tales of doom · 2019-03-24T19:20:58.399Z · score: 2 (1 votes) · LW · GW

You could always get a job at a company which controls an important algorithm.

Comment by john_maxwell_iv on Why the AI Alignment Problem Might be Unsolvable? · 2019-03-24T06:04:41.323Z · score: 8 (6 votes) · LW · GW

And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time, that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.

I don't believe an AI which simultaneously optimized multiple utility functions using a moral parliament approach would tile the universe with tiny artificial agents as described here.

"optimizing for competing utility functions" is not the same as optimizing for conflict resolution. There are various schemes for combining utility functions (some discussion on this podcast for instance). But let's wave our hands a bit and say each of my utility functions outputs a binary approve/disapprove signal for any given action, and we choose randomly among those actions which are approved of by all of my utility functions. Then if even a single utility function doesn't approve of the action "tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves", this action will not be done.

Comment by john_maxwell_iv on The Main Sources of AI Risk? · 2019-03-23T17:02:20.314Z · score: 4 (2 votes) · LW · GW

You could add another entry for "something we haven't thought of".

I think the best way to deal with the "something we haven't thought of" entry is to try & come up with simple ideas which knock out multiple entries on this list simultaneously. For example, 4 and 17 might both be solved if our system inspects code before running it to try & figure out whether running that code will be harmful according to its values. This is a simple solution which plausibly generalizes to problems we haven't thought of. (Assuming the alignment problem is solved.)

In the same way simple statistical models are more likely to generalize, I think simple patches are also more likely to generalize. Having a separate solution for every item on the list seems like overfitting to the list.

Comment by john_maxwell_iv on Humans aren't agents - what then for value learning? · 2019-03-19T04:17:19.600Z · score: 5 (3 votes) · LW · GW

Flagging that the end of "The Tails Coming Apart as Metaphor for Life" more or less describes "distributional shift" from the Concrete Problems in AI Safety paper.

I have a hunch that many AI safety problems end up boiling down to distributional shift in one way or another. For example, here I argued that concerns around Goodhart's Law are essentially an issue of distributional shift: If the model you're using for human values is vulnerable to distributional shift, then the maximum value will likely be attained off-distribution.

Comment by john_maxwell_iv on More realistic tales of doom · 2019-03-19T01:06:04.852Z · score: 2 (1 votes) · LW · GW

To a large extent "ML" refers to a few particular technologies that have the form "try a bunch of things and do more of what works" or "consider a bunch of things and then do the one that is predicted to work."

Why not "try a bunch of measurements and figure out which one generalizes best" or "consider a bunch of things and then do the one that is predicted to work according to the broadest variety of ML-generated measurements"? (I expect there's already some research corresponding to these suggestions, but more could be valuable?)

Comment by john_maxwell_iv on More realistic tales of doom · 2019-03-18T05:03:07.956Z · score: 10 (5 votes) · LW · GW

OK, thanks for clarifying. Sounds like a new framing of the "daemon" idea.

Comment by john_maxwell_iv on More realistic tales of doom · 2019-03-18T02:16:58.426Z · score: 7 (4 votes) · LW · GW

Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.


One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”

Consider this video of an AI system with a misspecified reward function. (Background in this post.) The AI system searches the space of policies to discover the one that performs best according to its reward function in the simulated boat-racing world. It turns out that the one which performs best according to this misspecified reward function doesn't perform well according to the intended reward function (the "training objective" that the system's developers use to evaluate performance).

The goal of picking up as many power-ups as possible could lead to influence-seeking behavior: If the boat can persuade us to leave the simulation on, it can keep picking up power-ups until the end of time. Suppose for the sake of argument that performing well on the training objective is the best strategy for obtaining influence, as you posit. Then the boat should complete the race correctly, in order to fool us into thinking it reliably works towards the training objective. And yet it doesn't complete the race correctly in the video. Why not?

One answer is that the human supervisor isn't part of the system's world model. But I don't think that would change things. Suppose instead of making use of an existing video game, the system's world model was generated automatically by observing the world, and the observations were detailed enough to include the supervisor of the AI system and even the AI system itself. Now the boat is trying to find policies that maximize power-ups in this absurdly detailed, automatically generated world model (with some power-ups manually added in). Why would a policy which manipulates the operator within the simulated world score well? It seems like it would take a confused world model for manipulation of the simulated operator to help with picking up simulated power-ups. Like if painting animals on cave walls actually caused them to appear. Larry Ellison is not going to win a yacht race by telling his data scientist to cripple his opponents in a simulation.

[Another frame: Cartesian dualism will happen by default, or at least will be easy to enforce on the architectural level. You could argue Cartesian dualists lose because they don't do self-improvement? But an implied premise of your post is that foom won't happen. I disagree but that's another discussion.]

But let's suppose the world model actually is confused, and the best policy in the simulation is one that manipulates the simulated operator to gain simulated power-ups. Even in this case, I think we'd still see a video like I linked earlier. We'd see the boat powering over to the part of the simulated world where the simulated operator resides, doing something to manipulate the simulated operator, and then the boat would have loads of power-ups somehow. I think the biggest concern is exposure to an information hazard when we see how the boat manipulates the operator. (Luckily, if we implement an information hazard filter before letting ourselves watch the video, the boat will not optimize to get past it.)

Human billionaires are hiring physicists to try & figure out if our universe is a simulation and if so, how to hack our way out. So there might be something here. Maybe if world model construction happens in tandem with exploring the space of policies, the boat will start "considering the possibility that it's in a simulation" in a sense. (Will trying to manipulate the thing controlling the simulation be a policy that performs well in the simulation?)

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-03-11T00:03:22.198Z · score: 2 (1 votes) · LW · GW

they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate.

This description makes it sound like the researcher looks ahead about 1 step. I think that's short-term planning, not long-term planning.

My intuition is that the most important missing puzzle pieces for AGI involve the "generate abstract theories of how and why different approaches work" part. Once you've figured that out, there's a second step of searching for an experiment which will let you distinguish between your current top few theories. In terms of competitiveness, I think the "long-term planning free" approach of looking ahead just 1 step will likely prove just as competitive if not more so than trying to look ahead multiple steps. (Doing long-term planning means spending a lot of time refining theories about hypothetical data points you haven't yet gathered! That seems a bit wasteful, since most possible data points won't actually get gathered. Why not spend that compute gathering data instead?)

But I also think this may all be beside the point. Remember my claim from further up this thread:

In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.

For the sake of argument, I'll assume we'll soon see major gains from long-term planning and modify my statement so it reads:

In machine learning++, we make plans for collecting data and refining theories about that data. Attaining new resources means making plans for manipulating the physical world. (And then executing that plan!) These are different search tasks with different objective functions.

Even in a world where long-term planning is a critical element of machine learning++, it seems to me that the state space that these plans act on is an abstract state space corresponding to states of knowledge of the system. It's not making plans for acting in the physical world, except accidentally insofar as it does computations which are implemented in the physical world. Despite its superhuman planning abilities, AlphaGo did not make any plans for e.g. manipulating humans in the physical world, because the state space it did its planning over only involved Go stones.

Comment by john_maxwell_iv on Karma-Change Notifications · 2019-03-07T19:53:11.966Z · score: 24 (7 votes) · LW · GW

FYI, I talked to Oliver about this and he says:

  • The average post gets between 200 and 500 unique views in the first month, with curated ones usually getting around 2k to 5k.

  • Usually viewership appears to be roughly a factor 20 or 30 times the vote count.

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-03-06T21:25:29.185Z · score: 2 (1 votes) · LW · GW

This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It's only a small step from there to somehow getting it implemented.

We've already calculated a solution for the optimization problem of "how to destroy human civilization": nuclear winter. It's only a "small step" to getting it implemented. But it has been several decades, and that small step hasn't been taken yet. Seems like the existence of a small step between knowledge of how to do something and actually doing it can be pretty meaningful.

My steelman is that a superpowered GPT-2 which isn't an agent could still inadvertently generate information hazards, which seems like a good point.

there's nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute.

How do you reckon long-term planning will be useful for architecture search? It's not a stateful system.

Architecture search is a problem of figuring out where you want to go. Once you know where you want to go, getting there is easy. Just use that as your architecture. Long-term planning is useful on "getting there" problems, not "figuring out where you want to go" problems. There's little use in planning long-term in a highly uncertain environment, and the entire point of architecture search is to resolve uncertainty about the "environment" of possible architectures. ("Environment" in scare quotes because I think you're making a type error, and "search space" is the right term in the context of architecture search, but I'm playing along with your ontology for the sake of argument.)

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-03-06T21:10:36.798Z · score: 5 (2 votes) · LW · GW

If you literally ran (a powered-up version of) GPT-2 on "A brilliant solution to the AI alignment problem is..." you would get the sort of thing an average internet user would think of as a brilliant solution to the AI alignment problem.

Change it to: "I'm a Turing Award winner and Fields medalist, and last night I had an incredible insight about how to solve the AI alignment problem. The insight is..." It's improbable that a mediocre quality idea will follow. (Another idea: write a description of an important problem in computer science, followed by "The solution is...", and then a brilliant solution someone came up with. Do this for a few major solved problems in computer science. Then write a description of the AI alignment problem, followed by "The solution is...", and let GPT-2 continue from there.)

Trying to do this more usefully basically leads to Paul's agenda (which is about trying to do imitation learning of an implicit organization of humans)

One take: Either GPT-2 can be radically improved (to offer useful completions as in the "Turing Award" example above), or it can't be. If it can be radically improved, it can help with FAI, perhaps by contributing to Paul's agenda. If it can't be radically improved, then it's not important for AGI. So GPT-2 is neutral or good news.

Comment by john_maxwell_iv on Karma-Change Notifications · 2019-03-02T21:46:16.123Z · score: 7 (4 votes) · LW · GW

Checking your userpage and checking your karma notifications are both random reinforcers, ergo switching from one to the other is dopamine neutral. Step one is to extinguish the behavior of checking your userpage by making that dopamine neutral behavior swap. Step two is decrease notification frequency.

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-03-02T07:22:12.518Z · score: 3 (2 votes) · LW · GW

But in training such a model, you explicitly define a utility function (minimization of prediction error) and then run powerful optimization algorithms on it. If those algorithms are just as complex as the superhuman language model, they could plausibly do things like hack the reward function, seek out information about the environment, or try to attain new resources in service of the goal of making the perfect language model.

Optimization algorithms used in deep learning are typically pretty simple. Gradient descent is taught in sophomore calculus. Variants on gradient descent are typically used, but all the ones I know of are well under a page of code in complexity.

But I'm not sure complexity is the right way to think about it. In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.

The best counterargument I know of is probably something like this. As it was put in a recent post: "trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners". This is the thing I would worry about most in a superhuman language model.

Comment by john_maxwell_iv on Karma-Change Notifications · 2019-03-02T05:25:03.911Z · score: 17 (8 votes) · LW · GW

I've found it particularly exciting to see comments and posts of mine that are many months old still getting upvotes, which I think makes me generally better calibrated on the long-term value of writing things up and making them public, instead of just talking to people in-person which tends to have a higher immediate reward but a lower long-term reward.

Even vote counts underestimate viewership numbers pretty drastically, don't they? I remember making comments with embedded polls where the poll got 100+ votes and the comment was sitting at +2. (And only logged-in users can vote in polls!)

Comment by john_maxwell_iv on Karma-Change Notifications · 2019-03-02T05:17:05.252Z · score: 12 (6 votes) · LW · GW

Cool feature! I've also noticed myself doing this.

A trick I've found for making behavior changes like this: Start with a "dopamine neutral" change that sets up a behavioral pathway for later changes. In this case, the "dopamine neutral" change is making notifications real-time. After a while, you unlearn the behavior of looking at your user page, because you're getting that info through notifications. Then you can slow the notifications down. The risk of setting them to daily or weekly right away is that you never unlearn the behavior of going straight to your userpage to get the latest changes.

Probably overkill for this use case, but the general pattern can be useful in other contexts. Example: Resolve to only open your web browser through a command line script. This is close to dopamine neutral. But once you've got that behavioral hook embedded, you can modify the script so that it forces you to wait 10 minutes, or asks you some questions about your intentions, or gets you to specify a whitelist of domains you will visit, or whatever. Then opening your browser through some other method serves as a Schelling fence you know not to cross. (If you find yourself making "just this once" modifications to the script, you could design some policies for when modifications can be made.) I've gotten a lot of mileage out of building systems like this.

Comment by john_maxwell_iv on Open Thread February 2019 · 2019-02-25T21:48:58.709Z · score: 10 (6 votes) · LW · GW

Hey, welcome! Glad you made this post.

My biggest failure point is my inability to carry out goals. That’s what my inner Murphy says would cause my failure to get into MIRI and do good work. That’s probably the most important thing I’m currently trying to get out of LW - the Hammertime sequence looks promising. If anyone has any good recommendations for people who can’t remember to focus, I’d love them.

If you elaborate on your productivity issues, maybe we can offer specific recommendations. What's the nature of your difficulty focusing?

My Outside View is sane, and I know there’s a very low chance that I’ve seen something that everyone else missed.

I was a computer science undergraduate at a top university. The outside view is that for computer science students taking upper division classes, assisting professors with research is nothing remarkable. Pure math is different, because there is so much already and you need to climb to the top before contributing. But AI safety is a very young field.

The thing you're describing sounds similar to other proposals I've seen. But I'd suggest developing it independently for a while. A common piece of research advice: If you read what others write, you think the same thoughts they're thinking, which decreases your odds of making an original contribution. (Once you run out of steam, you can survey the literature, figure out how your idea is different, and publish the delta.)

I would suggest playing with ideas without worrying a lot about whether they're original. Independently re-inventing something can still be a rewarding experience. See also. You miss all the shots you don't take.

Final note: When I was your age, I suffered from the halo effect when thinking about MIRI. It took me years to realize MIRI has blind spots just like everyone else. I wish I had realized this sooner. They say science advances one funeral at a time. If AI safety is to progress faster than that, we'll need willingness to disregard the opinions of senior people while they're still alive. A healthy disrespect for authority is a good thing to have.

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-02-23T21:15:22.911Z · score: 2 (6 votes) · LW · GW

"If my calculator can multiply two 100-digit numbers, then it evidently has the ability to use resources to further its goal of doing difficult arithmetic problems, else it couldn't do difficult arithmetic problems."

This is magical thinking.

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-02-23T08:34:17.444Z · score: 5 (4 votes) · LW · GW

I'm not sure if you're being serious or not, but in case you are: Do you know much about how language models work? If so, which part of the code is the part that's going to turn the world into computronium?

We already have narrow AIs that are superhuman in their domains. To my knowledge, nothing remotely like this "turn the world to computronium in order to excel in this narrow domain" thing has ever happened. This post might be useful to read. In Scott Alexander jargon, a language model seems like a behavior-executor, not a utility-maximizer.

Comment by john_maxwell_iv on Thoughts on Human Models · 2019-02-22T22:30:05.104Z · score: 2 (1 votes) · LW · GW

A toy model I find helpful is correlated vs uncorrelated safety measures. Suppose we have 3 safety measures. Suppose if even 1 safety measure succeeds, our AI remains safe. And suppose each safety measure has a 60% success rate in the event of an accident. If the safety measures are accurately described by independent random variables, our odds of safety in an accident are 1 - 0.4^3 = 94%. If the successes of the safety measures are perfectly correlated, failure of one implies certain failure of the others, and our odds of safety are only 1 - 0.4 = 60%.

In my mind, this is a good argument for working on ideas like safely interruptible agents, impact measures, and boxing. The chance of these ideas failing seems fairly independent from the chance of your value learning system failing.

But I think you could get a similar effect by having your AGI search for models whose failure probabilities are uncorrelated with one another. The better your AGI, the better this approach is likely to work.

Comment by john_maxwell_iv on Thoughts on Human Models · 2019-02-22T22:27:33.421Z · score: 3 (2 votes) · LW · GW

Human modelling is very close to human manipulation in design space. A system with accurate models of humans is close to a system which successfully uses those models to manipulate humans.

Trying to communicate why this sounds like magical thinking to me... Taylor is a data scientist for the local police department. Taylor notices that detectives are wasting a lot of time working on crimes which never get solved. They want to train a logistic regression on the crime database in order to predict whether a given crime will ever get solved, so detectives can focus their efforts on crimes that are solvable. Would you advise Taylor against this project, on the grounds that the system will be "too close in design space" to one which attempts to commit the perfect crime?

Although they do not rely on human modelling, some of these approaches nevertheless make most sense in a context where human modelling is happening: for example, impact measures seem to make most sense for agents that will be operating directly in the real world, and such agents are likely to require human modelling.

Let's put AI systems into two categories: those that operate in the real world and those that don't. The odds of x-risk from the second kind of system seem low. I'm not sure what kind of safety work is helpful, aside from making sure it truly does not operate in the real world. But if a system does operate in the real world, it's probably going to learn about humans and acquire knowledge about our preferences. Which means you have to solve the problems that implies.

My steelman of this section is: Find a way to create a narrow AI that puts the world on a good trajectory.

Comment by john_maxwell_iv on Thoughts on Human Models · 2019-02-22T21:33:30.916Z · score: 14 (4 votes) · LW · GW

Re: independent audits, although they're not possible for this particular problem, there are many close variants of this problem such that independent audits are possible. Let's think of human approval as a distorted view of our actual preferences, and our goal is to avoid things which are really bad according to our undistorted actual preferences. If we pass distorted human approval to our AI system, and the AI system avoids things which are really bad according to undistorted human approval, that suggests the system is robust to distortion.

For example:

  • Input your preferences extremely quickly, then see if the result is acceptable when you're given more time to think about it.
  • Input your preferences while drunk, then see if the result is acceptable to your sober self.
  • Tell your friend they can only communicate using gestures. Have a 5-minute "conversation" with them, then go off and input their preferences as you understand them. See if they find the result acceptable.
  • Distort the inputs in code. This lets you test out a very wide range of distortion models and see which produce acceptable performance.

It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.

Comment by john_maxwell_iv on Thoughts on Human Models · 2019-02-22T10:57:32.802Z · score: 5 (3 votes) · LW · GW

I've heard it claimed that better calibration is not the way to solve AI safety, but it seems like a promising solution to the transit design problem. Suppose we have a brilliant Bayesian machine learning system. Given a labeled dataset of transit system designs we approve/disapprove of, our system estimates the probability that any given model is the "correct" model which separates good designs from bad designs. Now consider two models chosen for the sake of argument: a "human approval" model and an "actual preferences" model. The probability of the "human approval" model will be rated very high. But I'd argue that the probability of the "actual preferences" model will also be rated rather high, because the labeled dataset we provide will be broadly compatible with our actual preferences. As long as the system assigns a reasonably high prior probability to our actual preferences, and the likelihood of the labels given our actual preferences is reasonably high, we should be OK.

Then instead of aiming for a design which is easy to compose, we aim for a design whose probability of being good is maximal when the model gets summed out. This means we're maximizing an objective which includes a wide variety of models which are broadly compatible with the labeled data... including, in particular, our "actual preferences".

In other words, find many reasonable ways of extrapolating the labeled data, and select a transit system which is OK according to all of them. (Or even select a transit system which is OK according to half of them, then use the other half as a test set. Note that it's not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there's some model in the ensemble that also makes that veto.)

I'd argue from a safety point of view, it's more important to have an acceptable transit system than an optimal transit system. Similarly, the goal with our first AGI should be to put the world on an acceptable trajectory, not the optimal trajectory. If the world is on an acceptable trajectory, we can always work to improve things. If the world shifts to an unacceptable trajectory, we may not be able to improve things. So to a first approximation, our first AGI should work to minimize the odds that the world is on an unacceptable trajectory, according to its subjective estimate of what constitutes an unacceptable trajectory.

Comment by john_maxwell_iv on Two Small Experiments on GPT-2 · 2019-02-22T08:36:12.063Z · score: 6 (4 votes) · LW · GW

It seems more likely to speed up its arrival, as would any general AI research that’s not specifically aimed at safety.

Research can be helpful for safety even if it's not done by the AI Safety Community™. I think you have to evaluate individual research advances on their merits. The safety implications of a particular advance aren't necessarily obvious.

To illustrate, imagine a superhuman language model that's prompted with the following: "A brilliant solution to the AI alignment problem is..." And then, as follow-ups: "The biggest risk with the above scheme is... A very different, but still sizeable, risk is..." (I'm actually kind of curious what people think of this, because it doesn't seem to fit the paradigm of most AI Safety™ work I've seen. EDIT: The best objection may involve daemons.)

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-19T22:49:52.705Z · score: 3 (2 votes) · LW · GW

So this post seemed to get a pretty healthy number of comments. I think others have suggested the idea of discussion prompts or discussion questions to engage commenters at the end of the post; maybe just having this post say "I want more comments" played a similar role. I was implicitly assuming that readership and comment numbers track each other closely, but maybe that's not true. It might be interesting to analyze the LW article database and see how ratios between viewership, comments, and votes have changed over time.

Comment by john_maxwell_iv on Epistemic Tenure · 2019-02-19T09:08:04.702Z · score: 7 (4 votes) · LW · GW

Coincidentally, the accepted technical term is also very relevant to the discussion. One way for Bob to mitigate the problem of others judging him for his bad idea is to write "epistemic status: exploratory" or similar at the top of his post.

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-15T03:57:40.617Z · score: 11 (3 votes) · LW · GW

Maybe it has something to do with this question you asked? Maybe letting people leave anonymous comments if they're approved by the post author or something like that could help?

Comment by john_maxwell_iv on How much can value learning be disentangled? · 2019-02-12T06:25:41.359Z · score: 2 (1 votes) · LW · GW

What's your answer to the postmodernist?

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-10T22:03:00.209Z · score: 3 (2 votes) · LW · GW

As it happens, your writing style is pretty enjoyable

Thanks, I'm very flattered!

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-10T08:44:22.228Z · score: 5 (3 votes) · LW · GW

I think having it be automated will help posts avoid getting forgotten in the sands of time.

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-10T02:12:22.188Z · score: 12 (3 votes) · LW · GW

This post cites Scott Aaronson, but maybe there were other discussions too.

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-10T02:01:59.539Z · score: 3 (2 votes) · LW · GW

I would think related questions is something to put off until you have lots of question data on which to tune your relatedness metric.

Comment by john_maxwell_iv on The Case for a Bigger Audience · 2019-02-10T02:00:12.275Z · score: 17 (9 votes) · LW · GW

Thanks for the reply! I see what you're saying, but here are some considerations on the other side.

Part of what I was trying to point out here is that 179 comments would not be "extraordinary" growth, it would be an "ordinary" return to what used to be the status quo. If you want to talk about startups, Paul Graham says 5-7% a week is a good growth rate during Y Combinator. 5% weekly growth corresponds to 12x annual growth, and I don't get the sense LW has grown 12x in the past year. Maybe 12x/year is more explosive than ideal, but I think there's room for more growth even if it's not explosive. IMO, growth is good partially because it helps you discover product-market fit. You don't want to overfit to your initial users, or, in the case of an online community, over-adapt to the needs of a small initial userbase. And you don't want to be one of those people who never ships. Some entrepreneurs say if you're not embarrassed by your initial product launch, you waited too long.

that metric is obviously very goodhart-able

One could easily goodhart the metric by leaving lots of useless one-line comments, but that's a little beside the point. The question for me is whether additional audience members are useful on the current margin. I think the answer is yes, if they're high-quality. The only promo method I suggested which doesn't filter heavily is the Adwords thing. Honestly I brought it up mostly to point out that we used to do that and it wasn't terrible, so it's a data point about how far it's safe to go.

A second and related reason to be skeptical of focusing on moving comments from 19 to 179 at the current stage (especially if I put on my 'community manager hat'), is a worry about wasting people's time. In general, LessWrong is a website where we don't want many core members of the community to be using it 10 hours per day. Becoming addictive and causing all researchers to be on it all day, could easily be a net negative contribution to the world. While none of your recommendations were about addictiveness, there are related ways of increasing the number of comments such as showing a user's karma score on every page, like LW 1.0 did.

What if we could make AI alignment research addictive? If you can make work feel like play, that's a huge win right?

See also Giving Your All. You could argue that I should either be spending 0% of my time on LW or 100% of my time on LW. I don't think the argument fully works, because time spent on LW is probably a complementary good with time spent reading textbooks and so on, but it doesn't seem totally unreasonable for me to see the number of upvotes I get as a proxy for the amount of progress I'm making.

I want LW to be more addictive on the current margin. I want to feel motivated to read someone's post about AI alignment and write some clever comment on it that will get me karma. But my System 1 doesn't have a sufficient expectation of upvotes & replies for me to experience a lot of intrinsic motivation to do this.

I'd suggest thinking in terms of focus destruction rather than addictiveness. Ideally, I find LW enjoyable to use without it hurting my ability to focus.

I think instead of restricting the audience, a better idea is making discussion dynamics a little less time-driven.

  • If I leave a comment on LW in the morning, and I'm deep in some equations during the afternoon, I don't want my brain nagging me to go check if I need to defend my claims on LW while the discussion is still on the frontpage.

  • Spreading discussions out over time also serves as spaced repetition to reinforce concepts.

  • I think I heard about research which found that brainstorming 5 minutes on 5 different days, instead of 25 minutes on a single day, is a better way to generate divergent creative insights. This makes sense to me because the effect of being anchored on ideas you've already had is lessened.

  • See also the CNN effect.

Re: intro texts, I'd argue having Rohin's value learning sequence go by without much of an audience to read & comment on it was a big missed opportunity. Paul Christiano's ideas seem important, and it could've been really valuable to have lively discussions of those ideas to see if we could make progress on them, or at least share our objections as they were rerun here on LW.

Ultimately, it's the idea that matters, not whether it comes in the form of a blog post, journal article, or comment. You mods have talked about the value of people throwing ideas around even when they're not 100% sure about them. I think comments are a really good format for that. [Say, random idea: what if we had a "you should turn this into a post" button for comments?]

Comment by john_maxwell_iv on Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" · 2019-02-09T07:27:35.514Z · score: 2 (1 votes) · LW · GW

I made a relevant post in the Meta section.

The Case for a Bigger Audience

2019-02-09T07:22:07.357Z · score: 64 (25 votes)
Comment by john_maxwell_iv on EA grants available (to individuals) · 2019-02-08T04:30:55.557Z · score: 9 (3 votes) · LW · GW

Paul Christiano might still be active in funding stuff. (There are a few more links to funding opportunities in the comments of that post.)

Comment by john_maxwell_iv on X-risks are a tragedies of the commons · 2019-02-07T10:24:01.178Z · score: 7 (7 votes) · LW · GW

True, but from a marketing perspective it's better to emphasize the fact reducing x-risk is in each individual's self-interest even if no one else is doing it. Also, instead of talking about AI arms races, we should talk about why AI done right means a post-scarcity era whose benefits can be shared by all. There's no real benefit to being the person who triggers the post-scarcity era.

Comment by john_maxwell_iv on Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" · 2019-02-07T10:02:05.217Z · score: 3 (2 votes) · LW · GW

Good talk. I'd like to hear what he thinks about the accelerating change/singularity angle, as applied to the point about the person living during the industrial revolution who's trying to improve the far future.

Comment by john_maxwell_iv on Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" · 2019-02-07T09:56:22.289Z · score: 10 (3 votes) · LW · GW

The criticism is expecting counter-criticism. i.e. What I think we're missing is critics who are in it for the long haul, who see their work as the first step of an iterative process, with an expectation that the AI safety field will respond and/or update to their critiques.

As someone who sometimes writes things that are a bit skeptical regarding AI doom, I find the difficulty of getting counter-criticism frustrating.

Comment by john_maxwell_iv on How does Gradient Descent Interact with Goodhart? · 2019-02-02T04:25:59.515Z · score: 17 (6 votes) · LW · GW

I think it depends almost entirely on the shape of V and W.

In order to do gradient descent, you need a function which is continuous and differentiable. So W can't be noise in the traditional regression sense (independent and identically distributed for each individual observation), because that's not going to be differentiable.

If W has lots of narrow, spiky local maxima with broad bases, then gradient descent is likely to find those local maxima, while random sampling rarely hits them. In this case, fake wins are likely to outnumber real wins in the gradient descent group, but not the random sampling group.

More generally, if U = V + W, then dU/dx = dV/dx + dW/dx. If V's gradient is typically bigger than W's gradient, gradient descent will mostly pay attention to V; the reverse is true if W's gradient is typically bigger.

But even if W's gradient typically exceeds V's gradient, U's gradient will still correlate with V's, assuming dV/dx and dW/dx are uncorrelated. (cov(dU, dV) = cov(dV+dW, dV) = cov(dV, dV) + cov(dW, dV) = cov(dV, dV).)

So I'd expect that if you change your experiment so instead of looking at the results in some band, you instead take the best n results from each group, the best n results of the gradient descent group will be better on average. Another intuition pump: Let's consider the spiky W scenario again. If V is constant everywhere, gradient descent will basically find us the nearest local maximum in W, which essentially adds random movement. But if V is a plane with a constant slope, and the random initialization is near two different local maxima in W, gradient descent will be biased towards the local maximum in W which is higher up on the plane of V. The very best points will tend to be those that are both on top of a spike in W and high up on the plane of V.

I think this is a more general point which applies regardless of the optimization algorithm you're using: If your proxy consists of something you're trying to maximize plus unrelated noise that's roughly constant in magnitude, you're still best off maximizing the heck out of that proxy, because the very highest value of the proxy will tend to be a point where the noise is high and the thing you're trying to maximize is also high.

"Constant unrelated noise" is an important assumption. For example, if you're dealing with a machine learning model, noise is likely to be higher for inputs off of the training distribution, so the top n points might be points far off the training distribution chosen mainly on the basis of noise. (Goodhart's Law arguably reduces to the problem of distributional shift.) I guess then the question is what the analogous region of input space is for approval. Where does the correspondence between approval and human value tend to break down?

(Note: Although W can't be i.i.d., W's gradient could be faked so it is. I think this corresponds to perturbed gradient descent, which apparently helps performance on V too.)

Comment by john_maxwell_iv on How much can value learning be disentangled? · 2019-01-31T10:13:30.696Z · score: 2 (1 votes) · LW · GW

Your original argument, as I understood it, was something like: Explanation aims for a particular set of mental states in the student, which is also what manipulation does, so therefore explanation can't be defined in a way that distinguishes it from manipulation. I pushed back on that. Now you're saying that explanation tends to produce side effects in the listener's values. Does this mean you're allowing the possibility that explanation can be usefully defined in a way that distinguishes it from manipulation?

BTW, computer security researchers distinguish between "reject by default" (whitelisting) and "accept by default" (blacklisting). "Reject by default" is typically more secure. I'm more optimistic about trying to specify what it means to explain something (whitelisting) than what it means to manipulate someone in a way that's improper (blacklisting). So maybe we're shooting at different targets.

Tying all of this back to FAI... you say you find the value changes that come with greater understanding to be (generally) positive and you'd like them to be more common. I'm worried about the possibility that AGI will be a global catastrophic risk. I think there are good arguments that by default, AGI will be something which is not positive. Maybe from a triage point of view, it makes sense to focus on minimizing the probability that AGI is a global catastrophic risk, and worry about the prevention of things that we think are likely to be positive once we're pretty sure the global catastrophic risk aspect of things has been solved?

In Eliezer's CEV paper, he writes:

In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that inter- preted.

I haven't seen anyone on Less Wrong argue against CEV as a vision for how the future of humanity should be determined. And CEV seems to involve having the future be controlled by humans who are more knowledgable than current humans in some sense. But maybe you're a CEV skeptic?

Comment by john_maxwell_iv on How much can value learning be disentangled? · 2019-01-31T07:09:14.401Z · score: 2 (1 votes) · LW · GW

Hm, I understood the traditional Less Wrong view to be something along the lines of: there is truth about the world, and that truth is independent of your values. Wanting something to be true won't make it so. Whereas I'd expect a postmodernist to say something like: the Christians have their truth, the Buddhists have their truth, and the Atheists have theirs. Whose truth is the "real" truth comes down to the preferences of the individual. Your statement sounds more in line with the postmodernist view than the Less Wrong one.

This matters because if the Less Wrong view of the world is correct, it's more likely that there are clean mathematical algorithms for thinking about and sharing truth that are value-neutral (or at least value-orthogonal, e.g. "aim to share facts that the student will think are maximally interesting or surprising". Note that this doesn't necessarily need to be implemented in a way that a "fact" which triggers an epileptic fit and causes the student to hit the "maximally interesting" button will be selected for sharing. If I have a rough model of the user's current beliefs and preferences, I could use that to estimate the VoI of various bits of information to the user and use that as my selection criterion. Point being that our objective doesn't need to be defined in terms of "aiming for a particular set of mental states".)

Why don't people use formal methods?

2019-01-22T09:39:46.721Z · score: 21 (8 votes)

General and Surprising

2017-09-15T06:33:19.797Z · score: 3 (3 votes)

Heuristics for textbook selection

2017-09-06T04:17:01.783Z · score: 8 (8 votes)

Revitalizing Less Wrong seems like a lost purpose, but here are some other ideas

2016-06-12T07:38:58.557Z · score: 22 (28 votes)

Zooming your mind in and out

2015-07-06T12:30:58.509Z · score: 8 (9 votes)

Purchasing research effectively open thread

2015-01-21T12:24:22.951Z · score: 12 (13 votes)

Productivity thoughts from Matt Fallshaw

2014-08-21T05:05:11.156Z · score: 13 (14 votes)

Managing one's memory effectively

2014-06-06T17:39:10.077Z · score: 14 (15 votes)

OpenWorm and differential technological development

2014-05-19T04:47:00.042Z · score: 6 (7 votes)

System Administrator Appreciation Day - Thanks Trike!

2013-07-26T17:57:52.410Z · score: 70 (71 votes)

Existential risks open thread

2013-03-31T00:52:46.589Z · score: 10 (11 votes)

Why AI may not foom

2013-03-24T08:11:55.006Z · score: 23 (35 votes)

[Links] Brain mapping/emulation news

2013-02-21T08:17:27.931Z · score: 2 (7 votes)

Akrasia survey data analysis

2012-12-08T03:53:35.658Z · score: 13 (14 votes)

Akrasia hack survey

2012-11-30T01:09:46.757Z · score: 11 (14 votes)

Thoughts on designing policies for oneself

2012-11-28T01:27:36.337Z · score: 80 (80 votes)

Room for more funding at the Future of Humanity Institute

2012-11-16T20:45:18.580Z · score: 18 (21 votes)

Empirical claims, preference claims, and attitude claims

2012-11-15T19:41:02.955Z · score: 5 (28 votes)

Economy gossip open thread

2012-10-28T04:10:03.596Z · score: 23 (30 votes)

Passive income for dummies

2012-10-27T07:25:33.383Z · score: 17 (22 votes)

Morale management for entrepreneurs

2012-09-30T05:35:05.221Z · score: 9 (14 votes)

Could evolution have selected for moral realism?

2012-09-27T04:25:52.580Z · score: 4 (14 votes)

Personal information management

2012-09-11T11:40:53.747Z · score: 18 (19 votes)

Proposed rewrites of LW home page, about page, and FAQ

2012-08-17T22:41:57.843Z · score: 18 (19 votes)

[Link] Holistic learning ebook

2012-08-03T00:29:54.003Z · score: 10 (17 votes)

Brainstorming additional AI risk reduction ideas

2012-06-14T07:55:41.377Z · score: 12 (15 votes)

Marketplace Transactions Open Thread

2012-06-02T04:31:32.387Z · score: 29 (30 votes)

Expertise and advice

2012-05-27T01:49:25.444Z · score: 17 (22 votes)

PSA: Learn to code

2012-05-25T18:50:01.407Z · score: 34 (39 votes)

Knowledge value = knowledge quality × domain importance

2012-04-16T08:40:57.158Z · score: 8 (13 votes)

Rationality anecdotes for the homepage?

2012-04-04T06:33:32.097Z · score: 3 (8 votes)

Simple but important ideas

2012-03-21T06:59:22.043Z · score: 18 (23 votes)

6 Tips for Productive Arguments

2012-03-18T21:02:32.326Z · score: 30 (45 votes)

Cult impressions of Less Wrong/Singularity Institute

2012-03-15T00:41:34.811Z · score: 34 (59 votes)

[Link, 2011] Team may be chosen to receive $1.4 billion to simulate human brain

2012-03-09T21:13:42.482Z · score: 8 (15 votes)

Productivity tips for those low on motivation

2012-03-06T02:41:20.861Z · score: 7 (12 votes)

The Singularity Institute has started publishing monthly progress reports

2012-03-05T08:19:31.160Z · score: 21 (24 votes)

Less Wrong mentoring thread

2011-12-29T00:10:58.774Z · score: 31 (34 votes)

Heuristics for Deciding What to Work On

2011-06-01T07:31:17.482Z · score: 20 (23 votes)

Upcoming meet-ups: Auckland, Bangalore, Houston, Toronto, Minneapolis, Ottawa, DC, North Carolina, BC...

2011-05-21T05:06:08.824Z · score: 5 (8 votes)

Being Rational and Being Productive: Similar Core Skills?

2010-12-28T10:11:01.210Z · score: 18 (31 votes)

Applying Behavioral Psychology on Myself

2010-06-20T06:25:13.679Z · score: 53 (60 votes)

The Math of When to Self-Improve

2010-05-15T20:35:37.449Z · score: 6 (16 votes)

Accuracy Versus Winning

2009-04-02T04:47:37.156Z · score: 12 (21 votes)

So you say you're an altruist...

2009-03-12T22:15:59.935Z · score: 11 (35 votes)