Why GPT wants to mesa-optimize & how we might change this 2020-09-19T13:48:30.348Z · score: 53 (17 votes)
John_Maxwell's Shortform 2020-09-11T20:55:20.409Z · score: 9 (1 votes)
Are HEPA filters likely to pull COVID-19 out of the air? 2020-03-25T01:07:18.833Z · score: 14 (4 votes)
Comprehensive COVID-19 Disinfection Protocol for Packages and Envelopes 2020-03-15T10:00:33.170Z · score: 28 (14 votes)
Why don't singularitarians bet on the creation of AGI by buying stocks? 2020-03-11T16:27:20.600Z · score: 33 (16 votes)
When are immunostimulants/immunosuppressants likely to be helpful for COVID-19? 2020-03-05T21:44:08.288Z · score: 13 (3 votes)
The Goodhart Game 2019-11-18T23:22:13.091Z · score: 12 (8 votes)
Self-Fulfilling Prophecies Aren't Always About Self-Awareness 2019-11-18T23:11:09.410Z · score: 15 (7 votes)
What AI safety problems need solving for safe AI research assistants? 2019-11-05T02:09:17.686Z · score: 15 (4 votes)
The problem/solution matrix: Calculating the probability of AI safety "on the back of an envelope" 2019-10-20T08:03:23.934Z · score: 24 (8 votes)
The Dualist Predict-O-Matic ($100 prize) 2019-10-17T06:45:46.085Z · score: 17 (6 votes)
Replace judges with Keynesian beauty contests? 2019-10-07T04:00:37.906Z · score: 31 (10 votes)
Three Stories for How AGI Comes Before FAI 2019-09-17T23:26:44.150Z · score: 28 (9 votes)
How to Make Billions of Dollars Reducing Loneliness 2019-08-30T17:30:50.006Z · score: 60 (27 votes)
Response to Glen Weyl on Technocracy and the Rationalist Community 2019-08-22T23:14:58.690Z · score: 60 (28 votes)
Proposed algorithm to fight anchoring bias 2019-08-03T04:07:41.484Z · score: 10 (2 votes)
Raleigh SSC/LW/EA Meetup - Meet MealSquares People 2019-05-08T00:01:36.639Z · score: 12 (3 votes)
The Case for a Bigger Audience 2019-02-09T07:22:07.357Z · score: 69 (27 votes)
Why don't people use formal methods? 2019-01-22T09:39:46.721Z · score: 21 (8 votes)
General and Surprising 2017-09-15T06:33:19.797Z · score: 3 (3 votes)
Heuristics for textbook selection 2017-09-06T04:17:01.783Z · score: 8 (8 votes)
Revitalizing Less Wrong seems like a lost purpose, but here are some other ideas 2016-06-12T07:38:58.557Z · score: 24 (29 votes)
Zooming your mind in and out 2015-07-06T12:30:58.509Z · score: 8 (9 votes)
Purchasing research effectively open thread 2015-01-21T12:24:22.951Z · score: 12 (13 votes)
Productivity thoughts from Matt Fallshaw 2014-08-21T05:05:11.156Z · score: 13 (14 votes)
Managing one's memory effectively 2014-06-06T17:39:10.077Z · score: 14 (15 votes)
OpenWorm and differential technological development 2014-05-19T04:47:00.042Z · score: 9 (8 votes)
System Administrator Appreciation Day - Thanks Trike! 2013-07-26T17:57:52.410Z · score: 70 (71 votes)
Existential risks open thread 2013-03-31T00:52:46.589Z · score: 10 (11 votes)
Why AI may not foom 2013-03-24T08:11:55.006Z · score: 23 (35 votes)
[Links] Brain mapping/emulation news 2013-02-21T08:17:27.931Z · score: 2 (7 votes)
Akrasia survey data analysis 2012-12-08T03:53:35.658Z · score: 13 (14 votes)
Akrasia hack survey 2012-11-30T01:09:46.757Z · score: 11 (14 votes)
Thoughts on designing policies for oneself 2012-11-28T01:27:36.337Z · score: 80 (80 votes)
Room for more funding at the Future of Humanity Institute 2012-11-16T20:45:18.580Z · score: 18 (21 votes)
Empirical claims, preference claims, and attitude claims 2012-11-15T19:41:02.955Z · score: 5 (28 votes)
Economy gossip open thread 2012-10-28T04:10:03.596Z · score: 26 (31 votes)
Passive income for dummies 2012-10-27T07:25:33.383Z · score: 17 (22 votes)
Morale management for entrepreneurs 2012-09-30T05:35:05.221Z · score: 9 (14 votes)
Could evolution have selected for moral realism? 2012-09-27T04:25:52.580Z · score: 4 (14 votes)
Personal information management 2012-09-11T11:40:53.747Z · score: 18 (19 votes)
Proposed rewrites of LW home page, about page, and FAQ 2012-08-17T22:41:57.843Z · score: 18 (19 votes)
[Link] Holistic learning ebook 2012-08-03T00:29:54.003Z · score: 10 (17 votes)
Brainstorming additional AI risk reduction ideas 2012-06-14T07:55:41.377Z · score: 12 (15 votes)
Marketplace Transactions Open Thread 2012-06-02T04:31:32.387Z · score: 29 (30 votes)
Expertise and advice 2012-05-27T01:49:25.444Z · score: 17 (22 votes)
PSA: Learn to code 2012-05-25T18:50:01.407Z · score: 34 (40 votes)
Knowledge value = knowledge quality × domain importance 2012-04-16T08:40:57.158Z · score: 10 (14 votes)
Rationality anecdotes for the homepage? 2012-04-04T06:33:32.097Z · score: 3 (8 votes)
Simple but important ideas 2012-03-21T06:59:22.043Z · score: 20 (25 votes)


Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-27T05:45:54.540Z · score: 2 (1 votes) · LW · GW

Your philosophical point is interesting; I have a post in the queue about that. However I don't think it really proves what you want it to.

Having John_Maxwell in the byline makes it far more likely that I'm the author of the post.

If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don't see why a language model can't do the same, in principle.

GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

A perfectly optimal next-step predictor would not be improved by lookahead or anything else, it's perfectly optimal. I'm talking about computational structures which might be incentivized during training when the predictor is suboptimal. (It's still going to be suboptimal after training with current technology, of course.)

In orthonormal's post they wrote:

...GPT-3's ability to write fiction is impressive- unlike GPT-2, it doesn't lose track of the plot, it has sensible things happen, it just can't plan its way to a satisfying resolution.

I'd be somewhat surprised if GPT-4 shared that last problem.

I suspect that either GPT-4 will still be unable to plan its way to a satisfying resolution, or GPT-4 will develop some kind of internal lookahead (probably not beam search, but beam search could be a useful model for understanding it) which is sufficiently general to be re-used across many different writing tasks. (Generality takes fewer parameters.) I don't know what the relative likelihoods of those possibilities are. But the whole idea of AI safety is to ask what happens if we succeed.

Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T04:03:22.727Z · score: 2 (1 votes) · LW · GW

So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution. Lookahead is almost certainly going to do better than random guessing, even topic models can do that.

By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

Comment by john_maxwell on Draft report on AI timelines · 2020-09-24T05:27:28.684Z · score: 2 (1 votes) · LW · GW

Worth noting that the "evidence from the nascent AI industry" link has bits of evidence pointing in both directions. For example:

Training a single AI model can cost hundreds of thousands of dollars (or more) in compute resources. While it’s tempting to treat this as a one-time cost, retraining is increasingly recognized as an ongoing cost, since the data that feeds AI models tends to change over time (a phenomenon known as “data drift”).

Doesn't this kind of cost make AI services harder to commodify? And also:

We’ve seen a massive difference in COGS between startups that train a unique model per customer versus those that are able to share a single model (or set of models) among all customers....

That sounds rather monopoly-ish doesn't it? Although the blogger's takeaway is

Machine learning startups generally have no moat or meaningful special sauce

I'll be somewhat surprised if language modeling gets commodified down to 0 profits even if Google and Facebook release competing models. I'd expect it to look more like cloud infrastructure industry, "designed to extract maximum blood" as the author of your blog post puts it. See e.g.

Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-23T08:05:00.555Z · score: 4 (2 votes) · LW · GW
  1. Stopping mesa-optimizing completely seems mad hard.

As I mentioned in the post, I don't think this is a binary, and stopping mesa-optimization "incompletely" seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn't seem mad hard to me.

  1. Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.

I'm less optimistic about this approach.

  1. There is a stochastic aspect to training ML models, so it's not enough to say "the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y". If Mesa-Optimizing for Y is nearby in model-space, we're liable to stumble across it.

  2. Even if your mesa-optimizer is aligned, if it doesn't have a way to stop mesa-optimization, there's the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn't necessarily aligned.

  3. I'm picturing value learning via (un)supervised learning, and I don't see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)

My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you're shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the "Competent" part in place before the "Human Values" part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.

On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we're liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it's less likely that we'll hit the smaller target of a Competent Mesa-Optimizer.

Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-22T11:38:43.856Z · score: 2 (1 votes) · LW · GW

My thought was that if lookahead improves performance during some period of the training, it's liable to develop mesa-optimization during that period, and then find it to be a useful for other things later on.

Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-22T11:35:09.051Z · score: 4 (2 votes) · LW · GW

Now it's true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.

Yeah, that's the possibility the post explores.

At a high level, I don't think we really need to be concerned with this form of "internal lookahead" unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).

Is there an easy way to detect if it's started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that's (a) possible and (b) wouldn't necessarily involve a big performance penalty.)

Comment by john_maxwell on Developmental Stages of GPTs · 2020-09-20T02:01:48.805Z · score: 2 (1 votes) · LW · GW

BTW with regard to "studying mesa-optimization in the context of such systems", I just published this post: Why GPT wants to mesa-optimize & how we might change this.

I'm still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I'm still thinking about if/how that could result in dangerous mesa-optimization.

Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-20T00:40:46.911Z · score: 8 (4 votes) · LW · GW

Well I suppose mesa-optimization isn't really a binary is it? Like, maybe there's a trivial sense in which self-attention "mesa-optimizes" over its input when figuring out what to pay attention to.

But ultimately, what matters isn't the definition of the term "mesa-optimization", it's the risk of spontaneous internal planning/optimization that generalizes in unexpected ways or operates in unexpected domains. At least in my mind. So the question is whether this considering multiple possibilities about text stuff could also improve its ability to consider multiple possibilities in other domains. Which depends on whether the implementation of "considering multiple possibilities" looks more like beam search vs very domain-adapted heuristics.

Comment by john_maxwell on Why GPT wants to mesa-optimize & how we might change this · 2020-09-19T22:09:59.409Z · score: 6 (3 votes) · LW · GW

This post distinguishes between mesa-optimization and learned heuristics. What you're describing sounds like learned heuristics. ("Learning which words are easy to rhyme" was an example I gave in the post.) Learned heuristics aren't nearly as worrisome as mesa-optimization because they're harder to modify and misuse to do planning in unexpected domains. When I say "lookahead" in the post I'm pretty much always referring to the mesa-optimization sort.

Comment by john_maxwell on Developmental Stages of GPTs · 2020-09-19T01:18:52.334Z · score: 2 (1 votes) · LW · GW


Comment by john_maxwell on Developmental Stages of GPTs · 2020-09-17T23:40:42.360Z · score: 2 (1 votes) · LW · GW

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way.

GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?

Comment by john_maxwell on Developmental Stages of GPTs · 2020-09-17T22:54:07.344Z · score: 2 (1 votes) · LW · GW

esp. since GPT-3's 0-shot learning looks like mesa-optimization

Could you provide more details on this?

Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?

If that's what you're saying... That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure.

96 transformers would appear to offer a very limited budget for any kind of serial computation, but there's a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you'd want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer).

Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which seems like a lot of serial computation relative to GPT-3's budget...

Comment by john_maxwell on John_Maxwell's Shortform · 2020-09-17T22:27:23.167Z · score: 2 (1 votes) · LW · GW

/r/tressless is about 6 times as big FYI.

The way I'm currently thinking about it is that reddit was originally designed as a social news website, and you have tack on a bunch of extras if you want your subreddit to do knowledge-accumulation, but phpBB gets you that with much less effort. (Could be as simple as having a culture of "There's already a thread for that here, you should add your post to it.")

Comment by john_maxwell on John_Maxwell's Shortform · 2020-09-16T05:52:27.210Z · score: 4 (2 votes) · LW · GW

Another point is that if LW and a hypothetical phpBB forum have different "cognitive styles", it could be valuable to keep both around for the sake of cognitive diversity.

Comment by john_maxwell on John_Maxwell's Shortform · 2020-09-11T20:55:24.404Z · score: 20 (5 votes) · LW · GW

Progress Studies: Hair Loss Forums

I still have about 95% of my hair. But I figure it's best to be proactive. So over the past few days I've been reading a lot about how to prevent hair loss.

My goal here is to get a broad overview (i.e. I don't want to put in the time necessary to understand what a 5-alpha-reductase inhibitor actually is, beyond just "an antiandrogenic drug that helps with hair loss"). I want to identify safe, inexpensive treatments that have both research and anecdotal support.

In the hair loss world, the "Big 3" refers to 3 well-known treatments for hair loss: finasteride, minoxidil, and ketoconazole. These treatments all have problems. Some finasteride users report permanent loss of sexual function. If you go off minoxidil, you lose all the hair you gained, and some say it wrinkles their skin. Ketoconazole doesn't work very well.

To research treatments beyond the Big 3, I've been using various tools, including both Google Scholar and a "custom search engine" I created for digging up anecdotes from forums. Basically, take whatever query I'm interested in ("pumpkin seed oil" for instance), add this OR OR OR OR OR OR OR OR OR OR OR OR OR OR OR OR

and then search on Google.

Doing this repeatedly has left me feeling like a geologist who's excavated a narrow stratigraphic column of Internet history.

And my big takeaway is how much dumber people got collectively between the "old school phpBB forum" layer and the "subreddit" layer.

This is a caricature, but I don't think it would be totally ridiculous to summarize discussion on /r/tressless as:

  1. Complaining about Big 3 side effects
  2. Complaining that the state of the art in hair loss hasn't advanced in the past 10 years
  3. Putdowns for anyone who tries anything which isn't the Big 3

If I was conspiracy-minded, I would wonder if Big 3 manufacturers had paid shills who trolled online forums making fun of anyone who tries anything which isn't their product. It's just the opposite of the behavior you'd expect based on game theory: Someone who tries something new individually runs the risk of new side effects, or wasting their time and money, with some small chance of making a big discovery which benefits the collective. So a rational forum user's response to someone trying something new should be: "By all means, please be the guinea pig". And yet that seems uncommon.

Compared with reddit, discussion of nonstandard treatments on old school forums goes into greater depth--I stumbled across a thread on an obscure treatment which was over 1000 pages long. And the old school forums have a higher capacity for innovation... here is a website that an old school forum user made for a DIY formula he invented, "Zix", which a lot of forum users had success with. (The site has a page explaining why we should expect the existence of effective hair loss treatments that the FDA will never approve.) He also links to a forum friend who started building and selling custom laser helmets for hair regrowth. (That's another weird thing about online hair loss forums... Little discussion of laser hair regrowth, even though it's FDA approved, intuitively safe, and this review found it works better than finasteride or minoxidil.)

So what happened with the transition to reddit? Some hypotheses:

  • Generalized eternal September
  • Internet users have a shorter attention span nowadays
  • Upvoting/downvoting facilitates groupthink
  • reddit's "hot" algorithm discourages the production of deep content; the "bump"-driven discussion structure of old school forums allows for threads which are over 1000 pages long
  • Weaker community feel due to intermixing with the entire reddit userbase

I'm starting to wonder if we should set up a phpBB style AI safety discussion forum. I have hundreds of thousands of words of AI content in my personal notebook, only a small fraction of which I've published. Posting to LW seems to be a big psychological speed bump for me. And I'm told that discussion on the Alignment Forum represents a fairly narrow range of perspectives within the broader AI safety community, perhaps because of the "upvoting/downvoting facilitates groupthink" thing.

The advantage of upvoting/downvoting seems to be a sort of minimal quality control--there is less vulnerability to individual fools as described in this post. But I'm starting to wonder if some of the highs got eliminated along with the lows.

Anyway, please send me a message if an AI safety forum sounds interesting to you.

Comment by john_maxwell on The Box Spread Trick: Get rich slightly faster · 2020-09-03T05:14:45.349Z · score: 4 (2 votes) · LW · GW

Does anyone have thoughts on whether buying Treasury Inflation-Protected Securities (probably in the form of an ETF) on margin would be a good way to hedge against currency devaluation?

Comment by john_maxwell on ricraz's Shortform · 2020-08-27T04:36:06.301Z · score: 9 (4 votes) · LW · GW

There's been a fair amount of discussion of that sort of thing here: There are also groups outside LW thinking about social technology such as RadicalxChange.

Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies.

I'm not sure. If you put those 5 LWers together, I think there's a good chance that the highest status person speaks first and then the others anchor on what they say and then it effectively ends up being like a group project for school with the highest status person in charge. Some related links.

Comment by john_maxwell on ricraz's Shortform · 2020-08-26T11:52:22.090Z · score: 2 (1 votes) · LW · GW
  1. All else equal, the harder something is, the less we should do it.

  2. My quick take is that writing lit reviews/textbooks is a comparative disadvantage of LW relative to the mainstream academic establishment.

In terms of producing reliable knowledge... if people actually care about whether something is true, they can always offer a cash prize for the best counterargument (which could of course constitute citation of academic research). The fact that people aren't doing this suggests to me that for most claims on LW, there isn't any (reasonably rich) person who cares deeply re: whether the claim is true. I'm a little wary of putting a lot of effort into supply if there is an absence of demand.

(I guess the counterargument is that accurate knowledge is a public good so an individual's willingness to pay doesn't get you complete picture of the value accurate knowledge brings. Maybe what we need is a way to crowdfund bounties for the best argument related to something.)

(I agree that LW authors would ideally engage more with each other and academic literature on the margin.)

Comment by john_maxwell on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T11:02:04.690Z · score: 2 (1 votes) · LW · GW

Let's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides.

The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.

Comment by john_maxwell on ricraz's Shortform · 2020-08-26T08:10:34.050Z · score: 4 (2 votes) · LW · GW

Fair enough. I'm reminded of a time someone summarized one of my posts as being a definitive argument against some idea X and me thinking to myself "even I don't think my post definitively settles this issue" haha.

Comment by john_maxwell on ricraz's Shortform · 2020-08-26T05:41:09.199Z · score: 6 (3 votes) · LW · GW

LW doesn't have enough depth because people don't care enough about depth - they're willing to accept ideas even before they've been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards - specifically, the standard of requiring (and rewarding) deep investigation and scholarship.

Your solution to the "willingness to accept ideas even before they've been explored in depth" problem is to explore ideas in more depth. But another solution is to accept fewer ideas, or hold them much more provisionally.

I'm a proponent of the second approach because:

  • I suspect even academia doesn't hold ideas as provisionally as it should. See Hamming on expertise:

  • I suspect trying to browbeat people to explore ideas in more depth works against the grain of an online forum as an institution. Browbeating works in academia because your career is at stake, but in an online forum, it just hurts intrinsic motivation and cuts down on forum use (the forum runs on what Clay Shirky called "cognitive surplus", essentially a term for peoples' spare time and motivation). I'd say one big problem with LW 1.0 that LW 2.0 had to solve before flourishing was people felt too browbeaten to post much of anything.

If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops--and this incentive is a positive one, not a punishment-driven browbeating incentive.

Maybe part of the issue is that on LW, peer review generally happens in the comments after you publish, not before. So there's no publication carrot to offer in exchange for overcoming the objections of peer reviewers.

Comment by john_maxwell on ricraz's Shortform · 2020-08-26T05:27:15.969Z · score: 2 (1 votes) · LW · GW

One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.

Depends on the claim, right?

If the cost of evaluating a hypothesis is high, and hypotheses are cheap to generate, I would like to generate a great deal before selecting one to evaluate.

Comment by john_maxwell on "Good judgement" and its components · 2020-08-26T04:49:47.818Z · score: 2 (1 votes) · LW · GW

So much AI safety literature is based around reinforcement learning, but it seems like an impoverished model for describing how humans plan. I have a feeling RL will ultimately be left behind in the same way e.g. SVMs have been left behind.

Comment by john_maxwell on How much can surgical masks help with wildfire smoke? · 2020-08-26T04:40:08.942Z · score: 4 (2 votes) · LW · GW

Sorry to hear about your asthma.

The Powecom KN95s sold here did very well in unofficial government KN95 tests and breathe easier than my P100 respirator:

Adding this improves the seal:

Comment by john_maxwell on On Suddenly Not Being Able to Work · 2020-08-26T04:23:38.012Z · score: 6 (2 votes) · LW · GW

Some ideas:

  • Take some kind of antianxiety thing like ashwagandha or theanine (see
  • Have a place in your house where you sit with your computer where work is the only allowed thing. You can take a break, but it has to be away from your computer (e.g. walk around the neighborhood). Or you can goof off on your computer, but you have to move it out of the work place first. Then make rules for moving out of the work place (e.g. 10-minute wait required).
  • When you relax in the evening, select breaks that entrain longer relaxation (e.g. a full TV show instead of short youtube videos). Don't allow yourself to worry about work after 10 pm say--the idea is to retrain your concentration ability through an activity that rewards concentration.
Comment by john_maxwell on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T04:06:18.798Z · score: 2 (1 votes) · LW · GW

Note that this decomposition is simpler than a "reasonable" version of figure 4, since the boundaries between the three modules don't need to be specified.

Consider two versions of the same program. One makes use of a bunch of copy/pasted code. The other makes use of a nice set of re-usable abstractions. The second program will be shorter/simpler.

Boundaries between modules don't cost very much, and modularization is super helpful for simplifying things.

Comment by john_maxwell on Alignment By Default · 2020-08-21T23:51:01.925Z · score: 2 (1 votes) · LW · GW

Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).

Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I'd like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.

Comment by john_maxwell on Alignment By Default · 2020-08-20T08:20:14.157Z · score: 4 (2 votes) · LW · GW

This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".

Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don't think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)

I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don't look much like anything that's e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.

More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?

Here's an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)

Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn't mean harm is likely.

Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.

Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.

I'm a bit confused why you're bringing up "safety problems too complex for ourselves" because it sounds like you don't think there are any important safety problems like that, based on the sentences that came before this one?

What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.

I'm talking about the broad sense of "corrigible" described in e.g. the beginning of this post.

(BTW, I just want to clarify that we're having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don't want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I'm just stating it explicitly in order to make the discussion a bit easier to follow.)

Comment by john_maxwell on Alignment By Default · 2020-08-19T07:06:13.297Z · score: 4 (2 votes) · LW · GW

If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.

Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?

On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the "tools" have their own models of human values, and use those models to check the safety of their outputs... which brings us right back to alignment.

There's an aspect of defense-in-depth here. If your tool's model of human values is slightly imperfect, that doesn't necessarily fail hard the way an agent with a model of human values that's slightly imperfect does.

BTW, let's talk about the "Research Assistant" story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.)

Simple mechanisms like always displaying an estimated probability that I'll regret asking a question would probably help, but I'm mainly worried about the unknown unknowns, not the known unknowns. That's part of what I mean when I talk about marginal improvements vs closing the bulk of the gap - the unknown unknowns are the bulk of the gap.

That's why you need a tool... so it can tell you the unknown unknowns you're missing, and how to solve them. We'd rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn't we? ;-) Shouldn't we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.

it has to do some reasoning which is too complex for humans - which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API.

If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you're getting at with the "unknown unknowns" stuff), what is the alternative?

I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of "corrigibility" has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we're relying on corrigibility, I'd ideally like it to improve with capabilities, in the same way and for the same reasons as I'd like alignment to improve with capabilities. Do you know of an argument that it's easier?

We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn't want to get locked into a merely OK solution for all of eternity. I'm saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I'm assuming corrigibility is solvable to an OK degree as well.

Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say "corrigibility" has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct. There's no reason we can't make the system's notion of corrigibility corrigible in the same way its values are corrigible. (BTW, I don't think corrigibility even necessarily needs to be thought of as separate from alignment, you can think of them as both being reflected in an agent's reward function say. But that's a tangent.) And we can leverage capability increases by having the system explain various notions of corrigibility it's discovered and how they differ so we can figure out which notion(s) we want to use.

Comment by john_maxwell on Alignment By Default · 2020-08-18T09:31:09.439Z · score: 2 (1 votes) · LW · GW

This comment definitely wins the award for best comment on the post so far.


I don't consider myself an expert on the unsupervised learning literature by the way, I expect there is more cool stuff to be found.

Comment by john_maxwell on Alignment By Default · 2020-08-18T08:49:28.499Z · score: 2 (1 votes) · LW · GW

I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.

Can you be more specific about the theoretical bottlenecks that seem most important?

I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away).

I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic "not just benign, actually aligned" AI.

An analogy here would be Linux vs Windows. Linux lets you shoot your foot off and wipe your hard drive with a single command, but it also gives you greater control of your system and your computer is less likely to get viruses. Windows is safer and more paternalistic, with less user control. Windows is a better choice for the average user, but that's partially because we have a lot of experience building operating systems. It wouldn't make sense to aim for a Windows as our first operating system, because (a) it's a more ambitious project and (b) we wouldn't have enough experience to know the right ways in which to be paternalistic. Heck, it was you who linked disparagingly to waterfall-style software development the other day :) There's a lot to be said for simplicity of implementation.

(Random aside: In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can't be trusted, but I'm not sure the total amount of responsibility we're assigning to humans has changed--if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right. I'd rather shove responsibility into the post-singularity world, because the current world seems non-ideal, for example, AI designers have limited time to think due to e.g. possible arms races.)

What do I mean by the "safe-use-of-dangerous-tools problem"? Well, many dangerous tools will come with an instruction manual or mandatory training in safe tool use. For a tool AI, this manual might include things like:

  • Before asking the AI any question, ask: "If I ask Question X, what is the estimated % chance that I will regret asking on reflection?"

  • Tell the AI: "When you answer this question, instead of revealing any information you think will plausibly harm me, replace it with [I'm not revealing this because it could plausibly harm you]"

  • If using a human-simulation approach to alignment, tell your AI to only make use of the human-simulation to inform terminal values, never instrumental values. Or give the human simulation loads of time to reflect, so it's effectively a speed superintelligence (assuming for the moment what seems to be a common AI safety assumption that more reflection always improves outcomes--skepticism here). Or make sure the simulated human has access to the safety manual.

I think it's possible to do useful work on the manual for the Tool AI even in the absence of any actual Tool AI having been created. In fact, I suspect this work will generalize better between different AI designs than most alignment work generalizes between designs.

Insights from our manual could even be incorporated into the user interface for the tool. For example, the question-asking flow could by default show us the answer to the question "If I ask Question X, what is the estimated % chance that I will regret asking on reflection?" and ask us to read the result and confirm that the question is actually one we want to ask. This would be analogous to alias rm='rm -i' in Linux--it doesn't reduce transparency or add brittle complexity, but it does reduce the risk of shooting ourselves in the foot.

BTW you wrote:

Coming at it from a different angle: if a safety problem is handled by a system's designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system's users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.

One possible plan for the tool is to immediately use it to create a more paternalistic system (or just generate a bunch of UI safeguards as I described above). So then you're essentially just rolling the dice once.

Two examples of "specifics of the data collection process" to illustrate

From my perspective, these examples essentially illustrate that there's not a single natural abstraction for "human values"--but as I said elsewhere, I think that's a solvable problem.

My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I'd really like it to be able to refine its notion of human values over time. In other words, the oracle's notion of human values may be accurate but not precise, and I'd like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.

Let's make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)

Comment by john_maxwell on Alignment By Default · 2020-08-15T12:04:52.712Z · score: 4 (2 votes) · LW · GW

Does he think this is a good presentation of his proposal?

I'm very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree--see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don't think there's going to be a single natural abstraction for "human values" like johnswentworth seems to imply with this post, but I also think that's a solvable problem.

(previous discussion for reference)

Comment by john_maxwell on Alignment By Default · 2020-08-15T11:45:15.444Z · score: 2 (1 votes) · LW · GW

My take is that corrigibility is sufficient to get you an AI that understands what it means to "keep improving their understanding of Alice's values and to serve those values".  I don't think the AI needs to play the "genius philosopher" role, just the "loyal and trustworthy servant" role.  A superintelligent AI which plays that role should be able to facilitate a "long reflection" where flesh and blood humans solve philosophical problems.

(I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)

Comment by john_maxwell on Alignment By Default · 2020-08-15T11:33:16.879Z · score: 7 (4 votes) · LW · GW

Thanks a lot for writing this. I've been thinking about FAI plans along these lines for a while now, here are some thoughts on specific points you made.

First, I take issue with the "Alignment By Default" title. There are two separate questions here. Question #1 is whether we'd have a good outcome if everyone concerned with AI safety got hit by a bus. Question #2 is whether there's a way to create Friendly AI using unsupervised learning. I'm rather optimistic that the answer to Question #2 is yes. I find the unsupervised learning family of approaches more appealing than IRL or HCH (from what I understand of those approaches). But I still think there are various ways in which things could go wrong, some of which you mention in this post, and it's useful to have safety researchers thinking about this, because the problems seem pretty tractable to me. You, me, and Steve Byrnes are the only people in the community I remember off the top of my head who seem to be giving this serious thought, which is a little odd because so many top AI people seem to think that unsupervised learning is The Nut That Must Be Cracked if we are to build AGI.

Anyway, in order to illustrate that the problems seem tractable, here are a couple things you brought up + thoughts on solving them.

With regard to the high-resolution molecular model of a human, there's the possibility of using this model as an upload somehow even if the embedding of human values is nontrivial. I guess the challenge is to excise everything around the human from the model, and replace those surroundings with whatever an ideal environment for doing moral / philosophical reasoning would be, along with some communication channel to the outside world. This is approach is similar to the Paul Christiano construction described on p. 198 of Superintellligence. In this case, I guess it is more important for the embedding of a person's physical surroundings to be "natural" enough that we can mess with it without messing with the person's mind. However, even if the embedding of the person's physical surroundings is kinda bad (meaning that our "ideal environment for doing moral / philosophical reasoning" ends up being like a glitchy VR sim in practice), this plausibly won't cause a catastrophic alignment failure. Also, you don't necessarily need a super high-resolution model to do this sort of thing (imagine prompting GPT-N with "Gandhi goes up the mountain to contemplate Moral Question X, he returns after a year of contemplation and proclaims...").

This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

A couple thoughts.

First, I think it's possible to create this window. Suppose we restrict ourselves to feeding our system data from before the year 2000. There should be a decent representation of human values to be learned from this data, yet it should be quite difficult to figure out the specifics of the 2020+ data-collection process from it. Identifying the specific quirks which cause the data-collection process to differ from human values seems especially difficult. (I think restricting ourselves to pre-2000 data is overkill, I just chose 2000 for the purpose of illustration.)

Second, one way to check on things is to deliberately include a small quantity of mislabeled data, then once the system is done learning, check whether its model correctly recognizes that the mislabeled data is mislabeled (and agrees with all data that is correctly labeled). (This should be combined with the idea above where we disguise the data-collection process from the AI, because otherwise we might pinpoint "the data-collection process prior to the time at which the mislabeled data was introduced"?)

I know of two main ways to go from aligning one AI to long-term alignment

A third approach which you don't mention is to use the initial aligned AI as a "human values oracle" for subsequent AIs. Once you have a cheap, fast computational representation of human values, you can replicate it across a massive compute cluster and

  • Use it to generate extremely large quantities of training data

  • Use it as the "consciousness module" for some bigger, more sophisticated system

  • Use it to identify specific ways in which the newer AI's concept of human values is wrong, and keep correcting the newer AI's concept of human values until it's good (maybe using active learning)

You need the new AI and the old AI to communicate with one another. But details of how they work can be totally different if you have them communicate using labeled data. Training one ML model to predict the output of some other ML model is a technique I see every so often in machine learning papers... "Distilling the Knowledge in a Neural Network" is a well-known example of this.

Finally, you wrote:

That’s assuming things don’t go irreversibly wrong before then - including not just hostile takeover, but even just development of deceptive behavior, since this scenario does not have any built-in mechanism to detect deception.

Mesa-optimizers are a real danger, but if we put those aside for a moment, I don't think there is much risk of a hostile takeover from an unsupervised learning system since it's not an agent.

Comment by john_maxwell on Alignment By Default · 2020-08-15T10:15:39.892Z · score: 7 (4 votes) · LW · GW

Some notes on the loss function in unsupervised learning:

Since an unsupervised learner is generally just optimized for predictive power

I think it's worthwhile to distinguish the loss function that's being optimized during unsupervised learning, vs what the practitioner is optimizing for. Yes, the loss function being optimized in an unsupervised learning system is frequently minimization of reconstruction error or similar. But when I search for "unsupervised learning review" on Google Scholar, I find this highly cited paper by Bengio et al. The abstract talks a lot about learning useful representations and says nothing about predictive power. In other words, learning "natural abstractions" appears to be pretty much the entire game from a practitioner perspective.

And in the same way supervised learning has dials such as regularization which let us control the complexity of our model, unsupervised learning has similar dials.

For clustering, we could achieve 0 reconstruction error (or equivalently, explain all the variation in the data) by putting every data point in its own cluster, but that would completely defeat the point. The elbow method is a well-known heuristic for figuring out what the "right" number of clusters in a dataset is.

Similarly, we could achieve 0 reconstruction error with an autoencoder by making the number of dimensions in the bottleneck be equal to the number of dimensions in the original input, but again, that would completely defeat the point. Someone on the Stats Stackexchange says that there is no standard way to select the number of dimensions for an autoencoder. (For reference, the standard way to select the regularization parameter which controls complexity in supervised learning would obviously be through cross-validation.) However, I suspect this is a tractable research problem.

It was interesting that you mentioned the noise of air molecules, because one unsupervised learning trick is to deliberately introduce noise into the input to see if the system has learned "natural" representations which allow it to reconstruct the original noise-free input. See denoising autoencoder. This is the kind of technique which might allow an autoencoder to learn natural representations even if the number of dimensions in the bottleneck is equal to the number of dimensions in the original input.

BTW, here's an interesting-looking (pessimistic) paper I found while researching this comment: Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

You brought up microscope AI. I think a promising research direction here may be to formulate a notion of "ease of interpretability" which can be added as an additional term to an unsupervised loss function (the same way we might, for example, add a term to a clustering algorithm's loss function so that in addition to minimizing reconstruction error, it also seeks to minimize the number of clusters).

Hardcoding "human values" by hand is hopeless, but hardcoding "ease of human interpretability" by hand seems much more promising, since ease of human interpretability is likely to correspond to easily formalizable notions such as simplicity. Also, if your hardcoded notion of "ease of human interpretability" turns out to be slightly wrong, that's not a catastrophe: you just get an ML model which is a bit harder to interpret than you might like.

Another option is to learn a notion of what constitutes an interpretable model by e.g. collecting "ease of interpretability" data from human microscope users.

Of course, one needs to be careful that any interpretability term does not get too much weight in the loss function, because if it does, we may stop learning the "natural" abstractions that we desire (assuming a worst-case scenario where human interpretability is anticorrelated with "naturalness"). The best approach may be to learn two models, one of which was optimized for interpretability and one of which wasn't, and only allow our system to take action when the two models agree. I guess mesa-optimizers in the non-interpretable model are still a worry though.

Comment by john_maxwell on Tags Discussion/Talk Thread · 2020-08-08T08:30:56.800Z · score: 5 (3 votes) · LW · GW

FWIW, I'm not a fan of "akrasia"--seems unnecessarily highfalutin to me.  Most stuff tagged with "akrasia" is essentially about procrastination, not akrasia as a philosophical problem.  (Just found this article on Google.)  I think it's OK for LW to use jargon, but we should recognize jargon comes with a cost, and there's no reason to pay the cost if we aren't getting any particular benefit.

(crl826 mentioned that "procrastination" is another related tag in the latest open thread.)

Comment by john_maxwell on Open & Welcome Thread - August 2020 · 2020-08-08T08:27:27.692Z · score: 2 (1 votes) · LW · GW

I don't think you should combine quite yet.  More discussion here.  (I suggest we continue there since that's the dedicated tag thread.)

Comment by John_Maxwell_IV on [deleted post] 2020-08-08T08:25:57.404Z

Oh thanks, I missed it somehow.

Comment by John_Maxwell_IV on [deleted post] 2020-08-08T06:22:04.547Z

How hard would it be to add a feature that lets me check & see how many reads a post of mine has?

Ideally "read" would be somewhat meaningful -- a bot scraping the LW archives shouldn't count, someone opening the post in a tab and closing the tab before reading shouldn't count, but if you read halfway through and abandon the post in disgust, that should count.  If the same user opens the post on two occasions that should count as at most one read. has a nice analytics dashboard.

(I'm asking this because I think # reads might motivate me better than # upvotes.  Based on stats, it seems like my posts there have about 100x as many viewers as upvoters (small sample size).)

Comment by john_maxwell on Tags Discussion/Talk Thread · 2020-08-08T05:29:36.394Z · score: 3 (2 votes) · LW · GW

BTW, "productivity" and "akrasia" are another pair of tags that feel a bit poorly differentiated to me.

Comment by john_maxwell on Tags Discussion/Talk Thread · 2020-08-08T05:18:44.403Z · score: 2 (1 votes) · LW · GW

So it sounds like the underlying content categories are:

  • Technical AI safety
  • Nontechnical AI safety/AI strategy
  • AI content unrelated to safety

Is that right?

I guess my complaint is that while "AI content unrelated to safety" always gets tagged "AI", and "Nontechnical AI safety/AI strategy" always gets tagged "AI Risk", there doesn't seem to be a consistent policy for the "Technical AI safety" content.

Comment by john_maxwell on Tags Discussion/Talk Thread · 2020-08-08T04:23:55.295Z · score: 2 (1 votes) · LW · GW

We have both an AI tag and an AI Risk tag.  When should one use one or the other?  Maybe we should rename AI Risk to AI Risk Strategy or AI Strategy so they're more clearly differentiated.

Comment by john_maxwell on PSA: Tagging is Awesome · 2020-08-07T07:39:36.659Z · score: 2 (1 votes) · LW · GW

I expect you'd get better results by using older, less hyped NLP techniques that are designed for this sort of thing:

The tagging work that's already been done need not be a waste, because you can essentially use it as training data for the kind of tags you'd like an automated system to discover and assign.  For example, tweak the hyperparameters of the topic modeling system until it is really good at independently rediscovering/reassigning the tags that have already been manually assigned.

An advantage of the automated approach is that you should be able to reapply it to some other document corpus--for example, autogenerate tags for the EA Forum, or all AI alignment related papers/discussion off LW, or the entire AI literature in order to help with/substitute for this job (especially if you can get some kind of hierarchical tagging to work)

I've actually spent a while thinking about this sort of problem and I'm happy to video call and chat more if you want.

Comment by john_maxwell on Tags Discussion/Talk Thread · 2020-08-07T07:33:11.320Z · score: 4 (3 votes) · LW · GW

What are the norms around the number of tags that are appropriate for a post to get?  There are some posts of mine that I wish more people would read, and piling relevant tags onto them looks like an easy way to accomplish this.  However, I'm looking at some of the other tagging effort that's being done, and it seems like sometimes posts are being tagged with just one or two of a larger collection of say 4-5 tags that could be considered relevant.

Edit: Thanks for the responses, all.

Comment by john_maxwell on Infinite Data/Compute Arguments in Alignment · 2020-08-06T03:09:59.870Z · score: 4 (2 votes) · LW · GW

See also Robustness to Scale. You wrote that "we expect that the failure modes which still appear under such assumptions are the hard failure modes" (emphasis mine). But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute, such as the "malign universal prior" problem. It's unclear how much to worry about these problems, because as you say, we don't actually expect to use e.g. Solomonoff induction. I suspect a key issue is whether the problem is an inevitable result of scaling any algorithm, vs a quirk of the particular infinite data/compute algorithm being discussed.

Comment by john_maxwell on What specific dangers arise when asking GPT-N to write an Alignment Forum post? · 2020-07-29T05:31:21.184Z · score: 6 (4 votes) · LW · GW

A general method for identifying dangers: For every topic which gets discussed on AF, figure out what could go wrong if GPT-N decided to write a post on that topic.

  • GPT-N writes a post about fun theory. It illustrates principles of fun theory by describing an insanely fun game you can play with an ordinary 52-card deck. FAI work gets pushed aside as everyone becomes hooked on this new game. (Procrastination is an existential threat!)

  • GPT-N writes a post about human safety problems. To motivate its discussion, it offers some extraordinarily compelling reasons why the team which creates the first AGI might want to keep the benefits to themselves.

  • GPT-N writes a post about wireheading. In the "Human Wireheading" section, it describes an incredibly easy and pleasurable form of meditation. Soon everyone is meditating 24/7.

  • GPT-N writes a post about s-risks. Everyone who reads it gets a bad case of PTSD.

  • GPT-N writes a post about existential hope. Everyone who reads it becomes unbearably impatient for the posthuman era. Security mindset becomes a thing of the past. Alternatively, everyone's motivation for living in the present moment gets totally sapped. There are several high-profile suicides.

  • GPT-N has an incredibly bad take on decision theory, game theory, and blackmail. It gets deleted from AF. The Streisand effect occurs and millions of people read it.

  • GPT-N offers a very specific answer to the question "What specific dangers arise when asking GPT-N to write an Alignment Forum post?"

For the prompt you provided, one risk would be that GPT-N says the best way to audit cognition is to look for each of these 10 different types of nefarious activity, and in describing the 10 types, it ends up writing something nefarious.

GPT-N might inadvertently write a post which presents an incredibly compelling argument for an incorrect and harmful conclusion ("FAI work doesn't matter because FAI is totally impossible"), but one hopes that you could simply use GPT-N to write a counterargument to that post to see if the conclusion is actually solid. (Seems like good practice for GPT-N posts in general.)

Comment by john_maxwell on What specific dangers arise when asking GPT-N to write an Alignment Forum post? · 2020-07-28T12:15:01.270Z · score: 5 (3 votes) · LW · GW

One class of problem comes about if GPT-N starts thinking about "what would a UFAI do in situation X":

  • Inspired by AI box experiments, GPT-N writes a post about the danger posed by ultra persuasive AI-generated arguments for bad conclusions, and provides a concrete example of such an argument.
  • GPT-N writes a post where it gives a detailed explanation of how a UFAI could take over the world.  Terrorists read the post and notice that UFAI isn't a hard requirement for the plan to work.
  • GPT-N begins writing a post about mesa-optimizers and starts simulating a mesa-optimizer midway through.
Comment by john_maxwell on Are we in an AI overhang? · 2020-07-28T11:31:04.723Z · score: 33 (17 votes) · LW · GW

I'm not sure it's good for this comment to get a lot of attention?  OpenAI is more altruism-oriented than a typical AI research group, and this is essentially a persuasive essay for why other groups should compete with them.

Comment by john_maxwell on Alignment As A Bottleneck To Usefulness Of GPT-3 · 2020-07-26T09:33:19.274Z · score: -3 (3 votes) · LW · GW

There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper.


This is simultaneously

  • a major retreat from the "default outcome is doom" thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that's is 99.9% likely to be safe, which is very much incompatible with "default outcome is doom")
  • unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn't good enough for you)

You've picked a position vaguely in between the motte and the bailey and said "the motte and the bailey are both equivalent to this position!"  That doesn't look at all true to me.

All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don't know exactly what those parameter ranges are.

This is a very strong claim which to my knowledge has not been well-justified anywhere.  Daniel K agreed with me the other day that there isn't a standard reference for this claim.  Do you know of one?

There are a couple problems I see here:

  • Simple is not the same as obvious.  Even if someone at some point tried to think of every obvious solution and justifiably discarded them all, there are probably many "obvious" solutions they didn't think of.
  • Nothing ever gets counted as evidence against this claim.  Simple proposals get rejected on the basis that everyone knows simple proposals won't work.

A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety.  Maybe there are good arguments for that, but the problem is that if you're not careful, your view of reality is gonna get distorted.  Which means community wisdom on claims such as "simple solutions never work" is likely to be systematically wrong.  "Everyone knows X", without a good written defense of X, or a good answer to "what would change the community's mind about X", is fertile ground for information cascades etc.  And this is on top of standard ideological homophily problems (the AI safety community is very self-selected subset of the broader AI research world).

What I'm saying is, that truss is design is 100% going to fail once it gets big enough, and we don't currently know how big that is. When I say "it's hard to design a long bridge which won't fall down", I do not mean a bridge which might not fall down if we're lucky and just happen to be within the safe parameter range.

My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks.  This is logically rude.  And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not.  From my perspective, you've pulled this conversational move multiple times in this thread.  It seems to be pretty common when I have discussions about AI safety people.  That's part of why I find the discussions so frustrating.  My view is that this is a cultural problem which has to be solved for the AI safety community to do much useful AI safety work (as opposed to "complaining about how hard AI safety is" work, which is useful but insufficient).

Anyway, I'll let you have the last word in this thread.

Comment by john_maxwell on Alignment As A Bottleneck To Usefulness Of GPT-3 · 2020-07-25T09:47:50.860Z · score: 2 (1 votes) · LW · GW

if we're not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.

Optimize for having a diverse range of models that all seem to fit the data.