Posts

Why GPT wants to mesa-optimize & how we might change this 2020-09-19T13:48:30.348Z
John_Maxwell's Shortform 2020-09-11T20:55:20.409Z
Are HEPA filters likely to pull COVID-19 out of the air? 2020-03-25T01:07:18.833Z
Comprehensive COVID-19 Disinfection Protocol for Packages and Envelopes 2020-03-15T10:00:33.170Z
Why don't singularitarians bet on the creation of AGI by buying stocks? 2020-03-11T16:27:20.600Z
When are immunostimulants/immunosuppressants likely to be helpful for COVID-19? 2020-03-05T21:44:08.288Z
The Goodhart Game 2019-11-18T23:22:13.091Z
Self-Fulfilling Prophecies Aren't Always About Self-Awareness 2019-11-18T23:11:09.410Z
What AI safety problems need solving for safe AI research assistants? 2019-11-05T02:09:17.686Z
The problem/solution matrix: Calculating the probability of AI safety "on the back of an envelope" 2019-10-20T08:03:23.934Z
The Dualist Predict-O-Matic ($100 prize) 2019-10-17T06:45:46.085Z
Replace judges with Keynesian beauty contests? 2019-10-07T04:00:37.906Z
Three Stories for How AGI Comes Before FAI 2019-09-17T23:26:44.150Z
How to Make Billions of Dollars Reducing Loneliness 2019-08-30T17:30:50.006Z
Response to Glen Weyl on Technocracy and the Rationalist Community 2019-08-22T23:14:58.690Z
Proposed algorithm to fight anchoring bias 2019-08-03T04:07:41.484Z
Raleigh SSC/LW/EA Meetup - Meet MealSquares People 2019-05-08T00:01:36.639Z
The Case for a Bigger Audience 2019-02-09T07:22:07.357Z
Why don't people use formal methods? 2019-01-22T09:39:46.721Z
General and Surprising 2017-09-15T06:33:19.797Z
Heuristics for textbook selection 2017-09-06T04:17:01.783Z
Revitalizing Less Wrong seems like a lost purpose, but here are some other ideas 2016-06-12T07:38:58.557Z
Zooming your mind in and out 2015-07-06T12:30:58.509Z
Purchasing research effectively open thread 2015-01-21T12:24:22.951Z
Productivity thoughts from Matt Fallshaw 2014-08-21T05:05:11.156Z
Managing one's memory effectively 2014-06-06T17:39:10.077Z
OpenWorm and differential technological development 2014-05-19T04:47:00.042Z
System Administrator Appreciation Day - Thanks Trike! 2013-07-26T17:57:52.410Z
Existential risks open thread 2013-03-31T00:52:46.589Z
Why AI may not foom 2013-03-24T08:11:55.006Z
[Links] Brain mapping/emulation news 2013-02-21T08:17:27.931Z
Akrasia survey data analysis 2012-12-08T03:53:35.658Z
Akrasia hack survey 2012-11-30T01:09:46.757Z
Thoughts on designing policies for oneself 2012-11-28T01:27:36.337Z
Room for more funding at the Future of Humanity Institute 2012-11-16T20:45:18.580Z
Empirical claims, preference claims, and attitude claims 2012-11-15T19:41:02.955Z
Economy gossip open thread 2012-10-28T04:10:03.596Z
Passive income for dummies 2012-10-27T07:25:33.383Z
Morale management for entrepreneurs 2012-09-30T05:35:05.221Z
Could evolution have selected for moral realism? 2012-09-27T04:25:52.580Z
Personal information management 2012-09-11T11:40:53.747Z
Proposed rewrites of LW home page, about page, and FAQ 2012-08-17T22:41:57.843Z
[Link] Holistic learning ebook 2012-08-03T00:29:54.003Z
Brainstorming additional AI risk reduction ideas 2012-06-14T07:55:41.377Z
Marketplace Transactions Open Thread 2012-06-02T04:31:32.387Z
Expertise and advice 2012-05-27T01:49:25.444Z
PSA: Learn to code 2012-05-25T18:50:01.407Z
Knowledge value = knowledge quality × domain importance 2012-04-16T08:40:57.158Z
Rationality anecdotes for the homepage? 2012-04-04T06:33:32.097Z
Simple but important ideas 2012-03-21T06:59:22.043Z

Comments

Comment by John_Maxwell (John_Maxwell_IV) on Testing The Natural Abstraction Hypothesis: Project Intro · 2021-04-09T05:40:04.961Z · LW · GW

I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking into.)

This is despite humans having pretty much identical cognitive architectures (assuming that we can create a de novo AGI with a cognitive architecture as similar to a human brain as human brains are to each other seems unrealistic). Perhaps you could argue that some human-generated abstractions are "natural" and others aren't, but that leaves the problem of ensuring that the human operating our AI is making use of the correct, "natural" abstractions in their own thinking. (Some ancient cultures lacked a concept of the number 0. From our perspective, and that of a superintelligent AGI, 0 is a 'natural' abstraction. But there could be ways in which the superintelligent AGI invents 'natural' abstraction that we haven't yet invented, such that we are living in a "pre-0 culture" with respect to this abstraction, and this would cause an ontological mismatch between us and our AGI.)

But I'm still optimistic about the overall research direction. One reason is if your dataset contains human-generated artifacts, e.g. pictures with captions written in English, then many unsupervised learning methods will naturally be incentivized to learn English-language abstractions to minimize reconstruction error. (For example, if we're using self-supervised learning, our system will be incentivized to correctly predict the English-language caption beneath an image, which essentially requires the system to understand the picture in terms of English-language abstractions. This incentive would also arise for the more structured supervised learning task of image captioning, but the results might not be as robust.)

This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

Social sciences are a notable exception here. And I think social sciences (or even humanities) may be the best model for alignment--'human values' and 'corrigibility' seem related to the subject matter of these fields.

Anyway, I had a few other comments on the rest of what you wrote, but I realized what they all boiled down to was me having a different set of abstractions in this domain than the ones you presented. So as an object lesson in how people can have different abstractions (heh), I'll describe my abstractions (as they relate to the topic of abstractions) and then explain how they relate to some of the things you wrote.

I'm thinking in terms of minimizing some sort of loss function that looks vaguely like

reconstruction_error + other_stuff

where reconstruction_error is a measure of how well we're able to recreate observed data after running it through our abstractions, and other_stuff is the part that is supposed to induce our representations to be "useful" rather than just "predictive". You keep talking about conditional independence as the be-all-end-all of abstraction, but from my perspective, it is an interesting (potentially novel!) option for the other_stuff term in the loss function. The same way dropout was once an interesting and novel other_stuff which helped supervised learning generalize better (making neural nets "useful" rather than just "predictive" on their training set).

The most conventional choice for other_stuff would probably be some measure of the complexity of the abstraction. E.g. a clustering algorithm's complexity can be controlled through the number of centroids, or an autoencoder's complexity can be controlled through the number of latent dimensions. Marcus Hutter seems to be as enamored with compression as you are with conditional independence, to the point where he created the Hutter Prize, which offers half a million dollars to the person who can best compress a 1GB file of Wikipedia text.

Another option for other_stuff would be denoising, as we discussed here.

You speak of an experiment to "run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional". My guess is if the other_stuff in your loss function consists only of conditional independence things, your representation won't be particularly low-dimensional--your representation will see no reason to avoid the use of 100 practically-redundant dimensions when one would do the job just as well.

Similarly, you speak of "a system which provably learns all learnable abstractions", but I'm not exactly sure what this would look like, seeing as how for pretty much any abstraction, I expect you can add a bit of junk code that marginally decreases the reconstruction error by overfitting some aspect of your training set. Or even junk code that never gets run / other functional equivalences.

The right question in my mind is how much info at a distance you can get for how many additional dimensions. There will probably be some number of dimensions N such that giving your system more than N dimensions to play with for its representation will bring diminishing returns. However, that doesn't mean the returns will go to 0, e.g. even after you have enough dimensions to implement the ideal gas law, you can probably gain a bit more predictive power by checking for wind currents in your box. See the elbow method (though, the existence of elbows isn't guaranteed a priori).

(I also think that an algorithm to "provably learn all learnable abstractions", if practical, is a hop and a skip away from a superintelligent AGI. Much of the work of science is learning the correct abstractions from data, and this algorithm sounds a lot like an uberscientist.)

Anyway, in terms of investigating convergence, I'd encourage you to think about the inductive biases induced by both your loss function and also your learning algorithm. (We already know that learning algorithms can have different inductive biases than humans, e.g. it seems that the input-output surfaces for deep neural nets aren't as biased towards smoothness as human perceptual systems, and this allows for adversarial perturbations.) You might end up proving a theorem which has required preconditions related to the loss function and/or the algorithm's inductive bias.

Another riff on this bit:

This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

Maybe we could differentiate between the 'useful abstraction hypothesis', and the stronger 'unique abstraction hypothesis'. This statement supports the 'useful abstraction hypothesis', but the 'unique abstraction hypothesis' is the one where alignment becomes way easier because we and our AGI are using the same abstractions. (Even though I'm only a believer in the useful abstraction hypothesis, I'm still optimistic because I tend to think we can have our AGI cast a net wide enough to capture enough useful abstractions that ours are in their somewhere, and this number will be manageable enough to find the right abstractions from within that net--or something vaguely like that.) In terms of science, the 'unique abstraction hypothesis' doesn't just say scientific theories can be useful, it also says there is only one 'natural' scientific theory for any given phenomenon, and the existence of competing scientific schools sorta seems to disprove this.

Anyway, the aspect of your project that I'm most optimistic about is this one:

This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.

Since I don't believe in the "unique abstraction hypothesis", checking whether a given abstraction is the right one seems important to me. The problem seems tractable, and a method that's abstract enough to work across a variety of different learning algorithms/architectures (including stuff that might get invented in the future) could be really useful.

Comment by John_Maxwell (John_Maxwell_IV) on Vim · 2021-04-08T22:47:49.001Z · LW · GW

Interesting, thanks for sharing.

I couldn't figure out how to go backwards easily.

Command-shift-g right?

Comment by John_Maxwell (John_Maxwell_IV) on Vim · 2021-04-07T23:34:54.998Z · LW · GW

After practicing Vim for a few months, I timed myself doing the Vim tutorial (vimtutor on the command line) using both Vim with the commands recommended in the tutorial, and a click-and-type editor. The click-and-type editor was significantly faster. Nowadays I just use Vim for the macros, if I want to do a particular operation repeatedly on a file.

I think if you get in the habit of double-clicking to select words and triple-clicking to select lines (triple-click and drag to select blocks of code), click-and-type editors can be pretty fast.

Comment by John_Maxwell (John_Maxwell_IV) on Theory of Knowledge (rationality outreach) · 2021-04-01T04:27:42.153Z · LW · GW

Here is one presentation for young people

Comment by John_Maxwell (John_Maxwell_IV) on Open Problems with Myopia · 2021-03-12T03:20:54.176Z · LW · GW

We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.

Are you sure that "episode" is the word you're looking for here?

https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL

I'm especially confused because you switched to using the word "timestep" later?

Having an action which modifies the reward on a subsequent episode seems very weird. I don't even see it as being the same agent across different episodes.

Also...

Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.

Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.

I think as a practical matter, the result depends entirely on the method you're using to solve the MDP and the rewards that your simulation delivers.

Comment by John_Maxwell (John_Maxwell_IV) on Borasko's Shortform · 2021-03-10T09:13:10.398Z · LW · GW

lsuser had an interesting idea of creating a new Youtube account and explicitly training the recommendation system to recommend particular videos (in his case, music): https://www.lesswrong.com/posts/wQnJ4ZBEbwE9BwCa3/personal-experiment-one-year-without-junk-media

I guess you could also do it for Youtube channels which are informative & entertaining, e.g. CGP Grey and Kings & Generals. I believe studies have found that laughter tends to be rejuvenating, so optimizing for videos you think are funny is another idea.

Comment by John_Maxwell (John_Maxwell_IV) on Willa's Shortform · 2021-03-10T09:08:35.058Z · LW · GW

I suspect you will be most successful at this if you get in the habit of taking breaks away from your computer when you inevitably start to flag mentally. Some that have worked for me include: going for a walk, talking to friends, taking a nap, reading a magazine, juggling, noodling on a guitar, or just daydreaming.

Comment by John_Maxwell (John_Maxwell_IV) on A Semitechnical Introductory Dialogue on Solomonoff Induction · 2021-03-05T07:00:14.838Z · LW · GW

...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.

I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representative. One way to sort a list using a hypercomputer is to try every possible permutation of the list until we discover one which is sorted. I tend to see Solomonoff induction as being cartoonishly wasteful in a similar way.

Comment by John_Maxwell (John_Maxwell_IV) on Why GPT wants to mesa-optimize & how we might change this · 2021-03-04T20:01:17.970Z · LW · GW

From a safety standpoint, hoping and praying that SGD won't stumble across lookahead doesn't seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that's being traversed.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2021-03-04T09:00:41.587Z · LW · GW

Lately I've been examining the activities I do to relax and how they might be improved. If you haven't given much thought to this topic, Meaningful Rest is excellent background reading.

An interesting source of info for me has been lsusr's posts on cutting out junk media: 1, 2, 3. Although I find lsusr's posts inspiring, I'm not sure I want to pursue the same approach myself. lsusr says: "The harder a medium is to consume (or create, as applicable) the smarter it makes me." They responded to this by cutting all the easy-to-consume media out of their life.

But when I relax, I don't necessarily want to do something hard. I want to do something which rejuvenates me. (See "Meaningful Rest" post linked previously.)

lsusr's example is inspiring in that it seems they got themselves studying things like quantum field theory for fun in their spare time. But they also noted that "my productivity at work remains unchanged", and ended up abandoning the experiment 9 months in "due to multiple changes in my life circumstances". Personally, when I choose to work on something, I usually expect it to be at least 100x as good a use of my time as random productive-seeming stuff like studying quantum field theory. So given a choice, I'd often rather my breaks rejuvenate me a bit more per minute of relaxation, so I can put more time and effort into my 100x tasks, than have the break be slightly useful on its own.

To adopt a different frame... I'm a fan of the wanting/liking/approving framework from this post.

  • In some sense, +wanting breaks are easy to engage in because it doesn't require willpower to get yourself to do them. But +wanting breaks also tend to be compulsive, and that makes them less rejuvenating (example: arguing online).

  • My point above is that I should mostly ignore the +approving or -approving factor in terms of the break's non-rejuvenating, external effects.

  • It seems like the ideal break is +liking, and enough +wanting that it doesn't require willpower to get myself to do it, and once I get started I can disconnect for hours and be totally engrossed, but not so +wanting that I will be tempted to do it when I should be working or keep doing it late into the night. I think playing the game Civilization might actually meet these criteria for me? I'm not as hooked on it as I used to be, but I still find it easy to get engrossed for hours.

Interested to hear if anyone else wants to share their thinking around this or give examples of breaks which meet the above criteria.

Comment by John_Maxwell (John_Maxwell_IV) on Weighted Voting Delenda Est · 2021-03-04T08:55:32.001Z · LW · GW

Good to know! I was thinking the application process would be very transparent and non-demanding, but maybe it's better to ditch it altogether.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2021-03-04T08:35:52.171Z · LW · GW

Related to the discussion of weighted voting allegedly facilitating groupthink earlier https://www.lesswrong.com/posts/kxhmiBJs6xBxjEjP7/weighted-voting-delenda-est

An interesting litmus test for groupthink might be: What has LW changed its collective mind about? By that I mean: the topic was discussed on LW, there was a particular position on the issue that was held by the majority of users, new evidence/arguments came in, and now there's a different position which is held by the majority of users. I'm a bit concerned that nothing comes to mind which meets these criteria? I'm not sure it has much to do with weighted voting because I can't think of anything from LW 1.0 either.

Comment by John_Maxwell (John_Maxwell_IV) on Weighted Voting Delenda Est · 2021-03-04T08:30:02.666Z · LW · GW

Makes sense, thanks.

Comment by John_Maxwell (John_Maxwell_IV) on Weighted Voting Delenda Est · 2021-03-03T21:15:47.185Z · LW · GW

For whatever it's worth, I believe I was the first to propose weighted voting on LW, and I've come to agree with Czynski that this is a big downside. Not necessarily enough to outweigh the upsides, and probably insufficient to account for all the things Czynski dislikes about LW, but I'm embarrassed that I didn't foresee it as a potential problem. If I was starting a new forum today, I think I'd experiment with no voting at all -- maybe try achieving quality control by having an application process for new users? Does anyone have thoughts about that?

Comment by John_Maxwell (John_Maxwell_IV) on Takeaways from one year of lockdown · 2021-03-03T20:43:08.190Z · LW · GW

Another possible AI parallel: Some people undergo a positive feedback loop where more despair leads to less creativity, less creativity leads to less problem-solving ability (e.g. P100 thing), less problem-solving ability leads to a belief that the problem is impossible, and a belief that the problem is impossible leads to more despair.

Comment by John_Maxwell (John_Maxwell_IV) on Book review: The Geography of Thought · 2021-03-03T09:27:43.487Z · LW · GW

China's government is more involved to large-scale businesses.

According to the World Economic Forum website:

China is home to 109 corporations listed on the Fortune Global 500 - but only 15% of those are privately owned.

https://www.weforum.org/agenda/2019/05/why-chinas-state-owned-companies-still-have-a-key-role-to-play/

Comment by John_Maxwell (John_Maxwell_IV) on Tournesol, YouTube and AI Risk · 2021-02-13T22:29:38.588Z · LW · GW

Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.

Makes sense.

...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.

I think it might be useful to distinguish between being aware of oneself in a literal sense, and the term "self-aware" as it is used colloquially / the connotations the term sneaks in.

Some animals, if put in front of a mirror, will understand that there is some kind of moving animalish thing in front of them. The ones that pass the mirror test are the ones that realize that moving animalish thing is them.

There is a lot of content on YouTube about YouTube, so the system will likely become aware of itself in a literal sense. That's not the same as our colloquial notion of "self-awareness".

IMO, it'd be useful to understand the circumstances under which the first one leads to the second one.

My guess is that it works something like this. In order to survive and reproduce, evolution has endowed most animals with an inborn sense of self, to achieve self-preservation. (This sense of self isn't necessary for cognition--if you trip on psychedelics and experience ego death, your brain can still think. Occasionally people will hurt themselves in this state since their self-preservation instincts aren't functioning as normal.)

Colloquial "self-awareness" occurs when an animal looking in the mirror realizes that the thing in the mirror and its inborn sense of self are actually the same thing. Similar to Benjamin Franklin realizing that lightning and electricity are actually the same thing.

If this story is correct, we need not worry much about the average ML system developing "self-awareness" in the colloquial sense, since we aren't planning to endow it with an inborn sense of self.

That doesn't necessarily mean I think Predict-O-Matic is totally safe. See this post I wrote for instance.

Comment by John_Maxwell (John_Maxwell_IV) on Tournesol, YouTube and AI Risk · 2021-02-13T10:47:42.604Z · LW · GW

I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also polarizes society as a whole and gets them to spend more time on Youtube on a macro level seems to require a different sort of reasoning.

For the stock thing, I think it depends on how the system is scored. When training a supervised machine learning model, we score potential models based on how well they predict past data -- data the model itself has no way to affect (except if something really weird is going on?) There doesn't seem to be much incentive to select a model that makes self-fulfilling prophecies. A model which ignores the impact of its "prophecies" will score better, insofar as the prophecy would've affected the outcome.

I'm not necessarily saying there isn't a concern here, I just think step 1 is to characterize the problem precisely.

Comment by John_Maxwell (John_Maxwell_IV) on Making Vaccine · 2021-02-13T03:06:48.860Z · LW · GW

Fixed twitter link

Comment by John_Maxwell_IV on [deleted post] 2021-01-26T00:01:25.724Z

Not sure if this answers, but the book Superforecasting explains, among other things, that probabilistic thinkers tend to make better forecasts.

Comment by John_Maxwell (John_Maxwell_IV) on A few thought on the inner ring · 2021-01-24T22:57:30.149Z · LW · GW

Yes, I didn't say "they are not considering that hypothesis", I am saying "they don't want to consider that hypothesis". Those do indeed imply very different actions. I think one gives very naturally rise to producing counterarguments, the other one does not.

They don't want to consider the hypothesis, and that's why they'll spend a bunch of time carefully considering it and trying to figure out why it is flawed?

In any case... Assuming the Twitter discussion is accurate, some people working on AGI have already thought about the "alignment is hard" position (since those expositions are how they came to work on AGI). But they don't think the "alignment is hard" position is correct -- it would be kinda dumb to work on AGI carelessly if you thought that position is correct. So it seems to be a matter of considering the position and deciding it is incorrect.

I am not really sure what you mean by the second paragraph. AI is being actively regulated, and there are very active lobbying efforts on behalf of the big technology companies, producing large volumes of arguments for why AI is nothing you have to worry about.

That's interesting, but it doesn't seem that any of the arguments they've made have reached LW or the EA Forum -- let me know if I'm wrong. Anyway I think my original point basically stands -- from the perspective of EA cause prioritization, the incentives to dismantle/refute flawed arguments for prioritizing AI safety are pretty diffuse. (True for most EA causes -- I've long maintained that people should be paid to argue for unincentivized positions.)

Comment by John_Maxwell (John_Maxwell_IV) on A few thought on the inner ring · 2021-01-24T08:20:57.322Z · LW · GW

What? What about all the people who prefer to do fun research that builds capabilities and has direct ways to make them rich, without having to consider the hypothesis that maybe they are causing harm?

If they're not considering that hypothesis, that means they're not trying to think of arguments against it. Do we disagree?

I agree if the government was seriously considering regulation of AI, the AI industry would probably lobby against this. But that's not the same question. From a PR perspective, just ignoring critics often seems to be a good strategy.

Comment by John_Maxwell (John_Maxwell_IV) on A few thought on the inner ring · 2021-01-23T20:15:18.367Z · LW · GW

There was an interesting discussion on Twitter the other day about how many AI researchers were inspired to work on AGI by AI safety arguments. Apparently they bought the "AGI is important and possible" part of the argument but not the "alignment is crazy difficult" part.

I do think the AI safety community has some unfortunate echo chamber qualities which end up filtering those people out of the discussion. This seems bad because (1) the arguments for caution might be stronger if they were developed by talking to the smartest skeptics and (2) it may be that alignment isn't crazy difficult and the people filtered out have good ideas for tackling it.

If I had extra money, I might sponsor a prize for a "why we don't need to worry about AI safety" essay contest to try & create an incentive to bridge the tribal gap. Could accomplish one or more of the following:

  • Create more cross talk between people working in AGI and people thinking about how to make it safe

  • Show that the best arguments for not needing to worry, as discovered by this essay contest, aren't very good

  • Get more mainstream AI people thinking about safety (and potentially realizing over the course of writing their essay that it needs to be prioritized)

  • Get fresh sets of eyes on AI safety problems in a way that could generate new insights

Another point here is that from a cause prioritization perspective, there's a group of people incentivized to argue that AI safety is important (anyone who gets paid to work on AI safety), but there's not really any group of people with much of an incentive to argue the reverse (that I can think of at least, let me know if you disagree). So we should expect the set of arguments which have been published to be imbalanced. A contest could help address that.

Comment by John_Maxwell (John_Maxwell_IV) on A few thought on the inner ring · 2021-01-22T14:09:55.809Z · LW · GW

In Thinking Fast and Slow, Daniel Kahneman describes an adversarial collaboration between himself and expertise researcher Gary Klein. They were originally on opposite sides of the "how much can we trust the intuitions of confident experts" question, but eventually came to agree that expert intuitions can essentially be trusted if & only if the domain has good feedback loops. So I guess that's one possible heuristic for telling apart a group of sound craftsmen from a mutual admiration society?

Comment by John_Maxwell (John_Maxwell_IV) on Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment · 2021-01-15T16:04:38.067Z · LW · GW

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.

I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.

Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

If you think the alignment problem is hard, you probably believe that humans can't be trusted to delegate to an AI, which means we are left with either having humans run the world (something humans can't be trusted to do) or having humans build an AI to run the world (also something humans can't be trusted to do).

The best path, in my view, is to pick and choose in order to make the overall task as easy as possible. If we're having a hard time thinking of how to align an AI for a particular situation, add more human control. If we think humans are incompetent or untrustworthy in some particular circumstance, delegate to the AI in that circumstance.

It's not obvious to me that becoming wiser is difficult -- your comment is light on supporting evidence, violence seems less frequent nowadays, and it seems possible to me that becoming wiser is merely unincentivized, not difficult. (BTW, this is related to the question of how effective rationality training is.)

However, again, I see a false choice. We don't have flawless computerized wisdom at the touch of a button. The alignment problem remains unsolved. What we do have are various exotic proposals for computerized wisdom (coherent extrapolated volition, indirect normativity) which are very difficult to test. Again, insofar as you believe the problem of aligning AIs with human values is hard, you should be pessimistic about these proposals working, and (relatively) eager to shift responsibility to systems we are more familiar with (biological humans).

Let's take coherent extrapolated volition. We could try & specify some kind of exotic virtual environment where the AI can simulate idealized humans and observe their values... or we could become idealized humans. Given the knowledge of how to create a superintelligent AI, the second approach seems more robust to me. Both approaches require us to nail down what we mean by an "idealized human", but the second approach does not include the added complication+difficulty of specifying a virtual environment, and has a flesh and blood "human in the loop" observing the process at every step, able to course correct if things seem to be going wrong.

The best overall approach might be a committee of ordinary humans, morally enhanced humans, and morally enhanced ems of some sort, where the AI only acts when all three parties agree on something (perhaps also preventing the parties from manipulating each other somehow). But anyway...

You talk about the influence of better material conditions and institutions. Fine, have the AI improve our material conditions and design better institutions. Again I see a false choice between outcomes achieved by institutions and outcomes achieved by a hypothetical aligned AI which doesn't exist. Insofar as you think alignment is hard, you should be eager to make an AI less load-bearing and institutions more load-bearing.

Maybe we can have an "institutional singularity" where we have our AI generate a bunch of proposals for institutions, then we have our most trusted institution choose from amongst those proposals, we build the institution as proposed, then have that institution choose from amongst a new batch of institution proposals until we reach a fixed point. A little exotic, but I think I've got one foot on terra firma.

Comment by John_Maxwell (John_Maxwell_IV) on The Great Karma Reckoning · 2021-01-15T14:59:35.116Z · LW · GW

We removed the historical 10x multiplier for posts that were promoted to main on LW 1.0

Are comments currently accumulating karma in the same way that toplevel posts do?

Comment by John_Maxwell (John_Maxwell_IV) on Approval Extraction Advertised as Production · 2021-01-13T11:35:40.659Z · LW · GW

When I read this essay in 2019, I remember getting the impression that approval-extracting vs production-oriented was supposed to be about the behavior of the founders, not the industry the company competes in.

Comment by John_Maxwell (John_Maxwell_IV) on Why GPT wants to mesa-optimize & how we might change this · 2021-01-12T00:03:39.014Z · LW · GW

I was using it to refer to "any inner optimizer". I think that's the standard usage but I'm not completely sure.

Comment by John_Maxwell (John_Maxwell_IV) on Why GPT wants to mesa-optimize & how we might change this · 2021-01-09T02:30:54.451Z · LW · GW

With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn't work great, but I do think editing has potential because it allows for more sophisticated thinking.

Let's say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the context window when it's writing about the wings), but it won't allow the wing design to influence the engine design. However, if the model is allowed to edit its text, it can rethink the engine in light of the wings and rethink the wings in light of the engine until it's designed a really good aircraft.

In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.

Agreed. Perhaps if we generated lots of travelling salesman problem instances where the greedy approach doesn't get you something that looks like the optimal route, then try & train a GPT architecture to predict the cities in the optimal route in order?

This is an interesting quote:

...in our experience we find that lean stochastic local search techniques such as simulated annealing are often the most competitive for hard problems with little structure to exploit.

Source.

I suspect GPT will be biased towards avoiding mesa-optimization and making use of heuristics, so the best contrived mesa-optimization setup may be an optimization problem with little structure where heuristics aren't very helpful. Maybe we could focus on problems where non-heuristic methods such as branch and bound / backtracking are considered state of the art, and train the architecture to mesa-optimize by starting with easy instances and gradually moving to harder and harder ones.

Comment by John_Maxwell (John_Maxwell_IV) on Why GPT wants to mesa-optimize & how we might change this · 2020-11-28T06:11:04.868Z · LW · GW

Thanks for sharing!

Comment by John_Maxwell (John_Maxwell_IV) on The (Unofficial) Less Wrong Comment Challenge · 2020-11-14T09:06:29.689Z · LW · GW

I also felt frustrated by lack of feedback my posts got, my response was to write this: https://www.lesswrong.com/posts/2E3fpnikKu6237AF6/the-case-for-a-bigger-audience Maybe submitting LW posts to targeted subreddits could be high impact?

LessWrong used to have a lot of comments back in the day. I wonder if part of the issue is simply that the number of posts went up, which means a bigger surfaces for readers to be spread across. Why did the writer/reader ratio go up? Perhaps because writing posts falls into the "endorsed" category, whereas reading/writing comments feels like "time-wasting". And as CFAR et al helped rationalists be more productive, they let activities labeled as "time-wasting" fall by the wayside. (Note that there's something rather incoherent about this: If the subject matter of the post was important enough to be worth a post, surely it is also worth reading/commenting?)

Anyway, here are the reasons why commenting falls into the "endorsed" column for me:

  • It seems neglected. See above argument.
  • I suspect people actually read comments a fair amount. I know I do. Sometimes I will skip to the comments before reading the post itself.
  • Writing a comment doesn't trigger the same "officialness" anxiety that writing a post does. I don't feel obligated to do background research, think about how my ideas should be structured, or try to anticipate potential lines of counterargument.
  • Taking this further, commenting doesn't feel like work. So it takes fewer spoons. I'm writing this comment during a pre-designated goof off period, in fact. The ideal activity is one which is high-impact yet feels like play. Commenting and brainstorming are two of the few things that fall in that category for me.

I know there was an effort to move the community from Facebook to LW recently. Maybe if we pitched LW as "just as fun as Facebook, but discussing more valuable things and adding to a searchable/taggable knowledge archive" that could lure people over? IMO the concept of "work that feels like play" is underrated in the rationalist and EA communities.

Unfortunately, even though I find it fun to write comments, I tend to get demoralized a while later when my comments don't get comment replies themselves :P So that ends up being an "endorsed" reason to avoid commenting.

Comment by John_Maxwell (John_Maxwell_IV) on Where do (did?) stable, cooperative institutions come from? · 2020-11-04T08:02:10.368Z · LW · GW

Well, death spirals can happen, but turnaround / reform can also happen. It usually needs good leadership though.

Sure, they have competitors, but what are they competing on? In terms of what's going on in the US right now, one story is that newspapers used to be nice and profitable, which created room for journalists to pursue high-minded ideals related to objectivity, fairness, investigative reporting, etc. But since Google/Craigslist took most of their ad revenue, they've had to shrink a bunch, and the new business environment leaves less room for journalists to pursue those high-minded ideals. Instead they're forced to write clickbait and/or pander to a particular ideological group to get subscriptions. Less sophisticated reporting/analysis means less sophisticated voting means less sophisticated politicians who aren't as capable of reforming whatever government department is currently most in need of reform (or, less sophisticated accountability means they do a worse job).

Comment by John_Maxwell (John_Maxwell_IV) on Where do (did?) stable, cooperative institutions come from? · 2020-11-04T04:59:43.709Z · LW · GW

Another hypothesis: Great people aren't just motivated by money. They're also motivated by things like great coworkers, interesting work, and prestige.

In the private sector, you see companies like Yahoo go into death spirals: Once good people start to leave, the quality of the coworkers goes down, the prestige of being a Yahoo employee goes down, and you have to deal with more BS instead of bold, interesting initiatives... which means fewer great people join and more leave (partially, also, because mediocre people can't identify, or don't want to hire, great people.)

This death spiral is OK in the private sector because people can just switch their search engine from Yahoo to Google if the results become bad. But there's no analogous competitive process for provisioning public sector stuff.

Good Marines get out because of bad leadership, which means bad Marines stay in and eventually get promoted to leadership positions and the cycle repeats itself.

Source

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2020-11-03T04:18:24.638Z · LW · GW

That's possible, but I'm guessing that it's not hard for a superintelligent AI to suddenly swallow an entire system using something like gray goo.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2020-11-03T03:58:52.531Z · LW · GW

In this reaction to Critch's podcast, I wrote about some reasons to think that a singleton would be preferable to a multipolar scenario. Here's another rather exotic argument.

[The dark forest theory] is explained very well near the end of the science fiction novel, The Dark Forest by Liu Cixin.

...

When two [interstellar] civilizations meet, they will want to know if the other is going to be friendly or hostile. One side might act friendly, but the other side won't know if they are just faking it to put them at ease while armies are built in secret. This is called chains of suspicion. You don't know for sure what the other side's intentions are. On Earth this is resolved through communication and diplomacy. But for civilizations in different solar systems, that's not possible due to the vast distances and time between message sent and received. Bottom line is, every civilization could be a threat and it's impossible to know for sure, therefore they must be destroyed to ensure your survival.

Source. (Emphasis mine.)

Secure second strike is the ability to retaliate with your own nuclear strike if someone hits you with nukes. Secure second strike underpins mutually assured destruction. If nuclear war had a "first mover advantage", where whoever launches nukes first wins because the country that is hit with nukes is unable to retaliate, that would be much worse for a game theory perspective, because there's an incentive to be the first mover and launch a nuclear war (especially if you think your opponent might do the same).

My understanding is that the invention of nuclear submarines was helpful for secure second strike. There is so much ocean for them to hide in that it's difficult to track and eliminate all of your opponent's nuclear submarines and ensure they won't be able to hit you back.

However, in Allan Dafoe's article AI Governance: Opportunity and Theory of Impact, he mentions that AI processing of undersea sensors could increase the risk of nuclear war (presumably because it makes it harder for nuclear submarines to hide).

Point being, we don't know what the game theory of a post-AGI world looks like. And we really don't know what interstellar game theory between different AGIs looks like. ("A colonized solar system is plausibly a place where predators can see most any civilized activities of any substantial magnitude, and get to them easily if not quickly."--source.) It might be that the best strategy is for multipolar AIs to unify into a singleton anyway.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2020-10-31T03:16:17.952Z · LW · GW

A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.

In a possibly inaccurate nutshell, Critch argues that what we think of as the "alignment problem" is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton--he says people tend to resist that kind of concentration of power, and it will be hard for an AI team that has this as their plan to recruit team members. Critch says there is really a taxonomy of alignment problems:

  • single-single, where we have a single operator aligning a single AI with their preferences
  • single-multi, where we have a single operator aligning multiple AIs with their preferences
  • multi-single, where we have multiple operators aligning a single AI with their preferences
  • multi-multi, where we have multiple operators aligning multiple AIs with their preferences

Critch says that although there are commercial incentives to solve the single-single alignment problem, there aren't commercial incentives to solve all of the others. He thinks the real alignment failures might look like the sort of diffusion of responsibility you see when navigating bureaucracy.

I'm a bit skeptical of this perspective. For one thing, I'm not convinced commercial incentives for single-single alignment will extrapolate well to exotic scenarios such as the "malign universal prior" problem--and if hard takeoff happens then these exotic scenarios might come quickly. For another thing, although I can see why advocating a singleton would be a turnoff to the AI researchers that Critch is pitching, I feel like the question of whether to create a singleton deserves more than the <60 seconds of thought that an AI researcher having a casual conversation with Critch likely puts into their first impression. If there are commercial incentives to solve single-single alignment but not other kinds, shouldn't we prefer that single-single is the only kind which ends up being load-bearing? Why can't we form an aligned singleton and then tell it to design a mechanism by which everyone can share their preferences and control what the singleton does (democracy but with better reviews)?

I guess a big issue is the plausibility of hard takeoff, because if hard takeoff is implausible, that makes it less likely that a singleton will form under any circumstances, and it also means that exotic safety problems aren't likely to crop up as quickly. If this is Critch's worldview then I could see why he is prioritizing the problems he is prioritizing.

Anyway my friend and I spent some time brainstorming about how to solve versions of the alignment problem besides single-single. Since we haven't actually read ARCHES or much relevant literature, it's likely that much of what comes below is clueless, but it might also have new insights due to being unconstrained by existing paradigms :P

One scenario which is kind of in between multi-single and multi-multi alignment is a scenario where everyone has an AI agent which negotiates with some kind of central server on their behalf. We could turn multi-single into this scenario by telling the single AI to run internal simulations of everyone's individual AI agent, or we could turn multi-multi into this scenario if we have enough cooperation/enforcement for different people to abide by the agreements that their AI agents make with one another on their behalf.

Most of the game theory we're familiar with deals with a fairly small space of agreements it is possible to make, but it occurred to us that in an ideal world, these super smart AIs would be doing a lot of creative thinking, trying to figure out a clever way for everyone's preferences to be satisfied simultaneously. Let's assume each robot agent has a perfect model of its operator's preferences (or can acquire a perfect model as needed by querying the operator). The central server queries the agents about how much utility their operator assigns to various scenarios, or whether they prefer Scenario A to Scenario B, or something like that. And the agents can respond either truthfully or deceptively ("data poisoning"), trying to navigate towards a final agreement which is as favorable as possible for their operator. Then the central server searches the space of possible agreements in a superintelligent way and tries to find an agreement that everyone likes. (You can also imagine a distributed version of this where there is no central server and individual robot agents try to come up with a proposal that everyone likes.)

How does this compare to the scenario I mentioned above, where an aligned AI designs a mechanism and collects preferences from humans directly without any robot agent as an intermediary? The advantage of robot agents is that if everyone gets a superintelligent agent, then it is harder for individuals to gain advantage through the use of secret robot agents, so the overall result ends up being more fair. However, it arguably makes the mechanism design problem harder: If it is humans who are answering preference queries rather than superintelligent robot agents, since humans have finite intelligence, it will be harder for them to predict the strategic results of responding in various ways to preference queries, so maybe they're better off just stating their true preferences to minimize downside risk. Additionally, an FAI is probably better at mechanism design than humans. But then again, if the mechanism design for discovering fair agreements between superintelligent robot agents fails, and a single agent manages to negotiate really well on behalf of its owner's preferences, then arguably you are back in the singleton scenario. So maybe the robot agents scenario has the singleton scenario as its worst case.

I said earlier that it will be harder for humans to predict the strategic results of responding in various ways to preference queries. But we might be able to get a similar result for supersmart AI agents by making use of secret random numbers during the negotiation process to create enough uncertainty where revealing true preferences becomes the optimal strategy. (For example, you could imagine two mechanisms, one of which incentivizes strategic deception in one direction, and the other incentivizes strategic deception in the other direction; if we collect preferences and then flip a coin regarding which mechanism to use, the best strategy might be to do no deception at all.)

Another situation to consider is one where we don't have as much cooperation/enforcement and individual operators are empowered to refuse to abide by any agreement--let's call this "declaring war". In this world, we might prefer to overweight the preferences of more powerful players, because if everyone is weighted equally regardless of power, then the powerful players might have an incentive to declare war and get more than their share. However it's unclear how to do power estimation in an impartial way. Also, such a setup incentivizes accumulation of power.

One idea which seems like it might be helpful on first blush would be to try to invent some way of verifiably implementing particular utility functions, so competing teams could know that a particular AI will take their utility function into account. However this could be abused as follows: In the same way the game of chicken incentivizes tearing out your steering wheel so the opponent has no choice but to swerve, Team Evil could verifiably implement a particular utility function in their AI such that their AI will declare war unless competing teams verifiably implement a utility function Team Evil specifies.

Anyway looking back it doesn't seem like what I've written actually does much for the "bureaucratic diffusion of responsibility" scenario. I'd be interested to know concretely how this might occur. Maybe what we need is a mechanism for incentivizing red teaming/finding things that no one is responsible for/acquiring responsibility for them?

Comment by John_Maxwell (John_Maxwell_IV) on Babble challenge: 50 consequences of intelligent ant colonies · 2020-10-30T08:15:12.000Z · LW · GW

Last week we tried a more direct babble, on solving a problem in our lives. When I did it, I felt a bit like the tennis player trying to swing their racket the same way as when they were doing a bicep curl. I felt like I went too directly at the problem, while misunderstanding the mechanism.

Maybe a babble for "50 babble prompts that are both useful and not too direct"? :P

Seems to me that you want to gradually transition towards being able to babble about topics you don't feel very babbly about. It's the most important, most ugh-ish areas of our lives where we typically need fresh thinking the most, IMO.

Perhaps "50 ways to make it easier to babble about things that don't feel babbly"? ;)

Comment by John_Maxwell (John_Maxwell_IV) on The Darwin Game · 2020-10-15T22:52:18.689Z · LW · GW

It's a good point but in the original Darwin Game story, the opening sequence 2, 0, 2 was key to the plot.

Comment by John_Maxwell (John_Maxwell_IV) on Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’ · 2020-10-12T15:14:33.561Z · LW · GW

For some reason I was reminded of this post, which could be seen as being about class structure within the Effective Altruist movement.

Comment by John_Maxwell (John_Maxwell_IV) on The Darwin Game · 2020-10-10T15:39:59.027Z · LW · GW

Thanks.

Comment by John_Maxwell (John_Maxwell_IV) on The Darwin Game · 2020-10-10T12:38:39.021Z · LW · GW

Why does get_opponent_source take self as an argument?

Comment by John_Maxwell (John_Maxwell_IV) on Upside decay - why some people never get lucky · 2020-10-10T09:08:47.199Z · LW · GW

Yeah I think it's an empirical question what fraction of upside is explained by weak ties.

Paul Graham wrote this essay which identifies weak ties as one of the 2 main factors behind the success of startup hubs. He also says that "one of the most distinctive things about startup hubs is the degree to which people help one another out, with no expectation of getting anything in return".

Comment by John_Maxwell (John_Maxwell_IV) on Open & Welcome Thread – October 2020 · 2020-10-07T07:59:01.358Z · LW · GW

There hasn't been an LW survey since 2017. That's the longest we've ever gone without a survey since the first survey. Are people missing the surveys? What is the right interval to do them on, if any?

Comment by John_Maxwell (John_Maxwell_IV) on Open & Welcome Thread – October 2020 · 2020-10-07T07:55:37.975Z · LW · GW

Why not just have a comment which is a list of bullet points and keep editing it?

Comment by John_Maxwell (John_Maxwell_IV) on MikkW's Shortform · 2020-10-05T12:51:29.235Z · LW · GW

For what it's worth, I get frustrated by people not responding to my posts/comments on LW all the time. This post was my attempt at a constructive response to that frustration. I think if LW was a bit livelier I might replace all my social media use with it. I tried to do my part to make it lively by reading and leaving comments a lot for a while, but eventually gave up.

Comment by John_Maxwell (John_Maxwell_IV) on Davis_Kingsley's Shortform · 2020-10-05T12:42:41.630Z · LW · GW

In a world of distraction, focusing on something is a revolutionary act.

Comment by John_Maxwell (John_Maxwell_IV) on Postmortem to Petrov Day, 2020 · 2020-10-05T02:24:42.818Z · LW · GW

You mentioned petrov_day_admin_account, but I got a message from a user called petrovday:

Hello John_Maxwell,

You are part of a smaller group of 30 users who has been selected for the second part of this experiment. In order for the website not to go down, at least 5 of these selected users must enter their codes within 30 minutes of receiving this message, and at least 20 of these users must enter their codes within 6 hours of receiving the message. To keep the site up, please enter your codes as soon as possible. You will be asked to complete a short survey afterwards.

I saw the message more than 6 hours after it was sent and didn't read it very carefully. The possibility of phishing didn't occur to me, and I assumed that this new smaller group thing would involve entering a different code into a different page. Anyway, it was a useful lesson in being more aware of phishing attacks.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2020-09-30T03:27:32.093Z · LW · GW

Someone wanted to know about the outcome of my hair loss research so I thought I would quickly write up what I'm planning to try for the next year or so. No word on how well it works yet.

Most of the ideas are from this review: https://www.karger.com/Article/FullText/492035

I think this should be safer/less sketchy than the big 3 and fairly low cost, but plausibly less effective on expectation; let me know if you disagree.

Comment by John_Maxwell (John_Maxwell_IV) on Some Simple Observations Five Years After Starting Mindfulness Meditation · 2020-09-28T00:10:16.220Z · LW · GW

Those fatigue papers you recommended were a serious game-changer for me

Any chance you could link to whatever you're referring to? :)

Comment by John_Maxwell (John_Maxwell_IV) on Why GPT wants to mesa-optimize & how we might change this · 2020-09-27T05:45:54.540Z · LW · GW

Your philosophical point is interesting; I have a post in the queue about that. However I don't think it really proves what you want it to.

Having John_Maxwell in the byline makes it far more likely that I'm the author of the post.

If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don't see why a language model can't do the same, in principle.

GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

A perfectly optimal next-step predictor would not be improved by lookahead or anything else, it's perfectly optimal. I'm talking about computational structures which might be incentivized during training when the predictor is suboptimal. (It's still going to be suboptimal after training with current technology, of course.)

In orthonormal's post they wrote:

...GPT-3's ability to write fiction is impressive- unlike GPT-2, it doesn't lose track of the plot, it has sensible things happen, it just can't plan its way to a satisfying resolution.

I'd be somewhat surprised if GPT-4 shared that last problem.

I suspect that either GPT-4 will still be unable to plan its way to a satisfying resolution, or GPT-4 will develop some kind of internal lookahead (probably not beam search, but beam search could be a useful model for understanding it) which is sufficiently general to be re-used across many different writing tasks. (Generality takes fewer parameters.) I don't know what the relative likelihoods of those possibilities are. But the whole idea of AI safety is to ask what happens if we succeed.