Safety via selection for obedience 2020-09-10T10:04:50.283Z · score: 26 (9 votes)
Safer sandboxing via collective separation 2020-09-09T19:49:13.692Z · score: 20 (4 votes)
The Future of Science 2020-07-28T02:43:37.503Z · score: 21 (6 votes)
Thiel on Progress and Stagnation 2020-07-20T20:27:59.112Z · score: 142 (61 votes)
Environments as a bottleneck in AGI development 2020-07-17T05:02:56.843Z · score: 31 (11 votes)
A space of proposals for building safe advanced AI 2020-07-10T16:58:33.566Z · score: 42 (15 votes)
Arguments against myopic training 2020-07-09T16:07:27.681Z · score: 50 (14 votes)
AGIs as collectives 2020-05-22T20:36:52.843Z · score: 20 (10 votes)
Multi-agent safety 2020-05-16T01:59:05.250Z · score: 22 (10 votes)
Competitive safety via gradated curricula 2020-05-05T18:11:08.010Z · score: 34 (9 votes)
Against strong bayesianism 2020-04-30T10:48:07.678Z · score: 49 (27 votes)
What is the alternative to intent alignment called? 2020-04-30T02:16:02.661Z · score: 10 (3 votes)
Melting democracy 2020-04-29T20:10:01.470Z · score: 26 (8 votes)
ricraz's Shortform 2020-04-26T10:42:18.494Z · score: 6 (1 votes)
What achievements have people claimed will be warning signs for AGI? 2020-04-01T10:24:12.332Z · score: 17 (7 votes)
What information, apart from the connectome, is necessary to simulate a brain? 2020-03-20T02:03:15.494Z · score: 17 (7 votes)
Characterising utopia 2020-01-02T00:00:01.268Z · score: 27 (8 votes)
Technical AGI safety research outside AI 2019-10-18T15:00:22.540Z · score: 36 (13 votes)
Seven habits towards highly effective minds 2019-09-05T23:10:01.020Z · score: 39 (10 votes)
What explanatory power does Kahneman's System 2 possess? 2019-08-12T15:23:20.197Z · score: 33 (16 votes)
Why do humans not have built-in neural i/o channels? 2019-08-08T13:09:54.072Z · score: 26 (12 votes)
Book review: The Technology Trap 2019-07-20T12:40:01.151Z · score: 30 (14 votes)
What are some of Robin Hanson's best posts? 2019-07-02T20:58:01.202Z · score: 36 (13 votes)
On alien science 2019-06-02T14:50:01.437Z · score: 46 (15 votes)
A shift in arguments for AI risk 2019-05-28T13:47:36.486Z · score: 33 (14 votes)
Would an option to publish to AF users only be a useful feature? 2019-05-20T11:04:26.150Z · score: 14 (5 votes)
Which scientific discovery was most ahead of its time? 2019-05-16T12:58:14.628Z · score: 40 (11 votes)
When is rationality useful? 2019-04-24T22:40:01.316Z · score: 29 (7 votes)
Book review: The Sleepwalkers by Arthur Koestler 2019-04-23T00:10:00.972Z · score: 75 (22 votes)
Arguments for moral indefinability 2019-02-12T10:40:01.226Z · score: 54 (18 votes)
Coherent behaviour in the real world is an incoherent concept 2019-02-11T17:00:25.665Z · score: 40 (18 votes)
Vote counting bug? 2019-01-22T15:44:48.154Z · score: 7 (2 votes)
Disentangling arguments for the importance of AI safety 2019-01-21T12:41:43.615Z · score: 127 (47 votes)
Comments on CAIS 2019-01-12T15:20:22.133Z · score: 75 (21 votes)
How democracy ends: a review and reevaluation 2018-11-27T10:50:01.130Z · score: 17 (9 votes)
On first looking into Russell's History 2018-11-08T11:20:00.935Z · score: 35 (11 votes)
Speculations on improving debating 2018-11-05T16:10:02.799Z · score: 26 (10 votes)
Implementations of immortality 2018-11-01T14:20:01.494Z · score: 21 (8 votes)
What will the long-term future of employment look like? 2018-10-24T19:58:09.320Z · score: 11 (4 votes)
Book review: 23 things they don't tell you about capitalism 2018-10-18T23:05:29.465Z · score: 19 (11 votes)
Book review: The Complacent Class 2018-10-13T19:20:05.823Z · score: 21 (9 votes)
Some cruxes on impactful alternatives to AI policy work 2018-10-10T13:35:27.497Z · score: 156 (55 votes)
A compendium of conundrums 2018-10-08T14:20:01.178Z · score: 12 (12 votes)
Thinking of the days that are no more 2018-10-06T17:00:01.208Z · score: 13 (6 votes)
The Unreasonable Effectiveness of Deep Learning 2018-09-30T15:48:46.861Z · score: 88 (28 votes)
Deep learning - deeper flaws? 2018-09-24T18:40:00.705Z · score: 43 (18 votes)
Book review: Happiness by Design 2018-09-23T04:30:00.939Z · score: 14 (6 votes)
Book review: Why we sleep 2018-09-19T22:36:19.608Z · score: 52 (25 votes)
Realism about rationality 2018-09-16T10:46:29.239Z · score: 178 (84 votes)
Is epistemic logic useful for agent foundations? 2018-05-08T23:33:44.266Z · score: 19 (6 votes)


Comment by ricraz on Environments as a bottleneck in AGI development · 2020-09-27T09:23:36.355Z · score: 4 (2 votes) · LW · GW

The fact that progress on existing environments (Go, ALE-57, etc) isn't bottlenecked by environments doesn't seem like particularly useful evidence. The question is whether we could be making much more progress towards AGI with environments that were more conducive to developing AGI. The fact that we're running out of "headline" challenges along the lines of Go and Starcraft is one reason to think that having better environments would make a big difference - although to be clear, the main focus of my post is on the coming decades, and the claim that environments are currently a bottleneck does seem much weaker.

More concretely, is it possible to construct some dataset on which our current methods would get significantly closer to AGI than they are today? I think that's plausible - e.g. perhaps we could take the linguistic corpus that GPT-3 was trained on, and carefully annotate what counts as good reasoning and what doesn't. (In some ways this is what reward modelling is trying to do - but that focuses more on alignment than capabilities.)

Or another way of putting it: suppose we gave the field of deep learning 10,000x current compute and algorithms that are 10 years ahead of today. Would people know what to apply them to, in order to get much closer to AGI? If not, this also suggests that environments will be a bottleneck unless someone focuses on them within the next decade.

Comment by ricraz on ricraz's Shortform · 2020-09-17T20:01:32.380Z · score: 4 (2 votes) · LW · GW

Greg Egan on universality:

I believe that humans have already crossed a threshold that, in a certain sense, puts us on an equal footing with any other being who has mastered abstract reasoning. There’s a notion in computing science of “Turing completeness”, which says that once a computer can perform a set of quite basic operations, it can be programmed to do absolutely any calculation that any other computer can do. Other computers might be faster, or have more memory, or have multiple processors running at the same time, but my 1988 Amiga 500 really could be programmed to do anything my 2008 iMac can do — apart from responding to external events in real time — if only I had the patience to sit and swap floppy disks all day long. I suspect that something broadly similar applies to minds and the class of things they can understand: other beings might think faster than us, or have easy access to a greater store of facts, but underlying both mental processes will be the same basic set of general-purpose tools. So if we ever did encounter those billion-year-old aliens, I’m sure they’d have plenty to tell us that we didn’t yet know — but given enough patience, and a very large notebook, I believe we’d still be able to come to grips with whatever they had to say.
Comment by ricraz on Coherent behaviour in the real world is an incoherent concept · 2020-09-09T06:46:35.755Z · score: 4 (2 votes) · LW · GW

What is missing here is an argument that the VNM theorem does have important implications in settings where its assumptions are not true. Nobody has made this argument. I agree it's suggestive, but that's very far from demonstrating that AGIs will necessarily be ruthlessly maximising some simple utility function.

"obviously we don't expect a superintelligent AI to be predictably stupid in the way Eliezer lines out"

Eliezer argued that superintelligences will have certain types of goals, because of the VNM theorem. If they have different types of goals, then behaviour which violates VNM is no longer "predictably stupid". For example, if I have a deontological goal, then maybe violating VNM is the best strategy.

Comment by ricraz on ricraz's Shortform · 2020-08-26T15:47:40.572Z · score: 5 (2 votes) · LW · GW

It feels partly like an incentives problem, but also I think a lot of people around here are altruistic and truth-seeking and just don't realise that there are much more effective ways to contribute to community epistemics than standard blog posts.

I think that most LW discussion is at the level where "paying for mistakes" wouldn't be that helpful, since a lot of it is fuzzy. Probably the thing we need first are more reference posts that distill a range of discussion into key concepts, and place that in the wider intellectual context. Then we can get more empirical. (Although I feel pretty biased on this point, because my own style of learning about things is very top-down). I guess to encourage this, we could add a "reference" section for posts that aim to distill ongoing debates on LW.

In some cases you can get a lot of "cheap" credit by taking other people's ideas and writing a definitive version of them aimed at more mainstream audiences. For ideas that are really worth spreading, that seems useful.

Comment by ricraz on ricraz's Shortform · 2020-08-26T06:53:44.830Z · score: 4 (3 votes) · LW · GW

I wanted to register that I don't like "babble and prune" as a model of intellectual development. I think intellectual development actually looks more like:

1. Babble

2. Prune

3. Extensive scholarship

4. More pruning

5. Distilling scholarship to form common knowledge

And that my main criticism is the lack of 3 and 5, not the lack of 2 or 4.

I also note that: a) these steps get monotonically harder, so that focusing on the first two misses *almost all* the work; b) maybe I'm being too harsh on the babble and prune framework because it's so thematically appropriate for me to dunk on it here; I'm not sure if your use of the terminology actually reveals a substantive disagreement.

Comment by ricraz on ricraz's Shortform · 2020-08-26T06:39:58.208Z · score: 4 (2 votes) · LW · GW

"If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops--and this incentive is a positive one, not a punishment-driven browbeating incentive."

Hmm, it sounds like we agree on the solution but are emphasising different parts of it. For me, the question is: who's this "we" that should accept fewer ideas? It's the set of people who agree with my argument that you shouldn't believe things which haven't been fleshed out very much. But the easiest way to add people to that set is just to make the argument, which is what I've done. Specifically, note that I'm not criticising anyone for producing posts that are short and speculative: I'm criticising the people who update too much on those posts.

Comment by ricraz on Mathematical Inconsistency in Solomonoff Induction? · 2020-08-26T06:28:37.270Z · score: 2 (1 votes) · LW · GW

"That is, you could say something like "It's the list of all primes OR the list of all squares. Compressed data: first number is zero""

Just to clarify here (because it took me a couple of seconds): you only need the first number of the compressed data because that is sufficient to distinguish whether you have a list of primes or a list of squares. But as Pongo said, you could describe that same list in a much more compressed way by skipping the irrelevant half of the OR statement.

Comment by ricraz on Mathematical Inconsistency in Solomonoff Induction? · 2020-08-25T17:40:36.928Z · score: 9 (6 votes) · LW · GW

My understanding is that a hypothesis is a program which generates a complete prediction of all observations. So there is no specific hypothesis (X OR Y), for the same reason that there is no sequence of numbers which is (list of all primes OR list of all squares).

Note that by "complete prediction of all observations" I don't mean things like "tomorrow you'll see a blackbird", but rather the sense that you get an observation in a MDP or POMDP. If you imagine watching the world through a screen with a given frame rate, every hypothesis has to predict every single pixel of that screen, for each frame.

I don't know where this is explained properly though. In fact I think a proper explanation, which explains how these idealised "hypotheses" relate to hypotheses in the human sense, would basically need to explain what thinking is and also solve the entire embedded agency agenda. For that reason, I place very little weight on claims linking Solomonoff induction to bounded human or AI reasoning.

Comment by ricraz on ricraz's Shortform · 2020-08-23T10:05:22.149Z · score: 5 (2 votes) · LW · GW

Also, I liked your blog post! More generally, I strongly encourage bloggers to have a "best of" page, or something that directs people to good posts. I'd be keen to read more of your posts but have no idea where to start.

Comment by ricraz on ricraz's Shortform · 2020-08-23T09:44:15.112Z · score: 3 (2 votes) · LW · GW

Thanks, these links seem great! I think this is a good (if slightly harsh) way of making a similar point to mine:

"I find that autodidacts who haven’t experienced institutional R&D environments have a self-congratulatory low threshold for what they count as research. It’s a bit like vanity publishing or fan fiction. This mismatch doesn’t exist as much in indie art, consulting, game dev etc"

Comment by ricraz on ricraz's Shortform · 2020-08-23T06:28:00.026Z · score: 7 (3 votes) · LW · GW

I feel like this comment isn't critiquing a position I actually hold. For example, I don't believe that "the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours". I am happy for people to work towards building evidence for their hypotheses in many ways, including fleshing out details, engaging with existing literature, experimentation, and operationalisation.

Perhaps this makes "proven claim" a misleading phrase to use. Perhaps more accurate to say: "one fully fleshed out theory is more valuable than a dozen intuitively compelling ideas". But having said that, I doubt that it's possible to fully flesh out a theory like simulacra levels without engaging with a bunch of academic literature and then making predictions.

I also agree with Raemon's response below.

Comment by ricraz on ricraz's Shortform · 2020-08-23T06:18:40.815Z · score: 18 (6 votes) · LW · GW
In general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists ("faster than science")

"Being more openminded about what evidence to listen to" seems like a way in which we have lower epistemic standards than scientists, and also that's beneficial. It doesn't rebut my claim that there are some ways in which we have lower epistemic standards than many academic communities, and that's harmful.

In particular, the relevant question for me is: why doesn't LW have more depth? Sure, more depth requires more work, but on the timeframe of several years, and hundreds or thousands of contributors, it seems viable. And I'm proposing, as a hypothesis, that LW doesn't have enough depth because people don't care enough about depth - they're willing to accept ideas even before they've been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards - specifically, the standard of requiring (and rewarding) deep investigation and scholarship.

Comment by ricraz on ricraz's Shortform · 2020-08-22T06:32:07.112Z · score: 2 (1 votes) · LW · GW

Eh, this seems a bit nitpicky. It's arbitrarily simple given a call to a randomness oracle, which in practice we can approximate pretty easily. And it's "definitionally" easy to specify as well: "the function which, at each call, returns true with 50% likelihood and false otherwise."

Comment by ricraz on ricraz's Shortform · 2020-08-22T06:28:38.901Z · score: 9 (4 votes) · LW · GW

"I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools)."

Yepp, I agree with this. I guess our main disagreement is whether the "low epistemic standards" framing is a useful way to shape that energy. I think it is because it'll push people towards realising how little evidence they actually have for many plausible-seeming hypotheses on this website. One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.

When you say "there's also a lot of space for more 'details' to get fleshed out and subquestions to be cleanly answered", I find myself expecting that this will involve people who believe the hypothesis continuing to build their castle in the sky, not analysis about why it might be wrong and why it's not.

That being said, LW is very good at producing "fake frameworks". So I don't want to discourage this too much. I'm just arguing that this is a different thing from building robust knowledge about the world.

Comment by ricraz on ricraz's Shortform · 2020-08-21T14:36:50.178Z · score: 2 (1 votes) · LW · GW

I'm confused, what is Ω-karma?

Comment by ricraz on ricraz's Shortform · 2020-08-21T14:35:06.121Z · score: 3 (2 votes) · LW · GW

Maybe historians of the industrial revolution? Who grapple with really complex phenomena and large-scale patterns, like us, but unlike us use a lot of data, write a lot of thorough papers and books, and then have a lot of ongoing debate on those ideas. And then the "progress studies" crowd is an example of an online community inspired by that tradition (but still very nascent, so we'll see how it goes).

More generally I'd say we could learn to be more rigorous by looking at any scientific discipline or econ or analytic philosophy. I don't think most LW posters are in a position to put in as much effort as full-time researchers, but certainly we can push a bit in that direction.

Comment by ricraz on Search versus design · 2020-08-21T08:50:16.819Z · score: 6 (3 votes) · LW · GW

Nice post, very much the type of work I'd like to see more of. :) A few small comments:

Why should a search process factorize its constructions? It has no need for factorization because it does not operate on the basis of abstraction layers.

I think this is incorrect - for example, "biological systems are highly modular, at multiple different scales". And I expect deep learning to construct minds which are also fairly modular. That also allows search to be more useful, because it can make changes which are comparatively isolated.

This thread of work initially gained notoriety with Olah’s 2017 article

I'm not sure I'd describe this work as "notorious", even if some have reservations about it.

But there is a third option: we could automate design, making it competitive with search in terms of its effectiveness at producing powerful artificial intelligence systems, yet retaining its ability to produce comprehensible artifacts in which we can establish trust based on theories and abstraction layers.

In light of my claim that search can also produce modularity and abstraction, I suspect that this might look quite similar to what you describe as rescuing search - because search will still be doing the "construction" part of design, and then we just need a way to use the AIs we've constructed to analyse those constructions. So then I guess the key distinction is, as Paul identifies, whether the artifact works *because* of the story or not.

Comment by ricraz on ricraz's Shortform · 2020-08-21T06:58:31.031Z · score: 5 (3 votes) · LW · GW

Right, but this isn't mentioned in the post? Which seems odd. Maybe that's actually another example of the "LW mentality": why is the fact that there has been solid empirical research into 3 layers not being enough not important enough to mention in a post on why 3 layers isn't enough? (Maybe because the post was time-boxed? If so that seems reasonable, but then I would hope that people comment saying "Here's a very relevant paper, why didn't you cite it?")

Comment by ricraz on ricraz's Shortform · 2020-08-21T06:28:12.725Z · score: 11 (5 votes) · LW · GW

As mentioned in my reply to Ruby, this is not a critique of the LW team, but of the LW mentality. And I should have phrased my point more carefully - "epistemic standards are too low to make any progress" is clearly too strong a claim, it's more like "epistemic standards are low enough that they're an important bottleneck to progress". But I do think there's a substantive disagreement here. Perhaps the best way to spell it out is to look at the posts you linked and see why I'm less excited about them than you are.

Of the top posts in the 2018 review, and the ones you linked (excluding AI), I'd categorise them as follows:

Interesting speculation about psychology and society, where I have no way of knowing if it's true:

  • Local Validity as a Key to Sanity and Civilization
  • The Loudest Alarm Is Probably False
  • Anti-social punishment (which is, unlike the others, at least based on one (1) study).
  • Babble
  • Intelligent social web
  • Unrolling social metacognition
  • Simulacra levels
  • Can you keep this secret?

Same as above but it's by Scott so it's a bit more rigorous and much more compelling:

  • Is Science Slowing Down?
  • The tails coming apart as a metaphor for life

Useful rationality content:

  • Toolbox-thinking and law-thinking
  • A sketch of good communication
  • Varieties of argumentative experience

Review of basic content from other fields. This seems useful for informing people on LW, but not actually indicative of intellectual progress unless we can build on them to write similar posts on things that *aren't* basic content in other fields:

  • Voting theory primer
  • Prediction markets: when do they work
  • Costly coordination mechanism of common knowledge (Note: I originally said I hadn't seen many examples of people building on these ideas, but at least for this post there seems to be a lot.)
  • Six economics misconceptions
  • Swiss political system

It's pretty striking to me how much the original sequences drew on the best academic knowledge, and how little most of the things above draw on the best academic knowledge. And there's nothing even close to the thoroughness of Luke's literature reviews.

The three things I'd like to see more of are:

1. The move of saying "Ah, this is interesting speculation about a complex topic. It seems compelling, but I don't have good ways of verifying it; I'll treat it like a plausible hypothesis which could be explored more by further work." (I interpret the thread I originally linked as me urging Wei to do this).

2. Actually doing that follow-up work. If it's an empirical hypothesis, investigating empirically. If it's a psychological hypothesis, does it apply to anyone who's not you? If it's more of a philosophical hypothesis, can you identify the underlying assumptions and the ways it might be wrong? In all cases, how does it fit into existing thought? (That'll probably take much more than a single blog post).

3. Insofar as many of these scattered plausible insights are actually related in deep ways, trying to combine them so that the next generation of LW readers doesn't have to separately learn about each of them, but can rather download a unified generative framework.

Comment by ricraz on ricraz's Shortform · 2020-08-21T05:33:10.325Z · score: 4 (3 votes) · LW · GW

For the record, I think the LW team is doing a great job. There's definitely a sense in which better infrastructure can reduce the need for high epistemic standards, but it feels like the thing I'm pointing at is more like "Many LW contributors not even realising how far away we are from being able to reliably produce and build on good ideas" (which feels like my criticism of Ben's position in his comment, so I'll respond more directly there).

Comment by ricraz on ricraz's Shortform · 2020-08-20T20:50:53.709Z · score: 5 (3 votes) · LW · GW

In the half-formed thoughts stage, I'd expect to see a lot of literature reviews, agendas laying out problems, and attempts to identify and question fundamental assumptions. I expect that (not blog-post-sized speculation) to be the hard part of the early stages of intellectual progress, and I don't see it right now.

Perhaps we can split this into technical AI safety and everything else. Above I'm mostly speaking about "everything else" that Less Wrong wants to solve. Since AI safety is now a substantial enough field that its problems need to be solved in more systemic ways.

Comment by ricraz on ricraz's Shortform · 2020-08-20T14:06:57.832Z · score: 35 (17 votes) · LW · GW

One fairly strong belief of mine is that Less Wrong's epistemic standards are not high enough to make solid intellectual progress here. So far my best effort to make that argument has been in the comment thread starting here. Looking back at that thread, I just noticed that a couple of those comments have been downvoted to negative karma. I don't think any of my comments have ever hit negative karma before; I find it particularly sad that the one time it happens is when I'm trying to explain why I think this community is failing at its key goal of cultivating better epistemics.

There's all sorts of arguments to be made here, which I don't have time to lay out in detail. But just step back for a moment. Tens or hundreds of thousands of academics are trying to figure out how the world works, spending their careers putting immense effort into reading and producing and reviewing papers. Even then, there's a massive replication crisis. And we're trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.

Comment by ricraz on Delegate a Forecast · 2020-07-29T09:52:22.805Z · score: 3 (2 votes) · LW · GW

I think the intention is that the forecasts are of continuous variables. Are you interested in the expected number of people who get covid?

Comment by ricraz on Thiel on Progress and Stagnation · 2020-07-28T13:31:56.126Z · score: 5 (4 votes) · LW · GW

Thanks, this is very useful! Agreed that they're worth including, we just decided to ship earlier at the cost of being more comprehensive. I'll add these over the next few weeks probably.

Comment by ricraz on Arguments against myopic training · 2020-07-24T06:07:56.016Z · score: 2 (1 votes) · LW · GW

Ah, makes sense. There's already a paragraph on this (starting "I should note that so far"), but I'll edit to mention it earlier.

Comment by ricraz on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-22T19:00:14.287Z · score: 11 (10 votes) · LW · GW

I think 1% in the next year and a half is significantly too low.

Firstly, conditioning on AGI researchers makes a pretty big difference. It rules out most mainstream AI researchers, including many of the most prominent ones who get the most media coverage. So I suspect your gut feeling about what people would say isn't taking this sufficiently into account.

Secondly, I think attributing ignorance to the outgroup is a pretty common fallacy, so you should be careful of that. I think a clear majority of AGI researchers are probably familiar with the concept of reward gaming by now, and could talk coherently about AGIs reward gaming, or manipulating humans. Maybe they couldn't give very concrete disaster scenarios, but neither can many of us.

And thirdly, once you get agreement that there are problems, you basically get "we should fix the problems first" for free. I model most AGI researchers as thinking that AGI is far enough away that we can figure out practical ways to prevent these things, like better protocols for giving feedback. So they'll agree that we should do that first, because they think that it'll happen automatically anyway.

Comment by ricraz on Arguments against myopic training · 2020-07-22T07:45:38.351Z · score: 2 (1 votes) · LW · GW

I'm also confused.

"While the overseer might very well try to determine how effective it's own actions will be at achieving long-term goals, it never evaluates how effective the model's actions will be."

Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?

In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking "I'm going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X", versus a model that's thinking "I'm going to pursue goal X"? (To the extent these are different, which I'm still confused about).

Comment by ricraz on Environments as a bottleneck in AGI development · 2020-07-18T10:14:31.517Z · score: 4 (2 votes) · LW · GW

I endorse Steve's description as a caricature of my view, and also Rohin's comment. To flesh out my view a little more: I think that GPT-3 doing so well on language without (arguably) being able to reason, is the same type of evidence as Deep Blue or AlphaGo doing well at board games without being able to reason (although significantly weaker). In both cases it suggests that just optimising for this task is not sufficient to create general intelligence. While it now seems pretty unreasonable to think that a superhuman chess AI would by default be generally intelligent, that seems not too far off what people used to think.

Now, it might be the case that the task doesn't matter very much for AGI if you "put a ton of information / inductive bias into the architecture", as Rohin puts it. But I interpret Sutton to be arguing against our ability to do so.

We'll eventually invent a different architecture-and-learning-algorithm that is suited to reasoning

There are two possible interpretations of which, one of which I agree with, one of which I don't. I could either interpret you as saying that we'll eventually develop an architecture/learning algorithm biased towards reasoning ability - I disagree with this.

Or you could be saying that future architectures will be capable of reasoning in ways that transformers aren't, by virtue of just being generally more powerful. Which seems totally plausible to me.

Comment by ricraz on Environments as a bottleneck in AGI development · 2020-07-18T10:06:07.409Z · score: 6 (3 votes) · LW · GW

+1, I endorse this summary. I also agree that GPT-3 was an update towards the environment not mattering as much as I thought.

Your summary might be clearer if you rephrase as:

It considers two possibilities: the “easy paths hypothesis” that which many environments would incentivize AGI, and the “hard paths hypothesis” that such environments are rare.

Since "easy paths" and "hard paths" by themselves are kinda ambiguous terms - are we talking about the paths, or the hypothesis? This is probably my fault for choosing bad terminology.

Comment by ricraz on Environments as a bottleneck in AGI development · 2020-07-18T09:57:29.383Z · score: 8 (3 votes) · LW · GW

While this is a sensible point, I also think we should have a pretty high threshold for not talking about things, for a couple of reasons:

1. Safety research is in general much more dependent on having good ideas than capabilities research (because a lot of capabilities are driven by compute, and also because there are fewer of us).

2. Most of the AI people who listen to things people like us say are safety people.

3. I don't think there's enough work on safety techniques tailored to specific paths to AGI (as I discuss briefly at the end of this post).

4. It's uncooperative and gives others a bad impression of us.

So the type of thing I'd endorse not saying is "Here's one weird trick which will make the generation of random environments much easier." But something I endorse talking about is the potential importance of multi-agent environments for training AGIs, even though this is to me a central example of a "useful insight about what environment features are needed to incentivize general intelligence".

Comment by ricraz on Environments as a bottleneck in AGI development · 2020-07-17T16:05:32.285Z · score: 6 (3 votes) · LW · GW

To be precise, the argument is that elephants (or other animals in similar situations) *wouldn't* evolve to human-level intelligence. The fact that they *didn't* isn't very much information (for anthropic reasons, because if they did then it'd be them wondering why primates didn't get to elephant-level intelligence).

And then we should also consider that the elephant environment isn't a randomly-sampled environment either, but is also correlated with ours (which means we should also anthropically discount this).

Comment by ricraz on Environments as a bottleneck in AGI development · 2020-07-17T16:00:23.790Z · score: 4 (2 votes) · LW · GW
AGI wouldn't have those chicken-and-egg problems.

I like and agree with this point, and have made a small edit to the original post to reflect that. However, while I don't dispute that GPT-3 has some human-like concepts, I'm less sure about its reasoning abilities, and it's pretty plausible to me that self-supervised training on language alone plateaus before we get to a GPT-N that does. I'm also fairly uncertain about this, but these types of environmental difficulties are worth considering.

I'm also a bit confused about your reference to "Rich Sutton’s bitter lesson". Do you agree that Transformers learn more / better in the same environment than MLPs? That LSTMs learn more / better in the same environment than simpler RNNs?

Yes, but my point is that the *content* comes from the environment, not the architecture. We haven't tried to leverage our knowledge of language by, say, using a different transformer for each part of speech. I (and I assume Sutton) agree that we'll have increasingly powerful models, but they'll also be increasing general - and therefore the question of whether a model with the capacity to become an AGI does so or not will depend to a significant extent on the environment.

Comment by ricraz on Six economics misconceptions of mine which I've resolved over the last few years · 2020-07-13T05:53:42.846Z · score: 4 (2 votes) · LW · GW

One thing this post makes me curious about: in the last section, you talk about the effects of price controls on people selling goods, and also jobs. Usually we think of the labor market as workers selling their labor, rather than companies selling jobs. But is there any problem with this inversion? I guess the former view is better when workers are very heterogeneous, whereas maybe in cases where the company cares less about worker quality (like minimum wage jobs) the latter is also viable.

Also, I expect service industries to in general adapt better to price controls, since worker time can often vary more continuously than other products. Although idk how helpful such generalisations are, since there will be many exceptions.

Comment by ricraz on Arguments against myopic training · 2020-07-11T09:59:48.517Z · score: 8 (4 votes) · LW · GW

I broadly agree about what our main disagreement is. Note that I've been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what's really going on during amplification, which is a bigger topic that I'll need to think about more.

On the side disagreement (of whether looking at future states before evaluation counts as "myopic") I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I've added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:

Objection 2: This sacrifices competitiveness, because now the human can't look at the medium-term consequences of actions before providing feedback.

Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn't matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can't do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.

E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.

This is consistent with your position: "When I talk about myopic training vs. regular RL, I'm imagining that they have the same information available when feedback is given". However, it also raises the question of why we can't just wait until the end of the trajectory to give myopic feedback anyway. In my edits I've called this "semi-myopia". This wouldn't be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.

Comment by ricraz on Arguments against myopic training · 2020-07-10T16:33:46.012Z · score: 2 (1 votes) · LW · GW

Why do nonmyopic agents end up power-seeking? Because the supervisor rates some states highly, and so the agent is incentivised to gain power in order to reach those states.

Why do myopic agents end up power-seeking? Because to train a competitive myopic agent, the supervisor will need to calculate how much approval they assign to actions based on how much those actions contribute to reaching valuable states. So the agent will be rewarded for taking actions which acquire it more power, since the supervisor will predict that those contribute to reaching valuable states.

(You might argue that, if the supervisor doesn't want the agent to be power-seeking, they'll only approve of actions which gain the agent more power in specified ways. But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.)

Comment by ricraz on Arguments against myopic training · 2020-07-10T09:46:45.434Z · score: 10 (5 votes) · LW · GW
There's a lot of ways that reward functions go wrong besides manipulation.

I'm calling them manipulative states because if the human notices that the reward function has gone wrong, they'll just change the reward they're giving. So there must be something that stops them from noticing this. But maybe it's a misleading term, and this isn't an important point, so for now I'll use "incorrectly rewarded states" instead.

I agree that if what you're worried about is manipulation in N actions, then you shouldn't let the trajectory go on for N actions before evaluating.

This isn't quite my argument. My two arguments are:

1. IF an important reason you care about myopia is to prevent agents from making N-step plans to get to incorrectly rewarded states, THEN you can't defend the competitiveness of myopia by saying that we'll just look at the whole trajectory (as you did in your original reply).

2. However, even myopically cutting off the trajectory before the agent takes N actions is insufficient to prevent the agent from making N-step plans to get to incorrectly rewarded states.

Sure, but humans are better at giving approval feedback than reward feedback. ... we just aren't very used to thinking in terms of "rewards".

Has this argument been written up anywhere? I think I kinda get what you mean by "better", but even if that's true, I don't know how to think about what the implications are. Also, I think it's false if we condition on the myopic agents actually being competitive.

My guess is that this disagreement is based on you thinking primarily about tasks where it's clear what we want the agent to do, and we just need to push it in that direction (like the ones discussed in the COACH paper). I agree that approval feedback is much more natural for this use case. But when I'm talking about competitive AGI, I'm talking about agents that can figure out novel approaches and strategies. Coming up with reward feedback that works for that is much easier than coming up with workable approval feedback, because we just don't know the values of different actions. If we do manage to train competitive myopic agents, I expect that the way we calculate the approval function is by looking at the action, predicting what outcomes it will lead to, and evaluating how good those outcomes are - which is basically just mentally calculating a reward function and converting it to a value function. But then we could just skip the "predicting" bit and actually look at the outcomes instead - i.e. making it nonmyopic.

If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning, or without benefiting greatly from looking at what the actual consequences are, then that would constitute a compelling argument against my position. E.g. maybe we can figure out what "good cognitive steps" are, and then reward the agent for doing those without bothering to figure out what outcomes good cognitive steps will lead to. That seems very hard, but it's the sort of thing I think you need to defend if you're going to defend myopia. (I expect Debate about which actions to take, for instance, to benefit greatly from the judge being able to refer to later outcomes of actions).

Another way of making this argument: humans very much think in terms of outcomes, and how good those outcomes are, by default. I agree that we are bad at giving step-by-step dense rewards. But the whole point of a reward function is that you don't need to do the step-by-step thing, you can mostly just focus on rewarding good outcomes, and the agent does the credit assignment itself. I picture you arguing that we'll need shaped rewards to help the agent explore, but a) we can get rid of those shaped rewards as soon as the agent has gotten off the ground, so that they don't affect long-term incentives, and b) even shaped rewards can still be quite outcome-focused (and therefore natural to think about) - e.g. +1 for killing Roshan in League of Legends.

In terms of catching and correcting mistakes in the specification, I agree that myopia forces the supervisor to keep watching the agent, which means that the supervisor is more likely to notice if they've accidentally incentivised the agent to do something bad. But whatever bad behaviour the supervisor is able to notice during myopic training, they could also notice during nonmyopic training if they were watching carefully. So perhaps myopia is useful as a commitment device to force supervisors to pay attention, but given the huge cost of calculating the likely outcomes of all actions, I doubt anyone will want to use it that way.

Comment by ricraz on Arguments against myopic training · 2020-07-10T06:23:30.467Z · score: 2 (1 votes) · LW · GW

"Why would that behavior, in particular, lead to the highest myopic reward?"

I addressed this in my original comment: "More specifically: if a myopic agent's actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they'll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward."

Comment by ricraz on Arguments against myopic training · 2020-07-10T00:36:37.782Z · score: 4 (2 votes) · LW · GW

"Many quasi-independently predicted approval judgments must cohere into a dangerous policy."

I described how this happens in the section on manipulating humans. In short, there is no "quasi-independence" because you are still evaluating every action based on whether you think it'll lead to a fun adventure. This is exactly analogous to why the reward function you described takes over the world.

Comment by ricraz on Arguments against myopic training · 2020-07-09T22:00:24.497Z · score: 14 (4 votes) · LW · GW
[Objection 2] doesn't seem to be true -- if you want, you can collect a full trajectory to see the consequences of the actions, and then provide approval feedback on each of the actions individually when computing gradients.

It feels like we have two disagreements here. One is whether the thing you describe in this quote is "myopic" training. If you think that the core idea of myopia is that the evaluation of an action isn't based on its effects, then this is better described as nonmyopic. But if you think that the core idea of myopia is that the agent doesn't do its own credit assignment, then this is best described as myopic.

If you think, as I interpret you as saying, that the main reason myopia is useful is because it removes the incentive for agents to steer towards incorrectly high-reward states (which I'll call "manipulative" states), then you should be inclined towards the first definition. Because the approach you described above (of collecting and evaluating a full trajectory before giving feedback) means the agent still has an incentive to do multi-step manipulative plans.

More specifically: if a myopic agent's actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they'll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward. In other words, it doesn't matter if the agent is doing its own credit assignment because the supervisor is basically doing the same credit assignment as the agent would. So if you count the approach you described above as myopic, then myopia doesn't do the thing you claim it does.

(I guess you could say that something count as a "small" error if it only affects a few states, and so what I just described is not a small error in the approval function? But it's a small error in the *process* of generating the approval function, which is the important thing. In general I don't think counting the size of an error in terms of the number of states affected makes much sense, since you can always arbitrarily change those numbers.)

The second disagreement is about:

Most "simple" reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that's what the human says is correct.

I am kinda confused about what sort of approval feedback you're talking about. Suppose we have a simple reward function, which gives the agent more points for collecting more berries. Then the agent has lots of convergent instrumental subgoals. Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).

I picture you saying that the latter is not very simple, because it needs to make all these predictions about complex dependencies on future states. But that's what you need in any approval function that you want to use to train a competent agent. It seems like you're only picturing myopic feedback that doesn't actually solve the problem of figuring out which actions lead to which states - but as soon as you do, you get the same issues. It is no virtue of approval functions that most of them are safe, if none of the safe ones specify the behaviour we actually want from AIs.

Comment by ricraz on The ground of optimization · 2020-07-01T08:34:39.024Z · score: 4 (2 votes) · LW · GW

That's weird; thanks for the catch. Fixed.

Comment by ricraz on The ground of optimization · 2020-06-28T17:27:11.780Z · score: 8 (4 votes) · LW · GW
But now every system is an optimizing system, because we can always come up with some preference ordering that explains a system as an optimizing system.

Hmmm, I'm a little uncertain about whether this is the case. E.g. suppose you have a box with a rock in it, in an otherwise empty universe. Nothing happens. You perturb the system by moving the rock outside the box. Nothing else happens in response. How would you describe this as an optimising system? (I'm assuming that we're ruling out the trivial case of a constant utility function; if not, we should analogously include the trivial case of all states being target states).

As a more general comment: I suspect that what starts to happen after you start digging into what "perturbation" means, and what counts as a small or big perturbation, is that you run into the problem that a *tiny* perturbation can transform a highly optimising system to a non-optimising system (e.g. flicking the switch to turn off the AGI). In order to quantify size of perturbations in an interesting way, you need the pre-existing concept of which subsystems are doing the optimisation.

My preferred solution to this is just to stop trying to define optimisation in terms of *outcomes*, and start defining it in terms of *computation* done by systems. E.g. a first attempt might be: an agent is an optimiser if it does planning via abstraction towards some goal. Then we can zoom in on what all these words mean, or what else we might need to include/exclude (in this case, we've ruled out evolution, so we probably need to broaden it). The broad philosophy here is that it's better to be vaguely right than precisely wrong. Unfortunately I haven't written much about this approach publicly - I briefly defend it in a comment thread on this post though.

Comment by ricraz on The ground of optimization · 2020-06-27T12:11:14.820Z · score: 14 (4 votes) · LW · GW

Two examples which I'd be interested in your comments on:

1. Consider adding a big black hole in the middle of a galaxy. Does this turn the galaxy into a system optimising for a really big black hole in the middle of the galaxy? (Credit for the example goes to Ramana Kumar).

2. Imagine that I have the goal of travelling as fast as possible. However, there is no set of states which you can point to as the "target states", since whatever state I'm in, I'll try to go even faster. This is another argument for, as I argue below, defining an optimising system in terms of increasing some utility function (rather than moving towards target states).

Comment by ricraz on The ground of optimization · 2020-06-23T20:28:46.549Z · score: 2 (1 votes) · LW · GW

It would work at least as well as the original proposal, because your utility function could just be whatever metric of "getting closer to the target states" would be used in the original proposal.

Comment by ricraz on Do Women Like Assholes? · 2020-06-23T12:51:26.580Z · score: 2 (1 votes) · LW · GW

Yes, this seems reasonable. I guess I'm curious about which of these traits is more robustly attractive. That is: assuming the ideal male protagonist is both an alpha male, and honorable and kind, would their attractiveness drop more if you removed just the "honorable and kind" bit, or just the "alpha male" bit? I suspect the latter, but that's just speculation. We might be able to get more quantitative data by seeing how many male protagonists fall into each category.

Comment by ricraz on Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate · 2020-06-22T09:58:43.622Z · score: 4 (2 votes) · LW · GW

Thanks for the post :) To be clear, I'm very excited about conceptual and deconfusion work in general, in order to come up with imprecise theories of rationality and intelligence. I guess this puts my position in world 1. The thing I'm not excited about is the prospect of getting to this final imprecise theory via doing precise technical research. In other words, I'd prefer HRAD work to draw more on cognitive science and less on maths and logic. I outline some of the intuitions behind that in this post.

Having said that, when I've critiqued HRAD work in the past, on a couple of occasions I've later realised that the criticism wasn't aimed at a crux for people actually working on it (here's my explanation of one of those cases). To some extent this is because, without a clearly-laid-out position to criticise, the critic has the difficult task of first clarifying the position then rebutting it. But I should still flag that I don't know how much HRAD researchers would actually disagree with my claims in the first paragraph.

Comment by ricraz on The ground of optimization · 2020-06-22T08:52:11.794Z · score: 18 (8 votes) · LW · GW

This seems great, I'll read and comment more thoroughly later. Two quick comments:

It didn't seem like you defined what it meant to evolve towards the target configuration set. So it seems like either you need to commit to the system actually reaching one of the target configurations to call it an optimiser, or you need some sort of metric over the configuration space to tell whether it's getting closer to or further away from the target configuration set. But if you're ranking all configurations anyway, then I'm not sure it adds anything to draw a binary distinction between target configurations and all the others. In other words, can't you keep the definition in terms of a utility function, but just add perturbations?

Also, you don't cite Dennett here, but his definition has some important similarities. In particular, he defines several different types of perturbation (such as random perturbations, adversarial perturbations, etc) and says that a system is more agentic when it can withstand more types of perturbations. Can't remember exactly where this is from - perhaps The Intentional Stance?

Comment by ricraz on Do Women Like Assholes? · 2020-06-22T08:42:49.119Z · score: 9 (3 votes) · LW · GW

Thanks for actually doing some solid data analysis instead of just speculating on the internet :) Having said that, I'll now proceed to respond by speculating on the internet. Apologies.

I suspect that it'd be very helpful to disentangle "I prefer" into "I reflectively endorse being involved with" and "I am attracted to". Right now it seems like you're using some combination of those two. But people can be more attracted to things they reflectively endorse less, and may then act inconsistently, leading to different results when you look at different evidence sources.

One way to disentangle these two is to look at porn, where it's purely about attraction and you don't need to worry about what you actually endorse. And then you see things like 50 shades of grey or 365 days being very popular with women - where (especially in the latter) the male love interest's defining trait is being a bit of an asshole.

(I think the analogous thing for men might be: reflectively endorsing dating really strong, assertive women, but in practice being more attracted to quieter, shyer women).

Comment by ricraz on I'm leaving AI alignment – you better stay · 2020-06-20T17:01:18.609Z · score: 6 (4 votes) · LW · GW

I very much appreciate your efforts both in safety research and in writing this retrospective :)

For other people who are or will be in a similar position to you: I agree that focusing on producing results immediately is a mistake. I don't think that trying to get hired immediately is a big mistake, but I do think that trying to get hired specifically at an AI alignment research organisation is very limiting, especially if you haven't taken much time to study up ML previously.

For example, I suspect that for most people there would be very little difference in overall impact between working as a research engineer on an AI safety team straight out of university, versus working as an ML engineer somewhere else for 1-2 years then joining an AI safety team. (Ofc this depends a lot on how much you already know + how quickly you learn + how much supervision you need).

Perhaps this wouldn't suit people who only want to do theoretical stuff - but given that you say that you find implementing ML fun, I'm sad you didn't end up going down the latter route. So this is a signal boost for others: there's a lot of ways to gain ML skills and experience, no matter where you're starting from - don't just restrict yourself to starting with safety.

Comment by ricraz on An overview of 11 proposals for building safe advanced AI · 2020-06-16T20:24:00.401Z · score: 16 (5 votes) · LW · GW

But if you could always get arbitrarily high performance with long enough training, then claiming "the performance isn't high enough" is equivalent to saying "we haven't trained long enough". So it reduces to just one dimension of competitiveness, which is how steep the curve of improvement over time is on average.

For the actual reason I think it makes sense to separate these, see my other comment: you can't usually get arbitrarily high performance by training longer.

Comment by ricraz on An overview of 11 proposals for building safe advanced AI · 2020-06-16T20:20:23.165Z · score: 10 (5 votes) · LW · GW
My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance.

I think this is incorrect. Most training setups eventually flatline, or close to it (e.g. see AlphaZero's ELO curve), and need algorithmic or other improvements to do better.