Posts

Book Review: Working With Contracts 2020-09-14T23:22:11.215Z · score: 95 (31 votes)
Egan's Theorem? 2020-09-13T17:47:01.970Z · score: 17 (6 votes)
CTWTB: Paths of Computation State 2020-09-08T20:44:08.951Z · score: 36 (9 votes)
Alignment By Default 2020-08-12T18:54:00.751Z · score: 99 (25 votes)
The Fusion Power Generator Scenario 2020-08-08T18:31:38.757Z · score: 101 (43 votes)
Infinite Data/Compute Arguments in Alignment 2020-08-04T20:21:37.310Z · score: 42 (17 votes)
Generalized Efficient Markets in Political Power 2020-08-01T04:49:32.240Z · score: 32 (12 votes)
Alignment As A Bottleneck To Usefulness Of GPT-3 2020-07-21T20:02:36.030Z · score: 93 (42 votes)
Anthropomorphizing Humans 2020-07-17T17:49:37.086Z · score: 46 (24 votes)
Mazes and Duality 2020-07-14T19:54:42.479Z · score: 49 (14 votes)
Models of Value of Learning 2020-07-07T19:08:31.785Z · score: 24 (8 votes)
High Stock Prices Make Sense Right Now 2020-07-03T20:16:53.852Z · score: 81 (33 votes)
Mediators of History 2020-06-27T19:55:48.485Z · score: 25 (9 votes)
Abstraction, Evolution and Gears 2020-06-24T17:39:42.563Z · score: 25 (7 votes)
The Indexing Problem 2020-06-22T19:11:53.626Z · score: 38 (7 votes)
High-School Algebra for Data Structures 2020-06-17T18:09:24.550Z · score: 20 (9 votes)
Causality Adds Up to Normality 2020-06-15T17:19:58.333Z · score: 12 (3 votes)
Cartesian Boundary as Abstraction Boundary 2020-06-11T17:38:18.307Z · score: 24 (6 votes)
Public Static: What is Abstraction? 2020-06-09T18:36:49.838Z · score: 65 (16 votes)
Everyday Lessons from High-Dimensional Optimization 2020-06-06T20:57:05.155Z · score: 114 (47 votes)
Speculations on the Future of Fiction Writing 2020-05-28T16:34:45.599Z · score: 34 (22 votes)
Highlights of Comparative and Evolutionary Aging 2020-05-22T17:01:30.158Z · score: 49 (18 votes)
Pointing to a Flower 2020-05-18T18:54:53.711Z · score: 51 (18 votes)
Conjecture Workshop 2020-05-15T22:41:31.984Z · score: 34 (10 votes)
Project Proposal: Gears of Aging 2020-05-09T18:47:26.468Z · score: 62 (24 votes)
Writing Causal Models Like We Write Programs 2020-05-05T18:05:38.339Z · score: 57 (20 votes)
Generalized Efficient Markets and Academia 2020-04-30T21:59:56.285Z · score: 45 (18 votes)
Motivating Abstraction-First Decision Theory 2020-04-29T17:47:31.896Z · score: 41 (12 votes)
What's Hard About Long Tails? 2020-04-23T16:32:36.266Z · score: 24 (10 votes)
Intuitions on Universal Behavior of Information at a Distance 2020-04-20T21:44:42.260Z · score: 26 (8 votes)
Integrating Hidden Variables Improves Approximation 2020-04-16T21:43:04.639Z · score: 15 (3 votes)
Noise Simplifies 2020-04-15T19:48:39.452Z · score: 24 (13 votes)
Probabilistic Blue-Eyed Islanders 2020-04-15T17:55:00.429Z · score: 26 (7 votes)
Transportation as a Constraint 2020-04-06T04:58:28.862Z · score: 144 (51 votes)
Mediation From a Distance 2020-03-20T22:02:46.545Z · score: 15 (4 votes)
Alignment as Translation 2020-03-19T21:40:01.266Z · score: 44 (15 votes)
Abstraction = Information at a Distance 2020-03-19T00:19:49.189Z · score: 26 (6 votes)
Positive Feedback -> Optimization? 2020-03-16T18:48:52.297Z · score: 19 (7 votes)
Adaptive Immune System Aging 2020-03-13T03:47:22.056Z · score: 64 (23 votes)
Please Press "Record" 2020-03-11T23:56:27.699Z · score: 49 (14 votes)
Trace README 2020-03-11T21:08:20.669Z · score: 33 (10 votes)
Name of Problem? 2020-03-09T20:15:11.760Z · score: 9 (2 votes)
The Lens, Progerias and Polycausality 2020-03-08T17:53:30.924Z · score: 62 (22 votes)
Interfaces as a Scarce Resource 2020-03-05T18:20:26.733Z · score: 131 (41 votes)
Trace: Goals and Principles 2020-02-28T23:50:12.900Z · score: 20 (5 votes)
johnswentworth's Shortform 2020-02-27T19:04:55.108Z · score: 8 (1 votes)
Value of the Long Tail 2020-02-26T17:24:28.707Z · score: 47 (18 votes)
Theory and Data as Constraints 2020-02-21T22:00:00.783Z · score: 43 (14 votes)
Exercises in Comprehensive Information Gathering 2020-02-15T17:27:19.753Z · score: 103 (47 votes)
Demons in Imperfect Search 2020-02-11T20:25:19.655Z · score: 84 (24 votes)

Comments

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-19T22:21:14.622Z · score: 2 (1 votes) · LW · GW

Thanks, that makes more sense now.

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-18T02:37:47.921Z · score: 5 (4 votes) · LW · GW

This was a super-helpful comment, thank you!

I'm surprised about the jury part, for multiple reasons. I would have expected that judges handle most contract disputes without a jury (partly because I'd expect disagreements on law/interpretation more often than fact), and that the sort of parties who want to avoid trials would usually prefer to designate some other arbitrator ahead of time for most matters anyway. What am I missing here?

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-16T16:51:08.627Z · score: 4 (2 votes) · LW · GW

That's surprising to me; I expected contracts to have a sufficiently long history that there wouldn't be any recent major innovations. In retrospect, I realize a long history alone isn't enough to assume that: mathematics is also ancient but has seen its fair share of recent-ish innovations anyway.

From Personal To Prison Gangs is probably relevant here. As society transitions from many repeated interactions between small numbers of individuals to more one-off interactions between large numbers of individuals (ultimately enabled by communication and transportation technology), we should expect more reliance on formal rules and standards. Those formal rules and standards also need to cover more people in a wider variety of situations - they need to be more general-purpose (since people themselves are less siloed than previously).

That sort of transition seems to have been particularly prevalent around the early-to-mid twentieth century.

That's the sort of heuristic which predicts this kind of fundamental shift in contract law (among many other things) around the time that we saw such a shift.

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-15T21:26:53.450Z · score: 2 (1 votes) · LW · GW

Thanks, I was hoping someone would leave a comment along these lines. Definitely helps me understand the underlying drivers better.

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-15T19:58:41.708Z · score: 5 (3 votes) · LW · GW

I agree with this heuristic in general; when I say they don't seem to be good at this, I do mean that they don't seem to be good at it. It's entirely possible that there's some underlying purpose.

That said, there are plausible reasons to expect that modern contract-writing is not yet near equilibrium.

First, modern contract law is relatively new; the uniform commercial code, for instance, only came along in 1952. My impression is that older versions of contract law had a lot more use-case-specific rules, requirements for specific wording, geographic variation, etc - in short, it was less based on general principles. (And this is still the case in many other countries.) I'd expect older versions of contract law to have made it easier to write contracts which "just worked" in common use cases, but also made it harder to develop good general-purpose techniques.

Second, it took half a century for software-writing to come as far as it has, and the incentives for scalable legibility just don't seem as sharp in contracts - so it should take even longer. At the end of the day, most contracts operate in an environment where people are invested in reputations and relationships; an oversight which could be abused usually isn't, an accidental breach can usually be worked out with the counterparty, and so forth. It's not like a computer which just executes whatever code is in front of it. (And even today, plenty of software engineers do throw patches on top of patches - it just seems more commonly understood in the software world that this is bad practice.)

Comment by johnswentworth on Egan's Theorem? · 2020-09-15T16:53:07.822Z · score: 2 (1 votes) · LW · GW

Sure. In that case, it would say something like "the higher order terms should be small in places where the lower-order equation was already accurate".

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-15T16:50:56.286Z · score: 7 (4 votes) · LW · GW

That wasn't intended to be an alternative to a nested series of if-thens; it's a solution to a different problem. (The usual software engineer's solution to a nested series of if-thens is to figure out what the thing is that you actually want, which all these conditions are trying to approximate, and then write the thing you actually want instead. Of course it's more difficult than that in practice, because sometimes the thing you actually want can't be written directly due to the limitations of programming languages/contract enforceability. I would imagine that skill is quite similar for both good contract lawyers and good software engineers.)

The idea of enforced scope/modularity is to make the table of contents binding, so people with specific use-cases don't need to review the whole thing. So for instance, suppose we have some complex transaction involving both representations and covenants, and we put the representations in their own section. Post-closing, people will mostly need to double-check the contract to see what the covenants say, not what the representations say. So it would be useful to have language alongside the table of contents nullifying any post-closing covenants which appear in the "Representations" section. Then, most people reviewing the contract post-closing won't need to read through the Representations section just to be sure there aren't any covenants hiding in there.

Comment by johnswentworth on Book Review: Working With Contracts · 2020-09-15T15:07:56.097Z · score: 2 (1 votes) · LW · GW

Thanks for bringing this up. I had intended to mention it in the OP, but it didn't quite fit in anywhere, so I'm glad someone mentioned it.

Comment by johnswentworth on Egan's Theorem? · 2020-09-14T05:07:35.985Z · score: 2 (1 votes) · LW · GW

In one of the comment replies you suggested that the same set of observations could be modeled by two different models, and there should be a morphism between the two models, either directly or through a third model that is more "accurate" or "powerful" in some sense than the other two. If I knew enough category theory, I would probably be able to express it in terms of some commuting diagrams, but alas.

Yes, something like that would capture the idea, although it's not necessarily the only or best way to formulate it.

Comment by johnswentworth on Egan's Theorem? · 2020-09-14T03:58:39.830Z · score: 2 (1 votes) · LW · GW

I'd expect Turing machines to be a bad way to model this. They're inherently blackboxy; the only "structure" they make easy to work with is function composition. The sort of structures relevant here don't seem like they'd care much about function boundaries. (This is why I use models like these as my default model of computation these days.)

Anyway, yeah, I'm still not sure what the "relationship" should be, and it's hard to formulate in a way that seems to capture the core idea.

Comment by johnswentworth on Egan's Theorem? · 2020-09-13T22:17:05.108Z · score: 2 (1 votes) · LW · GW

It seems like there shouldn't be a guaranteed relationship that's much simpler than reconstructing the data and recomputing the inferred point particles.

Yeah, I'm claiming exactly the opposite of this. When the old theory itself has some simple structure (e.g. classical mechanics), there should be a guaranteed relationship that's much simpler than reconstructing the data and recomputing the inferred point particles.

One possible formulation: if I find that a terabyte of data compresses down to a gigabyte, and then I find a different model which compresses it down to 500MB, there should be a relationship between the two models which can be expressed without expanding out the whole terabyte. (Or, if there isn't such a relationship, that means the two models are capturing different patterns from the data, and there should exist another model which compresses the data more than either by capturing the patterns found by both models.)

Comment by johnswentworth on Egan's Theorem? · 2020-09-13T18:59:46.143Z · score: 2 (1 votes) · LW · GW

there is no ironclad guarantee of properties continuing

Properties continuing is not what I'm asking about. The example in the OP is relevant: even if the entire universe undergoes some kind of phase change tomorrow and the macroscopic physical laws change entirely, it would still be true that the old laws did work before the phase change, and any new theory needs to account for that in order to be complete.

nor any guarantee that there will be a simple mapping between theories

I do not know of any theorem or counterexample which actually says this. Do you?

simple properties can be expected (in a probabilistic sense) to generalize even if the model is incomplete

Similar issue to "no ironclad guarantee of properties continuing": I'm not asking about properties generalizing to other parts of the environment, I'm asking about properties generalizing to any theory or model which describes the environment.

Comment by johnswentworth on Sunday September 13, 12:00PM (PT) — talks by John Wentworth, Liron and more · 2020-09-12T19:00:36.707Z · score: 6 (3 votes) · LW · GW

Theme for my talk will be how to detect unknown unknowns.

Comment by johnswentworth on Comparative advantage and when to blow up your island · 2020-09-12T17:56:00.004Z · score: 26 (15 votes) · LW · GW

One important thing I'd include: while adding more people (i.e. more than two) creates the possibility of individuals becoming worse off, it also very quickly removes most of the incentives for strategic negotiation behavior (i.e. hiding information, faking skill, threatening to blow up islands, etc). Even with just a dozen people/islands, multiple people have to form a cartel in order to achieve high price for a particular good, and it only takes one defector to break a cartel.

Comment by johnswentworth on What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers · 2020-09-12T17:33:45.631Z · score: 4 (2 votes) · LW · GW

Which specific parts did you have in mind?

Comment by johnswentworth on Social Capital Paradoxes · 2020-09-10T23:12:03.635Z · score: 8 (4 votes) · LW · GW

Note that horizontal gene transfer is mostly random; it's not like a bacteria has much control over which new genes to absorb from its environment. Humans do have some choice over which memes to pay attention to and to spread. Conversely, humans do not have much choice over whatever genes/memes they inherit vertically; that happens at a time when we're too young to have much control.

Comment by johnswentworth on High Stock Prices Make Sense Right Now · 2020-09-10T16:54:30.345Z · score: 4 (2 votes) · LW · GW

That would be correct assuming that

  • the sample is in fact representative, i.e. the investor types cover the large majority of the capital in the market, and
  • investors within each type have "similar" behavior - ideally they can all be captured by a representative agent.

(We could also circumvent the need for representative agents by estimating the demand function of each investor class directly, but then with n assets we need to estimate a function from R^n to R^n rather than a function from R^n to R, so the data and computation requirements are dramatically higher. Also, at that point there aren't clear benefits to breaking out classes of investors in the first place.)

Investor types corresponding to timelines is indeed sensible; I use that a lot in my own models. For instance, I can use data on individual trades to estimate the portfolios held by market makers as a function of price.

Comment by johnswentworth on How To Fermi Model · 2020-09-09T17:36:34.216Z · score: 18 (6 votes) · LW · GW

Some scattered thoughts...

Gears

First, how does this sort of approach relate to gears-level modelling? I think the process of brainstorming reference classes is usually a process of noticing particular gears - i.e. each "reference class" is typically a category of situations which have a subset of gears in common with the problem at hand. The models within a reference class spell out the particular gears relevant to that class of problems, and then we can think about how to transport those gears back to the original problem.

With that in mind, I think step 3 is missing some fairly crucial pieces. What you really want is not to treat each of the different models generated as independent black boxes and then poll them all, but rather to figure out which of the gears within the individual models can be carried back over to the original problem, and how all those gears ultimately fit together. In the end, you do want one unified model (though in practice that's more of an ideal to strive toward than a goal which will be achieved; if you don't have enough information to distinguish the correct model from all the possibilities then obviously you shouldn't pick one just for the sake of picking one). Some example ways the gears of different models can fit together:

  • Some of the different models will contain common subcomponents. 
  • Some models will have unspecified parameters (e.g. the discount factor in financial formulas), and in the context of the original problem, the values of those parameters will be determined mainly by the components of other models. 
  • Some models will contain components whose influence is mediated by the variables of other models.

Formulas

Second, the mathematical formulas. Obviously these are "wrong", in the sense that e.g. quality_of_connections*(# people)/(# connections) is not really expressing a relationship between natural quantities in the world; at best it's a rather artificial definition of "quality of connections" which may or may not be a good proxy for what we intuitively mean by that concept. That said, it seems like these are trying to express something important, even if the full mathematical implications are not really what matters.

I think what they're trying to capture is mostly necessity and sufficiency. This line from the OP is particularly suggestive:

You can quickly check your models by looking for 0s. What happens when any given factor is set to 0 or to arbitrarily large? Does the result make sense?

For these sorts of qualitative formulas, limiting behavior (e.g. "what happens when any given factor is set to 0 or to arbitrarily large) is usually the only content of the formula in which we put any faith. And for simple formulas like (var1*var2) or (var1 + var2), that limiting behavior mostly reduces to necessity and sufficiency - e.g. the expression quality_of_connections*(# people)/(# connections) says that both nonzero quality and nonzero person-count are necessary conditions in order to get any benefit at all.

(In some cases big-O behavior implied by these formulas may also be substantive and would capture more than just necessity and sufficiency, although I expect that people who haven't explicitly practiced thinking with big-O behavior usually won't do so instinctively.)

Examples

Third, use of examples. I'd recommend pushing example-generation earlier in the process. Ideally, brainstorming of examples for a reference class comes before brainstorming of models. I recommend a minimum of three examples, as qualitatively different from each other as possible, in order to really hone in on the exact concept/phenomenon which you're trying to capture. This will make it much easier to see exactly what the models need to explain, will help avoid sticking on one imperfect initial model, and will make everything a lot easier to explain to both yourself and others.

Comment by johnswentworth on Why would code/English or low-abstraction/high-abstraction simplicity or brevity correspond? · 2020-09-06T15:18:14.206Z · score: 3 (2 votes) · LW · GW

The solution to the "large overhead" problem is to amortize the cost of the human simulation over a large number of English sentences and predictions. We only need to specify the simulation once, and then we can use it for any number of prediction problems in conjunction with any number of sentences. A short English sentence then adds only a small amount of marginal complexity to the program - i.e. adding one more sentence (and corresponding predictions) only adds a short string to the program.

Comment by johnswentworth on Why would code/English or low-abstraction/high-abstraction simplicity or brevity correspond? · 2020-09-05T15:59:48.601Z · score: 4 (4 votes) · LW · GW

The relevant argument is equivalence of SI on different universal Turing machines, up to a constant. Briefly: if we have a short program on machine M1 (e.g. python), then in the worst case we can write an equivalent program on M2 (e.g. LISP) by writing an M1-simulator and then using the M1-program (e.g. writing a python interpreter in LISP and then using the python program). The key thing to notice here is that the M1-simulator may be long, but its length is completely independent what we're predicting - thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).

Applied to English: we could simulate an English-speaking human. This would be a lot more complicated than a python interpreter, but the program length would still be constant with respect to the prediction task. Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence. Thus, if something has a short English description, then there exists a short (up to a constant) code description which contains all the same information (i.e. can be used to predict all the same things).

Two gotchas to emphasize here:

  • The constant is big - it includes everything an English-speaking human knows, from what-trees-look-like to how-to-drive-a-car. All the hidden complexity of individual words is in that constant (or at least the hidden complexity that a human knows; things a human doesn't know wouldn't be in there).
  • The English sentence is a "program" (or part of a program), not data to be predicted; whatever we're predicting is separate from the English sentence. (This is implicit in the OP, but somebody will likely be confused by it.)
Comment by johnswentworth on Basic Inframeasure Theory · 2020-09-01T02:23:27.339Z · score: 2 (1 votes) · LW · GW

If we're imposing condition 5, then why go to all the trouble of talking about sa-measures, rather than just talking about a-measures from the start? Why do we need that extra generality?

Comment by johnswentworth on Basic Inframeasure Theory · 2020-09-01T00:27:37.508Z · score: 4 (2 votes) · LW · GW

A positive functional for  is a continuous linear function   that is nonnegative everywhere on .

I got really confused by this in conjunction with proposition 1. A few points of confusion:

  • The decomposition of  into  rather than . I'm sure this is standard somewhere, but I had to read back a ways to realize that  is negative in the constraint .
  • This does not match wikipedia's definition of a positive linear functional; that only requires that the functional be positive on the positive elements of the underlying space.
  • We seem to be talking about affine functions, not linear functions, but then Theorem 1 works around that by throwing in the constant .
Comment by johnswentworth on How to teach things well · 2020-08-29T19:18:23.967Z · score: 5 (4 votes) · LW · GW

I'm always a bit frustrated when people talk about a "knowledge graph"; the concept seems obviously useful, but also obviously incomplete. What precisely are the nodes and edges in the graph? What are the type signatures of these things?

I was thinking about this over breakfast. Here are some guesses.

One simple model is that each "node" in the graph is essentially a trigger-action plan. There's a small pattern-matcher, for example a pattern which recognizes root-finding problems with quadratic functions. When the pattern matches something, it triggers a bunch of possible connections - e.g. one connection might be a pointer to the quadratic equation, another might be a connection to polynomial factorization, etc. Each of those is itself either another node (e.g. the quadratic equation node) or, in the base-case, a simple action to take (e.g. writing some symbols on paper).

In this model, teaching involves a few different pieces:

  • Creation of the node itself - just giving it a name and emphasizing importance can help
  • Refining the pattern - e.g. practice recognizing root-finding problems with quadratic functions. Examples are probably the best tool here.
  • Installing the "downstream" pointers to other concepts, and "upstream" pointers from other concepts to this one. "Downstream" pointers would be things like "here's a list of tricks you can use to solve this sort of problem", "upstream" pointers would be things like "this is itself a root-finding method, so look for quadratics when you need to solve equations, and also use other equation-solving tools like adding a number to both sides".
  • Giving weight to upstream/downstream concepts - i.e. indicating which connections are more/less important, so they're properly prioritized in the list of "actions" triggered when a pattern is detected.
  • Building the habit of actually checking for the pattern, and actually triggering the "actions" when the pattern is matched. I.e. practice, preferably on a fairly wide variety of problems to minimize "out-of-distribution"-style failures.

So that's one model.

That model seems to capture a lot of useful things about procedural knowledge graphs, but it seems like there's a separate kind of knowledge graph for world-models. The part above is analogous to a program (it guides what-to-do), whereas a world-modelling knowledge graph would be analogous to the contents of a database; it's the datastructure on which the procedural knowledge graph operates. My current best model for a world-modelling knowledge graph looks something like this - it's a causal model recursively built out of reusable sub-models.

Teaching components of the world-modelling knowledge graph would involve somewhat different pieces:

  • We'd typically be teaching some prototypical submodel, a building block to use in many different places in the world-model. For instance, in introductory physics these submodels would be things like "masses" and "inclined planes" and "masses on inclined planes".
  • Teaching the submodel itself means walking through the components of the little causal subgraph the model specifies - e.g. how masses on inclined planes behave.
  • The submodel will have some pattern-matcher associated with it, for recognizing components of the real world to which the submodel applies. This means examples, to practice recognizing e.g. "masses" and "inclined planes" and "masses on inclined planes".
  • The submodel will itself have submodels, and these are the pointers out to other nodes. E.g. if there's a submodel for the prototypical mass-on-inclined-plane problem, then it should have a pointer to a "point mass" submodel. Here, the real key is the connections which are not present - e.g. the point-mass submodel doesn't care about the shape of the object in question, it's just approximated as a point.

It feels like there should be a clean way to unify the procedural and world-modelling knowledge graphs. I'm not sure what it is. I'm sure somebody will argue that it's all procedural and the world-modelling is just embedded in a bunch of procedures, but I'm not convinced; it sure feels like there's a graph of data on which the program operates. I could see it working the opposite way, though... maybe it's all world-modelling, and part of the world-model is something like "model of the best way to solve this problem", and our "procedural" behavior is actually just prediction on that part of the world model (sort predictive-processing-esque).

Comment by johnswentworth on Investment is a useful societal mechanism for getting new things made. Stock trading shares some functionality with investment, but seems very very inefficient, at that? · 2020-08-24T02:12:23.837Z · score: 8 (5 votes) · LW · GW

I wrote a post on this a few years ago. There's a few different roles that capital markets play, but I think the big one in terms of real economic value is probably warehousing. The financial markets - stock market, bond markets, etc - provide value mainly by warehousing credit. (Here I mean credit in a fairly general sense, including any sort of expected value at a later time in exchange for funds now - e.g. stocks are included, bonds are included, futures are included, etc.)

This provides value in much the same way as warehousing grain: when there's a shortage of grain, the grain warehouses can can provide grain for a little while (albeit at a higher price) to avoid starvation. When there's a grain surplus, the warehouses can buy up excess (albeit at a lower price) to avoid spoilage/waste. They smooth out the grain supply in time, and that's how they make money. Same with credit: when there's a shortage of credit, markets crash, and people/companies are desperate for cash. Those investors who were warehousing cash sell it, buying low-priced stocks/bonds/etc in exchange. When there's a surplus of credit, asset prices go back up, and the investors sell their assets off. They smooth out the credit supply in time, and that's how they make money.

Again, this isn't the only way that financial markets provide value; see the linked post for more. But I do think it's the main way.

Comment by johnswentworth on What's a Decomposable Alignment Topic? · 2020-08-22T01:43:48.176Z · score: 18 (8 votes) · LW · GW

Microscope AI in general seems like a very decomposition-friendly area. Take a trained neural net, assign each person a chunk to focus on, and everybody tries to figure out what features/algorithms/other neat stuff are embedded in their chunk.

Also should work well with a regular-meetup-group format, since the arrangement would be fairly robust to people missing a meeting, joining up midway, having completely different approaches or backgrounds, etc. Relatively open-ended, room for people to try different approaches based on what interests them and cross-pollinate strategies with the group.

Comment by johnswentworth on Radical Probabilism · 2020-08-20T23:57:57.534Z · score: 3 (2 votes) · LW · GW

Ah, I see. Made sense on a second read. Thanks.

Comment by johnswentworth on Alignment By Default · 2020-08-20T18:17:44.264Z · score: 4 (2 votes) · LW · GW

Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?

Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).

Using GPT-like systems to simulate alignment researchers' writing is a probably-safer use-case, but it still runs into the core catch-22. Either:

  • It writes something we'd currently write, which means no major progress (since we don't currently have solutions to the major problems and therefore can't write down such solutions), or
  • It writes something we currently wouldn't write, in which case it's out-of-distribution and we have to worry about how it's extrapolating us

I generally expect the former to mostly occur by default; the latter would require some clever prompts.

I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we're more useful to simulate.

Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.

This sounds like a great tool to have. It's exactly the sort of thing which is probably marginally useful. It's unlikely to help much on the big core problems; it wouldn't be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.

I do think a lot of the things you're suggesting would be valuable and worth doing, on the margin. They're probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they're still useful.

I'm a bit confused why you're bringing up "safety problems too complex for ourselves" because it sounds like you don't think there are any important safety problems like that, based on the sentences that came before this one?

The "safety problems too complex for ourselves" are things like the fusion power generator scenario - i.e. safety problems in specific situations or specific applications. The safety problems which I don't think are too complex are the general versions, i.e. how to build a generally-aligned AI.

An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.

I'm talking about the broad sense of "corrigible" described in e.g. the beginning of this post.

Ah ok, the suggestion makes sense now. That's a good idea. It's still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).

Comment by johnswentworth on Alignment By Default · 2020-08-20T02:46:14.011Z · score: 2 (1 votes) · LW · GW

Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?

It's not the function-representation that's the problem, it's the type-signature of the function. I don't know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.

All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.

This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".

More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?

If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you're getting at with the "unknown unknowns" stuff), what is the alternative?

I don't think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don't think that designing a friendly AI is too complex for humans.

Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.

Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities.

What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.

Comment by johnswentworth on Radical Probabilism · 2020-08-19T21:05:12.662Z · score: 9 (3 votes) · LW · GW

(Note that Bayes-with-a-side-channel does not imply conditions such as convergence and calibration; so, Jeffrey's theory of rationality is more demanding.)

What about the converse? Is a radical probabilist always behaviorally equivalent to a Bayesian with a side-channel? Or to some sequence of virtual evidence updates?

You seem to say so later on - "And remember, every update is a Bayesian update, with the right virtual evidence" - but I don't think this was proven?

Comment by johnswentworth on Survey Results: 10 Fun Questions for LWers · 2020-08-19T16:59:43.261Z · score: 6 (4 votes) · LW · GW

But it's weirdly bimodal, and I didn't have a theory that predicted that.

I had a comment a year ago which would predict this. The idea is that we generate value from slack by using that slack to take unreliable/high-noise opportunities. But as long as the noise in those high-noise opportunities is independent, we should usually be able to take advantage of N^2 opportunities using N units of slack (because noise in a sum scales with the square root of the number of things summed, roughly speaking). In other words, slack has increasing marginal returns: the tenth unit of slack is far more valuable than the second unit.

That suggests that individual people should either:

  • specialize in having lots of slack and using lots unreliable opportunities (so they can accept N^2 unreliability trade-offs with only N units of slack), or
  • specialize in having little slack and making everything in their life highly reliable (because a relatively large amount of slack would need to be set aside for just one high-noise opportunity).
Comment by johnswentworth on Alignment By Default · 2020-08-18T23:17:13.649Z · score: 4 (2 votes) · LW · GW

LGTM

Comment by johnswentworth on Alignment By Default · 2020-08-18T17:51:44.034Z · score: 2 (1 votes) · LW · GW

Can you be more specific about the theoretical bottlenecks that seem most important?

Type signature of human values is the big one. I think it's pretty clear at this point that utility functions aren't the right thing, that we value things "out in the world" as opposed to just our own "inputs" or internal state, that values are not reducible to decisions or behavior, etc. We don't have a framework for what-sort-of-thing human values are. If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.

The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic "not just benign, actually aligned" AI.

A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that's true, alignment work and tool safety work need to be basically the same thing.

On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the "tools" have their own models of human values, and use those models to check the safety of their outputs... which brings us right back to alignment.

Simple mechanisms like always displaying an estimated probability that I'll regret asking a question would probably help, but I'm mainly worried about the unknown unknowns, not the known unknowns. That's part of what I mean when I talk about marginal improvements vs closing the bulk of the gap - the unknown unknowns are the bulk of the gap.

(I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other hand, if we're doing the same things but faster, it's not clear that that scenario really favors alignment research over the Leeroy Jenkins of the world.)

In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can't be trusted, but I'm not sure the total amount of responsibility we're assigning to humans has changed--if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right.

This in particular I think is a strong argument, and the die-rolls argument is my main counterargument. 

We can indeed partially avoid the die-rolls issue by only using the system a limited number of times - e.g. to design another system. That said, in order for the first system to actually add value here, it has to do some reasoning which is too complex for humans - which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API. We'd be rolling the dice twice - once in designing the first system, once in using the first system to design the second - and that second die-roll in particular has a lot of unknown unknowns packed into it.

Let's make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)

I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of "corrigibility" has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we're relying on corrigibility, I'd ideally like it to improve with capabilities, in the same way and for the same reasons as I'd like alignment to improve with capabilities. Do you know of an argument that it's easier?

Comment by johnswentworth on Alignment By Default · 2020-08-16T21:49:12.007Z · score: 2 (1 votes) · LW · GW

To the extent that this is comparable to the branching pattern of a tree (which is a comparison you make in the post), I would argue that it increases rather than lessens the reason to worry: much like a tree's branch structure is chaotic, messy, and overall high-entropy, I expect human values to look similar, and therefore not really encompass any kind of natural category.

Bit of a side-note, but the high entropy of tree branching comes from trees using the biological equivalent of random number generators when "deciding" when/whether to form a branch. The distribution of branch length-ratios/counts/angles is actually fairly simple and stable, and is one of the main characteristics which makes particular tree species visually distinctive. See L-systems for the basics, or speedtree for the industrial-grade version (and some really beautiful images).

It's that distribution which is the natural abstraction - i.e. the distribution summarizes information about branching which is relevant to far-away trees of the same species.

Comment by johnswentworth on Alignment By Default · 2020-08-16T21:22:28.094Z · score: 3 (2 votes) · LW · GW

My model of abstraction is that high-level abstractions summarize all the information from some chunk of the world which is relevant "far away". Part of that idea is that, as we "move away" from the information-source, most information is either quickly wiped out by noise, or faithfully transmitted far away. The information which is faithfully transmitted will usually be present across many different channels; that's the main reason it's not wiped out by noise in the first place. Obviously this is not something which necessarily applies to all possible systems, but intuitively it seems like it should apply to most systems most of the time: information which is not duplicated across multiple channels is easily wiped out by noise.

Comment by johnswentworth on Alignment By Default · 2020-08-16T21:15:51.915Z · score: 2 (1 votes) · LW · GW

I think there's a subtle confusion here between two different claims:

  • Human values evolved as a natural abstraction of some territory.
  • Humans' notion of "human values" is a natural abstraction of humans' actual values.

It sounds like your comment is responding to the former, while I'm claiming the latter.

A key distinction here is between humans' actual values, and humans' model/notion of our own values. Humans' actual values are the pile of heuristics inherited from evolution. But humans also have a model of their values, and that model is not the same as the underlying values. The phrase "human values" necessarily points to the model, because that's how words work - they point to models. My claim is that the model is a natural abstraction of the actual values, not that the actual values are a natural abstraction of anything.

This is closely related to this section from the OP:

Human values are basically a bunch of randomly-generated heuristics which proved useful for genetic fitness; why would they be a “natural” abstraction? But remember, the same can be said of trees. Trees are a complicated pile of organic spaghetti code, but “tree” is still a natural abstraction, because the concept summarizes all the information from that organic spaghetti pile which is relevant to things far away. In particular, it summarizes anything about one tree which is relevant to far-away trees.

Roughly speaking, the concept of "human values" summarizes anything about the values of one human which is relevant to the values of far-away humans.

Does that make sense?

Comment by johnswentworth on Alignment By Default · 2020-08-16T17:34:54.257Z · score: 2 (1 votes) · LW · GW

Ah, I see. You're saying that the embedding might not actually be simple. Yeah, that's plausible.

Comment by johnswentworth on Alignment By Default · 2020-08-16T15:19:26.968Z · score: 2 (1 votes) · LW · GW

Trees are a fairly natural concept because "tall green things" and "Lifeforms that are >10% cellulose" point to a similar set of objects. There are many different simple boundaries in concept-space that largely separate trees from non trees. Trees are tightly clustered in thing-space.

That's not quite how natural abstractions work. There are lots of edge cases which are sort-of-trees-but-sort-of-not: logs, saplings/acorns, petrified trees, bushes, etc. Yet the abstract category itself is still precise.

An analogy: consider a Gaussian cluster model. Any given cluster will have lots of edge cases, and lots of noise in the individual points. But the cluster itself - i.e. the mean and variance parameters of the cluster - can still be precisely defined. Same with the concept of "tree", and (I expect) with "human values".

In general, we can have a precise high-level concept without a hard boundary in the low-level space.

Comment by johnswentworth on Alignment By Default · 2020-08-16T15:09:39.321Z · score: 2 (1 votes) · LW · GW

Note that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that.

You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.

The whole point of the "natural abstractions" section of the OP  is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the "proxy problems" section of the post, but I do not expect it to be an issue for identifying natural abstractions.

Comment by johnswentworth on Alignment By Default · 2020-08-16T15:01:38.289Z · score: 2 (1 votes) · LW · GW

This is entirely correct.

Comment by johnswentworth on Alignment By Default · 2020-08-16T14:51:56.374Z · score: 4 (2 votes) · LW · GW

Oh this is fascinating. This is basically correct; a high-level model space can include models which do not correspond to any possible low-level model.

One caveat: any high-level data or observations will be consistent with the true low-level model. So while there may be natural abstract objects which can't exist, and we can talk about those objects, we shouldn't see data supporting their existence - e.g. we shouldn't see a real-world voting system behaving like it satisfies all of Arrow's desiderata.

Comment by johnswentworth on Alignment By Default · 2020-08-15T20:21:56.516Z · score: 2 (1 votes) · LW · GW

I mostly agree with you here. I don't think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.

Comment by johnswentworth on Alignment By Default · 2020-08-15T18:27:07.006Z · score: 6 (3 votes) · LW · GW

This comment definitely wins the award for best comment on the post so far. Great ideas, highly relevant links.

I especially like the deliberate noise idea. That plays really nicely with natural abstractions as information-relevant-far-away: we can intentionally insert noise along particular dimensions, and see how that messes with prediction far away (either via causal propagation or via loss of information directly). As long as most of the noise inserted is not along the dimensions relevant to the high-level abstraction, denoising should be possible. So it's very plausible that denoising autoencoders are fairly-directly incentivized to learn natural abstractions. That'll definitely be an interesting path to pursue further.

Assuming that the denoising autoencoder objective more-or-less-directly incentivizes natural abstractions, further refinements on that setup could very plausibly turn into a useful "ease of interpretability" objective.

Comment by johnswentworth on Alignment By Default · 2020-08-15T18:13:23.623Z · score: 2 (1 votes) · LW · GW

Thanks for the comments, these are excellent!

Valid complaint on the title, I basically agree. I only give the path outlined in the OP ~10% of working without any further intervention by AI safety people, and I definitely agree that there are relatively-tractable-seeming ways to push that number up on the margin. (Though those would be marginal improvements only; I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.)

I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away). The idea of simulating a human doing moral philosophy is a bit different than what I usually imagine, though; it's basically like taking an alignment researcher and running them on faster hardware. That doesn't directly solve any of the underlying conceptual problems - it just punts them to the simulated researchers - but it is presumably a strict improvement over a limited number of researchers operating slowly in meatspace. Alignment research ems!

Suppose we restrict ourselves to feeding our system data from before the year 2000. There should be a decent representation of human values to be learned from this data, yet it should be quite difficult to figure out the specifics of the 2020+ data-collection process from it.

I don't think this helps much. Two examples of "specifics of the data collection process" to illustrate:

  • Suppose our data consists of human philosophers' writing on morality. Then the "specifics of the data collection process" includes the humans' writing skills and signalling incentives, and everything else besides the underlying human values.
  • Suppose our data consists of humans' choices in various situations. Then the "specifics of the data collection process" includes the humans' mistaken reasoning, habits, divergence of decision-making from values, and everything else besides the underlying human values.

So "specifics of the data collection process" is a very broad notion in this context. Essentially all practical data sources will include a ton of extra information besides just their information on human values.

Second, one way to check on things is to deliberately include a small quantity of mislabeled data, then once the system is done learning, check whether its model correctly recognizes that the mislabeled data is mislabeled (and agrees with all data that is correctly labeled).

I like this idea, and I especially like it in conjunction with deliberate noise as an unsupervised learning trick. I'll respond more to that on the other comment.

A third way which you don't mention is to use the initial aligned AI as a "human values oracle" for subsequent AIs.

I have mixed feelings on this.

My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I'd really like it to be able to refine its notion of human values over time. In other words, the oracle's notion of human values may be accurate but not precise, and I'd like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.

That said, as long as the oracle's alignment is accurate, we could use your suggestion to make sure that actions are OK for all possible human-values-notions within uncertainty. That's probably at least good enough to avoid disaster. It would still fall short of the full potential value of AI - there'd be missed opportunities, where the system has to be overly careful because its notion human values is insufficiently precise - but at least no disaster.

Finally, on deceptive behavior: I use the phrase a bit differently than I think most people do these days. My prototypical image isn't of a mesa-optimizer. Rather, I imagine people iteratively developing a system, trying things out, keeping things which seem to work, and thereby selecting for things which look good to humans (regardless of whether they're actually good). In that situation, we'd expect the system to end up doing things which look good but aren't, because the human developers accidentally selected for that sort of behavior. It's a "you-get-what-you-measure" problem, rather than a mesa-optimizers problem.

Comment by johnswentworth on Alignment By Default · 2020-08-15T16:35:36.798Z · score: 4 (2 votes) · LW · GW

I'm talking everyday situations. Like "if I push on this door, it will open" or "by next week my laundry hamper will be full" or "it's probably going to be colder in January than June". Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.

In places where the humans in question don't have much first-hand experiential data, or where the data is mostly noise, that's where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system's priors to matter most.) Another way to put it: humans' priors aren't great, but in most day-to-day prediction problems we have more than enough data to make up for that.

Comment by johnswentworth on Alignment By Default · 2020-08-15T02:41:23.380Z · score: 2 (1 votes) · LW · GW

Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn't one of the core points of the reductionism sequence that, while "thor caused the thunder" sounds simpler to a human than Maxwell's equations (because the words fit naturally into a human psychology), one of them is much "simpler" in an absolute sense than the other (and is in fact true).

Despite humans giving really dumb verbal explanations (like "Thor caused the thunder"), we tend to be pretty decent at actually predicting things in practice.

The same applies to natural abstractions. If I ask people "is 'tree' a natural category?" then they'll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they'll usually have no trouble at all picking the trees in the second set.

I thought the mesa optimisers would definitely arise during the training

If you're optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during "training" would just be overwritten by the optimal values computed at runtime.

Comment by johnswentworth on Alignment By Default · 2020-08-15T02:25:45.483Z · score: 6 (3 votes) · LW · GW

Sure,  works for what I'm saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I'm saying that either:

  • the mesa optimizer doesn't appear in , in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using ), or
  • the mesa optimizer does appear in , in which case the problem was really an outer alignment issue all along.
Comment by johnswentworth on Alignment By Default · 2020-08-15T02:05:25.597Z · score: 2 (1 votes) · LW · GW

I'm not sure about whether corrigibility is a natural abstraction. It's at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.

Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?

Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system's true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.

Comment by johnswentworth on Alignment By Default · 2020-08-14T23:20:32.032Z · score: 4 (2 votes) · LW · GW

When I say "optimize all the parameters at runtime", I do not mean "take one gradient step in between each timestep". I mean, at each timestep, fully optimize all of the parameters. Optimize  all the way to convergence before every single action.

Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, "runtime" for mesa-optimization purposes is every time the system chooses its action - i.e. every timestep - and "training" is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.

Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be , with the sum taken over all previous data points, since that's what the RL setup is approximating.

This optimization would probably still "find" the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former "mesa-optimizer" embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.

Does that make sense?

Comment by johnswentworth on Generalized Efficient Markets in Political Power · 2020-08-14T21:55:15.155Z · score: 2 (1 votes) · LW · GW

Great points!

Do you think there's a more reliable way (for an outsider like myself, who's not able to, I dunno, go and ask people in a dive bar what they think) to get the lay of the political land in a particular point in space?

This is wayyyy outside my zone of expertise, but I would look for specialist-oriented publications - e.g. newsletters specifically targeted at lobbyists/policymakers, or political information in the industry publications of special-interest industries.

If, when faced with a choice of writing about (a) things that are real but dull vs (b) things that are not real but get clicks, no one has an incentive to do (a), how do you form a view of the world?

I'd say the key is to generate your own questions, then proactively look for the answers rather than waiting around for whatever information comes to you. There's plenty of good information out there, it just isn't super-viral, so you have to go looking for it.

All of these people have different Schelling points!

Important point here: these people don't actually have different Schelling points. They presumably all agree that if Alice wins the election, then whatever Alice signs into law will be the new Schelling point. What these people disagree on is their expectations for what the future Schelling point will be.

Comment by johnswentworth on Alignment By Default · 2020-08-14T21:36:54.949Z · score: 2 (1 votes) · LW · GW

The way I understood it, the main reason a mesa-optimizer shows up in the first place is that some information is available at runtime which is not available during training, so some processing needs to be done at runtime to figure out the best action given the runtime-info. The mesa-optimizer handles that processing. If we directly optimize over all parameters at runtime, then there's no place for that to happen.

What am I missing?